Skip to main content
Version: 1.1

Metrics API

The Tecton Metrics API provides performance monitoring metrics for the Tecton Feature Platform using the OpenMetrics standard. The Metrics API is compatible with common APM systems like DataDog, SignalFX, and New Relic.

Available Metricsโ€‹

The following metrics are available through the Metrics API for building monitoring dashboards and alerts. See the OpenMetrics reference for details about metric types.

General Availabilityโ€‹

Name & DescriptionTypeUnitLabelsRelease stage
feature_service_requests_total_rate
Total count of feature service requests over five minutes
GAUGErequests per secondaws_regionGA
feature_service_requests_rate
GetFeatures request rate by feature service over 5 minutes. GetFeaturesBatch fans out requests into multiple GetFeatures requests increasing the rate proportional to the batch size.
GAUGErequests per secondaws_region
feature_service_id
feature_service_name
GA
feature_service_latency
Feature serving GetFeatures and GetFeaturesBatch latency by service
SUMMARYsecondfeature_service_id
feature_service_name
GA
feature_server_errors_rate
Feature serving errors rate by HTTP status
GAUGEpercentstatusGA
feature_server_utilization
Maximum utilization percentage among all the feature server instances
GAUGEpercentaws_regionGA
feature_server_average_utilization
Average utilization percentage among all the feature server instances
GAUGEpercentaws_regionGA
feature_server_minimum_utilization
Minimum utilization percentage among all the feature server instances
GAUGEpercentaws_regionGA
spark_stream_max_processed_event_age
Maximum age of event processed for Spark streaming
GAUGEsecondworkspace
feature_view_name
feature_view_id
GA
spark_stream_min_processed_event_age
Minimum age of event processed for Spark streaming
GAUGEsecondworkspace
feature_view_name
feature_view_id
GA
spark_stream_average_processed_event_age
Average age of events processed for Spark streaming
GAUGEsecondworkspace
feature_view_name
feature_view_id
GA
spark_stream_input_rate
Stream request input rate for Spark streaming
GAUGErequests per secondworkspace
feature_view_name
feature_view_id
GA
spark_stream_served_feature_age
Served feature age for Spark streaming
GAUGEsecondworkspace
feature_view_name
feature_view_id
GA
spark_stream_online_store_write_rate
Online store write request rate for Spark streaming
GAUGErows per secondworkspace
feature_view_name
feature_view_id
GA

Previewโ€‹

Name & DescriptionTypeUnitLabelsRelease stage
feature_service_requests_total_rate_per_server_group
Total count of feature service requests per server group over five minutes
GAUGErequests per secondaws_region
server_group
Preview
feature_service_requests_rate_per_server_group
GetFeatures request rate by feature service and server group over 5 minutes. GetFeaturesBatch fans out requests into multiple GetFeatures requests increasing the rate proportional to the batch size.
GAUGErequests per secondaws_region
feature_service_id
feature_service_name
server_group
Preview
feature_server_errors_rate_grpc
Feature serving errors rate by GRPC code
GAUGEpercentstatusPreview
feature_server_errors_rate_grpc_per_server_group
Feature serving errors rate by GRPC code per server group
GAUGEpercentstatus
server_group
Preview
feature_server_average_utilization_per_server_group
Average utilization percentage among all the feature server instances per server group
GAUGEpercentaws_region
server_group
Preview
feature_server_max_utilization_per_server_group
Maximum utilization percentage among all the feature server instances per server group
GAUGEpercentaws_region
server_group
Preview
feature_server_minimum_utilization_per_server_group
Minimum utilization percentage among all the feature server instances per server group
GAUGEpercentaws_region
server_group
Preview
feature_server_group_utilization_percentiles
Utilization percentiles (p50, p90, p95, p99, p100) across Feature Server instances in a server group, used as target metrics for autoscaling
GAUGEpercentaws_region
server_group
percentile
Preview
feature_server_scaling_requests
Count of feature server scaling request responses per GRPC response code
GAUGErequest countaws_region
code
Preview
feature_server_autoscaler_desired_replica_count
Desired replica count of feature server set by autoscaling policy (empty if autoscaling is disabled)
GAUGEreplica countaws_regionPreview
feature_server_autoscaler_current_replica_count
Current replica count of feature server (empty if autoscaling is disabled)
GAUGEreplica countaws_regionPreview
feature_server_autoscaler_max_replica_count
Maximum replica count of feature server (empty if autoscaling is disabled)
GAUGEreplica countaws_regionPreview
online_store_p99_latency_seconds
P99 latency of online store read latencies per feature view
and region
GAUGEsecondaws_region
feature_view_id
Preview
spark_stream_processing_rate
Stream request processing rate for Spark streaming
GAUGErequests per secondworkspace
feature_view_name
feature_view_id
Preview
stream_ingestapi_request_rate
Request rate for Stream Ingest API
GAUGErequests per secondaws_regionPreview
stream_ingestapi_request_processing_latency
Request processing latency for Stream Ingest API
SUMMARYsecondaws_regionPreview
stream_ingestapi_request_processing_error_rate
Request handling error rate for Stream Ingest API
GAUGErequests per secondaws_region
error_code (4xx or 5xx)
Preview
stream_ingestapi_online_store_write_rate
Rows write rate to the online store for Stream Ingest API
GAUGErows per secondworkspace
feature_view_name
feature_view_id
Preview
stream_ingestapi_offline_store_write_rate
Rows write rate to the offline store for Stream Ingest API
GAUGErows per secondworkspace
feature_view_name
feature_view_id
Preview
feature_server_cache_request_rate
Number of requests sent to the cache by operations per second. One FV read/write is a request
GAUGErequests per secondaws_region
operation
feature_service_name
Preview
feature_server_cache_hit_rate
Percentage of requests to the cache that have a value in the cache
GAUGEpercentfeature_service_name
feature_view
Preview
feature_server_cache_memory_used_total
Amount of memory in bytes that is being utilized by the cache by primary / replica
GAUGEbytesaws_region
shard_type
Preview
feature_server_cache_memory_provisioned_total
Amount of memory in bytes that can possibly be utilized by the cache by primary / replica
GAUGEbytesaws_regionPreview
feature_server_cache_primary_nodes_count
Number of nodes that are allocated as a primary shard in the cache instance
GAUGEprimary shard countaws_regionPreview
feature_server_cache_replica_nodes_count
Number of replica nodes that are allocated as a replica in the cache instance
GAUGEreplica shard countaws_regionPreview
feature_server_cache_engine_utilization_average_percent
Percent of the redis engine CPU thread that is currently being used
GAUGEpercentaws_regionPreview

Metric release stagesโ€‹

The release stage represents the expected stability of the metric. We recommend only relying on General Availability metrics for production dashboards. The release stage for each metric is noted in the table above, as well as in the Help string of the OpenMetrics protocol.

Release stageDescription
General Availability (GA)Ready for production use. Schema and definition of the metric will not change.
PreviewIntended for collecting feedback. Schema and definition subject to change before moving to GA.
DeprecatedUsage of the metric is discouraged. Will be maintained until the specified end-of-support date.

Metrics API endpointโ€‹

The Metrics API endpoint is https://<your-instance>.tecton.ai/api/v1/observability/metrics.

Here's an example query using 'curl':

$ curl -H "Authorization: Tecton-key $TECTON_API_KEY" \
https://$INSTANCE.tecton.ai/api/v1/observability/metrics

feature_service_requests_rate{aws_region="us-west-2",feature_service_id="072b546997cb6e586ed460ff0a3743ee",feature_service_name="fvfs_1"} 0.003703703703703704 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="005c5a6f3517e1e2a4ce411372a15d84",feature_service_name="fvfs_2"} 0 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="00c922b96f55a948e1bbfa08fdb3a699",feature_service_name="fvfs_3"} 0 1692222226141

This is a sample output for a gauge metric named feature_service_requests_rate with labels aws_region, feature_service_id and feature_service_name. The seconds part of the output is the value of the metric, and the last part is the timestamp in milliseconds. For output format details, see the OpenMetrics reference.

Example Integrationsโ€‹

The following sections show how to configure common observability platforms to scrape the Tecton Metrics API. An interval of 30 seconds is recommended for scraping the Metrics API.

You'll need an API key associated with a service account to use the Metrics API. Create a service account using the CLI:

$ tecton service-account create \
--name "metrics-api-consumer" \
--description "Metrics API consumer for operational monitoring"

Save this API Key - you will not be able to get it again.
API Key: your-api-key
Service Account ID: your-service-account-id

Make a note of the API key returned by the CLI.

DataDogโ€‹

A DataDog agent can be configured to ingest metrics from Tecton's Metrics API using a DataDog agent.

  1. Install the DataDog agent.

    note

    This step can be skipped if you already have a DataDog agent (โ‰ฅ 7.32.0) running on one of your machines and this machine has access to Tecton endpoints.

    The installation procedure depends on the platform. Use the official DataDog documentation for the specific platform.

  2. Edit the agent configuration found in the agent configuration directory. Modify openmetrics.d/conf.yaml by adding the following:

    instances:
    - openmetrics_endpoint: 'https://<your-instance>.tecton.ai/api/v1/observability/metrics'
    namespace: tecton
    metrics:
    - .+
    min_collection_interval: 30
    headers:
    Authorization: Tecton-key <TECTON API TOKEN>

SignalFXโ€‹

  1. Deploy the Splunk OpenTelemetry connector.

  2. Configure the collector to ingest Tecton metrics. For example:

    receivers:
    lightprometheus:
    endpoint: https://<your-instance>.tecton.ai/api/v1/observability/metrics
    headers:
    Authorization: Tecton-key <TECTON API TOKEN>
    collection_interval: 30s
    resource_attributes:
    service.name:
    enabled: false
    service.instance.id:
    enabled: false

    exporters:
    signalfx:
    access_token: <SIGNALFX TOKEN>
    realm: <SIGNALFX REALM>

    service:
    pipelines:
    metrics:
    receivers: [lightprometheus]
    exporters: [signalfx]

OpenTelemetry (OTEL) Collectorโ€‹

If your observability system doesn't support OpenMetrics out of the box, you can install the OTEL collector and configure it to export to a cloud-based monitoring system or any self-hosted alternative. There are several out-the-box exporters supported by the OTEL collector to integrate with nearly any monitoring setup.

The following example configures the OTEL collector to use the Tecton Metrics API (the configuration can vary depending on the version of the collector).

receivers:
prometheus:
config:
scrape_configs:
- job_name: otel-collector
scrape_interval: 30s
scheme: https
metrics_path: /api/v1/observability/metrics
authorization:
type: Tecton-key
credentials: <TECTON API KEY>
static_configs:
- targets: [<your-instance>.tecton.ai]

exporters:
datadog:
api:
site: <DD_SITE>
key: <DD_API_KEY>

processors:
batch:
send_batch_max_size: 100
send_batch_size: 10
timeout: 10s

service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [datadog]

Was this page helpful?