Version: Beta 🚧

Metrics API

The Tecton Metrics API provides performance monitoring metrics for the Tecton Feature Platform using the OpenMetrics standard. The Metrics API is compatible with common APM systems like DataDog, SignalFX, and New Relic.

Available Metrics

The following metrics are available through the Metrics API for building monitoring dashboards and alerts. See the OpenMetrics reference for details about metric types.

General Availability

Name & Description	Type	Unit	Labels	Release stage
feature_service_requests_total_rate Total count of feature service requests over five minutes	GAUGE	requests per second	aws_region	GA
feature_service_requests_rate GetFeatures request rate by feature service over 5 minutes. GetFeaturesBatch fans out requests into multiple GetFeatures requests increasing the rate proportional to the batch size.	GAUGE	requests per second	aws_region feature_service_id feature_service_name	GA
feature_service_latency Feature serving GetFeatures and GetFeaturesBatch latency by service	SUMMARY	second	feature_service_id feature_service_name	GA
feature_server_errors_rate Feature serving errors rate by HTTP status	GAUGE	percent	status	GA
feature_server_utilization Maximum utilization percentage among all the feature server instances	GAUGE	percent	aws_region	GA
feature_server_average_utilization Average utilization percentage among all the feature server instances	GAUGE	percent	aws_region	GA
feature_server_minimum_utilization Minimum utilization percentage among all the feature server instances	GAUGE	percent	aws_region	GA
spark_stream_max_processed_event_age Maximum age of event processed for Spark streaming	GAUGE	second	workspace feature_view_name feature_view_id	GA
spark_stream_min_processed_event_age Minimum age of event processed for Spark streaming	GAUGE	second	workspace feature_view_name feature_view_id	GA
spark_stream_average_processed_event_age Average age of events processed for Spark streaming	GAUGE	second	workspace feature_view_name feature_view_id	GA
spark_stream_input_rate Stream request input rate for Spark streaming	GAUGE	requests per second	workspace feature_view_name feature_view_id	GA
spark_stream_served_feature_age Served feature age for Spark streaming	GAUGE	second	workspace feature_view_name feature_view_id	GA
spark_stream_online_store_write_rate Online store write request rate for Spark streaming	GAUGE	rows per second	workspace feature_view_name feature_view_id	GA

Preview

Name & Description	Type	Unit	Labels	Release stage
feature_service_requests_total_rate_per_server_group Total count of feature service requests per server group over five minutes	GAUGE	requests per second	aws_region server_group	Preview
feature_service_requests_rate_per_server_group GetFeatures request rate by feature service and server group over 5 minutes. GetFeaturesBatch fans out requests into multiple GetFeatures requests increasing the rate proportional to the batch size.	GAUGE	requests per second	aws_region feature_service_id feature_service_name server_group	Preview
feature_server_errors_rate_grpc Feature serving errors rate by GRPC code	GAUGE	percent	status	Preview
feature_server_errors_rate_grpc_per_server_group Feature serving errors rate by GRPC code per server group	GAUGE	percent	status server_group	Preview
feature_server_average_utilization_per_server_group Average utilization percentage among all the feature server instances per server group	GAUGE	percent	aws_region server_group	Preview
feature_server_max_utilization_per_server_group Maximum utilization percentage among all the feature server instances per server group	GAUGE	percent	aws_region server_group	Preview
feature_server_minimum_utilization_per_server_group Minimum utilization percentage among all the feature server instances per server group	GAUGE	percent	aws_region server_group	Preview
feature_server_group_utilization_percentiles Utilization percentiles (p50, p90, p95, p99, p100) across Feature Server instances in a server group, used as target metrics for autoscaling	GAUGE	percent	aws_region server_group percentile	Preview
feature_server_scaling_requests Count of feature server scaling request responses per GRPC response code	GAUGE	request count	aws_region code	Preview
feature_server_autoscaler_desired_replica_count Desired replica count of feature server set by autoscaling policy (empty if autoscaling is disabled)	GAUGE	replica count	aws_region	Preview
feature_server_autoscaler_current_replica_count Current replica count of feature server (empty if autoscaling is disabled)	GAUGE	replica count	aws_region	Preview
feature_server_autoscaler_max_replica_count Maximum replica count of feature server (empty if autoscaling is disabled)	GAUGE	replica count	aws_region	Preview
online_store_p99_latency_seconds P99 latency of online store read latencies per feature view and region	GAUGE	second	aws_region feature_view_id	Preview
spark_stream_processing_rate Stream request processing rate for Spark streaming	GAUGE	requests per second	workspace feature_view_name feature_view_id	Preview
stream_ingestapi_request_rate Request rate for Stream Ingest API	GAUGE	requests per second	aws_region	Preview
stream_ingestapi_request_processing_latency Request processing latency for Stream Ingest API	SUMMARY	second	aws_region	Preview
stream_ingestapi_request_processing_error_rate Request handling error rate for Stream Ingest API	GAUGE	requests per second	aws_region error_code (4xx or 5xx)	Preview
stream_ingestapi_online_store_write_rate Rows write rate to the online store for Stream Ingest API	GAUGE	rows per second	workspace feature_view_name feature_view_id	Preview
stream_ingestapi_offline_store_write_rate Rows write rate to the offline store for Stream Ingest API	GAUGE	rows per second	workspace feature_view_name feature_view_id	Preview
feature_server_cache_request_rate Number of requests sent to the cache by operations per second. One FV read/write is a request	GAUGE	requests per second	aws_region operation feature_service_name	Preview
feature_server_cache_hit_rate Percentage of requests to the cache that have a value in the cache	GAUGE	percent	feature_service_name feature_view	Preview
feature_server_cache_memory_used_total Amount of memory in bytes that is being utilized by the cache by primary / replica	GAUGE	bytes	aws_region shard_type	Preview
feature_server_cache_memory_provisioned_total Amount of memory in bytes that can possibly be utilized by the cache by primary / replica	GAUGE	bytes	aws_region	Preview
feature_server_cache_primary_nodes_count Number of nodes that are allocated as a primary shard in the cache instance	GAUGE	primary shard count	aws_region	Preview
feature_server_cache_replica_nodes_count Number of replica nodes that are allocated as a replica in the cache instance	GAUGE	replica shard count	aws_region	Preview
feature_server_cache_engine_utilization_average_percent Percent of the redis engine CPU thread that is currently being used	GAUGE	percent	aws_region	Preview

Metric release stages

The release stage represents the expected stability of the metric. We recommend only relying on General Availability metrics for production dashboards. The release stage for each metric is noted in the table above, as well as in the Help string of the OpenMetrics protocol.

Release stage	Description
General Availability (GA)	Ready for production use. Schema and definition of the metric will not change.
Preview	Intended for collecting feedback. Schema and definition subject to change before moving to GA.
Deprecated	Usage of the metric is discouraged. Will be maintained until the specified end-of-support date.

Metrics API endpoint

The Metrics API endpoint is https://<your-instance>.tecton.ai/api/v1/observability/metrics.

Here's an example query using 'curl':

$ curl -H "Authorization: Tecton-key $TECTON_API_KEY" \
    https://$INSTANCE.tecton.ai/api/v1/observability/metrics

feature_service_requests_rate{aws_region="us-west-2",feature_service_id="072b546997cb6e586ed460ff0a3743ee",feature_service_name="fvfs_1"} 0.003703703703703704 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="005c5a6f3517e1e2a4ce411372a15d84",feature_service_name="fvfs_2"} 0 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="00c922b96f55a948e1bbfa08fdb3a699",feature_service_name="fvfs_3"} 0 1692222226141

This is a sample output for a gauge metric named feature_service_requests_rate with labels aws_region, feature_service_id and feature_service_name. The seconds part of the output is the value of the metric, and the last part is the timestamp in milliseconds. For output format details, see the OpenMetrics reference.

Example Integrations

The following sections show how to configure common observability platforms to scrape the Tecton Metrics API. An interval of 30 seconds is recommended for scraping the Metrics API.

You'll need an API key associated with a service account to use the Metrics API. Create a service account using the CLI:

$ tecton service-account create \
      --name "metrics-api-consumer" \
      --description "Metrics API consumer for operational monitoring"

Save this API Key - you will not be able to get it again.
API Key:            your-api-key
Service Account ID: your-service-account-id

Make a note of the API key returned by the CLI.

DataDog

A DataDog agent can be configured to ingest metrics from Tecton's Metrics API using a DataDog agent.

Install the DataDog agent.

note
This step can be skipped if you already have a DataDog agent (≥ 7.32.0) running on one of your machines and this machine has access to Tecton endpoints.

The installation procedure depends on the platform. Use the official DataDog documentation for the specific platform.

Edit the agent configuration found in the agent configuration directory. Modify openmetrics.d/conf.yaml by adding the following:

instances:
  - openmetrics_endpoint: 'https://<your-instance>.tecton.ai/api/v1/observability/metrics'
    namespace: tecton
    metrics:
      - .+
    min_collection_interval: 30
    headers:
      Authorization: Tecton-key <TECTON API TOKEN>

SignalFX

Deploy the Splunk OpenTelemetry connector.

Configure the collector to ingest Tecton metrics. For example:

receivers:
  lightprometheus:
    endpoint: https://<your-instance>.tecton.ai/api/v1/observability/metrics
    headers:
      Authorization: Tecton-key <TECTON API TOKEN>
    collection_interval: 30s
    resource_attributes:
      service.name:
        enabled: false
      service.instance.id:
        enabled: false

exporters:
  signalfx:
    access_token: <SIGNALFX TOKEN>
    realm: <SIGNALFX REALM>

service:
  pipelines:
    metrics:
      receivers: [lightprometheus]
      exporters: [signalfx]

OpenTelemetry (OTEL) Collector

If your observability system doesn't support OpenMetrics out of the box, you can install the OTEL collector and configure it to export to a cloud-based monitoring system or any self-hosted alternative. There are several out-the-box exporters supported by the OTEL collector to integrate with nearly any monitoring setup.

The following example configures the OTEL collector to use the Tecton Metrics API (the configuration can vary depending on the version of the collector).

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector
          scrape_interval: 30s
          scheme: https
          metrics_path: /api/v1/observability/metrics
          authorization:
            type: Tecton-key
            credentials: <TECTON API KEY>
          static_configs:
            - targets: [<your-instance>.tecton.ai]

exporters:
  datadog:
    api:
      site: <DD_SITE>
      key: <DD_API_KEY>

processors:
  batch:
    send_batch_max_size: 100
    send_batch_size: 10
    timeout: 10s

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [datadog]

Available Metrics​

General Availability​

Preview​

Metric release stages​

Metrics API endpoint​

Example Integrations​

DataDog​

SignalFX​

OpenTelemetry (OTEL) Collector​

Was this page helpful?