Version: 0.9

Metrics API

Metrics for monitoring the performance of the Tecton Feature Platform are available through the Tecton Metrics API. The Tecton Metrics API follows the OpenMetrics standard for metrics collection, which is supported by common Application Performance Monitoring systems such as DataDog, SignalFX and New Relic.

Available Metrics

The following metrics are currently available through the Metrics API. Leverage these metrics to build monitoring dashboards and alerts with your Application Performance Monitoring system.

Name	Type	Unit	Description	Labels	Release stage
feature_service_requests_total_rate	GAUGE	requests per second	Total count of feature service requests over five minutues	aws_region	GA
feature_service_requests_rate	GAUGE	requests per second	Count of GetFeatures and GetFeaturesBatch by feature service over five minutues. GetFeaturesBatch calls are translated (batch size directly corresponds to a proportional increase in the rate) into actual GetFeatures calls to calculate this value	aws_region, feature_service_id, feature_service_name	GA
feature_service_total_latency	SUMMARY	second	Feature serving GetFeatures and GetFeaturesBatch latency		GA
feature_service_latency	SUMMARY	second	Feature serving GetFeatures and GetFeaturesBatch latency by service	feature_service_id, feature_service_name	GA
feature_server_errors_rate	GAUGE	percent	Feature serving errors rate by HTTP status	status	GA
feature_server_utilization	GAUGE	percent	Maximum utilization percentage among all the feature server instances	aws_region	GA
feature_service_requests_total_rate_per_server_group	GAUGE	requests per second	Total count of feature service requests per server group over five minutues	aws_region, server_group_name	Preview
feature_service_requests_rate_per_server_group	GAUGE	requests per second	Count of GetFeatures and GetFeaturesBatch by feature service per server group over five minutues. GetFeaturesBatch calls are translated (batch size directly corresponds to a proportional increase in the rate) into actual GetFeatures calls to calculate this value	aws_region, feature_service_id, feature_service_name, server_group_name	Preview
feature_server_errors_rate_grpc	GAUGE	percent	Feature serving errors rate by GRPC code	status	Preview
feature_server_errors_rate_grpc_per_server_group	GAUGE	percent	Feature serving errors rate by GRPC code per server group	status, server_group_name	Preview
feature_server_average_utilization	GAUGE	percent	Average utilization percentage among all the feature server instances	aws_region	Preview
feature_server_minimum_utilization	GAUGE	percent	Minimum utilization percentage among all the feature server instances	aws_region	Preview
feature_server_average_utilization_per_server_group	GAUGE	percent	Average utilization percentage among all the feature server instances per server group	aws_region, server_group_name	Preview
feature_server_max_utilization_per_server_group	GAUGE	percent	Maximum utilization percentage among all the feature server instances per server group	aws_region, server_group_name	Preview
feature_server_minimum_utilization_per_server_group	GAUGE	percent	Minimum utilization percentage among all the feature server instances per server group	aws_region, server_group_name	Preview
spark_stream_max_processed_event_age	GAUGE	second	The maximum age of event processed for Spark streaming	workspace, feature_view_name, feature_view_id	Preview
spark_stream_average_processed_event_age	GAUGE	second	The average age of events processed for Spark streaming	workspace, feature_view_name, feature_view_id	Preview
spark_stream_input_rate	GAUGE	requests per second	The stream request input rate for Spark streaming	workspace, feature_view_name, feature_view_id	Preview
spark_stream_processing_rate	GAUGE	requests per second	The stream request processing rate for Spark streaming	workspace, feature_view_name, feature_view_id	Preview
spark_stream_served_feature_age	GAUGE	second	The served feature age for Spark streaming	workspace, feature_view_name, feature_view_id	Preview
spark_stream_online_store_write_rate	GAUGE	rows per second	The online store write request rate for Spark streaming	workspace, feature_view_name, feature_view_id	Preview
stream_ingestapi_request_rate	GAUGE	requests per second	The request rate for Stream Ingest API	aws_region	Preview
stream_ingestapi_request_processing_latency	SUMMARY	second	The request processing latency for Stream Ingest API	aws_region	Preview
stream_ingestapi_request_processing_error_rate	GAUGE	requests per second	The request handling error rate for Stream Ingest API	aws_region, error_code (4xx or 5xx)	Preview
stream_ingestapi_online_store_write_rate	GAUGE	rows per second	The rows write rate to the online store for Stream Ingest API	workspace, feature_view_name, feature_view_id	Preview
stream_ingestapi_offline_store_write_rate	GAUGE	rows per second	The rows write rate to the offline store for Stream Ingest API	workspace, feature_view_name, feature_view_id	Preview
feature_server_scaling_requests	GAUGE	requests count per response code	The count of feature server scaling request responses per GRPC response code	aws_region, code (OK or PERMISSION_DENIED)	Preview
feature_server_autoscaler_desired_replica_count	GAUGE	the desired feature server replica count	The desired replica count of feature server set by autoscaling policy (empty if autoscaling is disabled)	aws_region	Preview
feature_server_autoscaler_current_replica_count	GAUGE	the current feature server replica count	The current replica count of feature server (empty if autoscaling is disabled)	aws_region	Preview
feature_server_autoscaler_max_replica_count	GAUGE	the maximum feature server replica count	The maximum replica count of feature server (empty if autoscaling is disabled)	aws_region	Preview
feature_server_cache_request_rate	GAUGE	requests per second	The number of requests sent to the cache by operations per second. One FV read/write is a request	aws_region, operation, feature_service_name	Preview
feature_server_cache_hit_rate	GAUGE	percent	The percentage of requests to the cache that have a value in the cache	feature_service_name, feature_view	Preview
feature_server_cache_read_duration_bucket	HISTOGRAM	milliseconds	The bucketed distribution of latencies for reads to the cache	aws_region, le, feature_service_name	Preview
feature_server_cache_memory_used_total	GAUGE	bytes	The amount of memory in bytes that is being utilized by the cache by primary / replica	aws_region, shard_type	Preview
feature_server_cache_memory_provisioned_total	GAUGE	byte	The amount of memory in bytes that can possibly be utilized by the cache by primary / replica	aws_region	Preview
feature_server_cache_primary_nodes_count	GAUGE	primary shard count	The number of nodes that are allocated as a primary shard in the cache instance	aws_region	Preview
feature_server_cache_replica_nodes_count	GAUGE	replica shard count	The number of replica nodes that are allocated as a replica in the cache instance	aws_region	Preview
feature_server_cache_engine_utilization_average_percent	GAUGE	percent	The percent of the redis engine CPU thread that is currently being used	aws_region	Preview
online_store_p99_latency_seconds	GAUGE	second	The p99 latency of online store read latencies per feature view, and region	aws_region, feature_view_id	Preview

More details on the metric types and output formats can be found in the OpenMetrics reference.

Metric release stages

The release stage represents the expected stability of the metric. Tecton recommends only relying on metrics marked as GA for production dashboards. The release stage is noted in the table above, as well as in the Help string of the OpenMetrics protocol.

Release stage	Description
Generally Available (GA)	Ready for production use. Schema and definition of the metric will not change.
Preview	Initial release intended for collecting feedback. Schema and definition of the metric may change before moving to GA.
Deprecated	Usage of the metric is discouraged. Will be maintained until the specified end-of-support date.

Metrics API endpoint

The Metrics API endpoint is https://<your-instance>.tecton.ai/api/v1/observability/metrics.

To query metrics using ‘curl’, run:

curl -H "Authorization: Tecton-key $TECTON_API_KEY" https://$INSTANCE.tecton.ai/api/v1/observability/metrics

Example output

feature_service_requests_rate{aws_region="us-west-2",feature_service_id="072b546997cb6e586ed460ff0a3743ee",feature_service_name="fvfs_1"} 0.003703703703703704 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="005c5a6f3517e1e2a4ce411372a15d84",feature_service_name="fvfs_2"} 0 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="00c922b96f55a948e1bbfa08fdb3a699",feature_service_name="fvfs_3"} 0 1692222226141

This is a sample output for a gauge metric named feature_service_requests_rate with labels aws_region, feature_service_id and feature_service_name. The seconds part of the output is the value of the metric, and the last part is the timestamp in milliseconds. For output format details, see the OpenMetrics reference.

Example Integrations

The following sections show how to configure common Application Performance Monitoring systems to scrape the Tecton Metrics API.

By default, Tecton recommends scraping the Metrics API on 30 second intervals.

DataDog

A DataDog agent can be easily configured to ingest metrics from Tecton’s Metrics API.

If you’re already using DataDog, there’s a high chance that you have a DataDog agent up and running. In this case, integration will be even easier.

Overall, integration can be done in 3 steps:

Create a Tecton Service account via Tecton CLI

$ tecton service-account create \
      --name "metrics-consumer" \
      --description "Consumer of operational metrics"

Save this API Key - you will not be able to get it again.
API Key:            your-api-key
Service Account ID: your-service-account-id

Save the API key returned by the command. You will need it when configuring the agent.

Install DataDog agent

note
This step can be skipped if you already have a DataDog agent (≥ 7.32.0) running on one of your machines and this machine has access to Tecton API.

The installation procedure depends on the platform. Use the official DataDog documentation for the specific platform.
Configure the agent to ingest Tecton metrics

We need to edit one of the configuration files that come with Datadog Agent. Datadog Agent configuration files can be found in the agent configuration directory. Modify openmetrics.d/conf.yaml by adding the following:

instances:
	- openmetrics_endpoint: https://<your-tecton-url>.tecton.ai/api/v1/observability/metrics
		namespace: tecton  # all exported metrics will have this namespace
		metrics:
			- .+  # store all metrics
    min_collection_interval: 30
    headers:
      Authorization: Tecton-key <REPLACE THIS WITH TOKEN>

SignalFX

Create a Tecton Service account via Tecton CLI

$ tecton service-account create \
      --name "metrics-consumer" \
      --description "Consumer of operational metrics"

Save this API Key - you will not be able to get it again.
API Key:            your-api-key
Service Account ID: your-service-account-id

Save the API key returned by the command. You will need it when configuring the agent.

Deploy the Splunk OpenTelemetry connector https://docs.splunk.com/Observability/gdi/opentelemetry/opentelemetry.html
Configure the collector to ingest Tecton metrics

Example configuration:

receivers:
  lightprometheus:
    endpoint: https://<your-tecton-url>.tecton.ai/api/v1/observability/metrics
    headers:
      Authorization: Tecton-key <TOKEN>
    collection_interval: 30s
    resource_attributes:
      service.name:
        enabled: false
      service.instance.id:
        enabled: false

exporters:
  signalfx:
    access_token: <SIGNALFX TOKEN>
    realm: <SIGNALFX REALM>

service:
  pipelines:
    metrics:
      receivers: [lightprometheus]
      exporters: [signalfx]

OpenTelemetry (OTEL) Collector

If your Application Performance Monitoring system doesn’t support OpenMetrics out of the box, you can install the OTEL collector and configure it to export to a cloud-based monitoring system or any self-hosted alternative. There are enough out-the-box exporters in the OTEL collector to support almost any possible monitoring setup.

The following example shows how to configure the OTEL collector for use with the Tecton Metrics API (the config can vary depending on the version of the collector See config docs:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector
          scrape_interval: 30s
          scheme: https
          metrics_path: /api/v1/observability/metrics
          authorization:
            type: Tecton-key
            credentials: <API key>
          static_configs:
            - targets: [<cluster>.tecton.ai]

exporters:
  datadog:
    api:
      site: <DD_SITE>
      key: <DD_API_KEY>

processors:
  batch:
    send_batch_max_size: 100
    send_batch_size: 10
    timeout: 10s

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [datadog]

Available Metrics​

Metric release stages​

Metrics API endpoint​

Example Integrations​

DataDog​

SignalFX​

OpenTelemetry (OTEL) Collector​

Was this page helpful?