Metrics API
The Tecton Metrics API provides performance monitoring metrics for the Tecton Feature Platform using the OpenMetrics standard. The Metrics API is compatible with common APM systems like DataDog, SignalFX, and New Relic.
Available Metricsโ
The following metrics are available through the Metrics API for building monitoring dashboards and alerts. See the OpenMetrics reference for details about metric types.
General Availabilityโ
| Name & Description | Type | Unit | Labels | Release stage |
|---|---|---|---|---|
| feature_service_requests_total_rate Total count of feature service requests over five minutes | GAUGE | requests per second | aws_region | GA |
| feature_service_requests_rate GetFeatures request rate by feature service over 5 minutes. GetFeaturesBatch fans out requests into multiple GetFeatures requests increasing the rate proportional to the batch size. | GAUGE | requests per second | aws_region feature_service_id feature_service_name | GA |
| feature_service_latency Feature serving GetFeatures and GetFeaturesBatch latency by service | SUMMARY | second | feature_service_id feature_service_name | GA |
| feature_server_errors_rate Feature serving errors rate by HTTP status | GAUGE | percent | status | GA |
| feature_server_utilization Maximum utilization percentage among all the feature server instances | GAUGE | percent | aws_region | GA |
| feature_server_average_utilization Average utilization percentage among all the feature server instances | GAUGE | percent | aws_region | GA |
| feature_server_minimum_utilization Minimum utilization percentage among all the feature server instances | GAUGE | percent | aws_region | GA |
| spark_stream_max_processed_event_age Maximum age of event processed for Spark streaming | GAUGE | second | workspace feature_view_name feature_view_id | GA |
| spark_stream_min_processed_event_age Minimum age of event processed for Spark streaming | GAUGE | second | workspace feature_view_name feature_view_id | GA |
| spark_stream_average_processed_event_age Average age of events processed for Spark streaming | GAUGE | second | workspace feature_view_name feature_view_id | GA |
| spark_stream_input_rate Stream request input rate for Spark streaming | GAUGE | requests per second | workspace feature_view_name feature_view_id | GA |
| spark_stream_served_feature_age Served feature age for Spark streaming | GAUGE | second | workspace feature_view_name feature_view_id | GA |
| spark_stream_online_store_write_rate Online store write request rate for Spark streaming | GAUGE | rows per second | workspace feature_view_name feature_view_id | GA |
Previewโ
| Name & Description | Type | Unit | Labels | Release stage |
|---|---|---|---|---|
| feature_service_requests_total_rate_per_server_group Total count of feature service requests per server group over five minutes | GAUGE | requests per second | aws_region server_group | Preview |
| feature_service_requests_rate_per_server_group GetFeatures request rate by feature service and server group over 5 minutes. GetFeaturesBatch fans out requests into multiple GetFeatures requests increasing the rate proportional to the batch size. | GAUGE | requests per second | aws_region feature_service_id feature_service_name server_group | Preview |
| feature_server_errors_rate_grpc Feature serving errors rate by GRPC code | GAUGE | percent | status | Preview |
| feature_server_errors_rate_grpc_per_server_group Feature serving errors rate by GRPC code per server group | GAUGE | percent | status server_group | Preview |
| feature_server_average_utilization_per_server_group Average utilization percentage among all the feature server instances per server group | GAUGE | percent | aws_region server_group | Preview |
| feature_server_max_utilization_per_server_group Maximum utilization percentage among all the feature server instances per server group | GAUGE | percent | aws_region server_group | Preview |
| feature_server_minimum_utilization_per_server_group Minimum utilization percentage among all the feature server instances per server group | GAUGE | percent | aws_region server_group | Preview |
| feature_server_group_utilization_percentiles Utilization percentiles (p50, p90, p95, p99, p100) across Feature Server instances in a server group, used as target metrics for autoscaling | GAUGE | percent | aws_region server_group percentile | Preview |
| feature_server_scaling_requests Count of feature server scaling request responses per GRPC response code | GAUGE | request count | aws_region code | Preview |
| feature_server_autoscaler_desired_replica_count Desired replica count of feature server set by autoscaling policy (empty if autoscaling is disabled) | GAUGE | replica count | aws_region | Preview |
| feature_server_autoscaler_current_replica_count Current replica count of feature server (empty if autoscaling is disabled) | GAUGE | replica count | aws_region | Preview |
| feature_server_autoscaler_max_replica_count Maximum replica count of feature server (empty if autoscaling is disabled) | GAUGE | replica count | aws_region | Preview |
| online_store_p99_latency_seconds P99 latency of online store read latencies per feature view and region | GAUGE | second | aws_region feature_view_id | Preview |
| spark_stream_processing_rate Stream request processing rate for Spark streaming | GAUGE | requests per second | workspace feature_view_name feature_view_id | Preview |
| stream_ingestapi_request_rate Request rate for Stream Ingest API | GAUGE | requests per second | aws_region | Preview |
| stream_ingestapi_request_processing_latency Request processing latency for Stream Ingest API | SUMMARY | second | aws_region | Preview |
| stream_ingestapi_request_processing_error_rate Request handling error rate for Stream Ingest API | GAUGE | requests per second | aws_region error_code (4xx or 5xx) | Preview |
| stream_ingestapi_online_store_write_rate Rows write rate to the online store for Stream Ingest API | GAUGE | rows per second | workspace feature_view_name feature_view_id | Preview |
| stream_ingestapi_offline_store_write_rate Rows write rate to the offline store for Stream Ingest API | GAUGE | rows per second | workspace feature_view_name feature_view_id | Preview |
| feature_server_cache_request_rate Number of requests sent to the cache by operations per second. One FV read/write is a request | GAUGE | requests per second | aws_region operation feature_service_name | Preview |
| feature_server_cache_hit_rate Percentage of requests to the cache that have a value in the cache | GAUGE | percent | feature_service_name feature_view | Preview |
| feature_server_cache_memory_used_total Amount of memory in bytes that is being utilized by the cache by primary / replica | GAUGE | bytes | aws_region shard_type | Preview |
| feature_server_cache_memory_provisioned_total Amount of memory in bytes that can possibly be utilized by the cache by primary / replica | GAUGE | bytes | aws_region | Preview |
| feature_server_cache_primary_nodes_count Number of nodes that are allocated as a primary shard in the cache instance | GAUGE | primary shard count | aws_region | Preview |
| feature_server_cache_replica_nodes_count Number of replica nodes that are allocated as a replica in the cache instance | GAUGE | replica shard count | aws_region | Preview |
| feature_server_cache_engine_utilization_average_percent Percent of the redis engine CPU thread that is currently being used | GAUGE | percent | aws_region | Preview |
Metric release stagesโ
The release stage represents the expected stability of the metric. We recommend only relying on General Availability metrics for production dashboards. The release stage for each metric is noted in the table above, as well as in the Help string of the OpenMetrics protocol.
| Release stage | Description |
|---|---|
| General Availability (GA) | Ready for production use. Schema and definition of the metric will not change. |
| Preview | Intended for collecting feedback. Schema and definition subject to change before moving to GA. |
| Deprecated | Usage of the metric is discouraged. Will be maintained until the specified end-of-support date. |
Metrics API endpointโ
The Metrics API endpoint is
https://<your-instance>.tecton.ai/api/v1/observability/metrics.
Here's an example query using 'curl':
$ curl -H "Authorization: Tecton-key $TECTON_API_KEY" \
https://$INSTANCE.tecton.ai/api/v1/observability/metrics
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="072b546997cb6e586ed460ff0a3743ee",feature_service_name="fvfs_1"} 0.003703703703703704 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="005c5a6f3517e1e2a4ce411372a15d84",feature_service_name="fvfs_2"} 0 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="00c922b96f55a948e1bbfa08fdb3a699",feature_service_name="fvfs_3"} 0 1692222226141
This is a sample output for a gauge metric named feature_service_requests_rate
with labels aws_region, feature_service_id and feature_service_name. The
seconds part of the output is the value of the metric, and the last part is the
timestamp in milliseconds. For output format details, see the
OpenMetrics reference.
Example Integrationsโ
The following sections show how to configure common observability platforms to scrape the Tecton Metrics API. An interval of 30 seconds is recommended for scraping the Metrics API.
You'll need an API key associated with a service account to use the Metrics API. Create a service account using the CLI:
$ tecton service-account create \
--name "metrics-api-consumer" \
--description "Metrics API consumer for operational monitoring"
Save this API Key - you will not be able to get it again.
API Key: your-api-key
Service Account ID: your-service-account-id
Make a note of the API key returned by the CLI.
DataDogโ
A DataDog agent can be configured to ingest metrics from Tecton's Metrics API using a DataDog agent.
-
Install the DataDog agent.
noteThis step can be skipped if you already have a DataDog agent (โฅ 7.32.0) running on one of your machines and this machine has access to Tecton endpoints.
The installation procedure depends on the platform. Use the official DataDog documentation for the specific platform.
-
Edit the agent configuration found in the agent configuration directory. Modify
openmetrics.d/conf.yamlby adding the following:instances:
- openmetrics_endpoint: 'https://<your-instance>.tecton.ai/api/v1/observability/metrics'
namespace: tecton
metrics:
- .+
min_collection_interval: 30
headers:
Authorization: Tecton-key <TECTON API TOKEN>
SignalFXโ
-
Deploy the Splunk OpenTelemetry connector.
-
Configure the collector to ingest Tecton metrics. For example:
receivers:
lightprometheus:
endpoint: https://<your-instance>.tecton.ai/api/v1/observability/metrics
headers:
Authorization: Tecton-key <TECTON API TOKEN>
collection_interval: 30s
resource_attributes:
service.name:
enabled: false
service.instance.id:
enabled: false
exporters:
signalfx:
access_token: <SIGNALFX TOKEN>
realm: <SIGNALFX REALM>
service:
pipelines:
metrics:
receivers: [lightprometheus]
exporters: [signalfx]
OpenTelemetry (OTEL) Collectorโ
If your observability system doesn't support OpenMetrics out of the box, you can install the OTEL collector and configure it to export to a cloud-based monitoring system or any self-hosted alternative. There are several out-the-box exporters supported by the OTEL collector to integrate with nearly any monitoring setup.
The following example configures the OTEL collector to use the Tecton Metrics API (the configuration can vary depending on the version of the collector).
receivers:
prometheus:
config:
scrape_configs:
- job_name: otel-collector
scrape_interval: 30s
scheme: https
metrics_path: /api/v1/observability/metrics
authorization:
type: Tecton-key
credentials: <TECTON API KEY>
static_configs:
- targets: [<your-instance>.tecton.ai]
exporters:
datadog:
api:
site: <DD_SITE>
key: <DD_API_KEY>
processors:
batch:
send_batch_max_size: 100
send_batch_size: 10
timeout: 10s
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [datadog]