Metrics API
Metrics for monitoring the performance of the Tecton Feature Platform are available through the Tecton Metrics API. The Tecton Metrics API follows the OpenMetrics standard for metrics collection, which is supported by common Application Performance Monitoring systems such as DataDog, SignalFX and New Relic.
Available Metrics​
The following metrics are currently available through the Metrics API. Leverage these metrics to build monitoring dashboards and alerts with your Application Performance Monitoring system.
Name | Type | Unit | Description | Labels | Release stage |
---|---|---|---|---|---|
feature_service_requests_total_rate | GAUGE | requests per second | Total count of feature service requests over five minutues | aws_region | GA |
feature_service_requests_rate | GAUGE | requests per second | Count of GetFeatures and GetFeaturesBatch by feature service over five minutues. GetFeaturesBatch calls are translated (batch size directly corresponds to a proportional increase in the rate) into actual GetFeatures calls to calculate this value | aws_region, feature_service_id, feature_service_name | GA |
feature_service_total_latency | SUMMARY | second | Feature serving GetFeatures and GetFeaturesBatch latency | GA | |
feature_service_latency | SUMMARY | second | Feature serving GetFeatures and GetFeaturesBatch latency by service | feature_service_id, feature_service_name | GA |
feature_server_errors_rate | GAUGE | percent | Feature serving errors rate by HTTP status | status | GA |
feature_server_utilization | GAUGE | percent | Maximum utilization percentage among all the feature server instances | aws_region | GA |
feature_service_requests_total_rate_per_server_group | GAUGE | requests per second | Total count of feature service requests per server group over five minutues | aws_region, server_group_name | Preview |
feature_service_requests_rate_per_server_group | GAUGE | requests per second | Count of GetFeatures and GetFeaturesBatch by feature service per server group over five minutues. GetFeaturesBatch calls are translated (batch size directly corresponds to a proportional increase in the rate) into actual GetFeatures calls to calculate this value | aws_region, feature_service_id, feature_service_name, server_group_name | Preview |
feature_server_errors_rate_grpc | GAUGE | percent | Feature serving errors rate by GRPC code | status | Preview |
feature_server_errors_rate_grpc_per_server_group | GAUGE | percent | Feature serving errors rate by GRPC code per server group | status, server_group_name | Preview |
feature_server_average_utilization | GAUGE | percent | Average utilization percentage among all the feature server instances | aws_region | Preview |
feature_server_minimum_utilization | GAUGE | percent | Minimum utilization percentage among all the feature server instances | aws_region | Preview |
feature_server_average_utilization_per_server_group | GAUGE | percent | Average utilization percentage among all the feature server instances per server group | aws_region, server_group_name | Preview |
feature_server_max_utilization_per_server_group | GAUGE | percent | Maximum utilization percentage among all the feature server instances per server group | aws_region, server_group_name | Preview |
feature_server_minimum_utilization_per_server_group | GAUGE | percent | Minimum utilization percentage among all the feature server instances per server group | aws_region, server_group_name | Preview |
spark_stream_max_processed_event_age | GAUGE | second | The maximum age of event processed for Spark streaming | workspace, feature_view_name, feature_view_id | Preview |
spark_stream_average_processed_event_age | GAUGE | second | The average age of events processed for Spark streaming | workspace, feature_view_name, feature_view_id | Preview |
spark_stream_input_rate | GAUGE | requests per second | The stream request input rate for Spark streaming | workspace, feature_view_name, feature_view_id | Preview |
spark_stream_processing_rate | GAUGE | requests per second | The stream request processing rate for Spark streaming | workspace, feature_view_name, feature_view_id | Preview |
spark_stream_served_feature_age | GAUGE | second | The served feature age for Spark streaming | workspace, feature_view_name, feature_view_id | Preview |
spark_stream_online_store_write_rate | GAUGE | rows per second | The online store write request rate for Spark streaming | workspace, feature_view_name, feature_view_id | Preview |
stream_ingestapi_request_rate | GAUGE | requests per second | The request rate for Stream Ingest API | aws_region | Preview |
stream_ingestapi_request_processing_latency | SUMMARY | second | The request processing latency for Stream Ingest API | aws_region | Preview |
stream_ingestapi_request_processing_error_rate | GAUGE | requests per second | The request handling error rate for Stream Ingest API | aws_region, error_code (4xx or 5xx) | Preview |
stream_ingestapi_online_store_write_rate | GAUGE | rows per second | The rows write rate to the online store for Stream Ingest API | workspace, feature_view_name, feature_view_id | Preview |
stream_ingestapi_offline_store_write_rate | GAUGE | rows per second | The rows write rate to the offline store for Stream Ingest API | workspace, feature_view_name, feature_view_id | Preview |
feature_server_scaling_requests | GAUGE | requests count per response code | The count of feature server scaling request responses per GRPC response code | aws_region, code (OK or PERMISSION_DENIED) | Preview |
feature_server_autoscaler_desired_replica_count | GAUGE | the desired feature server replica count | The desired replica count of feature server set by autoscaling policy (empty if autoscaling is disabled) | aws_region | Preview |
feature_server_autoscaler_current_replica_count | GAUGE | the current feature server replica count | The current replica count of feature server (empty if autoscaling is disabled) | aws_region | Preview |
feature_server_autoscaler_max_replica_count | GAUGE | the maximum feature server replica count | The maximum replica count of feature server (empty if autoscaling is disabled) | aws_region | Preview |
More details on the metric types and output formats can be found in the OpenMetrics reference.
Metric release stages​
The release stage represents the expected stability of the metric. Tecton recommends only relying on metrics marked as GA for production dashboards. The release stage is noted in the table above, as well as in the Help string of the OpenMetrics protocol.
Release stage | Description |
---|---|
Generally Available (GA) | Ready for production use. Schema and definition of the metric will not change. |
Preview | Initial release intended for collecting feedback. Schema and definition of the metric may change before moving to GA. |
Deprecated | Usage of the metric is discouraged. Will be maintained until the specified end-of-support date. |
Metrics API endpoint​
The Metrics API endpoint is
https://<your-instance>.tecton.ai/api/v1/observability/metrics
.
To query metrics using ‘curl’, run:
curl -H "Authorization: Tecton-key $TECTON_API_KEY" https://$INSTANCE.tecton.ai/api/v1/observability/metrics
Example output
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="072b546997cb6e586ed460ff0a3743ee",feature_service_name="fvfs_1"} 0.003703703703703704 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="005c5a6f3517e1e2a4ce411372a15d84",feature_service_name="fvfs_2"} 0 1692222226141
feature_service_requests_rate{aws_region="us-west-2",feature_service_id="00c922b96f55a948e1bbfa08fdb3a699",feature_service_name="fvfs_3"} 0 1692222226141
This is a sample output for a gauge metric named feature_service_requests_rate
with labels aws_region
, feature_service_id
and feature_service_name
. The
seconds part of the output is the value of the metric, and the last part is the
timestamp in milliseconds. For output format details, see the
OpenMetrics reference.
Example Integrations​
The following sections show how to configure common Application Performance Monitoring systems to scrape the Tecton Metrics API.
By default, Tecton recommends scraping the Metrics API on 30 second intervals.
DataDog​
A DataDog agent can be easily configured to ingest metrics from Tecton’s Metrics API.
If you’re already using DataDog, there’s a high chance that you have a DataDog agent up and running. In this case, integration will be even easier.
Overall, integration can be done in 3 steps:
-
Create a Tecton Service account via Tecton CLI
$ tecton service-account create \
--name "metrics-consumer" \
--description "Consumer of operational metrics"
Save this API Key - you will not be able to get it again.
API Key: your-api-key
Service Account ID: your-service-account-idSave the API key returned by the command. You will need it when configuring the agent.
-
Install DataDog agent
noteThis step can be skipped if you already have a DataDog agent (≥ 7.32.0) running on one of your machines and this machine has access to Tecton API.
The installation procedure depends on the platform. Use the official DataDog documentation for the specific platform.
-
Configure the agent to ingest Tecton metrics
We need to edit one of the configuration files that come with Datadog Agent.
Datadog Agent configuration files can be found in
the agent configuration directory.
Modify openmetrics.d/conf.yaml
by adding the following:
instances:
- openmetrics_endpoint: https://<your-tecton-url>.tecton.ai/api/v1/observability/metrics
namespace: tecton # all exported metrics will have this namespace
metrics:
- .+ # store all metrics
min_collection_interval: 30
headers:
Authorization: Tecton-key <REPLACE THIS WITH TOKEN>
SignalFX​
-
Create a Tecton Service account via Tecton CLI
$ tecton service-account create \
--name "metrics-consumer" \
--description "Consumer of operational metrics"
Save this API Key - you will not be able to get it again.
API Key: your-api-key
Service Account ID: your-service-account-idSave the API key returned by the command. You will need it when configuring the agent.
-
Deploy the Splunk OpenTelemetry connector https://docs.splunk.com/Observability/gdi/opentelemetry/opentelemetry.html
-
Configure the collector to ingest Tecton metrics
Example configuration:
receivers:
lightprometheus:
endpoint: https://<your-tecton-url>.tecton.ai/api/v1/observability/metrics
headers:
Authorization: Tecton-key <TOKEN>
collection_interval: 30s
resource_attributes:
service.name:
enabled: false
service.instance.id:
enabled: false
exporters:
signalfx:
access_token: <SIGNALFX TOKEN>
realm: <SIGNALFX REALM>
service:
pipelines:
metrics:
receivers: [lightprometheus]
exporters: [signalfx]
OpenTelemetry (OTEL) Collector​
If your Application Performance Monitoring system doesn’t support OpenMetrics out of the box, you can install the OTEL collector and configure it to export to a cloud-based monitoring system or any self-hosted alternative. There are enough out-the-box exporters in the OTEL collector to support almost any possible monitoring setup.
The following example shows how to configure the OTEL collector for use with the Tecton Metrics API (the config can vary depending on the version of the collector See config docs:
receivers:
prometheus:
config:
scrape_configs:
- job_name: otel-collector
scrape_interval: 30s
scheme: https
metrics_path: /api/v1/observability/metrics
authorization:
type: Tecton-key
credentials: <API key>
static_configs:
- targets: [<cluster>.tecton.ai]
exporters:
datadog:
api:
site: <DD_SITE>
key: <DD_API_KEY>
processors:
batch:
send_batch_max_size: 100
send_batch_size: 10
timeout: 10s
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [datadog]