Skip to main content
Version: Beta 🚧

Cost Monitoring & Alerting

Because Tecton manages compute and storage infrastructure in your account, your organization will be charged for the resources used to process and serve features. This section shares some best practices for keeping infrastructure costs low.

Cost Alerting​

To ensure you're always aware of your infrastructure expenses, we recommend setting up billing alerts with your cloud provider, whether it's AWS or Google Cloud.

By configuring billing alerts, you're allowing your cloud account to monitor your monthly expenses. If the costs exceed your set limit, you'll get an email warning you about it.

Setting up Alerts​

For AWS users, follow these instructions.

For Google Cloud users, follow these instructions.

Alerts are not sent in real-time

AWS and Google Cloud refresh cost summaries every 24 hours. This means that there will be a delay between when the infrastructure costs exceed your defined limit, and when you receive the alert email.

Limiting alerts exclusively to Tecton Infrastructure Costs​

If you wish to receive alerts specifically for the infrastructure Tecton handles for you, you can set this up by narrowing the billing scope based on specific tags. The following section will guide you on how Tecton uses tags for the cloud infrastructure it manages.

Feature Freshness Monitoring​

See Freshness Alerts for information about configuring alerts for stale feature view data.

Monitoring costs using tags​

Tecton will automatically apply tags on compute instances and online store resources, according to the relevant feature view. By default, Tecton will apply the following tags: tecton_feature_view, tecton_workspace, and tecton_deployment.

If you would like to associate FeatureViews with various cost-centers, you can add those as tags to your FeatureView definition. Tecton will pass through those tags to the compute and online store resources Tecton manages associated with the feature view.

Limiting costs during new feature development​

Model training often involves large amounts of historical data to get the best results. However we rarely get features right the first time, so we need to be careful about the amount of processing and storage we use while iterating on a feature.

Note that this section focuses on features that materialize data, such as a Batch Feature View. Realtime Feature Views don't incur much infrastructure cost within Tecton.

Begin in a development workspace​

The safest way to validate the logic for a new feature is to apply it to your own development workspace. Development workspaces won't run any automatic materialization jobs.

For example, when working a new feature, you may want to switch to a blank development workspace.

$ tecton workspace create my_new_feature_ws
$ tecton apply

Once applied, you can use FeatureView.get_features_in_range(start_time, end_time) to view sample output for the feature on recent dates.

import tecton

ws = tecton.get_workspace("my_new_feature_ws")
fv = ws.get_feature_view("user_has_good_credit_sql")

from datetime import datetime
from datetime import timedelta

# We need to use from_source=True because we don't have materialized data
fv.get_features_in_range(start_time=(datetime.now() - timedelta(days=7)), end_time=datetime.now(), from_source=True)

Start with a recent feature_start_time​

When you apply a Feature View to a workspace with automatic materialization enabled, Tecton will automatically begin materializing feature data back to the feature_start_time.

It's usually a good idea to begin with a recent start time to avoid processing a lot of historical data before you've validated that this is the right feature. For example, you may start with a week to confirm the feature output is correct and inspect how long the jobs take to complete. Next you can backfill 2 months to evaluate the performance impact on a limited set of data, and finally backfill the full 2 years once you think it is a valuable feature.

When you extend the feature start time further back, Tecton intelligently only materializes the new dates. For example, let's say you begin with feature_start_time=datetime(2021,1,1) . Tecton will begin with backfilling all the data from 2021-01-01 up until the current date. If you then change to feature_start_time=datetime(2019,1,1) to train a model with the full history, Tecton will then only compute the data from 2019-01-01 to 2021-01-01.

Using AWS Cost Explorer, you can see the cost impact of the backfill by looking for resources with the tecton_feature_view tag.

Keep online=False until you're ready to serve online traffic​

In live (production) workspaces, when a feature view is configured with online=True, Tecton will materialize data to the online store for low latency retrieval.

Online stores are optimized to provide the low latency access needs of production applications, but are an expensive place to store data.

To keep costs low you should avoid writing data to the online store until you are ready to serve data in production. For example, you may only materialize data offline while training and evaluating a model, then set online=True when you're ready to deploy my model to production.

While changing from online=False to online=True will require reprocessing historical data, the extra compute cost is typically much less than the potential cost of backfilling to the online store twice.

Monitor materialization status, especially during backfills​

Tecton will automatically retry failed jobs in case of transient issues, such as spot instance loss. However it's a good idea to keep an eye on any failures in case it is unlikely to be solved by a retry, so that you can cancel them before they are run again.

By setting your email in a MonitoringConfig, you'll get an email if there are repeated failures. You can additionally check in the Web UI to see how jobs are proceeding.

Note that simultaneously backfilling many feature views to the online store has been known to cause jobs to fail. Although Tecton will retry them automatically, you may want to pause some feature view backfills while others complete.

Limiting costs for existing features​

See the Optimize Spark Materialization Clusters guide for how to right-size resources for materialization jobs.

See Suppress Recreates guide for how to avoid the cost of running repeated backfills while iterating on existing Feature Views.

Best Practices​

Cost Monitoring​

  1. Set up cloud provider billing alerts
  2. Configure resource tags for cost tracking
  3. Monitor billable usage logs
  4. Set appropriate freshness thresholds

Alert Configuration​

  1. Set realistic freshness thresholds
  2. Configure appropriate notification channels
  3. Define escalation paths
  4. Document alert response procedures

See alerting for more information about configuring alerts.

Regular Review​

  1. Analyze usage patterns
  2. Review alert effectiveness
  3. Adjust thresholds as needed
  4. Update notification lists

Next Steps​

Was this page helpful?