Monitoring Materialization
Overview
If feature processing jobs begin to fail, Tecton can begin to serve stale or inaccurate data. To ensure that feature processing jobs stay healthy, Tecton offers monitoring, alerting and debugging tools.
For a practical example of debugging a materialization alert, see Example: Debugging Materialization Alerts.
Setting Up Alerts
Tecton can automatically generate materialization health alerts and online store feature freshness alerts that are sent to a specified email address. See Types of Alerts for more details.
Note
It is highly recommend that an alert email is set for each FeatureView that is being consumed in production.
To configure alerts, specify monitoring
when defining a FeatureView
in your Feature Repository. MonitoringConfig
objects configure alert thresholds and feature freshness expectations.
@batch_feature_view(
...
monitoring = MonitoringConfig(
monitor_freshness=True,
expected_feature_freshness="2w",
alert_email="kanye@tecton.ai"
)
)
def my_feature_view(inputs):
...
monitor_freshness
: Set this toFalse
to suppress online store freshness-related alerts.expected_feature_freshness
: Set this value to decrease the sensitivity of freshness alerts. See Default Expected Feature Freshness for details about the default value if this field is unspecified.alert_email
: Recipient of alerts.
How Failed Jobs Are Retried
When materialization jobs fail, Tecton will automatically retry the jobs after some time: - If the failure was due to the AWS spot instance being reclaimed, the job will be retried immediately. - Otherwise, the job will be retried after 5 minutes, with exponential backoff for each successive failure.
Jobs that will be retried are shown in the Web UI as RETRYING
(in X minutes).
If a job fails continuously, auto-retries will stop, and this state will be shown in the Web UI as FAILED
.
Manually Retrying a Job that had Failed Continuously
When you are ready to retry the failed job (e.g. after fixing the cause of the failures), you can trigger a manual retry of the job by clicking the following link in the Additional Info column of the failed job:
A notification will appear when the retry has been successfully scheduled. This happens immediately:
If the scheduling of the retry fails, the notification will give you a job ID, and you can contact support with the message:
Debugging Tools
Tecton provides tools to monitor and debug production Feature Views from all Tecton tools: Web UI, SDK, and CLI.
Web UI: Health Overview
The easiest way to check the health of a materialized FeatureView is through the Web UI. Navigate to the FeatureView
in question and switch to the "Materialization" tab to see Feature View materialization diagnostics at a glance.
SDK: FeatureView Materialization Status
The Tecton SDK provides the FeatureView.materialization_status()
method to display details about failed materialization attempts.
Materialization Job Links
In the SDK and Web UI, Tecton provides a link to the auto-generated job that was used to compute feature values. This job link can be used to view the underlying error that caused a materialization job to fail.
To view this job, click on the Job status in the materialization table in the Web UI.
This link is also available in the SDK materialization_status()
method, and the tecton materialization-status
command in the CLI.
This link will open a page in your Spark processing engine where you will be able to see the job failure. In the example below, we show a spot failure in Databricks:
CLI: Cluster Overview and Status
Tecton provides the ability to view the status of all Feature Views in a cluster using the tecton freshness
CLI command.
$ tecton freshness
Feature View Stale? Freshness Expected Freshness Created At
=================================================================================================
partner_ctr_performance:14d Y 2wk 1d 2d 12/02/20 10:52
ad_group_ctr_performance N 1h 1m 2h 11/28/20 19:50
user_ad_impression_counts N 1m 35s 2h 10/01/20 2:16
content_keyword_ctr_performance:v2 N 1m 36s 2h 09/04/20 22:22
content_keyword_ctr_performance N 1m 37s 2h 08/26/20 12:52
user_total_ad_frequency_counts N 1m 38s 2h 08/26/20 12:52
You can also use the $ tecton materialization-status $FV_NAME
to see the materialization status of a specific FeatureView.
$ tecton materialization-status my_feature_view
All the displayed times are in UTC time zone
TYPE WINDOW_START_TIME WINDOW_END_TIME STATUS ATTEMPT_NUMBER JOB_CREATED_AT JOB_LOGS
================================================================================================================
BATCH 2020-12-15 00:00:00 2020-12-22 00:00:00 SUCCESS 1 2020-12-22 00:00:27 https://...
BATCH 2020-12-14 00:00:00 2020-12-21 00:00:00 SUCCESS 1 2020-12-21 00:00:14 https://...
BATCH 2020-12-13 00:00:00 2020-12-20 00:00:00 SUCCESS 1 2020-12-20 00:00:13 https://...
BATCH 2020-12-12 00:00:00 2020-12-19 00:00:00 SUCCESS 1 2020-12-19 00:00:10 https://...
BATCH 2020-12-11 00:00:00 2020-12-18 00:00:00 SUCCESS 1 2020-12-18 00:00:06 https://...
Default Expected Feature Freshness
By default, a Feature Views's freshness is expected to be less than twice the materialization schedule. By default, alerts will fire once this threshold, plus a small grace period, is crossed. For streaming Feature Views, freshness can be configured as low as 30 minutes.The grace period's duration depends on the FeatureView's materialization schedule:
Schedule | Grace Period |
---|---|
<= 10 minutes | 30 minutes |
<= 30 minutes | 90 minutes |
<= 1 hour | 2 hours |
<= 4 hours | 4 hours |
<= 24 hours | 12 hours |
> 24 hours | 24 hours |
The table below has examples of materialization schedules mapped to default alert thresholds:
Schedule | Default Alert Threshold |
---|---|
5 minutes | 40 minutes |
30 minutes | 2.5 hours |
1 hour | 4 hours |
4 hours | 12 hours |
24 hours | 60 hours |