Materialize Features
Materialization is an essential part of Tecton's operational ML features lifecycle management. It refers to the process of precomputing feature data using a feature pipeline, followed by publishing the results to either the Online or Offline Feature Store.
The main objective of materialization is to enable quick feature retrieval during training and inference, thereby reducing latencies and improving the efficiency of machine learning applications.
Types of Materialization​
Tecton handles backfill and steady-state materialization for batch and stream features based on your Feature View configuration.
Steady-state Materialization​
Steady-state Materialization refers to materialization being performed on new data arriving in real-time. Steady-state Materialization continuously occurs on all Feature Views where Materialization is enabled.
When a Feature View has materialization enabled, Tecton will schedule
steady-state materialization jobs on an ongoing basis in order to maintain fresh
feature values. The frequency of steady-state materialization is controlled by
the batch_schedule
parameter. If you use Delta for the offline store, Tecton
will run periodic background maintenance tasks on an ongoing basis with a 7-day
schedule to perform
optimize
and
vacuum
operations in order to optimize performance with file managements on your Delta
tables.
Backfill materialization​
Backfill refers to any materialization operations performed on data in the past. There are two Backfill operations.
The initial materialization of a Feature View is referred to as a bootstrap backfill. During a bootstrap materialization, existing raw data is processed into feature values.
When materialization is initially enabled for a Feature View, Tecton performs a
bootstrap materialization. The amount of data materialized during a bootstrap is
controlled by the feature_start_time
parameter.
Enabling Feature View materialization​
Every Batch and Stream Feature Views can enable materialization to the online
and/or offline store by setting online=True
and/or offline=True
in the
Feature View decorator parameters. These options are available for the following
types of Feature Views:
Realtime Feature Views cannot be materialized since they are calculated only at request-time.
Batch Feature Views​
You can easily make batch features available for low-latency online retrieval to feed an online model. Tecton also supports offline materialization to speed up some expensive queries.
If you don’t materialize Batch Feature Views offline, Tecton will execute your transformation directly against the upstream raw data sources when you use the Tecton SDK to generate offline data. Speaking in SQL terms, a Batch Feature View without offline materialization is simply a “View”. A Batch Feature View with offline materialization is a “Materialized View”.
Offline materialization has additional benefits including:
- Online-offline skew is minimized because the data in the online and offline store is updated at the same time.
- Offline feature data is saved so it is resilient to any losses of historical data upstream and training datasets can always be regenerated.
Stream Feature Views​
For online materialization, Tecton will run the Stream Feature View
transformation on each event that comes in from the underlying stream source and
write it to the online store. Any previous values will be overwritten, so the
online store only has the most recent value. Feature data will be backfilled
from the Stream Data Source's log of historical events (configured via its
batch_config
).
Feature data can also be materialized to the Offline Store in order to speed up
offline queries (for testing and training data generation). Tecton will run the
same Stream Feature View transformation pipeline against the batch source (the
historical log of stream events) that backs the stream source. The
batch_schedule
parameter determines how often Tecton will run offline
materialization jobs.
Materialization Job Scheduling Behavior​
During the initial backfill of your feature, Tecton tries to minimize the number
of backfill jobs in order to drastically reduce the backfill costs. For example,
if you define a feature with a batch_schedule
of 1 day that needs to backfill
1 year worth of data, you will find that Tecton schedules just ~10 distinct
backfill jobs, rather than 365, as you may typically expect. You can modify
Tecton's backfill job splitting behavior by setting the
max_backfill_interval
parameter.
If you're used to common Data Engineering tools like Airflow, you may expect
Tecton to schedule one backfill job for every batch_schedule
interval in your
backfill period. For instance, you may expect a feature that has a daily
batch_schedule
, and that needs to backfilled for the past 1 year, to kick off
365 distinct backfill jobs. This can be very expensive. It is also common
practice and the default behavior of most off-the-shelf ETL solutions (like
Airflow). You can force Tecton to use this naive backfill mode by setting
incremental_backfill
to True. Please visit
this guide
that discusses a valid use case for this mode.
For every steady-state, forward fill, of your feature, Tecton will schedule exactly one materialization job.
Feature Data Timestamp Expectations​
Every materialization run is expected to produce feature values for a specific time range. This time range is known as the “materialization window”. The materialization window is different for backfills and incremental runs:
- During the initial backfill of feature data to the Online and Offline Feature
Store, the materialization time window starts with
feature_start_time
and ends with Tecton’s “current wall clock time” at which the feature’s materialization is enabled. - On incremental runs, the materialization time window starts with the previous
run’s
start_time
and ends withstart_time + batch_schedule
.
Tecton only materializes feature values that fall within the materialization
time window. It automatically filters out unexpected feature values as shown
with the WHERE
clause below:
--Tecton applies this filter to the user-provided transformation
SELECT * FROM {batch_feature_view_transformation}
WHERE {timestamp_field} >= {start_time}
AND {timestamp_field} < {end_time}
The start time of the window is inclusive and the end time is exclusive. This
means that a feature value whose timestamp is exactly equal to the end_time
is
not part of the window.
Efficient Incremental Materialization​
In many cases, incremental materialization runs do not need to process all of the input source's raw data.
For example, an incremental materialization run of a row-level transformation that processes raw data at midnight every day should only look at the event data of the past 24 hours, and not the entire event history.
Automatic filtering using FilteredSource​
For convenience, Tecton offers a FilteredSource
class that automatically
pushes timestamp and partition filtering to the data source.
As a result, your Feature View transformation does not need to manually filter out raw data that's not required for the current materialization window.
Behind the scenes, Tecton will automatically filter the data source’s data based
on its timestamp_field
and, if applicable, its datetime_partition_columns
.
Here is an example that shows how to use the FilteredSource
in practice.
- Rift
- Spark
- Snowflake
from tecton import Attribute, batch_feature_view, FilteredSource
from tecton.types import String, Timestamp, Float64
from datetime import datetime, timedelta
@batch_feature_view(
sources=[FilteredSource(transactions)],
entities=[user],
mode="pandas",
online=True,
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2020, 10, 10),
offline=True,
timestamp_field="timestamp",
features=[
Attribute("amt", Float64),
],
)
def user_last_transaction_amount(transactions):
return transactions[["user_id", "timestamp", "amt"]]
from tecton import Attribute, batch_feature_view, FilteredSource
from datetime import datetime, timedelta
from tecton.types import Float64
@batch_feature_view(
sources=[FilteredSource(transactions)],
entities=[user],
mode="spark_sql",
online=True,
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2020, 10, 10),
offline=True,
timestamp_field="timestamp",
features=[
Attribute("amount", Float64),
],
)
def user_last_transaction_amount(transactions):
return f"""
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
"""
from tecton import batch_feature_view, FilteredSource
from datetime import datetime, timedelta
@batch_feature_view(
sources=[FilteredSource(transactions)],
entities=[user],
mode="snowflake_sql",
online=True,
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2020, 10, 10),
offline=True,
)
def user_last_transaction_amount(transactions):
return f"""
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
"""
By default, FilteredSource
filters for data between context.start_time
and
context.end_time
.
Manual filtering​
If FilteredSource
isn't an option for you, you can manually filter for the raw
data needed to produce feature values on each run by leveraging a context
object that Tecton passes into the transformation function. context.start_time
and context.end_time
are equal to the expected materialization time window as
shown in the diagram below:
The example transformation below filters for the required raw data in the
WHERE
clause.
- Rift
- Spark
- Snowflake
from tecton import Attribute, batch_feature_view, materialization_context
from tecton.types import Field, String, Timestamp, Float64
from datetime import datetime, timedelta
@batch_feature_view(
sources=[transactions],
entities=[user],
online=True,
mode="pandas",
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2020, 10, 10),
offline=True,
timestamp_field="timestamp",
features=[
Attribute("amt", Float64),
],
)
def user_last_transaction_amount(transactions, context=materialization_context()):
df = transactions[["user_id", "amt", "timestamp"]]
return df[(df["timestamp"] >= context.start_time) & (df["timestamp"] < context.end_time)]
from tecton import batch_feature_view, materialization_context
from datetime import datetime, timedelta
@batch_feature_view(
sources=[transactions],
entities=[user],
online=True,
mode="spark_sql",
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2020, 10, 10),
offline=True,
)
def user_last_transaction_amount(transactions, context=materialization_context()):
return f"""
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
WHERE TIMESTAMP >= TO_TIMESTAMP("{context.start_time}")
AND TIMESTAMP < TO_TIMESTAMP("{context.end_time}")
"""
from tecton import batch_feature_view, materialization_context
from datetime import datetime, timedelta
@batch_feature_view(
sources=[transactions],
entities=[user],
online=True,
mode="snowflake_sql",
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2020, 10, 10),
offline=True,
)
def user_last_transaction_amount(transactions, context=materialization_context()):
return f"""
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
WHERE TIMESTAMP >= TO_TIMESTAMP("{context.start_time}")
AND TIMESTAMP < TO_TIMESTAMP("{context.end_time}")
"""
In cases where you read from a time-partitioned data source, like a Glue table or partitioned data on S3, you typically will also want to filter by partition columns.
Late Arriving Data​
By default, incremental materialization jobs for Batch Feature Views run
immediately at the end of the batch schedule period. To override this default,
set the data_delay
parameter, which is specified in the data source
configuration (the batch_config
object referenced in the BatchSource
object
used by the Batch Feature View). data_delay
configures how long jobs wait
after the end of the batch schedule period before starting. This is typically
used to ensure that all data has landed. For example, if a Batch Feature View
has a batch_schedule
of 1 day and its data source input has
data_delay=timedelta(hours=1)
set, then incremental batch materialization jobs
will run at 01:00 UTC.
If your upstream data delay is unpredictable, you can trigger materialization with an API call. Please follow these instructions
Monitoring​
Tecton provides tools to monitor and debug production Feature Views via the Web UI, SDK, and CLI. More information on monitoring is available in Monitoring Materialization.