Version: 0.9

Materialize Features

Materialization is an essential part of Tecton's operational ML features lifecycle management. It refers to the process of precomputing feature data using a feature pipeline, followed by publishing the results to either the Online or Offline Feature Store.

The main objective of materialization is to enable quick feature retrieval during training and inference, thereby reducing latencies and improving the efficiency of machine learning applications.

Types of Materialization

Tecton handles backfill and steady-state materialization for batch and stream features based on your Feature View configuration.

Steady-state Materialization

Steady-state Materialization refers to materialization being performed on new data arriving in real-time. Steady-state Materialization continuously occurs on all Feature Views where Materialization is enabled.

When a Feature View has materialization enabled, Tecton will schedule steady-state materialization jobs on an ongoing basis in order to maintain fresh feature values. The frequency of steady-state materialization is controlled by the batch_schedule parameter. If you use Delta for the offline store, Tecton will run periodic background maintenance tasks on an ongoing basis with a 1-day schedule to perform optimize and vacuum operations in order to optimize performance with file managements on your Delta tables.

Backfill materialization

Backfill refers to any materialization operations performed on data in the past. There are two Backfill operations.

The initial materialization of a Feature View is referred to as a bootstrap backfill. During a bootstrap materialization, existing raw data is processed into feature values.

When materialization is initially enabled for a Feature View, Tecton performs a bootstrap materialization. The amount of data materialized during a bootstrap is controlled by the feature_start_time parameter.

Enabling Feature View materialization

Every Batch and Stream Feature Views can enable materialization to the online and/or offline store by setting online=True and/or offline=True in the Feature View decorator parameters. These options are available for the following types of Feature Views:

On-Demand Feature Views cannot be materialized since they are calculated only at request-time.

Batch Feature Views

You can easily make batch features available for low-latency online retrieval to feed an online model. Tecton also supports offline materialization to speed up some expensive queries.

If you don’t materialize Batch Feature Views offline, Tecton will execute your transformation directly against the upstream raw data sources when you use the Tecton SDK to generate offline data. Speaking in SQL terms, a Batch Feature View without offline materialization is simply a “View”. A Batch Feature View with offline materialization is a “Materialized View”.

Offline materialization has additional benefits including:

Online-offline skew is minimized because the data in the online and offline store is updated at the same time.
Offline feature data is saved so it is resilient to any losses of historical data upstream and training datasets can always be regenerated.

Stream Feature Views

For online materialization, Tecton will run the Stream Feature View transformation on each event that comes in from the underlying stream source and write it to the online store. Any previous values will be overwritten, so the online store only has the most recent value. Feature data will be backfilled from the Stream Data Source's log of historical events (configured via its batch_config).

Feature data can also be materialized to the Offline Store in order to speed up offline queries (for testing and training data generation). Tecton will run the same Stream Feature View transformation pipeline against the batch source (the historical log of stream events) that backs the stream source. The batch_schedule parameter determines how often Tecton will run offline materialization jobs.

Materialization Job Scheduling Behavior

During the initial backfill of your feature, Tecton tries to minimize the number of backfill jobs in order to drastically reduce the backfill costs. For example, if you define a feature with a batch_schedule of 1 day that needs to backfill 1 year worth of data, you will find that Tecton schedules just ~10 distinct backfill jobs, rather than 365, as you may typically expect. You can modify Tecton's backfill job splitting behavior by setting the max_backfill_interval parameter.

info

If you're used to common Data Engineering tools like Airflow, you may expect Tecton to schedule one backfill job for every batch_schedule interval in your backfill period. For instance, you may expect a feature that has a daily batch_schedule, and that needs to backfilled for the past 1 year, to kick off 365 distinct backfill jobs. This can be very expensive. It is also common practice and the default behavior of most off-the-shelf ETL solutions (like Airflow). You can force Tecton to use this naive backfill mode by setting incremental_backfill to True. Please visit this guide that discusses a valid use case for this mode.

For every steady-state, forward fill, of your feature, Tecton will schedule exactly one materialization job.

Feature Data Timestamp Expectations

Every materialization run is expected to produce feature values for a specific time range. This time range is known as the “materialization window”. The materialization window is different for backfills and incremental runs:

During the initial backfill of feature data to the Online and Offline Feature Store, the materialization time window starts with feature_start_time and ends with Tecton’s “current wall clock time” at which the feature’s materialization is enabled.
On incremental runs, the materialization time window starts with the previous run’s start_time and ends with start_time + batch_schedule.

Backfill and Incremental Materialization

Tecton only materializes feature values that fall within the materialization time window. It automatically filters out unexpected feature values as shown with the WHERE clause below:

--Tecton applies this filter to the user-provided transformation

SELECT * FROM {batch_feature_view_transformation}
WHERE {timestamp_field} >= {start_time}
  AND {timestamp_field} < {end_time}

info

The start time of the window is inclusive and the end time is exclusive. This means that a feature value whose timestamp is exactly equal to the end_time is not part of the window.

Efficient Incremental Materialization

In many cases, incremental materialization runs do not need to process all of the input source's raw data.

For example, an incremental materialization run of a row-level transformation that processes raw data at midnight every day should only look at the event data of the past 24 hours, and not the entire event history.

Automatic filtering using FilteredSource

For convenience, Tecton offers a FilteredSource class that automatically pushes timestamp and partition filtering to the data source.

As a result, your Feature View transformation does not need to manually filter out raw data that's not required for the current materialization window.

Behind the scenes, Tecton will automatically filter the data source’s data based on its timestamp_field and, if applicable, its datetime_partition_columns.

Here is an example that shows how to use the FilteredSource in practice.

Rift
Spark
Snowflake

from tecton import batch_feature_view, FilteredSource
from tecton.types import Field, String, Timestamp, Float64
from datetime import datetime, timedelta


@batch_feature_view(
    sources=[FilteredSource(transactions)],
    entities=[user],
    mode="pandas",
    online=True,
    batch_schedule=timedelta(days=1),
    schema=[Field("user_id", String), Field("timestamp", Timestamp), Field("amt", Float64)],
    feature_start_time=datetime(2020, 10, 10),
    offline=True,
)
def user_last_transaction_amount(transactions):
    return transactions[["user_id", "timestamp", "amt"]]

from tecton import batch_feature_view, FilteredSource
from datetime import datetime, timedelta


@batch_feature_view(
    sources=[FilteredSource(transactions)],
    entities=[user],
    mode="spark_sql",
    online=True,
    batch_schedule=timedelta(days=1),
    feature_start_time=datetime(2020, 10, 10),
    offline=True,
)
def user_last_transaction_amount(transactions):
    return f"""
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        """

from tecton import batch_feature_view, FilteredSource
from datetime import datetime, timedelta


@batch_feature_view(
    sources=[FilteredSource(transactions)],
    entities=[user],
    mode="snowflake_sql",
    online=True,
    batch_schedule=timedelta(days=1),
    feature_start_time=datetime(2020, 10, 10),
    offline=True,
)
def user_last_transaction_amount(transactions):
    return f"""
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        """

By default, FilteredSource filters for data between context.start_time and context.end_time.

Manual filtering

If FilteredSource isn't an option for you, you can manually filter for the raw data needed to produce feature values on each run by leveraging a context object that Tecton passes into the transformation function. context.start_time and context.end_time are equal to the expected materialization time window as shown in the diagram below:

Materialization Context Window

The example transformation below filters for the required raw data in the WHERE clause.

Rift
Spark
Snowflake

from tecton import batch_feature_view, materialization_context
from tecton.types import Field, String, Timestamp, Float64
from datetime import datetime, timedelta


@batch_feature_view(
    sources=[transactions],
    entities=[user],
    online=True,
    mode="pandas",
    batch_schedule=timedelta(days=1),
    schema=[Field("user_id", String), Field("timestamp", Timestamp), Field("amt", Float64)],
    feature_start_time=datetime(2020, 10, 10),
    offline=True,
)
def user_last_transaction_amount(transactions, context=materialization_context()):
    df = transactions[["user_id", "amt", "timestamp"]]
    return df[(df["timestamp"] >= context.start_time) & (df["timestamp"] < context.end_time)]

from tecton import batch_feature_view, materialization_context
from datetime import datetime, timedelta


@batch_feature_view(
    sources=[transactions],
    entities=[user],
    online=True,
    mode="spark_sql",
    batch_schedule=timedelta(days=1),
    feature_start_time=datetime(2020, 10, 10),
    offline=True,
)
def user_last_transaction_amount(transactions, context=materialization_context()):
    return f"""
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        WHERE TIMESTAMP >= TO_TIMESTAMP("{context.start_time}")
            AND TIMESTAMP < TO_TIMESTAMP("{context.end_time}")
        """

from tecton import batch_feature_view, materialization_context
from datetime import datetime, timedelta


@batch_feature_view(
    sources=[transactions],
    entities=[user],
    online=True,
    mode="snowflake_sql",
    batch_schedule=timedelta(days=1),
    feature_start_time=datetime(2020, 10, 10),
    offline=True,
)
def user_last_transaction_amount(transactions, context=materialization_context()):
    return f"""
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        WHERE TIMESTAMP >= TO_TIMESTAMP("{context.start_time}")
            AND TIMESTAMP < TO_TIMESTAMP("{context.end_time}")
        """

info

In cases where you read from a time-partitioned data source, like a Glue table or partitioned data on S3, you typically will also want to filter by partition columns.

Late Arriving Data

By default, incremental materialization jobs for Batch Feature Views run immediately at the end of the batch schedule period. To override this default, set the data_delay parameter, which is specified in the data source configuration (the batch_config object referenced in the BatchSource object used by the Batch Feature View). data_delay configures how long jobs wait after the end of the batch schedule period before starting. This is typically used to ensure that all data has landed. For example, if a Batch Feature View has a batch_schedule of 1 day and its data source input has data_delay=timedelta(hours=1) set, then incremental batch materialization jobs will run at 01:00 UTC.

If your upstream data delay is unpredictable, you can trigger materialization with an API call. Please follow these instructions

Duplicate Feature Behavior in Tecton

When materializing features in Tecton, it's important to understand how the system handles duplicate data with the same entity ID/join key and timestamp combination. This behavior varies based on the feature view type, whether aggregations are used, and the target store (online vs. offline).

Non-Aggregation Feature Views

For Batch Feature Views and Stream Feature Views that do not use aggregations:

Tecton expects unique values for each entity ID + timestamp combination
When duplicates occur, the last write wins (the last entry in the raw data source overwrites previous entries)
In the online store, a new write overwrites existing data if its timestamp is greater than or equal to the timestamp of the existing data

Aggregation Feature Views

For Feature Views that use aggregations:

Existing tiles will not be overwritten or deleted
New tiles can be rolled up alongside previously materialized tiles
Duplicate rows in the raw data get double-counted in the final aggregation result
If a Feature View has a TTL defined, online store upserts will only happen for data with a timestamp after the TTL

Stream Ingest API

For features pushed via the Stream Ingest API:

For the same entity ID + timestamp, the last write always wins

Monitoring

Tecton provides tools to monitor and debug production Feature Views via the Web UI, SDK, and CLI. More information on monitoring is available in Monitoring Materialization.

Types of Materialization​

Steady-state Materialization​

Backfill materialization​

Enabling Feature View materialization​

Batch Feature Views​

Stream Feature Views​

Materialization Job Scheduling Behavior​

Feature Data Timestamp Expectations​

Efficient Incremental Materialization​

Automatic filtering using FilteredSource​

Manual filtering​

Late Arriving Data​

Duplicate Feature Behavior in Tecton​

Non-Aggregation Feature Views​

Aggregation Feature Views​

Stream Ingest API​

Monitoring​

Was this page helpful?

Types of Materialization

Steady-state Materialization

Backfill materialization

Enabling Feature View materialization

Batch Feature Views

Stream Feature Views

Materialization Job Scheduling Behavior

Feature Data Timestamp Expectations

Efficient Incremental Materialization

Automatic filtering using FilteredSource

Manual filtering

Late Arriving Data

Duplicate Feature Behavior in Tecton

Non-Aggregation Feature Views

Aggregation Feature Views

Stream Ingest API

Monitoring