Skip to content

New Framework Preview for Tecton on Snowflake

This document provides a preview of Tecton’s next upcoming framework version which makes changes to the definition of Feature View classes.

This API is currently only available with Tecton on Snowflake.

For a general overview of Feature Views, refer to Feature Views Overview.

Please continue to refer to the main documentation for all other information regarding the usage of Tecton.

Batch Feature View

A Batch Feature View is used for defining transformations against a Batch Data Source. Batch Feature Views can run automatic backfills and can be scheduled to publish new feature data to the Online and Offline Feature Stores on a regular cadence.

Offline Features

By default, a Batch Feature View is treated as a view and registered as a view in Snowflake (if using Tecton on Snowflake). Tecton will run this query to generate offline feature data for testing, training, and batch predictions.

@batch_feature_view(
    sources=[transactions],
    entities=[user],
    mode="snowflake_sql"
)
def user_transaction_features(transactions):
    return f'''
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        '''

Online Feature Serving

Batch features can easily be made available for low-latency online retrieval in order to feed an online model. Simply set online=True and specify a batch_schedule frequency to materialize feature data online and a feature_start_time date to backfill feature data to.

@batch_feature_view(
    sources=[transactions],
    entities=[user],
    mode="snowflake_sql",
    online=True,
    batch_schedule='1d',
    feature_start_time=datetime(2020, 10, 10),
    ttl='3650d'
)
def user_transaction_features(transactions):
    return f'''
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        '''

During the initial backfill of feature data, Tecton will publish values between the provided feature_start_time and now.

On incremental runs, Tecton will only publish new feature values with timestamps greater than the last materialization run time.

Offline Materialization (Coming Soon)

Offline queries for feature data can be sped up by setting an offline=True flag on a Batch Feature View. With this value is True, rather than registering a view in your data platform Tecton will register a table by incrementally materializing features using the same logic in the section above for online materialization.

Offline materialization has additional benefits including:

  • Offline feature data is saved so it’s resilient to any losses of historical data upstream.
  • Online-offline skew is minimized because the data is incrementally updated at the same time.

For these reasons, we recommend enabling offline materialization after features have been tested offline.

@batch_feature_view(
    sources=[transactions],
    entities=[user],
    mode="snowflake_sql",
    online=True,
    batch_schedule='1d',
    feature_start_time=datetime(2020, 10, 10),
    ttl='3650d',
    offline=True
)
def user_transaction_features(transactions):
    return f'''
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        '''

Speeding Up Materialization with Incremental Raw Data Filtering (Coming Soon)

Note

We strongly recommend adding raw data filtering to Spark-based Feature Views in order to achieve good performance.

In some cases, running the provided query against the full set of raw data for each materialization run can become slow and inefficient. Users can filter for only the raw data needed to produce feature values on each run by leveraging a context object that Tecton passes into the transformation.

On the first backfill run context.start_time will equal the start_time specified in the Batch Feature View decorator. On subsequent incremental runs it will equal the last materialization run time.

Note: Depending on your data platform and raw data layout, filtering by partition columns may also be necessary to properly filter feature data and speed up materialization.

@batch_feature_view(
    sources=[transactions],
    entities=[user],
    mode="snowflake_sql",
    online=True,
    batch_schedule='1d',
    feature_start_time=datetime(2020, 10, 10),
    ttl='3650d',
    offline=True
)
def user_transaction_features(**context**, transactions):
    return f'''
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        WHERE
            TIMESTAMP > {context.feature_start_time}
        '''

Using Filtered Sources for Convenient Timestamp and Partition Filtering

For convenience, Tecton offers a FilteredSource class that applies timestamp and partition filtering automatically based on parameters set on the Batch Data Source.

In the example below, this would replace transactions with SELECT * FROM FRAUD.TRANSACTIONS WHERE TIMESTAMP >= {context.start_time}.

@batch_feature_view(
    sources=FilteredSource(source=transactions, offset=timedelta(days=0)),
    entities=[user],
    mode="snowflake_sql",
    batch_schedule='1d',
    feature_start_time=datetime(2020, 10, 10),
    online=True,
    offline=True
)
def user_transaction_features(context, transactions):
        return f'''
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        '''

Defining Time-Windowed Aggregation Features

Time-windowed aggregations are common ML features for event data, but defining them in a view can be error-prone and inefficient.

Tecton provides built-in implementations of common time-windowed aggregations that simplify transformation logic and ensure correct feature value computation. Additionally, Tecton optimizes the compute and storage of these aggregations to maximize efficiency.

For these reasons, we recommend using Tecton’s built-in aggregations whenever possible.

Time-windowed aggregations can be specified in the Batch Feature View decorator using the aggregations and aggregation_slide_period parameters.

Tecton will expect the provided SQL query to select the raw events (with timestamps) to be aggregated.

@batch_feature_view(
    sources=[transactions],
    entities=[user],
    mode="snowflake_sql",
    online=True,
    batch_schedule='1d',
    feature_start_time=datetime(2020, 10, 10),
    offline=True,
    aggregations=[FeatureAggregation(column='AMOUNT',function='mean', time_windows=['30d', '60d', '90d'])],
    aggregation_slide_period='1d'
)
def user_transaction_features(context, transactions):
    return f'''
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions}
        WHERE
            TIMESTAMP > {context.feature_start_time}
        '''

Parameters

See the API reference for the full list of parameters.

The new Batch Feature View class has the same API spec as the linked Batch Window Aggregate Feature View class with the following changes:

  • inputs is renamed to sources and takes a list of batch Data Sources.
  • The aggregation_slide_period and aggregations parameters are optional.
  • batch_schedule will default to aggregation_slide_period if aggregations are present.
  • ttl will be set automatically if aggregations are present.

Stream Feature View

Note: Stream Feature Views are only supported with Spark and not with Tecton on Snowflake today.

StreamFeatureView is used for computing fresh streaming features. It processes raw data from a StreamDataSource (e.g. Kafka and Kinesis) and can be backfilled from any BatchDataSource (e.g. S3, Hive Tables, Redshift) that contains a historical log of events.

Stream Feature Views take a single source as input and can run and event-level SQL transformations to produce features. Time-windowed aggregations can optionally be applied to the output rows of the SQL transformation. Tecton manages the aggregation transforms in order to optimize compute and storage as well as proper time-decay.

Examples

Row-Level Transformations

@stream_feature_view(
    source=transactions_stream,
    entities=[user],
    mode='spark_sql',
    online=True,
    offline=True,
    feature_start_time=datetime(2020, 10, 10),
    batch_schedule='1d',
    ttl='120d'
)
def user_last_transaction_amount(transactions_stream):
    return f'''
        SELECT
            NAMEORIG as USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions_stream}
        '''

Time-Windowed Aggregations

Time-windowed aggregations are applied via the optional aggregation_slide_period and aggregations parameters.

@stream_feature_view(
    source=transactions_stream,
    entities=[user],
    mode='spark_sql',
    online=True,
    feature_start_time=datetime(2020, 10, 10),
    batch_schedule='1d',
    aggregation_slide_period='5min',
    aggregations=[
        FeatureAggregation(column='AMOUNT',function='mean', time_windows=['5m', '30m', '90m']),
        FeatureAggregation(column='AMOUNT', function='sum', time_windows=['5m', '30m', '90m'])
    ]
)
def user_transaction_amount_metrics(transactions_stream):
    return f'''
        SELECT
            USER_ID,
            AMOUNT,
            TIMESTAMP
        FROM
            {transactions_stream}
        '''

Parameters

See the API reference for the full list of parameters.

The new Stream Feature View class has the same API spec as the linked Stream Window Aggregate Feature View class with the following changes:

  • inputs is changes to source and takes a single StreamDataSource.
  • The aggregation_slide_period and aggregations parameters are optional.
  • ttl will be set automatically if aggregations are present.

On-Demand Feature View

On-Demand Feature Views are unchanged from previous framework versions. Full documentation can be found here.