New Framework Preview for Tecton on Snowflake
This document provides a preview of Tecton’s next upcoming framework version which makes changes to the definition of Feature View classes.
This API is currently only available with Tecton on Snowflake.
For a general overview of Feature Views, refer to Feature Views Overview.
Please continue to refer to the main documentation for all other information regarding the usage of Tecton.
Batch Feature View
A Batch Feature View is used for defining transformations against a Batch Data Source. Batch Feature Views can run automatic backfills and can be scheduled to publish new feature data to the Online and Offline Feature Stores on a regular cadence.
Offline Features
By default, a Batch Feature View is treated as a view and registered as a view in Snowflake (if using Tecton on Snowflake). Tecton will run this query to generate offline feature data for testing, training, and batch predictions.
@batch_feature_view(
sources=[transactions],
entities=[user],
mode="snowflake_sql"
)
def user_transaction_features(transactions):
return f'''
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
'''
Online Feature Serving
Batch features can easily be made available for low-latency online retrieval in order to feed an online model. Simply set online=True
and specify a batch_schedule
frequency to materialize feature data online and a feature_start_time
date to backfill feature data to.
@batch_feature_view(
sources=[transactions],
entities=[user],
mode="snowflake_sql",
online=True,
batch_schedule='1d',
feature_start_time=datetime(2020, 10, 10),
ttl='3650d'
)
def user_transaction_features(transactions):
return f'''
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
'''
During the initial backfill of feature data, Tecton will publish values between the provided feature_start_time
and now.
On incremental runs, Tecton will only publish new feature values with timestamps greater than the last materialization run time.
Offline Materialization (Coming Soon)
Offline queries for feature data can be sped up by setting an offline=True
flag on a Batch Feature View. With this value is True
, rather than registering a view in your data platform Tecton will register a table by incrementally materializing features using the same logic in the section above for online materialization.
Offline materialization has additional benefits including:
- Offline feature data is saved so it’s resilient to any losses of historical data upstream.
- Online-offline skew is minimized because the data is incrementally updated at the same time.
For these reasons, we recommend enabling offline materialization after features have been tested offline.
@batch_feature_view(
sources=[transactions],
entities=[user],
mode="snowflake_sql",
online=True,
batch_schedule='1d',
feature_start_time=datetime(2020, 10, 10),
ttl='3650d',
offline=True
)
def user_transaction_features(transactions):
return f'''
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
'''
Speeding Up Materialization with Incremental Raw Data Filtering (Coming Soon)
Note
We strongly recommend adding raw data filtering to Spark-based Feature Views in order to achieve good performance.
In some cases, running the provided query against the full set of raw data for each materialization run can become slow and inefficient. Users can filter for only the raw data needed to produce feature values on each run by leveraging a context
object that Tecton passes into the transformation.
On the first backfill run context.start_time
will equal the start_time
specified in the Batch Feature View decorator. On subsequent incremental runs it will equal the last materialization run time.
Note: Depending on your data platform and raw data layout, filtering by partition columns may also be necessary to properly filter feature data and speed up materialization.
@batch_feature_view(
sources=[transactions],
entities=[user],
mode="snowflake_sql",
online=True,
batch_schedule='1d',
feature_start_time=datetime(2020, 10, 10),
ttl='3650d',
offline=True
)
def user_transaction_features(**context**, transactions):
return f'''
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
WHERE
TIMESTAMP > {context.feature_start_time}
'''
Using Filtered Sources for Convenient Timestamp and Partition Filtering
For convenience, Tecton offers a FilteredSource
class that applies timestamp and partition filtering automatically based on parameters set on the Batch Data Source.
In the example below, this would replace transactions
with SELECT * FROM FRAUD.TRANSACTIONS WHERE TIMESTAMP >= {context.start_time}
.
@batch_feature_view(
sources=FilteredSource(source=transactions, offset=timedelta(days=0)),
entities=[user],
mode="snowflake_sql",
batch_schedule='1d',
feature_start_time=datetime(2020, 10, 10),
online=True,
offline=True
)
def user_transaction_features(context, transactions):
return f'''
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
'''
Defining Time-Windowed Aggregation Features
Time-windowed aggregations are common ML features for event data, but defining them in a view can be error-prone and inefficient.
Tecton provides built-in implementations of common time-windowed aggregations that simplify transformation logic and ensure correct feature value computation. Additionally, Tecton optimizes the compute and storage of these aggregations to maximize efficiency.
For these reasons, we recommend using Tecton’s built-in aggregations whenever possible.
Time-windowed aggregations can be specified in the Batch Feature View decorator using the aggregations
and aggregation_slide_period
parameters.
Tecton will expect the provided SQL query to select the raw events (with timestamps) to be aggregated.
@batch_feature_view(
sources=[transactions],
entities=[user],
mode="snowflake_sql",
online=True,
batch_schedule='1d',
feature_start_time=datetime(2020, 10, 10),
offline=True,
aggregations=[FeatureAggregation(column='AMOUNT',function='mean', time_windows=['30d', '60d', '90d'])],
aggregation_slide_period='1d'
)
def user_transaction_features(context, transactions):
return f'''
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions}
WHERE
TIMESTAMP > {context.feature_start_time}
'''
Parameters
See the API reference for the full list of parameters.
The new Batch Feature View class has the same API spec as the linked Batch Window Aggregate Feature View class with the following changes:
inputs
is renamed tosources
and takes a list of batch Data Sources.- The
aggregation_slide_period
andaggregations
parameters are optional. batch_schedule
will default toaggregation_slide_period
if aggregations are present.ttl
will be set automatically if aggregations are present.
Stream Feature View
Note: Stream Feature Views are only supported with Spark and not with Tecton on Snowflake today.
A StreamFeatureView
is used for computing fresh streaming features. It processes raw data from a StreamDataSource
(e.g. Kafka and Kinesis) and can be backfilled from any BatchDataSource
(e.g. S3, Hive Tables, Redshift) that contains a historical log of events.
Stream Feature Views take a single source
as input and can run and event-level SQL transformations to produce features. Time-windowed aggregations can optionally be applied to the output rows of the SQL transformation. Tecton manages the aggregation transforms in order to optimize compute and storage as well as proper time-decay.
Examples
Row-Level Transformations
@stream_feature_view(
source=transactions_stream,
entities=[user],
mode='spark_sql',
online=True,
offline=True,
feature_start_time=datetime(2020, 10, 10),
batch_schedule='1d',
ttl='120d'
)
def user_last_transaction_amount(transactions_stream):
return f'''
SELECT
NAMEORIG as USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions_stream}
'''
Time-Windowed Aggregations
Time-windowed aggregations are applied via the optional aggregation_slide_period
and aggregations
parameters.
@stream_feature_view(
source=transactions_stream,
entities=[user],
mode='spark_sql',
online=True,
feature_start_time=datetime(2020, 10, 10),
batch_schedule='1d',
aggregation_slide_period='5min',
aggregations=[
FeatureAggregation(column='AMOUNT',function='mean', time_windows=['5m', '30m', '90m']),
FeatureAggregation(column='AMOUNT', function='sum', time_windows=['5m', '30m', '90m'])
]
)
def user_transaction_amount_metrics(transactions_stream):
return f'''
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{transactions_stream}
'''
Parameters
See the API reference for the full list of parameters.
The new Stream Feature View class has the same API spec as the linked Stream Window Aggregate Feature View class with the following changes:
inputs
is changes tosource
and takes a singleStreamDataSource
.- The
aggregation_slide_period
andaggregations
parameters are optional. ttl
will be set automatically if aggregations are present.
On-Demand Feature View
On-Demand Feature Views are unchanged from previous framework versions. Full documentation can be found here.