Skip to content

Batch Feature View

A BatchFeatureView is used for defining row-level or aggregate transformations against a Batch Data Source (e.g. S3, Hive, Redshift, Snowflake etc.). Batch Feature Views run automatic backfills and can be scheduled to publish new feature data to the Online and Offline Feature Stores on a regular cadence.

Note: Many aggregations are are already supported in a Batch Window Aggregate Feature View out of the box. These aggregations have been optimized for cost and efficiency and are a good place to start if you are looking to define time-windowed aggregations.

Use a BatchFeatureView, if:

  • you have your raw events available in a Batch Data Source
  • you want to run simple row-level based transformations on the raw data, or simply ingest raw data without further transformations
  • you want to define custom join and aggregation transformations
  • your use case can tolerate a feature freshness of > 1 hour
  • you wan to ingest a dimension table (e.g. a user's attributes) for feature consumption

Common Examples:

  • determining if a user's credit score is over a pre-defined threshold
  • counting distinct transactions over a time window
  • batch ingesting precomputed feature values from an existing batch data source
  • batch ingesting a user's date of birth

Feature Definition Example

For more examples see Examples here.

To create a Batch Feature View, use the @batch_feature_view annotation on your Python function.

Row-Level Transformation

Custom Aggregation Transformation

Annotation Parameters

See the API reference for the full list of parameters.

The backfill_config parameter (under development) controls the grouping of the backfill jobs that Tecton spins up, and requires a matching form of the transformation. Currently, the only available value is BackfillConfig("multiple_batch_schedule_intervals_per_job"). More values will be supported in the future.

Transformation Pipeline

Batch Feature Views can use pyspark or spark_sql transformation types. You can configure mode=pipeline to construct a pipeline of those transformations, or use mode=pyspark or mode=spark_sql to define an inline transformation.

The output of your transformation must include columns for the entity IDs and a timestamp. All other columns will be treated as features.

Usage Example

See how to use a Batch Feature View in a notebook here.

How they work

When materialized online and offline, Tecton will run the BatchFeatureView transformation according to the defined batch_schedule. It publishes the latest feature values per entity key to the Online Feature Store and all historical values to the Offline Feature Store

Batch transformations are executed as Spark jobs (additional compute will be supported soon).

How Tecton Uses Time in Batch Feature View Materialization

These parameters in a Batch Feature View definition configure how Tecton will run the materialization jobs:

  1. batch_schedule (e.g. "1d"): Controls how often Tecton will materialize new feature values to the Feature Store.
  2. feature_start_time (e.g. "datetime(2021, 4, 1)): Controls how far back Tecton will backfill feature data to the Feature Store once a new Feature View transformation is registered.
  3. window (e.g. "7d"): An optional parameter on each data source Input, which defaults to equal the Feature View batch_schedule and determines the time range of raw data Tecton will supply to the transformation for a given materialization run (e.g. the most recent 7 days worth of data). Tecton automatically filters data outside of this window based on the Data Source timestamp_key.

Using tecton_sliding_window for windowed aggregations

When aggregating over a time window with window, we recommend using the tecton_sliding_window() transformation. See this notebook for more details on how tecton_sliding_window() works.

First, add the tecton_sliding_window() transformation to your transformation pipeline.The tecton_sliding_window() has 3 primary inputs:

  • df: the input data.
  • timestamp_key: the timestamp column in your input data that represents the time of the event.
  • window_size: how far back in time the window should go. For example, if my feature is the number of distinct IDs in the last 30 days, then the window size is 30 days. Typically this value should match the window on your Input.

In the example above, our transformation pipeline now looks like this:

def user_distinct_merchant_transaction_count_30d(transactions_batch):
    return user_distinct_merchant_transaction_count_transformation(
        tecton_sliding_window(transactions_batch,
            timestamp_key=const('timestamp'),
            window_size=const('30d')))

In the following transformation, you will 'group by' the window_end column, alongside any entity columns. In the example above, our second transformation looks like this:

@transformation(mode='spark_sql')
def user_distinct_merchant_transaction_count_transformation(window_input_df):
    return f'''
        SELECT
            nameorig AS user_id,
            COUNT(DISTINCT namedest) AS distinct_merchant_count,
            window_end AS timestamp
        FROM {window_input_df}
        GROUP BY
            nameorig,
            window_end
    '''

And that's it! Tecton will now be able to calculate your feature that aggregates over the trailing 30 days.

FAQ

How is a BatchFeatureView different to a BatchWindowAggregateFeatureView ?

A BatchFeatureView is the more generic but less specialized sibling to a BatchWindowAggregateFeatureView. Use a BatchWindowAggregateFeatureView whenever you care about running time-window aggregations which it supports. See the BatchWindowAggregateFeatureView documentation for a quick explanation of how Tecton supports these types of features under the hood by leveraging pre-computed and on-demand transformations.