Skip to content

Batch Feature View

A BatchFeatureView is used for defining row-level or aggregate transformations against a Batch Data Source (e.g. S3, Hive, Redshift, Snowflake etc.). Batch Feature Views run automatic backfills and can be scheduled to publish new feature data to the Online and Offline Feature Stores on a regular cadence.

Note: Many aggregations are are already supported in a Batch Window Aggregate Feature View out of the box. These aggregations have been optimized for cost and efficiency and are a good place to start if you are looking to define time-windowed aggregations.

Use a BatchFeatureView, if:

  • you have your raw events available in a Batch Data Source
  • you want to run simple row-level based transformations on the raw data, or simply ingest raw data without further transformations
  • you want to define custom join and aggregation transformations
  • your use case can tolerate a feature freshness of > 1 hour
  • you wan to ingest a dimension table (e.g. a user's attributes) for feature consumption

Common Examples:

  • determining if a user's credit score is over a pre-defined threshold
  • counting distinct transactions over a time window
  • batch ingesting precomputed feature values from an existing batch data source
  • batch ingesting a user's date of birth

Feature Definition Example

For more examples see Examples here.

To create a Batch Feature View, use the @batch_feature_view annotation on your Python function.

Row-Level Transformation

Annotation Parameters

See the API reference for the full list of parameters.

Transformation Pipeline

Batch Feature Views can use pyspark or spark_sql transformation types. You can configure mode=pipeline to construct a pipeline of those transformations, or use mode=pyspark or mode=spark_sql to define an inline transformation.

The output of your transformation must include columns for the entity IDs and a timestamp. All other columns will be treated as features.

Usage Example

See how to use a Batch Feature View in a notebook here.

How they work

When materialized online and offline, Tecton will run the BatchFeatureView transformation according to the defined batch_schedule. It publishes the latest feature values per entity key to the Online Feature Store and all historical values to the Offline Feature Store

Batch transformations are executed as Spark jobs (additional compute will be supported soon).

How Tecton Uses Time in Batch Feature View Materialization

These parameters in a Batch Feature View definition configure how Tecton will run the materialization jobs:

  1. batch_schedule (e.g. "1 day"): Controls how often Tecton will materialize new feature values to the Feature Store.
  2. feature_start_time (e.g. "June 1st 2020"): Controls how far back Tecton will backfill feature data to the Feature Store once a new Feature View transformation is registered.

During each materialization run, Tecton also passes in a context object which can be used to reference the expected range of new feature values being materialized. context includes:

  1. context.feature_end_time: Always equal to the current scheduled run time of a materialization job.
  2. context.feature_end_time_string: A convenience parameter for use in SQL strings. This evaluates to context.feature_end_time_string.to_datetime_string().
  3. context.feature_start_time: During an initial backfill this is equal to the Feature View's feature_start_time parameter. During a steady-state materialization job this is equal to the last job run time.
  4. context.feature_start_time_string: A convenience parameter for use in SQL strings. This evaluates to context.feature_start_time_string.to_datetime_string().

Tecton automatically filters the raw data that is passed in to a transformation.

It is the user's responsibility to filter the feature data that is published to the Feature Store. Tecton expects new feature data to be between the last materialization job run time (or the Feature View's feature_start_time during an initial backfill) and the current job run time. In other words: context.feature_start_time <= new feature data < context.feature_end_time. This allows Tecton to preserve full data lineage and protect against accidental data leakage. Note: When doing event-level transformations that preserve timestamps, no filtering logic should be needed because the raw data has already been filtered by Tecton.

FAQ

How is a BatchFeatureView different to a BatchWindowAggregateFeatureView ?

A BatchFeatureView is the more generic but less specialized sibling to a BatchWindowAggregateFeatureView. Use a BatchWindowAggregateFeatureView whenever you care about running time-window aggregations which it supports. See the BatchWindowAggregateFeatureView documentation for a quick explanation of how Tecton supports these types of features under the hood by leveraging pre-computed and on-demand transformations.