Data Source Filtering and select_range()
Filtering Batch Data Sourcesβ
Data filtering is an essential technique in feature engineering, as it optimizes
data access, improves performance, and ensures that only relevant data is
processed. Raw data filtering is used both during materialization (when raw data
is transformed into feature values) and offline retrieval with
from_source=True
(when historical feature values are accessed for analysis or
model training).
Why Filter at the Raw Data Source Level?β
Filtering data at the source ensures that only the necessary data is processed, reducing computational overhead and improving the efficiency of your feature engineering pipeline. By applying a time filter, you can limit the data to only what's relevant for the task at hand, which is critical when working with large datasets or time-sensitive features.
How Filtering Works for Materializationβ
During materialization, data is transformed into feature values for a specific
job time interval. Without data source filtering, each job would read the full
raw data source and then filter the transformed data according to the Feature
Timestamp (timestamp_field
). This often meant that the job read more data than
what was necessary to produce features for the job's materialization interval.
Filtering at the Raw Data level ensures that only data within this time range is
processed. Starting Tecton 1.0, filtering the raw data source according to the
materialization time range is the default behavior for Batch Sources (or sources
with a batch_config
) when a source is added to a Feature View using the
sources
config.
@batch_feature_view(
sources = [my_transactions_source]
)
- Example: For a materialization job running from March 1 to March 5, the filter will limit the raw data read to this time range, improving performance by avoiding unnecessary data processing.
To read the entire Data Source for each job, use
@batch_feature_view(
sources = [my_transactions_source.unfiltered()]
)
How Filtering Works for Offline Retrievalβ
During Offline Retrieval queries with from_source=True
, Tecton ensures that
only data within the relevant time window is accessed, based on either specific
timestamps from a events
DataFrame in the case of
get_features_for_events(events)
or a user-defined time range in the case of
get_features_in_range(start, end)
.
The select_range()
Methodβ
In case you need to alter the default filtering behavior, the select_range()
method introduced in Tecton 1.0 provides a powerful and flexible way to filter
data according to custom time ranges. This method allows you to precisely define
the start and end times for filtering, making it adaptable to a wide variety of
use cases.