How get_historical_features() Works
Overview​
The primary way to get training data from Tecton is via
get_historical_features()
(GHF), which is a method that can be called from
either a feature service or a feature view. Here, we are specifically covering
GHF when called from a feature service and passing in a spine dataframe that
contains a list of join keys and timestamps to which Tecton joins point-in-time
features.
Here’s an example of calling get_historical_features()
:
import tecton
# Get the feature service
feature_service = tecton.get_workspace("my_workspace").get_feature_service("my_feature_service")
# Construct the "spine" dataframe
# spine = ... - whatever you need to do to construct this dataframe
# Get training dataframe from get_historical_features()
training_df = feature_service.get_historical_features(spine).to_spark()
# Show the results, save them somewhere, etc.
training_df.write.parquet("s3://....")
training_df.limit(100).show()
Steps that get_historical_features()
performs​
Under the hood, GHF performs the following steps:
Tecton decomposes each feature service into its constituent feature views ( fv1 .. fvn ).
For each row in the spine DataFrame, consisting of a set of join keys and a timestamp:
For each feature view in the feature service:
Tecton fetches the features for this feature view for the join key(s) and timestamp. If using materialized feature data, Tecton will look up the values; otherwise, Tecton will compute them ad-hoc in the notebook. See below for determining which way you are running GHF.
If the
ttl
parameter is set for this feature view, then Tecton will seek from the timestamp back in time until it finds the first non-null value for this feature view untilttl
days ago.
Tecton joins all of these rows and feature view columns together and sends back to the client.
GHF does a lot of work, and is part of what makes Tecton so powerful. This also means that a slow/failing GHF can stem from a wide variety of issues.
Materialized or non-materialized features​
GHF can use either pre-materialized feature data or not. Here’s how that behavior manifests itself in GHF:
- If using pre-materialized feature data : For each row in the spine dataframe, and then for each feature view contained in the feature service, Tecton looks up the feature value for the join key in the parquet or delta file containing the pre-computed features that correspond to the timestamp for that row.
- If not: Tecton will run the transformation in situ in the notebook cluster while you wait. Note that some transformations can be computationally or memory intensive, and so you may need a robustly-provisioned notebook cluster to properly run these.
Am I running GHF with materialized features?​
You are running GHF using materialized features if all of the following are true:
Your feature service is running in a live workspace
The constituent feature views have the option offline=True
You ran GHF with either the
from_source
option omitted or set toFalse
.
You are running GHF using non-materialized features (aka ad-hoc mode ) if any of the following are true:
Your feature service is running in a development workspace
Any of the constituent feature views have the option
offline=False
You ran GHF with
from_source=True