Skip to main content
Version: 0.5

How get_historical_features() Works

Overview

The primary way to get training data from Tecton is via get_historical_features() (GHF), which is a method that can be called from either a feature service or a feature view. Here, we are specifically covering GHF when called from a feature service and passing in a spine dataframe that contains a list of join keys and timestamps to which Tecton joins point-in-time features.

Here’s an example of calling get_historical_features():

import tecton

# Get the feature service
feature_service = tecton.get_workspace("my_workspace").get_feature_service(
"my_feature_service"
)

# Construct the "spine" dataframe
# spine = ... - whatever you need to do to construct this dataframe

# Get training dataframe from get_historical_features()
training_df = feature_service.get_historical_features(spine).to_spark()

# Show the results, save them somewhere, etc.
training_df.write.parquet("s3://....")
training_df.limit(100).show()

Steps that get_historical_features() performs

Under the hood, GHF performs the following steps:

  1. Tecton decomposes each feature service into its constituent feature views ( fv1 .. fvn ).

  2. For each row in the spine DataFrame, consisting of a set of join keys and a timestamp:

    1. For each feature view in the feature service:

    2. Tecton fetches the features for this feature view for the join key(s) and timestamp. If using materialized feature data, Tecton will look up the values; otherwise, Tecton will compute them ad-hoc in the notebook. See below for determining which way you are running GHF.

    3. If the ttl parameter is set for this feature view, then Tecton will seek from the timestamp back in time until it finds the first non-null value for this feature view until ttl days ago.

  3. Tecton joins all of these rows and feature view columns together and sends back to the client.

GHF does a lot of work, and is part of what makes Tecton so powerful. This also means that a slow/failing GHF can stem from a wide variety of issues.

Materialized or non-materialized features

GHF can use either pre-materialized feature data or not. Here’s how that behavior manifests itself in GHF:

  1. If using pre-materialized feature data : For each row in the spine dataframe, and then for each feature view contained in the feature service, Tecton looks up the feature value for the join key in the parquet or delta file containing the pre-computed features that correspond to the timestamp for that row.
  2. If not: Tecton will run the transformation in situ in the notebook cluster while you wait. Note that some transformations can be computationally or memory intensive, and so you may need a robustly-provisioned notebook cluster to properly run these.

Am I running GHF with materialized features?

You are running GHF using materialized features if all of the following are true:

  • Your feature service is running in a live workspace

  • The constituent feature views have the option offline=True

  • You ran GHF with either the from_source option omitted or set to False.

You are running GHF using non-materialized features (aka ad-hoc mode ) if any of the following are true:

  • Your feature service is running in a development workspace

  • Any of the constituent feature views have the option offline=False

  • You ran GHF with from_source=True