Skip to main content
Version: 0.7

FAQ: get_historical_features vs. run

Feature Views expose get_historical_features and run methods.

Method: get_historical_features​

get_historical_features should be used to compute or retrieve pre-computed offline feature data. This method will always produce accurate feature values for a requested time range or spine. get_historical_features will selectively retrieve pre-computed features from the offline store or compute them from raw event data depending on whether offline materialization is enabled. This can be explicitly overridden using from_source=True.

get_historical_features can be used for the following workflows:

  1. Generating historical training data using get_historical_features(spine=training_events), where training_events is a dataframe including historical timestamps for specific entities. This produces feature values as of a particular time for each requested entity, which can be used for model training.
  2. Generating batch inference data using get_historical_features(spine=inference_join_keys) where inference_join_keys is a dataframe including entities and the current timestamp, which produces the most recent feature data for requested entities.
  3. Inspecting offline data for a time range using get_historical_features(start_time=t1, end_time=t2).

Method: run​

run should only be used when interactively testing or debugging a Feature View. run quite literally runs a Feature View transformation. run is based on raw event data, but also provides the option to specify mocked data sources.

caution

Do not use run to generate training data since it is not guaranteed to produce accurate feature values.

tip

test_run is nearly identical to run, but is intended for use in unit testing since it explicitly requires mocked data sources, a local spark session, and does not make any network requests. Most of this document will focus on run, but the concepts extend to test_run.

🔑 Key Concept: get_historical_features one-to-many relationship with run​

Here’s another way of considering the differences between the two methods: in order to materialize offline data for a Feature Views, the Feature View pipeline is run on a scheduled interval (based on batch_schedule or aggregation_interval) in a materialization job. ** run mimics the query that would be run for a single materialization job for some time range**. This is why run requires a start_time and end_time, which should be aligned to 1 scheduled interval (the SDK will emit warnings if a specified time range does not align with 1 scheduled interval).

Finally, using the results of multiple runs, training data produced byget_historical_features is based on one or more materialization job runs.

Was this page helpful?