Skip to main content
Version: 1.0

Offline Retrieval Runs Slowly or Fails

Scope

This troubleshooting article covers how to diagnose slow or failing get_features_for_events() or get_features_in_range(). Some of the resolutions apply only to Spark-based retrieval, while others apply only to Snowflake-based retrieval.

Symptoms

This issue can manifest itself via the following symptoms:

  • Materialization jobs are cancelled after an hour if you use the default cluster configuration (relying on spot instances).

  • Pyspark times out in an EMR notebook (Livy connection fails)

  • Out-of-memory issues

  • Timeout failure in Snowflake or Athena

Prerequisites for troubleshooting

Review Methods for retrieving Offline Data which also helps you determine whether you are retrieving features using pre-materialized feature data (offline=True).

Resolution

If you run into slow Offline Retrieval, here are some possible causes, ways to test, and resolutions. We’ve sorted them into most to least common, so we suggest investigating these possible causes in order.

Isolating a slow feature view (Not pre-materialized feature data only)

If you are running get_features_for_events for a feature service, it may be that only one of your feature views is executing slowly, causing the method to run slowly. By isolating the slow feature view, you can focus your troubleshooting.

  • Testing: Instead of running <feature_service>.get_features_for_events(), try instead to run <feature_view>.get_features_for_events() for each feature view that is contained in the feature service.

Feature view transformation logic (Not pre-materialized)

Your feature view transformation logic may be written in such a way that it is performing expensive joins or scans across a large dataset. This can cause get_features_for_events() to run very slow or run out of memory.

  • Testing :

    • Inspect your transformation logic for joins or expensive reads from large tables.
  • Resolution :

    • We recommend simplifying your feature view logic as much as possible to make it clear where you may be doing expensive joins. For complex pipeline transformations, it can be difficult to assess what is happening. You can also use .explain() on the resulting dataframe from the Offline Retrieval method to inspect the physical plan of what Spark will try to execute to look for inefficiencies.

    • (v0.3 SDK with BatchFeatureViews): If you are using tecton_sliding_window() and joining one or more other batch tables, run tecton_sliding_window() outside the join, as it will explode the number of rows.

Very large or slow data source

If you are running Offline Retrieval on non-materialized feature views (from_source=True), then you may be running a well-written feature view against a very large and/or slow data source that takes time to process. This will be exacerbated if you have non-optimized Feature View logic. Note that Snowflake and Redshift tend to be faster than Hive and, especially, File data sources.

  • Testing :

    • Tecton on Spark : Try substituting your data source for a smaller sample FileDSConfig consisting of a single parquet file.
    • Tecton on Snowflake : Try selecting a smaller sample of data from your Snowflake table. You can do this by setting the query param for SnowflakeConfig as SELECT * FROM some_table LIMIT 10.
  • Resolution : If you are not able to speed up the data source, we recommend using the smaller data source mentioned above when developing features in a notebook as it can significantly speed up Tecton commands while iterating. You can scale up to the larger, production data source when your features are ready.

Tecton on Spark Only: Using a “File” data source (Not pre-materialized)

We include the FileConfig data source only for development and testing, as it does not include many basic speed improvements that the HiveConfig includes. For example, it does not understand directory partitioning, and Spark scans each file in the file source to infer the schema of the source. While Tecton will work with a FileConfig, it will run slowly if you attempt to use it on a large collection of files.

  • Testing : Try changing your uri parameter to a single parquet file if it is pointed at a large directory of files.

  • Resolution : Add a Glue catalog entry (via a Glue crawler) for this file source, and convert your FileSource to a HiveSource. Ensure that you specify any file partitions in your HiveConfig.

Tecton on Spark Only: Hive partitions not specified (Not pre-materialized)

If you are using a HiveConfig data source, Tecton does not by default assume a partition scheme, however, most data lake partitions are partitioned by date/time.

  • Testing : Check if you have passed in the date/time partition structure via the DatetimePartitionColumn option in your feature repository.

  • Resolution: Add the partition columns via the DatetimePartitionColumn option. Here is an example.

Tecton on Spark Only: Conversion to pandas DataFrame

Pandas DataFrames are usually a more familiar interface to manipulate dataframes that may come from a get_features_for_events() call. Under the hood, get_features_for_events() returns a Spark DataFrame, and converting Spark to pandas dataframes can be a very costly operation if you are passing a spine of more than a few million rows.

  • Testing : Instead of converting to pandas immediately, run get_features_for_events(events).to_spark().show(), which will avoid the pandas conversion.

  • Resolution : Either use Spark DataFrames in your code, or consider importing the koalas library which provides a pandas-like interface to Spark DataFrames. Note that koalas is supported natively on Spark 3.2 and above.

Tecton on Spark Only: Slow spine generation

Due to Spark’s lazy evaluation model , when you run get_features_for_events(events), you are executing a series of statements all at once, including usually get_features_for_events(events), parsing the output, but also generating the spine dataframe. This is an area to focus on if you read your spine from an external source like Hive, Redshift, Snowflake, etc.

  • Testing : To ensure that your spine dataframe generation is not the slow area, you might try to generate the spine dataframe, save it as a parquet file on S3, then loading the parquet file. Then, continue with get_features_for_events(). You can also call .cache() on your spine.

  • Resolution: You may want to consider generating your spine once and saving it to a faster storage layer such as S3 and adding .cache().

Tecton on Spark Only: Under-resourced notebook cluster

Tecton creates your notebook clusters initially, however, you are free to change the configuration or create additional notebook clusters as needed. It could happen that after optimizing your feature logic, that you still need to scan and process a large amount of data. In this case, increasing the notebook cluster size (especially memory) can improve Offline Retrieval performance. Spark can be much faster than older Hadoop installations because it can do much of its computation in memory, and memory is typically about 100x faster than disk if Spark has to frequently page to disk.

  • Testing/Resolution : Try scaling from m5.2xlarge to m5.8xlarge instances.

Long ttls

For feature views, you can add a ttl option. You should only include this option for row-based transformations where you are not aggregating any data, except in certain advanced scenarios. You would commonly use this if you want to return, for example a user_id’s creation date, which is tied to a data source row that was last updated years ago. The ttl parameter tells Tecton to keep searching back in time from the spine’s timestamp until it finds the first non-null value.

A long ttl generally doesn’t add significant extra time an Offline Retrieval call, but it may if you have feature view logic that scans lots of data from a slow (data lake) source.

  • Testing : Try decreasing the ttl by removing it or setting it to the batch_schedule.

  • Resolution : If you are not able to speed up the feature view by changing the logic, then consider reducing the ttl if it is possible.

Was this page helpful?