Remote Dataset Generation
This feature is currently in Private Preview.
- Dataset Jobs are not available in the Web UI and should be accessed via SDK.
- Graviton instances are supported in Spark on EMR 7.0+ and DBR 12.2+. Contact Support to enable.
Background
By default, when using Tecton's Offline Retrieval Methods to generate data offline for model training or offline inference, Tecton will construct and execute a point-in-time correct offline feature query in the local environment. Tecton leverages either the local Spark context or a Tecton proprietary local query engine called Rift for Python-only offline feature retrieval.
See this page for more information on choosing the local compute engine for offline feature retrieval.
However, local offline retrieval has limitations in terms of scalability, resource requirements, and operational complexity. To address these challenges, Tecton offers Remote Dataset Generation (RDG) as an alternative approach.
What is Remote Dataset Generation (RDG)
With remote dataset generation, offline retrieval no longer happens locally but on a remote cluster that is managed by Tecton. This meaningfully improves offline retrieval in the following areas:
- Easy to use: RDG only requires the
tectonPython package. It has no dependency on a local Spark context or local credentials to connect to data sources or the data plane. - Performant: Similar to materialization jobs, a remote dataset generation job cluster can be configured to match your data generation workload (see how to do it for Rift or Spark jobs). In addition, Tecton also optimizes offline retrieval under the hood to ensure that cluster resources are used optimally, for example by splitting up the retrieval query or the input training events DataFrame.
- Secure: This will conform to Tecton's Access Controls to ensure that only users with an appropriate role will be able to generate and retrieve offline features.
This capability is available for both get_features_for_events (for Feature
Views & Feature Services) and get_features_in_range (for Feature Views). Both
Rift and Spark compute engines are
supported.
When to use RDG?
Using Tecton with Rift
- For local feature development where you want to test features by iteratively generating small amounts of feature data, we still recommend using local offline retrieval instead of RDG.
- If when using local retrieval you might run out of resources, switch to RDG which allows you to use any instance size for your training data run.
- Use RDG for larger training dataset generation jobs, either ad hoc or as part of a pipeline.
Using Tecton with Spark
- It makes the most sense to use RDG if you want to kick off training data jobs and retrieve the generated data without needing a local Spark context (for example when running it from systems like Airflow).
How to use RDG?
Requirements
- Requires Operator, Editor or Admin role
Constraints
- Spins up single cluster jobs. Rift RDG will spin up single instance jobs only
- Does not work on unapplied Feature Views or Feature Services
Usage
Input argument
- For
get_features_for_eventsfirst create a DataFrame with theeventsfor which you want to generate training data. - For
get_features_in_rangeyou can optionally pass in a dataset with theentitiesfor which you want to generate training data. - In both cases, the dataset can be either
- (a) a Pandas-based DataFrame:
or
tecton_df = my_feature_service.get_features_for_events(events) - (b) an S3 storage location:
tecton_df = feature_view.get_features_for_events(
"s3://path/to/dataset.parquet",
timestamp_key="timestamp", # in this case timestamp_key is required
)
- (a) a Pandas-based DataFrame:
Kicking off the training dataset generation job
Instead of computing features locally by calling .to_pandas(), start a remote
Dataset Job. You can configure the the job directly through the cluster_config
parameter. If not configured in the function call, the job will take on the
cluster configuration set in your repo.yaml.
Spark example:
job = data_frame.start_dataset_job(
dataset_name="my_training_data:V1", # the name must be unique
cluster_config=DatabricksClusterConfig(instance_type="m5.2xlarge", spark_config={"spark.executor.memory": "12g"}),
tecton_materialization_runtime="1.1.0",
compute_mode="spark",
)
Rift example:
job = data_frame.start_dataset_job(
dataset_name="my_training_data:V1", # the name must be unique
cluster_config=RiftBatchConfig(
instance_type="m5.2xlarge",
),
environment="tecton-core-1.1.0",
compute_mode="rift",
)
While the job is running
You can monitor the job status as follows:
# Check the status of the job
job.get_status_for_display()
Alternatively, you can use the relevant Feature Service or Feature View to look up information about your training dataset generation job:
my_feature_service.list_jobs()
# [DatasetJob(...), DatasetJob(...)]
my_feature_service.get_job("3aa5161ff4fe4ce2ba0c752f0801d263")
# DatasetJob(...)
Alternatively, you can also monitor the job in the Web UI. Under the "Jobs" page, we recommend filtering for the "Data Generation" Task Type to see the relevant jobs more easily.
Retrieving the dataset
Offline feature data outputs will be written to S3 and will be accessible via a Tecton Dataset -- these are automatically cataloged within your Tecton workspace and Web UI and can be shared by multiple users in your organization.
You can retrieve the dataset as a DataFrame:
# You can wait until the job is done
job.wait_for_completion()
# Retrieve the Tecton Dataset
dataset = job.get_dataset()
# Retrieve the underlying DataFrame
df = dataset.to_dataframe().to_pandas()
You can also retrieve it as a DataFrame referencing the training dataset name directly:
dataset = workspace.get_dataset("my_training_data:V1")
dataset.to_dataframe().to_pandas()
Finally, you can retrieve the dataset directly from S3:
# Retrieve the location on S3
table_path = dataset.storage_location
# Use Spark read API
spark.read.format("delta").load(table_path)
Retries
RDG will retry a job three times before it is considered as failed. Each retry spins up a new compute instance.