Remote Dataset Generation
This feature is currently in Private Preview.
- Dataset Jobs are not available in the Web UI and should be accessed via SDK
- Must be enabled by Tecton Support
Background​
By default, Tecton's Offline Retrieval Methods
construct and execute a point-in-time correct offline feature query in the local
environment. Tecton leverages either the local Spark context or in the case of
Python-only offline feature retrieval, a local query engine that's included in
the tecton[rift]
pip package.
See this page for more information on choosing the local compute engine for offline feature retrieval.
Remote Dataset Generation​
Remote Dataset Generation aims to improve upon Tecton's offline feature retrieval experience with the following benefits:
- Scalable: Tecton will manage cluster(s) to execute offline retrieval. If needed, these can be configured in the same way that materialization clusters can be configured today (with Rift and Spark). Behind the scenes, Tecton will also automatically optimize offline retrieval to ensure that cluster resources are used optimally (e.g. by splitting up the retrieval query or the input training events DataFrame).
- Easy to Use: Offline feature data outputs will be written to S3 and will be accessible via a Tecton Dataset -- these are automatically cataloged within your Tecton workspace and Web UI and can be shared by multiple users in your organization.
- Standalone: Using this capability will only require the
tecton
Python package. It has no dependency on a local Spark context or local credentials to connect to Data Sources or the data plane. - Secure: This will conform to Tecton's Access Controls -- only users with an appropriate role will be able to retrieve offline features using this capability.
This capability is available for both get_features_for_events
(for Feature
Views, Feature Tables, & Feature Services) and get_features_in_range
(for
Feature Views & Feature Tables). Both Rift and Spark
compute engines are supported.
Usage​
Create a Tecton DataFrame as usual by calling get_features_for_events
or
get_features_in_range
:
data_frame = my_feature_service.get_features_for_events(events)
Only Pandas-based DataFrames are supported as events
argument of
get_features_for_events
and entities
argument of get_features_in_range
.
For a Spark DataFrame, first convert it to a Pandas DataFrame:
events = spark_df.toPandas()
my_feature_service.get_features_for_events(events)
Instead of computing features locally, start a remote Dataset Job:
(Spark example)
job = data_frame.start_dataset_job(
dataset_name="my_training_data:V1", # the name must be unique
cluster_config=DatabricksClusterConfig(instance_type="m5.2xlarge", spark_config={"spark.executor.memory": "12g"}),
tecton_materialization_runtime="0.10.0",
compute_mode="spark",
)
(Rift example)
job = data_frame.start_dataset_job(
dataset_name="my_training_data:V1", # the name must be unique
cluster_config=RiftBatchConfig(
instance_type="m5.2xlarge",
),
environment="rift-latest",
compute_mode="rift",
)
The job can be later retrieved by calling job-specific methods on a Feature View or a Feature Service object:
my_feature_service.list_jobs()
# [DatasetJob(...), DatasetJob(...)]
my_feature_service.get_job("3aa5161ff4fe4ce2ba0c752f0801d263")
# DatasetJob(...)
DatasetJob object can be used to wait for job completion and to retrieve the resulting Dataset:
# Block until Tecton has completed offline feature retrieval
job.wait_for_completion()
# Or just check the status yourself
job.get_status_for_display()
# Retrieve the Tecton Dataset
dataset = job.get_dataset()
# Retrieve the underlying DataFrame
df = dataset.to_dataframe().to_pandas()
Once a Dataset Generation has completed the Dataset can also be retrieved directly by the name:
dataset = workspace.get_dataset("my_training_data:V1")
dataset.to_dataframe().to_pandas()