Version: 1.1

Remote Dataset Generation

Private Preview

This feature is currently in Private Preview.

This feature has the following limitations:

Dataset Jobs are not available in the Web UI and should be accessed via SDK

If you would like to participate in the preview, please file a support ticket.

Background

By default, Tecton's Offline Retrieval Methods construct and execute a point-in-time correct offline feature query in the local environment. Tecton leverages either the local Spark context or in the case of Python-only offline feature retrieval, a local query engine that's included in the tecton[rift] pip package.

See this page for more information on choosing the local compute engine for offline feature retrieval.

Remote Dataset Generation

Remote Dataset Generation aims to improve upon Tecton's offline feature retrieval experience with the following benefits:

Scalable: Tecton will manage cluster(s) to execute offline retrieval. If needed, these can be configured in the same way that materialization clusters can be configured today (with Rift and Spark). Behind the scenes, Tecton will also automatically optimize offline retrieval to ensure that cluster resources are used optimally (e.g. by splitting up the retrieval query or the input training events DataFrame).
Easy to Use: Offline feature data outputs will be written to S3 and will be accessible via a Tecton Dataset -- these are automatically cataloged within your Tecton workspace and Web UI and can be shared by multiple users in your organization.
Standalone: Using this capability will only require the tecton Python package. It has no dependency on a local Spark context or local credentials to connect to Data Sources or the data plane.
Secure: This will conform to Tecton's Access Controls -- only users with an appropriate role will be able to retrieve offline features using this capability.

This capability is available for both get_features_for_events (for Feature Views, Feature Tables, & Feature Services) and get_features_in_range (for Feature Views & Feature Tables). Both Rift and Spark compute engines are supported.

Requires Operator role or permissions.

Usage

Create a Tecton DataFrame as usual by calling get_features_for_events or get_features_in_range:

data_frame = my_feature_service.get_features_for_events(events)

info

Only Pandas-based DataFrames are supported as events argument of get_features_for_events and entities argument of get_features_in_range.

For a Spark DataFrame, first convert it to a Pandas DataFrame:

events = spark_df.toPandas()
my_feature_service.get_features_for_events(events)

Instead of computing features locally, start a remote Dataset Job:

(Spark example)

job = data_frame.start_dataset_job(
    dataset_name="my_training_data:V1",  # the name must be unique
    cluster_config=DatabricksClusterConfig(instance_type="m5.2xlarge", spark_config={"spark.executor.memory": "12g"}),
    tecton_materialization_runtime="0.10.0",
    compute_mode="spark",
)

(Rift example)

job = data_frame.start_dataset_job(
    dataset_name="my_training_data:V1",  # the name must be unique
    cluster_config=RiftBatchConfig(
        instance_type="m5.2xlarge",
    ),
    environment="rift-latest",
    compute_mode="rift",
)

The job can be later retrieved by calling job-specific methods on a Feature View or a Feature Service object:

my_feature_service.list_jobs()
# [DatasetJob(...), DatasetJob(...)]

my_feature_service.get_job("3aa5161ff4fe4ce2ba0c752f0801d263")
# DatasetJob(...)

DatasetJob object can be used to wait for job completion and to retrieve the resulting Dataset:

# Block until Tecton has completed offline feature retrieval
job.wait_for_completion()

# Or just check the status yourself
job.get_status_for_display()

# Retrieve the Tecton Dataset
dataset = job.get_dataset()

# Retrieve the underlying DataFrame
df = dataset.to_dataframe().to_pandas()

Once a Dataset Generation has completed the Dataset can also be retrieved directly by the name:

dataset = workspace.get_dataset("my_training_data:V1")
dataset.to_dataframe().to_pandas()

Background​

Remote Dataset Generation​

Usage​

Was this page helpful?

Background

Remote Dataset Generation

Usage