Skip to main content
Version: 1.1

Remote Dataset Generation

Private Preview

This feature is currently in Private Preview.

This feature has the following limitations:
  • Dataset Jobs are not available in the Web UI and should be accessed via SDK.
  • Graviton instances are supported in Spark on EMR 7.0+ and DBR 12.2+. Contact Support to enable.
If you would like to participate in the preview, please file a support ticket.

Background

By default, when using Tecton's Offline Retrieval Methods to generate data offline for model training or offline inference, Tecton will construct and execute a point-in-time correct offline feature query in the local environment. Tecton leverages either the local Spark context or a Tecton proprietary local query engine called Rift for Python-only offline feature retrieval.

See this page for more information on choosing the local compute engine for offline feature retrieval.

However, local offline retrieval has limitations in terms of scalability, resource requirements, and operational complexity. To address these challenges, Tecton offers Remote Dataset Generation (RDG) as an alternative approach.

What is Remote Dataset Generation (RDG)

With remote dataset generation, offline retrieval no longer happens locally but on a remote cluster that is managed by Tecton. This meaningfully improves offline retrieval in the following areas:

  • Easy to use: RDG only requires the tecton Python package. It has no dependency on a local Spark context or local credentials to connect to data sources or the data plane.
  • Performant: Similar to materialization jobs, a remote dataset generation job cluster can be configured to match your data generation workload (see how to do it for Rift or Spark jobs). In addition, Tecton also optimizes offline retrieval under the hood to ensure that cluster resources are used optimally, for example by splitting up the retrieval query or the input training events DataFrame.
  • Secure: This will conform to Tecton's Access Controls to ensure that only users with an appropriate role will be able to generate and retrieve offline features.

This capability is available for both get_features_for_events (for Feature Views & Feature Services) and get_features_in_range (for Feature Views). Both Rift and Spark compute engines are supported.

When to use RDG?

Using Tecton with Rift

  • For local feature development where you want to test features by iteratively generating small amounts of feature data, we still recommend using local offline retrieval instead of RDG.
  • If when using local retrieval you might run out of resources, switch to RDG which allows you to use any instance size for your training data run.
  • Use RDG for larger training dataset generation jobs, either ad hoc or as part of a pipeline.

Using Tecton with Spark

  • It makes the most sense to use RDG if you want to kick off training data jobs and retrieve the generated data without needing a local Spark context (for example when running it from systems like Airflow).

How to use RDG?

Requirements

  • Requires Operator, Editor or Admin role

Constraints

  • Spins up single cluster jobs. Rift RDG will spin up single instance jobs only
  • Does not work on unapplied Feature Views or Feature Services

Usage

Input argument

  • For get_features_for_events first create a DataFrame with the events for which you want to generate training data.
  • For get_features_in_range you can optionally pass in a dataset with the entities for which you want to generate training data.
  • In both cases, the dataset can be either
    • (a) a Pandas-based DataFrame:
      tecton_df = my_feature_service.get_features_for_events(events)
      or
    • (b) an S3 storage location:
      tecton_df = feature_view.get_features_for_events(
      "s3://path/to/dataset.parquet",
      timestamp_key="timestamp", # in this case timestamp_key is required
      )

Kicking off the training dataset generation job

Instead of computing features locally by calling .to_pandas(), start a remote Dataset Job. You can configure the the job directly through the cluster_config parameter. If not configured in the function call, the job will take on the cluster configuration set in your repo.yaml.

Spark example:

job = data_frame.start_dataset_job(
dataset_name="my_training_data:V1", # the name must be unique
cluster_config=DatabricksClusterConfig(instance_type="m5.2xlarge", spark_config={"spark.executor.memory": "12g"}),
tecton_materialization_runtime="1.1.0",
compute_mode="spark",
)

Rift example:

job = data_frame.start_dataset_job(
dataset_name="my_training_data:V1", # the name must be unique
cluster_config=RiftBatchConfig(
instance_type="m5.2xlarge",
),
environment="tecton-core-1.1.0",
compute_mode="rift",
)

While the job is running

You can monitor the job status as follows:

# Check the status of the job
job.get_status_for_display()

Alternatively, you can use the relevant Feature Service or Feature View to look up information about your training dataset generation job:

my_feature_service.list_jobs()
# [DatasetJob(...), DatasetJob(...)]

my_feature_service.get_job("3aa5161ff4fe4ce2ba0c752f0801d263")
# DatasetJob(...)

Alternatively, you can also monitor the job in the Web UI. Under the "Jobs" page, we recommend filtering for the "Data Generation" Task Type to see the relevant jobs more easily.

Retrieving the dataset

Offline feature data outputs will be written to S3 and will be accessible via a Tecton Dataset -- these are automatically cataloged within your Tecton workspace and Web UI and can be shared by multiple users in your organization.

You can retrieve the dataset as a DataFrame:

# You can wait until the job is done
job.wait_for_completion()

# Retrieve the Tecton Dataset
dataset = job.get_dataset()

# Retrieve the underlying DataFrame
df = dataset.to_dataframe().to_pandas()

You can also retrieve it as a DataFrame referencing the training dataset name directly:

dataset = workspace.get_dataset("my_training_data:V1")
dataset.to_dataframe().to_pandas()

Finally, you can retrieve the dataset directly from S3:

# Retrieve the location on S3
table_path = dataset.storage_location
# Use Spark read API
spark.read.format("delta").load(table_path)

Retries

RDG will retry a job three times before it is considered as failed. Each retry spins up a new compute instance.

Was this page helpful?