Skip to content

Constructing Training Data

Overview

This example demonstrates how to use Tecton to request feature data to train machine learning models. Tecton fetches feature values for every row corresponding to the value of the feature at the time of the event. In Tecton, this is referred to as a point-in-time join and it is how Tecton prevents detrimental information leakage. This is critical for event-based features, which are very common in operational machine learning systems.

Constructing training data requires three steps:

  1. Create a prediction context. This is a DataFrame that includes the join keys, timestamps (if applicable), and labels required for building a training set.

    This logic is not handled by Tecton, but can be defined as code, a query against a data warehouse, a file in remote storage, or some other method.

  2. Create a Feature Service. This includes all of the features you would like to use for building your model.

  3. Construct a training dataset.

This example uses the features in a click-through rate (CTR) prediction model. Note you need to use your Spark environment to construct training datasets with Tecton.

Creating a Prediction Context

First, determine the granularity at which you want the predictions to be made. For example, you could calculate propensity of a given user to click. **Or you might decide that you need to know propensity of a given user to click for a specific ad group.

For this example, assume you want predictions per user per ad ID. This means your training set has two join keys: user_uuid and ad_id. You must collect events at this level of granularity — that is, the ground truth labels (whether an individual clicked or not) and timestamps associated with the events.

Tecton has built the DataFrame that represents the prediction context for this example. In practice the DataFrame is built by your data scientists during the model building process. Preview the data as follows:

events = tecton.get_virtual_data_source('sample_events_for_model').dataframe()

Sample Events DataFrame

Creating a Feature Service

Once you know what you want to predict, formulate a strategy for what features to generate and use for the model. Since Tecton gives you the ability to reuse features across your organization, start by using several features that have been generated (for purposes of this example) by your colleagues:

  • ad_ground_truth_ctr_performance_7_days
  • user_partner_impression_count_7_days
  • user_total_ad_frequency_counts
  • ad_group_ctr_performance

Once you decide on the Feature Packages with which to make predictions from your model, put them in a Feature Service. A Feature Service enables you to fetch data from all these Feature Packages for both batch and real-time use cases. The CTR prediction Feature Service looks like this:

ctr_prediction_service = FeatureService(
    name='ctr_prediction_service',
    description='A FeatureService used for supporting a CTR prediction model.',
    online_serving_enabled=True,
    features=[
        ad_ground_truth_ctr_performance_7_days,
        user_partner_impression_count_7_days,
        user_total_ad_frequency_counts,
        ad_group_ctr_performance,
    ],
    family='ad_serving',
)

Constructing a Training Dataset

Next, construct your training dataset by passing the prediction context outlined earlier to the Feature Service. For the keys and timestamps you provide, Tecton returns the feature values for you to construct a training set.

fs = tecton.get_feature_service("ctr_prediction_service")
training_data = fs.get_feature_dataframe(events, timestamp_key="timestamp")

training_data contains the join keys, timestamps, and values for all the features in the Feature Service, and any other columns that were in events.

To save the resulting training DataFrame (to track datasets for training models), indicate that you want to save the dataset in your call to get_feature_dataframe by setting save to True.

training_data = fs.get_feature_dataframe(events, timestamp_key="timestamp", save=True)