Feature Design Patterns
The Tecton Framework simplifies the implementation of machine learning features. This page aims to help translate your feature ideas into the Tecton framework by providing examples of common feature design patterns.
Features built on the Tecton platform typically fit into the following categories:
- Dimension features are the single latest value of a field for an entity.
- Aggregation features are metrics calculated by aggregating over a series of events.
- Real-time features are calculated based on data that is available at request time. These features are similar to Dimension features in that they typically include simple projection and filtering, but cannot be pre-computed based on a batch or stream data source.
- Derived features are advanced feature engineering techniques based on combining and post-processing basic features.
Dimension features are most commonly implemented as a Batch Feature View that performs projecting and filtering over the source table. If the dimension updates are available on a stream source, then a Stream Feature View will enable updating feature values more quickly.
|Entity property||A “property” of a single entity that is updated in place, commonly be derived from a dimension table.||User Date of Birth|
|Upstream feature pipelines||Final feature values that are calculated upstream of your Tecton Data Source. Even if the original feature calculation is more complex, once ingested to Tecton they are represented as a simple dimension feature.|
|ML Model Outputs||Outputs of one model are commonly used as a feature for another. Each output of the model represents the latest value for the user.||User Embeddings|
The Tecton Aggregation Engine makes it simple to develop and productionize Aggregation features.
Aggregations can be defined for either Batch or Stream Feature Views, depending on what source data you have available for your feature.
|Time-windowed Aggregations||Aggregations, such as count distinct or mean, over events during a trailing time period, such as the last 2 hours or last 90 days.||User Transaction Metrics|
|Lifetime Aggregations||Aggregations over the full data history.||User Lifetime Transactions|
|Secondary Key Aggregations||Time-windowed aggregations that are grouped over a secondary key in addition to the entity.||User clicks per Ad ID|
|Event History||List of previous events for an entity. Most commonly used to build Derived features, as described below.||List of page views|
Tecton’s On-Demand Feature Views are simple python feature transformations to be executed in real-time based on data provided in the request context.
|Request context transformation||Derive features based analysis of the request payload||Country of the transaction based on lat/long input|
|Rules||Apply heuristics to request data||User Transaction Amount Is Above Threshold|
Derived features are calculated at request time based on features retrieved from multiple feature views, or including request context. On-Demand Feature Views can be used to calculate derived features because they can operate on request context data, as well as the outputs of Batch or Stream Feature Views.
See using Feature View dependencies in On-Demand Feature Views for how to combine multiple input feature views and request data in a single On-Demand Feature View.
|Single Entity||Combine features related to a single entity, but typically originating from different data sources||User historical click-through-rate (ad clicks / ad impressions)|
|Multiple Entity||Combine features from separate entities to calculate relative comparisons or interactions between entities.||Distance between sender and recipient|
|Request vs. Metric or Dimension||Compare request context data to metric or dimension features||User age (current_time - date of birth)|
|Fitted||Model-specific transformation of other feature types in order to improve model performance. Because these features are fitted to the training data set for a specifc model, they are typically implemented in model code rather than in the Tecton repository.||One-hot encoding, Binning, Normalization|
Optimizing costs for Multiple Entity Features
By calculating derived features at request time, Tecton helps avoid the cost of a combinatorial explosion in the entity space when derived features are calculated based on features about multiple entities.
Example 1: Distance between sender and recipient home address
Assume we have a dimension table that contains the lat/long of the home address for every user, and our application has 10 million registered users.
The naive implementation would do a full cross-join on the user dimension table, and calculate the distance between each pair of users. However this leads to needing to store 10 million ^2 = 100 trillion feature values! Computing and storing all these feature values is prohibitively expensive.
Instead build just one Feature View for the user home address, and then use an On-Demand Feature View to calculate the distance at request time.
Example 2: User - Product page views for recommendations
In the case of Search and Recommendation systems, Secondary Key Aggregations allow for more efficient feature caching and retrieval by storing interaction data under a single entity key.
The naive implementation would calculate the time-window aggregation under a [user, product] compound key. However when scoring 1000 candidate products for a recommendation, that requires querying the compound key 1000 times. Further, the key space becomes prohibitively large to cache.
With Secondary Key Aggregations, instead calculate the product view count for every product the user has viewed in the past. In most cases, the number of products here will be manageable for a single aggregation. Then join the Secondary Key Aggregation results to your individual product feature vectors.
Sample Feature Repositories
To see these design patterns in action, explore the full sample Tecton feature repositories.