Embeddings
This feature is currently in Private Preview.
- Must be enabled by Tecton Support.
- Available for Rift-based Feature Views.
Embeddings are condensed, rich representations of unstructured data that can power both predictive and generative AI applications.
Embeddings are defined using the
Embedding class.
In predictive use cases such as fraud detection and recommendation systems, embeddings enable models to identify complex patterns within data, leading to more accurate predictions. For generative AI applications, embeddings provide a semantic bridge that allows models to leverage the deep contextual meaning of data.
Tecton provides a seamless way of generating embeddings from text data, while delivering the following benefits for users:
- Efficient Compute Resource Management: Large scale inference of embeddings, such as processing millions of product descriptions nightly for a recommendation system, can be computationally expensive and memory intensive. Tecton handles these workloads by carefully provisioning, scheduling, and tuning resources such as GPUs to ensure cost-efficient performance.
- Ease of Experimentation: Finding the optimal balance between embedding model complexity, inference performance, and infrastructure costs typically demands deep technical understanding and trial-and-error. Tecton provides ML practitioners easy tooling to quickly evaluate several state-of-the-art open source models, without worrying about the model and compute complexity.
Basic Exampleโ
from tecton import Entity, BatchSource, FileConfig, batch_feature_view, Embedding
from tecton.types import Field, String, Timestamp
from datetime import datetime, timedelta
@batch_feature_view(
sources=[user_text_data],
entities=[user],
mode="pandas",
online=True,
offline=True,
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2021, 1, 1),
timestamp_field="timestamp",
features=[
# Basic embedding feature using a pre-trained sentence transformer model
Embedding(
input_column=Field("user_bio", String),
model="sentence-transformers/all-MiniLM-L6-v2",
name="user_bio_embedding",
description="Embedding representation of user biography text",
tags={"feature_type": "text_embedding", "model_family": "sentence_transformers"},
),
# Another embedding feature for user interests
Embedding(
input_column=Field("user_interests", String),
model="sentence-transformers/all-MiniLM-L6-v2",
name="user_interests_embedding",
description="Embedding representation of user interests",
tags={"feature_type": "text_embedding", "domain": "interests"},
),
],
description="User text embedding features using sentence transformers",
)
def user_text_embeddings(user_text_data):
return user_text_data[["user_id", "user_bio", "user_interests", "timestamp"]]
How to Use Embeddingsโ
Embedding features are defined using the Embedding class within a features
list in a Batch Feature View. Tecton supports embeddings of fixed dimension
using the Array type. These features can be materialized to both online and
offline stores and used just like other feature types.
Batch Embeddings Generationโ
Embeddings of text living in one or many Batch Data Sources (e.g. Snowflake, Redshift, BigQuery, S3 etc.) can be generated with Batch Feature Views.
from tecton import batch_feature_view, Embedding, RiftBatchConfig
from tecton.types import Field, String, Timestamp, Float64
from datetime import datetime, timedelta
from tecton import Entity, BatchSource, FileConfig
# Add Entity definition
product = Entity(
name="product", join_keys=[Field("PRODUCT_ID", String)], description="Product entity for embedding features"
)
# Add BatchSource definition
products = BatchSource(
name="products",
batch_config=FileConfig(
uri="s3://your-bucket/products.parquet", file_format="parquet", timestamp_field="TIMESTAMP"
),
)
@batch_feature_view(
sources=[products],
entities=[product],
timestamp_field="TIMESTAMP",
features=[
Embedding(input_column=Field("PRODUCT_NAME", String), model="sentence-transformers/all-MiniLM-L6-v2"),
Embedding(input_column=Field("PRODUCT_DESCRIPTION", String), model="sentence-transformers/all-MiniLM-L6-v2"),
],
mode="pandas",
batch_schedule=timedelta(days=1),
batch_compute=RiftBatchConfig(
# NOTE: we recommend using L4 GPU instances for Embeddings inference
instance_type="g6.xlarge",
),
environment="my-embeddings-env-1.0", # Use a custom environment with embedding dependencies
feature_start_time=datetime(2021, 1, 1),
)
def product_info_embeddings(products):
return products[["PRODUCT_ID", "PRODUCT_NAME", "PRODUCT_DESCRIPTION", "TIMESTAMP"]]
By default, the embedding features would be named <COLUMN_NAME>_embedding. You
can also specify the name by explicitly specifying the name in the Embedding:
Embedding(name="merchant_embedding_all_MiniLM", column="merchant", column_dtype=String, model="sentence-transformers/all-MiniLM-L6-v2").
Supported Modelsโ
The following model names can be specified in model="<model_name>" for using
different open-source text embeddings models.
If you'd like to use specific open-source embeddings models not listed above, please file a support request.
For using proprietary Embeddings models see Model Generated Features.
Testing Batch Embeddings Generation Interactivelyโ
Feature Views with embeddings can be tested interactively similar to any other Batch Feature View, by running the following code in a notebook:
start = datetime(2024, 1, 1)
end = datetime(2024, 3, 1)
df = product_info_embeddings.get_features_in_range(start_time=start, end_time=end).to_pandas()
display(df.head(5))
Environmentโ
In order to use Embeddings, you need to use an environment that contains all third party packages your model relies on, plus the following packages:
tecton[rift-materialization]torch>=2.0.0transformers>=4.40.0
Create a custom environment and add the required packages to the requirements.txt file:
$ tecton environment create --name "my-custom-env-0.1" --description "My Custom Env 0.1" --requirements /path/to/requirements.txt
See Environments for more details.
Limitationsโ
- Batch embeddings generation is only supported for Rift based Feature Views.
- Batch embedding generation is currently only supported for text inputs, the
datatype of columns to be embedded should be
String. For generating embeddings on other input types see Model Generated Features. - A single Feature View can either have Aggregates or Embeddings, but not both.
- Embeddings generation is not yet supported for Stream Feature Views.
What's Nextโ
Once you've defined and materialized an embedding feature, consider: