Version: 1.2

Embeddings

Private Preview

This feature is currently in Private Preview.

This feature has the following limitations:

Must be enabled by Tecton Support.
Available for Rift-based Feature Views.

If you would like to participate in the preview, please file a support ticket.

Embeddings are condensed, rich representations of unstructured data that can power both predictive and generative AI applications.

Embeddings are defined using the Embedding class.

In predictive use cases such as fraud detection and recommendation systems, embeddings enable models to identify complex patterns within data, leading to more accurate predictions. For generative AI applications, embeddings provide a semantic bridge that allows models to leverage the deep contextual meaning of data.

Tecton provides a seamless way of generating embeddings from text data, while delivering the following benefits for users:

Efficient Compute Resource Management: Large scale inference of embeddings, such as processing millions of product descriptions nightly for a recommendation system, can be computationally expensive and memory intensive. Tecton handles these workloads by carefully provisioning, scheduling, and tuning resources such as GPUs to ensure cost-efficient performance.
Ease of Experimentation: Finding the optimal balance between embedding model complexity, inference performance, and infrastructure costs typically demands deep technical understanding and trial-and-error. Tecton provides ML practitioners easy tooling to quickly evaluate several state-of-the-art open source models, without worrying about the model and compute complexity.

Basic Example

from tecton import Entity, BatchSource, FileConfig, batch_feature_view, Embedding
from tecton.types import Field, String, Timestamp
from datetime import datetime, timedelta


@batch_feature_view(
    sources=[user_text_data],
    entities=[user],
    mode="pandas",
    online=True,
    offline=True,
    batch_schedule=timedelta(days=1),
    feature_start_time=datetime(2021, 1, 1),
    timestamp_field="timestamp",
    features=[
        # Basic embedding feature using a pre-trained sentence transformer model
        Embedding(
            input_column=Field("user_bio", String),
            model="sentence-transformers/all-MiniLM-L6-v2",
            name="user_bio_embedding",
            description="Embedding representation of user biography text",
            tags={"feature_type": "text_embedding", "model_family": "sentence_transformers"},
        ),
        # Another embedding feature for user interests
        Embedding(
            input_column=Field("user_interests", String),
            model="sentence-transformers/all-MiniLM-L6-v2",
            name="user_interests_embedding",
            description="Embedding representation of user interests",
            tags={"feature_type": "text_embedding", "domain": "interests"},
        ),
    ],
    description="User text embedding features using sentence transformers",
)
def user_text_embeddings(user_text_data):
    return user_text_data[["user_id", "user_bio", "user_interests", "timestamp"]]

How to Use Embeddings

Embedding features are defined using the Embedding class within a features list in a Batch Feature View. Tecton supports embeddings of fixed dimension using the Array type. These features can be materialized to both online and offline stores and used just like other feature types.

Batch Embeddings Generation

Embeddings of text living in one or many Batch Data Sources (e.g. Snowflake, Redshift, BigQuery, S3 etc.) can be generated with Batch Feature Views.

from tecton import batch_feature_view, Embedding, RiftBatchConfig
from tecton.types import Field, String, Timestamp, Float64
from datetime import datetime, timedelta
from tecton import Entity, BatchSource, FileConfig

# Add Entity definition
product = Entity(
    name="product", join_keys=[Field("PRODUCT_ID", String)], description="Product entity for embedding features"
)

# Add BatchSource definition
products = BatchSource(
    name="products",
    batch_config=FileConfig(
        uri="s3://your-bucket/products.parquet", file_format="parquet", timestamp_field="TIMESTAMP"
    ),
)


@batch_feature_view(
    sources=[products],
    entities=[product],
    timestamp_field="TIMESTAMP",
    features=[
        Embedding(input_column=Field("PRODUCT_NAME", String), model="sentence-transformers/all-MiniLM-L6-v2"),
        Embedding(input_column=Field("PRODUCT_DESCRIPTION", String), model="sentence-transformers/all-MiniLM-L6-v2"),
    ],
    mode="pandas",
    batch_schedule=timedelta(days=1),
    batch_compute=RiftBatchConfig(
        # NOTE: we recommend using L4 GPU instances for Embeddings inference
        instance_type="g6.xlarge",
    ),
    environment="my-embeddings-env-1.0",  # Use a custom environment with embedding dependencies
    feature_start_time=datetime(2021, 1, 1),
)
def product_info_embeddings(products):
    return products[["PRODUCT_ID", "PRODUCT_NAME", "PRODUCT_DESCRIPTION", "TIMESTAMP"]]

By default, the embedding features would be named <COLUMN_NAME>_embedding. You can also specify the name by explicitly specifying the name in the Embedding: Embedding(name="merchant_embedding_all_MiniLM", column="merchant", column_dtype=String, model="sentence-transformers/all-MiniLM-L6-v2").

Supported Models

The following model names can be specified in model="<model_name>" for using different open-source text embeddings models.

Model
mixedbread-ai/mxbai-embed-large-v1
Snowflake/snowflake-arctic-embed-l
Snowflake/snowflake-arctic-embed-m
Snowflake/snowflake-arctic-embed-s
Snowflake/snowflake-arctic-embed-xs
sentence-transformers/all-MiniLM-L6-v2
BAAT/bge-large-en-v1.5
BAAT/bge-base-en-v1.5
BAAT/bge-small-en-v1.5
thenlper/gte-large
thenlper/gte-base
thenlper/gte-small

If you'd like to use specific open-source embeddings models not listed above, please file a support request.

For using proprietary Embeddings models see Model Generated Features.

Testing Batch Embeddings Generation Interactively

Feature Views with embeddings can be tested interactively similar to any other Batch Feature View, by running the following code in a notebook:

start = datetime(2024, 1, 1)
end = datetime(2024, 3, 1)

df = product_info_embeddings.get_features_in_range(start_time=start, end_time=end).to_pandas()

display(df.head(5))

Environment

In order to use Embeddings, you need to use an environment that contains all third party packages your model relies on, plus the following packages:

tecton[rift-materialization]
torch>=2.0.0
transformers>=4.40.0

Create a custom environment and add the required packages to the requirements.txt file:

$ tecton environment create --name "my-custom-env-0.1" --description "My Custom Env 0.1" --requirements /path/to/requirements.txt

See Environments for more details.

Limitations

Batch embeddings generation is only supported for Rift based Feature Views.
Batch embedding generation is currently only supported for text inputs, the datatype of columns to be embedded should be String. For generating embeddings on other input types see Model Generated Features.
A single Feature View can either have Aggregates or Embeddings, but not both.
Embeddings generation is not yet supported for Stream Feature Views.

What's Next

Once you've defined and materialized an embedding feature, consider:

Basic Example​

How to Use Embeddings​

Batch Embeddings Generation​

Supported Models​

Testing Batch Embeddings Generation Interactively​

Environment​

Limitations​

What's Next​

Was this page helpful?