Skip to content

Transformations

A Transformation is a Tecton object that describes a set of operations on data. The operations are expressed through standard frameworks such as Spark SQL, PySpark, and Pandas.

A Transformation is required to create a feature within a Feature view. Once defined, a Transformation can be reused within multiple Feature Views, or multiple Transformations can be composed within a single Feature View.

If you have an existing pipeline to transform your feature values, you can ingest those values to a Feature Table without a Tecton Transformation. However, using these Transformations with your feature store provides several benefits:

  • Reusability: You can define a common Transformation — to clean up data, for example — that can be shared across all Features.
  • Feature versioning: If you change a Feature Transformation, the Feature Store increments the version of that feature and ensures that you don't accidentally mix features that were computed using two different implementations.
  • End-to-end lineage tracking and reproducibility: Since Tecton manages Transformations, it can tie feature definitions all the way through a training data set and a model that's used in production.
  • Visibility: Enabling data scientists to examine the code and see how the feature is calculated will help them understand if it's appropriate to re-use for their model.

Transformation Types

Register a python function as a Transformation in Tecton by annotating it @transformation, and set the mode parameter depending on the language used for the transformation. The current options are spark_sql, pyspark, and pandas

Spark SQL Transformation

SQL transformations are configured with mode=spark_sql, and return a Spark SQL query.

Function inputs must be a Spark dataframe or a Tecton constant. The tables in the FROM clause must be parameterized via the inputs.

Example

from tecton import transformation

@transformation(mode="spark_sql")
def user_has_good_credit_transformation(credit_scores):
    return f"""
        SELECT
            user_id,
            IF (credit_score > 670, 1, 0) as user_has_good_credit,
            date as timestamp
        FROM
            {credit_scores}
        """

Note that Spark SQL transformations cannot be used within an OnDemandFeatureView.

PySpark Transformations

PySpark transformations are configured with mode=pyspark, and contain Python code that will be executed within a Spark context. They can additionally include third party libraries as user-defined PySpark functions if your cluster allows third party libraries.

Function inputs must be a Spark dataframe or a Tecton constant.

Example

@transformation(mode="pyspark")
def user_has_good_credit_transformation(credit_scores):
    from pyspark.sql import functions as F

    df = credit_scores.withColumn("user_has_good_credit", \
        F.when(credit_scores["credit_score"] > 670, 1).otherwise(0))
    return df.select("user_id", \
        df["date"].alias("timestamp"), \
        "user_has_good_credit")

Note that PySpark transformations, like Spark SQL transformation, cannot be used within an OnDemandFeatureView.

Pandas Transformations

Pandas transformations are annotated with mode=pandas. Today they can only be used within an OnDemandFeatureView.

Function inputs must be a Pandas dataframe or a Tecton constant.

Example

@transformation(mode="pandas")
def transaction_amount_is_high_transformation(transaction_request):
    import pandas as pd

    df = pd.DataFrame()
    df['transaction_amount_is_high'] = (transaction_request['amount'] >= 10000).astype('int64')
    return df

Using functions or libraries for your Transformation

When applying Transformations to the Tecton feature repository, only the Transformation function’s body gets recorded. Therefore imports and other references from the outside of the Transformation function’s body will not work.

Importing Libraries

In order to use imported libraries, you must import Python libraries inside the Transformation function, not at the top level as you normally would. Avoid using aliases for imports (e.g. use import pandas instead of import pandas as pd).

### Valid
from tecton import transformation

@transformation(mode="pandas")
def my_transformation(request):
    import pandas

    df = pandas.DataFrame()
    df['amount_is_high'] = (request['amount'] >= 10000).astype('int64')
    return df
### Invalid - pandas is imported outside my_transformation!
from tecton import transformation
import pandas

@transformation(mode="pandas")
def my_transformation(request):
    df = pandas.DataFrame()
    df['amount_is_high'] = (request['amount'] >= 10000).astype('int64')
    return df

Any libraries used in function signatures must also be imported outside the function.

from tecton import transformation
import pandas # required for type hints on my_transformation.

@transformation(mode="pandas")
def my_transformation(request: pandas.DataFrame) -> pandas.DataFrame:
    import pandas # required for pandas.DataFrame() below.

    df = pd.DataFrame()
    df['amount_is_high'] = (request['amount'] >= 10000).astype('int64')
    return df

Inlining functions

You must annotate any defined functions used by transformations with @inlined in order for them be registered with the Feature Repository. Recursively, these functions need to follow the same rules with respect to importing libraries and calling other functions.

### Valid
from tecton import transformation, inlined

@inlined
def credit_threshold(description):
    if description == "good":
        return 670
    elif description == "excellent":
        return 740
    raise ValueError("Illegal credit description")

@transformation(mode="pyspark")
def user_has_good_credit_transformation(credit_scores):
    from pyspark.sql import functions as F

    threshold = credit_threshold("good")
    return credit_scores.withColumn("user_has_good_credit", \
        F.when(credit_scores["credit_score"] > threshold, 1).otherwise(0))
### Invalid - @inlined is missing!
from tecton import transformation

def credit_threshold(description):
    if description == "good":
        return 670
    elif description == "excellent":
        return 740
    raise ValueError("Illegal credit description")

@transformation(mode="pyspark")
def user_has_good_credit_transformation(credit_scores):
    from pyspark.sql import functions as F

    threshold = credit_threshold("good")
    return credit_scores.withColumn("user_has_good_credit", \
        F.when(credit_scores["credit_score"] > threshold, 1).otherwise(0))

Using Transformations in Feature Views

Now that you've created your Transformation function, the next step is to call the function from your Feature View. See the Feature View Overview for more details.