Skip to main content
Version: Beta 🚧

Aggregation Functions

Tecton's Aggregation Engine supports the following aggregations out of the box.

Support for custom aggregation functions is coming soon.

approx_count_distinct(precision)

An aggregation function that returns, for a materialization time window, the approximate number of distinct row values for a column, per entity value (such as a user_id value). Null values are excluded.

Not currently supported with:

  • Tecton on Snowflake
  • Serverless Feature Retrieval with Athena

Input column types

  • Tecton on Spark: String, Int32, Int64

Output column types

  • Int64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_count_distinct.

Then, define an Aggregation object using function=approx_count_distinct(precision), where precision is an integer >= 4 and <= 16, in a Batch or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. The storage cost (in both the offline and online store) is proportional to 2^precision. The standard error of the approximation is 1.04 / sqrt(2^precision). Here are the standard errors for several different values of precision:

PrecisionStandard Error
426.0%
613.0%
86.5%
103.3%
121.6%
140.8%
160.4%

The default value of precision is 8. We recommend using the default precision unless extreme accuracy is important.

In general, the approx_count_distinct aggregation might not return the exact correct value. However, the aggregation is typically able to return the exact correct value for low-cardinality data (i.e. data with at most several hundred distinct elements), as long as the maximum precision (16) is used.

This aggregation uses the HyperLogLog algorithm.

Example

Aggregation(column="address", function=approx_count_distinct(), time_window=timedelta(days=1)) to use the default value of precision=8

Aggregation(column="address", function=approx_count_distinct(precision=10), time_window=timedelta(days=1))

approx_percentile(percentile, precision)

An aggregation function that returns, for a materialization time window, a value that is approximately equal to the specified percentile, per entity value (such as a user_id value). Null values are excluded. For Float32 and Float64 input columns, NaNs, positive infinity, and negative infinity are also excluded.

Not currently supported with:

  • Tecton on Snowflake
  • Serverless Feature Retrieval with Athena

Input column types

  • Tecton on Spark: Float32, Float64, Int32, Int64

Output column types

  • Float64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_percentile.

Then, define an Aggregation object, using function=approx_percentile(percentile, precision), where percentile is a float >= 0.0 and <= 1.0 and precision is an integer >= 20 and <= 500, in a Batch Feature View or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. Specifically, the error rate of the estimate is inversely proportional to precision, and the storage cost is proportional to precision. The default value of precision is 100. We recommend using the default precision unless extreme accuracy is important.

This aggregation uses the t-Digest algorithm.

caution

This aggregation is not fully deterministic. Its final estimate depends on the order in which input data is processed. Therefore, for example, it is possible for get_historical_features to return different results when run twice, as Spark could shuffle the input data differently. Similarly, the feature server may return different results than the offline store, as there is no guarantee that the input data is processed in the exact same order. In practice, getting different results is rare, and when it does happen, the differences are extremely small.

caution

This aggregation is computationally intensive. As a result, running get_historical_features with from_source=True can be slow. If possible, we recommend waiting for offline materialization to finish and using from_source=False.

Example

Aggregation(column="count", function=approx_percentile(percentile=0.5), time_window=timedelta(days=1)) to get the 50th percentile with the default value of precision=100

Aggregation(column="count", function=approx_percentile(percentile=0.99, precision=500), time_window=timedelta(days=1)) to get the 99th percentile with extreme precision

count

An aggregation function that returns, for a materialization time window, the number of row values for a column, per entity value (such as a user_id value). Null values are excluded.

Input column types

  • Tecton on Spark: All types
  • Tecton on Snowflake: All types

Output column types

  • Int64

Usage

To use this aggregation, define an Aggregation object, using function="count", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="transaction_id", function="count", time_window=timedelta(days=1))

first_distinct(n)

An aggregation function that returns, for a materialization time window, the first N distinct row values for a column, per entity value (such as a user_id value).

For example, if the first 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on timestamp.

Not currently supported with:

  • Tecton on Snowflake
  • Serverless Feature Retrieval with Athena

Input column types

  • String

Output column type

  • Array[String]

Usage

Import this aggregation with from tecton.aggregation_functions import first_distinct.

Then, define an Aggregation object, using function=first_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function=first_distinct(2), time_window=timedelta(days=1)).

first(n)

An aggregation function that returns, for a materialization time window, the first N row values for a column, per entity value (such as a user_id value).

For example, if the first 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

Not currently supported with:

  • Serverless Feature Retrieval with Athena

Input column types

  • String

Output column type

  • Array[String]

Usage

Import this aggregation with from tecton.aggregation_functions import first.

Then, define an Aggregation object, using function=first(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function=first(2), time_window=timedelta(days=1))

last_distinct(n)

An aggregation function that returns, for a materialization time window, the last N distinct row values for a column, per entity value (such as a user_id value).

For example, if the last 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

Not currently supported with:

  • Tecton on Snowflake
  • Serverless Feature Retrieval with Athena

Input column types

  • String

Output column type

  • Array[String]

Usage

Import this aggregation with from tecton.aggregation_functions import last_distinct.

Then, define an Aggregation object, using function=last_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function=last_distinct(2), time_window=timedelta(days=1))

last

An aggregation function that returns, for a materialization time window, the last row value for a column, per entity value (such as a user_id value).

Not currently supported with:

  • Tecton on Snowflake
  • Serverless Feature Retrieval with Athena

Input column types

  • Int64, Int32, Float64, Bool, String

Output column type

  • Int64, Float64, Bool, String

Usage

To use this aggregation, define an Aggregation object, using function="last", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="last", time_window=timedelta(days=1))

last(n)

An aggregation function that returns, for a materialization time window, the last N row values for a column, per entity value (such as a user_id value).

For example, if the last 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

Not currently supported with:

  • Serverless Feature Retrieval with Athena

Input column types

  • String

Output column type

  • Array[String]

Usage

Import this aggregation with from tecton.aggregation_functions import last.

Then, define an Aggregation object using function=last(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function=last(2), time_window=timedelta(days=1))

max

An aggregation function that returns, for a materialization time window, the maximum of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64, String

Output column type

  • Int64, Float64, String

Usage

To use this aggregation, define an Aggregation object, using function="max", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="max", time_window=timedelta(days=1))

mean

An aggregation function that returns, for a materialization time window, the mean of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregation object, using function="mean", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="mean", time_window=timedelta(days=1))

min

An aggregation function that returns, for a materialization time window, the minimum of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64, String

Output column type

  • Int64, Float64, String

Usage

To use this aggregation, define an Aggregation object, using function="min", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="min", time_window=timedelta(days=1))

stddev_pop

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregation object, using function="stddev_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="stddev_pop", time_window=timedelta(days=1))

stddev_samp

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregation object, using function="stddev_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="stddev_samp", time_window=timedelta(days=1))

sum

An aggregation function that returns, for a materialization time window, the sum of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Int64 or Float64

Usage

To use this aggregation, define an Aggregation object, using function="sum", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="sum", time_window=timedelta(days=1))

var_pop

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregation object, using function="var_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="var_pop", time_window=timedelta(days=1))

var_samp

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregation object, using function="var_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregation(column="amt", function="var_samp", time_window=timedelta(days=1))

Was this page helpful?

Happy React is loading...

🧠 Hi! Ask me anything about Tecton!

Floating button icon