Version: Beta 🚧

Aggregation Functions

Tecton's Aggregation Engine supports the following aggregations out of the box. All Compute Engines are supported except where explicitly noted.

Note that null feature values are excluded from the output unless noted otherwise.

approx_count_distinct(precision)

An aggregation function that returns, for a materialization time window, the approximate number of distinct row values for a column, per entity value (such as a user_id value).

Not currently supported with:

Tecton on Snowflake
Serverless Feature Retrieval with Athena

Input column types

String, Int32, Int64

Output column types

Int64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_count_distinct.

Then, define an Aggregate object using function=approx_count_distinct(precision), where precision is an integer >= 4 and <= 16, in a Batch or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. The storage cost (in both the offline and online store) is proportional to 2^precision. The standard error of the approximation is 1.04 / sqrt(2^precision). Here are the standard errors for several different values of precision:

Precision	Standard Error
4	26.0%
6	13.0%
8	6.5%
10	3.3%
12	1.6%
14	0.8%
16	0.4%

The default value of precision is 8. We recommend using the default precision unless extreme accuracy is important.

In general, the approx_count_distinct aggregation might not return the exact correct value. However, the aggregation is typically able to return the exact correct value for low-cardinality data (i.e. data with at most several hundred distinct elements), as long as the maximum precision (16) is used.

This aggregation uses the HyperLogLog algorithm.

Example

Aggregate(input_column=Field("address", String), function=approx_count_distinct(), time_window=timedelta(days=1)) to use the default value of precision=8

Aggregate(input_column=Field("address", String), function=approx_count_distinct(precision=10), time_window=timedelta(days=1))

approx_percentile(percentile, precision)

An aggregation function that returns, for a materialization time window, a value that is approximately equal to the specified percentile, per entity value (such as a user_id value). For Float32 and Float64 input columns, NaNs, positive infinity, and negative infinity are excluded.

Not currently supported with:

Tecton on Snowflake
Serverless Feature Retrieval with Athena

Input column types

Float32, Float64, Int32, Int64

Output column types

Float64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_percentile.

Then, define an Aggregate object, using function=approx_percentile(percentile, precision), where percentile is a float >= 0.0 and <= 1.0 and precision is an integer >= 20 and <= 500, in a Batch Feature View or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. Specifically, the error rate of the estimate is inversely proportional to precision, and the storage cost is proportional to precision. The default value of precision is 100. We recommend using the default precision unless extreme accuracy is important.

This aggregation uses the t-Digest algorithm.

caution

This aggregation is not fully deterministic. Its final estimate depends on the order in which input data is processed. Therefore, for example, it is possible for get_features_for_events() to return different results when run twice, as Spark could shuffle the input data differently. Similarly, the feature server may return different results than the offline store, as there is no guarantee that the input data is processed in the exact same order. In practice, getting different results is rare, and when it does happen, the differences are extremely small.

caution

This aggregation is computationally intensive. As a result, running get_features_for_events() or get_features_in_range() with from_source=True can be slow. If possible, we recommend waiting for offline materialization to finish and using from_source=False.

Example

Aggregate(input_column=Field("count", Int64), function=approx_percentile(percentile=0.5), time_window=timedelta(days=1)) to get the 50th percentile with the default value of precision=100

Aggregate(input_column=Field("count", Int64), function=approx_percentile(percentile=0.99, precision=500), time_window=timedelta(days=1)) to get the 99th percentile with extreme precision

count

An aggregation function that returns, for a materialization time window, the number of row values for a column, per entity value (such as a user_id value).

Input column types

All types

Output column types

Int64

Usage

To use this aggregation, define an Aggregate object, using function="count", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("transaction_id", String), function="count", time_window=timedelta(days=1))

first_distinct(n)

An aggregation function that returns, for a materialization time window, the first N distinct row values for a column, per entity value (such as a user_id value).

For example, if the first 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on timestamp.

For Spark-based Feature Views, null input values are included in the output; for a String input column, they're included as an empty String.

Not currently supported with:

Tecton on Snowflake
Serverless Feature Retrieval with Athena

Input column types

String, Int64

Output column type

Array[String], Array[Int64]

Usage

Import this aggregation with from tecton.aggregation_functions import first_distinct.

Then, define an Aggregate object, using function=first_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function=first_distinct(2), time_window=timedelta(days=1)).

first(n)

An aggregation function that returns, for a materialization time window, the first N row values for a column, per entity value (such as a user_id value).

For example, if the first 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output.

Not currently supported with:

Serverless Feature Retrieval with Athena

Input column types

String, Int64, Float32, Float64, Bool, Array

Output column type

Array[InputType]

Usage

Import this aggregation with from tecton.aggregation_functions import first.

Then, define an Aggregate object, using function=first(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64),, function=first(2), time_window=timedelta(days=1))

last_distinct(n)

An aggregation function that returns, for a materialization time window, the last N distinct row values for a column, per entity value (such as a user_id value).

For example, if the last 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output; for a String input column, they're included as an empty String.

Not currently supported with:

Tecton on Snowflake
Serverless Feature Retrieval with Athena

Input column types

String, Int64

Output column type

Array[String], Array[Int64]

Usage

Import this aggregation with from tecton.aggregation_functions import last_distinct.

Then, define an Aggregate object, using function=last_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function=last_distinct(2), time_window=timedelta(days=1))

last

An aggregation function that returns, for a materialization time window, the last row value for a column, per entity value (such as a user_id value).

Not currently supported with:

Tecton on Snowflake
Serverless Feature Retrieval with Athena

Input column types

Int64, Int32, Float64, Bool, String, Array

Output column type

InputType

Usage

To use this aggregation, define an Aggregate object, using function="last", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="last", time_window=timedelta(days=1))

last(n)

An aggregation function that returns, for a materialization time window, the last N row values for a column, per entity value (such as a user_id value).

For example, if the last 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output.

Not currently supported with:

Tecton on Snowflake
Serverless Feature Retrieval with Athena

Input column types

String, Int64, Float32, Float64, Bool, Array

Output column type

Array[InputType]

Usage

Import this aggregation with from tecton.aggregation_functions import last.

Then, define an Aggregate object using function=last(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function=last(2), time_window=timedelta(days=1))

max

An aggregation function that returns, for a materialization time window, the maximum of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64, String

Output column type

InputType

Usage

To use this aggregation, define an Aggregate object, using function="max", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="max", time_window=timedelta(days=1))

mean

An aggregation function that returns, for a materialization time window, the mean of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="mean", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="mean", time_window=timedelta(days=1))

min

An aggregation function that returns, for a materialization time window, the minimum of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64, String

Output column type

InputType

Usage

To use this aggregation, define an Aggregate object, using function="min", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="min", time_window=timedelta(days=1))

stddev_pop

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="stddev_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="stddev_pop", time_window=timedelta(days=1))

stddev_samp

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="stddev_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="stddev_samp", time_window=timedelta(days=1))

sum

An aggregation function that returns, for a materialization time window, the sum of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Int64 or Float64

Usage

To use this aggregation, define an Aggregate object, using function="sum", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="sum", time_window=timedelta(days=1))

var_pop

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="var_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="var_pop", time_window=timedelta(days=1))

var_samp

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="var_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="var_samp", time_window=timedelta(days=1))

approx_count_distinct(precision)​

approx_percentile(percentile, precision)​

count​

first_distinct(n)​

first(n)​

last_distinct(n)​

last​

last(n)​

max​

mean​

min​

stddev_pop​

stddev_samp​

sum​

var_pop​

var_samp​

Was this page helpful?

approx_count_distinct(precision)

approx_percentile(percentile, precision)

count

first_distinct(n)

first(n)

last_distinct(n)

last

last(n)

max

mean

min

stddev_pop

stddev_samp

sum

var_pop

var_samp