Version: Beta 🚧

Aggregation Functions

Tecton's Aggregation Engine supports the following aggregations out of the box. All Compute Engines are supported except where explicitly noted.

Aggregations are defined using the Aggregate class.

Note that null feature values are excluded from the output unless noted otherwise.

approx_count_distinct(precision)

An aggregation function that returns, for a materialization time window, the approximate number of distinct row values for a column, per entity value (such as a user_id value).

Input column types

String, Int32, Int64

Output column types

Int64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_count_distinct.

Then, define an Aggregate object using function=approx_count_distinct(precision), where precision is an integer >= 4 and <= 16, in a Batch or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. The storage cost (in both the offline and online store) is proportional to 2^precision. The standard error of the approximation is 1.04 / sqrt(2^precision). Here are the standard errors for several different values of precision:

Precision	Standard Error
4	26.0%
6	13.0%
8	6.5%
10	3.3%
12	1.6%
14	0.8%
16	0.4%

The default value of precision is 8. We recommend using the default precision unless extreme accuracy is important.

In general, the approx_count_distinct aggregation might not return the exact correct value. However, the aggregation is typically able to return the exact correct value for low-cardinality data (i.e. data with at most several hundred distinct elements), as long as the maximum precision (16) is used.

This aggregation uses the HyperLogLog algorithm.

Example

Aggregate(input_column=Field("transaction_id", String), function=approx_count_distinct(), time_window=timedelta(days=30), name="approx_distinct_transactions_1m")

will use the default value of precision=8. To specify a precision of 10:

Aggregate(input_column=Field("transaction_id", String), function=approx_count_distinct(precision=10), time_window=timedelta(days=30), name="approx_distinct_transactions_1m")

Example Output

          user_id  approx_distinct_transactions_1m
user_402539845901                                2
user_600003278485                               17
user_912293302206                                9
user_687958452057                               15
user_469998441571                               13

approx_percentile(percentile, precision)

An aggregation function that returns, for a materialization time window, a value that is approximately equal to the specified percentile, per entity value (such as a user_id value). For Float32 and Float64 input columns, NaNs, positive infinity, and negative infinity are excluded.

Input column types

Float32, Float64, Int32, Int64

Output column types

Float64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_percentile.

Then, define an Aggregate object, using function=approx_percentile(percentile, precision), where percentile is a float >= 0.0 and <= 1.0 and precision is an integer >= 20 and <= 500, in a Batch Feature View or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. Specifically, the error rate of the estimate is inversely proportional to precision, and the storage cost is proportional to precision. The default value of precision is 100. We recommend using the default precision unless extreme accuracy is important.

This aggregation uses the t-Digest algorithm.

caution

This aggregation is not fully deterministic. Its final estimate depends on the order in which input data is processed. Therefore, for example, it is possible for get_features_for_events() to return different results when run twice, as Spark could shuffle the input data differently. Similarly, the feature server may return different results than the offline store, as there is no guarantee that the input data is processed in the exact same order. In practice, getting different results is rare, and when it does happen, the differences are extremely small.

caution

This aggregation is computationally intensive. As a result, running get_features_for_events() or get_features_in_range() with from_source=True can be slow. If possible, we recommend waiting for offline materialization to finish and using from_source=False.

Example

Aggregate(input_column=Field("amt", Float64), function=approx_percentile(percentile=0.5), time_window=timedelta(days=30), name="approx_median_amt_1m")

to get the 50th percentile with the default value of precision=100.

Aggregate(input_column=Field("amt", Float64), function=approx_percentile(percentile=0.99, precision=500), time_window=timedelta(days=30), name="approx_median_amt_1m")

to get the 99th percentile with extreme precision.

Example Output

          user_id  approx_median_amt_1m
user_709462196403             17.969999
user_222506789984             42.310001
user_402539845901             41.070000
user_609904782486             48.570000
user_644787199786             58.610001

count

An aggregation function that returns, for a materialization time window, the number of row values for a column, per entity value (such as a user_id value).

Input column types

All types

Output column types

Int64

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="count", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("transaction_id", String), function="count", time_window=timedelta(days=30), name="transaction_count_1m")

Example Output

          user_id  transaction_count_1m
user_402539845901                    22
user_459842889956                    13
user_917975462998                     8
user_222506789984                    44
user_131340471060                     3

first_distinct(n)

An aggregation function that returns, for a materialization time window, the first N distinct row values for a column, per entity value (such as a user_id value).

For example, if the first 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on timestamp.

For Spark-based Feature Views, null input values are included in the output; for a String input column, they're included as an empty String.

Input column types

String, Int64

Output column type

Array[String], Array[Int64]

Usage

Import this aggregation with from tecton.aggregation_functions import first_distinct.

Then, define an Aggregate object, using function=first_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=first_distinct(1), time_window=timedelta(days=30), name="first_distinct_merchant_1m")

Example Output

          user_id        first_distinct_merchant_1m
user_337750317412                    [Bogisich Inc]
user_568801468984   [Johnston, Nikolaus and Maggio]
user_650387977076    [Hermiston, Pacocha and Smith]
user_884240387242                     [Kilback LLC]
user_337750317412  [Connelly, Reichert and Fritsch]

first(n)

An aggregation function that returns, for a materialization time window, the first N row values for a column, per entity value (such as a user_id value).

For example, if the first 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output.

Input column types

String, Int64, Float32, Float64, Bool, Array

Output column type

Array[InputType]

Usage

Import this aggregation with from tecton.aggregation_functions import first.

Then, define an Aggregate object, using function=first(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=first(2), time_window=timedelta(days=30), name="first_3_merchants_1m")

Example Output

         user_id               first_2_merchants_1m
user_26990816968  [Dare-Marvin, Abernathy and Sons]
user_855115135598 [Altenwerth, Cartwright and Koss]
user_26990816968              [Kuhic LLC, Baumbach]
user_459842889956     [Watsica, Haag and Considine]
user_644787199786                       [Beier LLC]

last_distinct(n)

An aggregation function that returns, for a materialization time window, the last N distinct row values for a column, per entity value (such as a user_id value).

For example, if the last 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output; for a String input column, they're included as an empty String.

Input column types

String, Int64

Output column type

Array[String], Array[Int64]

Usage

Import this aggregation with from tecton.aggregation_functions import last_distinct.

Then, define an Aggregate object, using function=last_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=last_distinct(1), time_window=timedelta(days=30), name="last_distinct_merchant_1m")

Example Output

          user_id    last_distinct_merchant_1m
user_457435146833         [Hayes and Nikolaus]
user_568801468984      [Dickinson and Labadie]
user_884240387242          [Harris and Bednar]
user_131340471060           [Bechtelar-Rippin]
user_394495759023                   [Will Ltd]

last

An aggregation function that returns, for a materialization time window, the last row value for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64, Bool, String, Array

Output column type

InputType

Usage

To use this aggregation, define an Aggregate object, using function="last", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="last", time_window=timedelta(days=1))

last(n)

An aggregation function that returns, for a materialization time window, the last N row values for a column, per entity value (such as a user_id value).

For example, if the last 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output.

Input column types

String, Int64, Float32, Float64, Bool, Array

Output column type

Array[InputType]

Usage

Import this aggregation with from tecton.aggregation_functions import last.

Then, define an Aggregate object using function=last(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=last(2), time_window=timedelta(days=30), name="last_2_merchants_1m")

Example Output

          user_id              last_2_merchants_1m
user_222506789984   [Bahringer, Osinski and Block]
user_502567604689         [Rowe, Batz and Goodwin]
user_650387977076       [Heathcote LLC, Romaguera]
user_222506789984 [Cartwright PLC, Hamill-D'Amore]
user_268308151877       [Bode-Rempel, Bode-Rempel]

max

An aggregation function that returns, for a materialization time window, the maximum of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64, String

Output column type

InputType

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="max", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="max", time_window=timedelta(days=30), name="max_amt_1m")

Example Output

          user_id  max_amt_1m
user_337750317412      117.78
user_402539845901       92.62
user_445755474326      346.72
user_459842889956      172.22
user_461615966685     1266.84

mean

An aggregation function that returns, for a materialization time window, the mean of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="mean", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="mean", time_window=timedelta(days=30), name="mean_amt_1m")

Example Output

          user_id  mean_amt_1m
user_402539845901    47.030000
user_459842889956    61.790952
user_461615966685    98.588571
user_724235628997    77.376000
user_782510788708    38.832000

min

An aggregation function that returns, for a materialization time window, the minimum of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64, String

Output column type

InputType

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="min", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="min", time_window=timedelta(days=30), name="min_amt_1m")

Example Output

          user_id  min_amt_1m
user_917975462998        1.62
user_459842889956        8.20
user_699955105085        3.19
user_884240387242        1.43
user_568801468984        2.60

stddev_pop

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="stddev_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="stddev_pop", time_window=timedelta(days=30), name="stddev_pop_amt_1m")

Example Output

          user_id  stddev_pop_amt_1m
user_499975010057          39.042573
user_950482239421         133.300228
user_131340471060          65.114006
user_212730160038          19.236001
user_457435146833          70.824972

stddev_samp

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="stddev_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="stddev_samp", time_window=timedelta(days=30), name="stddev_samp_amt_1m")

Example Output

          user_id  stddev_samp_amt_1m
user_268514844966          178.279950
user_337750317412           34.012570
user_884240387242           53.857238
user_469998441571           78.482849
user_538895124917           27.950343

sum

An aggregation function that returns, for a materialization time window, the sum of the row values for a column, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Int64 or Float64

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="sum", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="sum", time_window=timedelta(days=30), name="sum_amt_1m")

Example Output

          user_id  sum_amt_1m
user_131340471060      195.60
user_538895124917      534.97
user_916905857181      631.18
user_930691958107      491.96
user_131340471060      401.76

var_pop

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="var_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="var_pop", time_window=timedelta(days=30), name="var_pop_amt_1m")

Example Output

          user_id  var_pop_amt_1m
user_855115135598     2331.988524
user_457435146833        0.000000
user_724235628997     2026.691776
user_884240387242     9713.346553
user_268514844966     1170.593603

var_samp

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

Int64, Int32, Float64

Output column type

Float64

Usage

To use this aggregation, define an Aggregate object, using function="var_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="var_samp", time_window=timedelta(days=30), name="var_samp_amt_1m")

Example Output

          user_id  var_samp_amt_1m
user_268308151877      2381.217252
user_460877961787     14270.516080
user_568801468984      4028.327196
user_656020174537      1209.767752
user_722584453020     16769.225087

approx_count_distinct(precision)​

approx_percentile(percentile, precision)​

count​

first_distinct(n)​

first(n)​

last_distinct(n)​

last​

last(n)​

max​

mean​

min​

stddev_pop​

stddev_samp​

sum​

var_pop​

var_samp​

Was this page helpful?

approx_count_distinct(precision)

approx_percentile(percentile, precision)

count

first_distinct(n)

first(n)

last_distinct(n)

last

last(n)

max

mean

min

stddev_pop

stddev_samp

sum

var_pop

var_samp