Skip to main content
Version: Beta ๐Ÿšง

Aggregation Functions

Tecton's Aggregation Engine supports the following aggregations out of the box. All Compute Engines are supported except where explicitly noted.

Aggregations are defined using the Aggregate class.

Note that null feature values are excluded from the output unless noted otherwise.

approx_count_distinct(precision)โ€‹

An aggregation function that returns, for a materialization time window, the approximate number of distinct row values for a column, per entity value (such as a user_id value).

Input column types

  • String, Int32, Int64

Output column types

  • Int64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_count_distinct.

Then, define an Aggregate object using function=approx_count_distinct(precision), where precision is an integer >= 4 and <= 16, in a Batch or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. The storage cost (in both the offline and online store) is proportional to 2^precision. The standard error of the approximation is 1.04 / sqrt(2^precision). Here are the standard errors for several different values of precision:

PrecisionStandard Error
426.0%
613.0%
86.5%
103.3%
121.6%
140.8%
160.4%

The default value of precision is 8. We recommend using the default precision unless extreme accuracy is important.

In general, the approx_count_distinct aggregation might not return the exact correct value. However, the aggregation is typically able to return the exact correct value for low-cardinality data (i.e. data with at most several hundred distinct elements), as long as the maximum precision (16) is used.

This aggregation uses the HyperLogLog algorithm.

Example

Aggregate(input_column=Field("transaction_id", String), function=approx_count_distinct(), time_window=timedelta(days=30), name="approx_distinct_transactions_1m")

will use the default value of precision=8. To specify a precision of 10:

Aggregate(input_column=Field("transaction_id", String), function=approx_count_distinct(precision=10), time_window=timedelta(days=30), name="approx_distinct_transactions_1m")

Example Output

          user_id  approx_distinct_transactions_1m
user_402539845901 2
user_600003278485 17
user_912293302206 9
user_687958452057 15
user_469998441571 13

approx_percentile(percentile, precision)โ€‹

An aggregation function that returns, for a materialization time window, a value that is approximately equal to the specified percentile, per entity value (such as a user_id value). For Float32 and Float64 input columns, NaNs, positive infinity, and negative infinity are excluded.

Input column types

  • Float32, Float64, Int32, Int64

Output column types

  • Float64

Usage

Import this aggregation with from tecton.aggregation_functions import approx_percentile.

Then, define an Aggregate object, using function=approx_percentile(percentile, precision), where percentile is a float >= 0.0 and <= 1.0 and precision is an integer >= 20 and <= 500, in a Batch Feature View or a Stream Feature View.

The precision parameter controls the accuracy of the approximation. A higher precision yields lower error at the cost of more storage; the impact on performance (i.e. speed) is negligible. Specifically, the error rate of the estimate is inversely proportional to precision, and the storage cost is proportional to precision. The default value of precision is 100. We recommend using the default precision unless extreme accuracy is important.

This aggregation uses the t-Digest algorithm.

caution

This aggregation is not fully deterministic. Its final estimate depends on the order in which input data is processed. Therefore, for example, it is possible for get_features_for_events() to return different results when run twice, as Spark could shuffle the input data differently. Similarly, the feature server may return different results than the offline store, as there is no guarantee that the input data is processed in the exact same order. In practice, getting different results is rare, and when it does happen, the differences are extremely small.

caution

This aggregation is computationally intensive. As a result, running get_features_for_events() or get_features_in_range() with from_source=True can be slow. If possible, we recommend waiting for offline materialization to finish and using from_source=False.

Example

Aggregate(input_column=Field("amt", Float64), function=approx_percentile(percentile=0.5), time_window=timedelta(days=30), name="approx_median_amt_1m")

to get the 50th percentile with the default value of precision=100.

Aggregate(input_column=Field("amt", Float64), function=approx_percentile(percentile=0.99, precision=500), time_window=timedelta(days=30), name="approx_median_amt_1m")

to get the 99th percentile with extreme precision.

Example Output

          user_id  approx_median_amt_1m
user_709462196403 17.969999
user_222506789984 42.310001
user_402539845901 41.070000
user_609904782486 48.570000
user_644787199786 58.610001

countโ€‹

An aggregation function that returns, for a materialization time window, the number of row values for a column, per entity value (such as a user_id value).

Input column types

  • All types

Output column types

  • Int64

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="count", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("transaction_id", String), function="count", time_window=timedelta(days=30), name="transaction_count_1m")

Example Output

          user_id  transaction_count_1m
user_402539845901 22
user_459842889956 13
user_917975462998 8
user_222506789984 44
user_131340471060 3

first_distinct(n)โ€‹

An aggregation function that returns, for a materialization time window, the first N distinct row values for a column, per entity value (such as a user_id value).

For example, if the first 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on timestamp.

For Spark-based Feature Views, null input values are included in the output; for a String input column, they're included as an empty String.

Input column types

  • String, Int64

Output column type

  • Array[String], Array[Int64]

Usage

Import this aggregation with from tecton.aggregation_functions import first_distinct.

Then, define an Aggregate object, using function=first_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=first_distinct(1), time_window=timedelta(days=30), name="first_distinct_merchant_1m")

Example Output

          user_id        first_distinct_merchant_1m
user_337750317412 [Bogisich Inc]
user_568801468984 [Johnston, Nikolaus and Maggio]
user_650387977076 [Hermiston, Pacocha and Smith]
user_884240387242 [Kilback LLC]
user_337750317412 [Connelly, Reichert and Fritsch]

first(n)โ€‹

An aggregation function that returns, for a materialization time window, the first N row values for a column, per entity value (such as a user_id value).

For example, if the first 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output.

Input column types

  • String, Int64, Float32, Float64, Bool, Array

Output column type

  • Array[InputType]

Usage

Import this aggregation with from tecton.aggregation_functions import first.

Then, define an Aggregate object, using function=first(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=first(2), time_window=timedelta(days=30), name="first_3_merchants_1m")

Example Output

         user_id               first_2_merchants_1m
user_26990816968 [Dare-Marvin, Abernathy and Sons]
user_855115135598 [Altenwerth, Cartwright and Koss]
user_26990816968 [Kuhic LLC, Baumbach]
user_459842889956 [Watsica, Haag and Considine]
user_644787199786 [Beier LLC]

last_distinct(n)โ€‹

An aggregation function that returns, for a materialization time window, the last N distinct row values for a column, per entity value (such as a user_id value).

For example, if the last 2 distinct row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output; for a String input column, they're included as an empty String.

Input column types

  • String, Int64

Output column type

  • Array[String], Array[Int64]

Usage

Import this aggregation with from tecton.aggregation_functions import last_distinct.

Then, define an Aggregate object, using function=last_distinct(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=last_distinct(1), time_window=timedelta(days=30), name="last_distinct_merchant_1m")

Example Output

          user_id    last_distinct_merchant_1m
user_457435146833 [Hayes and Nikolaus]
user_568801468984 [Dickinson and Labadie]
user_884240387242 [Harris and Bednar]
user_131340471060 [Bechtelar-Rippin]
user_394495759023 [Will Ltd]

lastโ€‹

An aggregation function that returns, for a materialization time window, the last row value for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64, Bool, String, Array

Output column type

  • InputType

Usage

To use this aggregation, define an Aggregate object, using function="last", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Int64), function="last", time_window=timedelta(days=1))

last(n)โ€‹

An aggregation function that returns, for a materialization time window, the last N row values for a column, per entity value (such as a user_id value).

For example, if the last 2 row values for a column, in the materialization time window, are 10 and 20, then the function returns [10,20].

note

The output sequence is in ascending order based on the timestamp.

For Spark-based Feature Views, null input values are included in the output.

Input column types

  • String, Int64, Float32, Float64, Bool, Array

Output column type

  • Array[InputType]

Usage

Import this aggregation with from tecton.aggregation_functions import last.

Then, define an Aggregate object using function=last(n), where n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("merchant", String), function=last(2), time_window=timedelta(days=30), name="last_2_merchants_1m")

Example Output

          user_id              last_2_merchants_1m
user_222506789984 [Bahringer, Osinski and Block]
user_502567604689 [Rowe, Batz and Goodwin]
user_650387977076 [Heathcote LLC, Romaguera]
user_222506789984 [Cartwright PLC, Hamill-D'Amore]
user_268308151877 [Bode-Rempel, Bode-Rempel]

maxโ€‹

An aggregation function that returns, for a materialization time window, the maximum of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64, String

Output column type

  • InputType

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="max", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="max", time_window=timedelta(days=30), name="max_amt_1m")

Example Output

          user_id  max_amt_1m
user_337750317412 117.78
user_402539845901 92.62
user_445755474326 346.72
user_459842889956 172.22
user_461615966685 1266.84

meanโ€‹

An aggregation function that returns, for a materialization time window, the mean of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="mean", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="mean", time_window=timedelta(days=30), name="mean_amt_1m")

Example Output

          user_id  mean_amt_1m
user_402539845901 47.030000
user_459842889956 61.790952
user_461615966685 98.588571
user_724235628997 77.376000
user_782510788708 38.832000

minโ€‹

An aggregation function that returns, for a materialization time window, the minimum of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64, String

Output column type

  • InputType

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="min", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="min", time_window=timedelta(days=30), name="min_amt_1m")

Example Output

          user_id  min_amt_1m
user_917975462998 1.62
user_459842889956 8.20
user_699955105085 3.19
user_884240387242 1.43
user_568801468984 2.60

stddev_popโ€‹

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregate object, using function="stddev_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="stddev_pop", time_window=timedelta(days=30), name="stddev_pop_amt_1m")

Example Output

          user_id  stddev_pop_amt_1m
user_499975010057 39.042573
user_950482239421 133.300228
user_131340471060 65.114006
user_212730160038 19.236001
user_457435146833 70.824972

stddev_sampโ€‹

An aggregation function that returns, for a materialization time window, the standard deviation of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregate object, using function="stddev_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="stddev_samp", time_window=timedelta(days=30), name="stddev_samp_amt_1m")

Example Output

          user_id  stddev_samp_amt_1m
user_268514844966 178.279950
user_337750317412 34.012570
user_884240387242 53.857238
user_469998441571 78.482849
user_538895124917 27.950343

sumโ€‹

An aggregation function that returns, for a materialization time window, the sum of the row values for a column, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Int64 or Float64

Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.

Usage

To use this aggregation, define an Aggregate object, using function="sum", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="sum", time_window=timedelta(days=30), name="sum_amt_1m")

Example Output

          user_id  sum_amt_1m
user_131340471060 195.60
user_538895124917 534.97
user_916905857181 631.18
user_930691958107 491.96
user_131340471060 401.76

var_popโ€‹

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the population mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregate object, using function="var_pop", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="var_pop", time_window=timedelta(days=30), name="var_pop_amt_1m")

Example Output

          user_id  var_pop_amt_1m
user_855115135598 2331.988524
user_457435146833 0.000000
user_724235628997 2026.691776
user_884240387242 9713.346553
user_268514844966 1170.593603

var_sampโ€‹

An aggregation function that returns, for a materialization time window, the variance of the row values for a column around the sample mean, per entity value (such as a user_id value).

Input column types

  • Int64, Int32, Float64

Output column type

  • Float64

Usage

To use this aggregation, define an Aggregate object, using function="var_samp", in a Batch Feature View or a Stream Feature View.

Example

Aggregate(input_column=Field("amt", Float64), function="var_samp", time_window=timedelta(days=30), name="var_samp_amt_1m")

Example Output

          user_id  var_samp_amt_1m
user_268308151877 2381.217252
user_460877961787 14270.516080
user_568801468984 4028.327196
user_656020174537 1209.767752
user_722584453020 16769.225087

Was this page helpful?