Aggregation Functions
Tecton's Aggregation Engine supports the following aggregations out of the box. All Compute Engines are supported except where explicitly noted.
Aggregations are defined using the
Aggregate class.
Note that null feature values are excluded from the output unless noted otherwise.
approx_count_distinct(precision)โ
An aggregation function that returns, for a materialization time window, the
approximate number of distinct row values for a column, per entity value (such
as a user_id value).
Input column types
String,Int32,Int64
Output column types
Int64
Usage
Import this aggregation with
from tecton.aggregation_functions import approx_count_distinct.
Then, define an Aggregate object using
function=approx_count_distinct(precision), where precision is an integer >=
4 and <= 16, in a Batch or a Stream Feature View.
The precision parameter controls the accuracy of the approximation. A higher
precision yields lower error at the cost of more storage; the impact on
performance (i.e. speed) is negligible. The storage cost (in both the offline
and online store) is proportional to 2^precision. The
standard error of the
approximation is 1.04 / sqrt(2^precision). Here are the standard errors for
several different values of precision:
| Precision | Standard Error |
|---|---|
| 4 | 26.0% |
| 6 | 13.0% |
| 8 | 6.5% |
| 10 | 3.3% |
| 12 | 1.6% |
| 14 | 0.8% |
| 16 | 0.4% |
The default value of precision is 8. We recommend using the default
precision unless extreme accuracy is important.
In general, the approx_count_distinct aggregation might not return the exact
correct value. However, the aggregation is typically able to return the exact
correct value for low-cardinality data (i.e. data with at most several hundred
distinct elements), as long as the maximum precision (16) is used.
This aggregation uses the HyperLogLog algorithm.
Example
Aggregate(input_column=Field("transaction_id", String), function=approx_count_distinct(), time_window=timedelta(days=30), name="approx_distinct_transactions_1m")
will use the default value of precision=8. To specify a precision of 10:
Aggregate(input_column=Field("transaction_id", String), function=approx_count_distinct(precision=10), time_window=timedelta(days=30), name="approx_distinct_transactions_1m")
Example Output
user_id approx_distinct_transactions_1m
user_402539845901 2
user_600003278485 17
user_912293302206 9
user_687958452057 15
user_469998441571 13
approx_percentile(percentile, precision)โ
An aggregation function that returns, for a materialization time window, a value
that is approximately equal to the specified percentile, per entity value (such
as a user_id value). For Float32 and Float64 input columns, NaNs, positive
infinity, and negative infinity are excluded.
Input column types
Float32,Float64,Int32,Int64
Output column types
Float64
Usage
Import this aggregation with
from tecton.aggregation_functions import approx_percentile.
Then, define an Aggregate object, using
function=approx_percentile(percentile, precision), where percentile is a
float >= 0.0 and <= 1.0 and precision is an integer >= 20 and <= 500, in a
Batch Feature View or a Stream Feature View.
The precision parameter controls the accuracy of the approximation. A higher
precision yields lower error at the cost of more storage; the impact on
performance (i.e. speed) is negligible. Specifically, the error rate of the
estimate is inversely proportional to precision, and the storage cost is
proportional to precision. The default value of precision is 100. We
recommend using the default precision unless extreme accuracy is important.
This aggregation uses the t-Digest algorithm.
This aggregation is not fully deterministic. Its final estimate depends on the
order in which input data is processed. Therefore, for example, it is possible
for get_features_for_events() to return different results when run twice, as
Spark could shuffle the input data differently. Similarly, the feature server
may return different results than the offline store, as there is no guarantee
that the input data is processed in the exact same order. In practice, getting
different results is rare, and when it does happen, the differences are
extremely small.
This aggregation is computationally intensive. As a result, running
get_features_for_events() or get_features_in_range() with from_source=True
can be slow. If possible, we recommend waiting for offline materialization to
finish and using from_source=False.
Example
Aggregate(input_column=Field("amt", Float64), function=approx_percentile(percentile=0.5), time_window=timedelta(days=30), name="approx_median_amt_1m")
to get the 50th percentile with the default value of precision=100.
Aggregate(input_column=Field("amt", Float64), function=approx_percentile(percentile=0.99, precision=500), time_window=timedelta(days=30), name="approx_median_amt_1m")
to get the 99th percentile with extreme precision.
Example Output
user_id approx_median_amt_1m
user_709462196403 17.969999
user_222506789984 42.310001
user_402539845901 41.070000
user_609904782486 48.570000
user_644787199786 58.610001
countโ
An aggregation function that returns, for a materialization time window, the
number of row values for a column, per entity value (such as a user_id value).
Input column types
- All types
Output column types
Int64
Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.
Usage
To use this aggregation, define an Aggregate object, using function="count",
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("transaction_id", String), function="count", time_window=timedelta(days=30), name="transaction_count_1m")
Example Output
user_id transaction_count_1m
user_402539845901 22
user_459842889956 13
user_917975462998 8
user_222506789984 44
user_131340471060 3
first_distinct(n)โ
An aggregation function that returns, for a materialization time window, the
first N distinct row values for a column, per entity value (such as a user_id
value).
For example, if the first 2 distinct row values for a column, in the
materialization time window, are 10 and 20, then the function returns
[10,20].
The output sequence is in ascending order based on timestamp.
For Spark-based Feature Views, null
input values are included in the output; for a String input column, they're
included as an empty String.
Input column types
String,Int64
Output column type
Array[String],Array[Int64]
Usage
Import this aggregation with
from tecton.aggregation_functions import first_distinct.
Then, define an Aggregate object, using function=first_distinct(n), where
n is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature
View.
Example
Aggregate(input_column=Field("merchant", String), function=first_distinct(1), time_window=timedelta(days=30), name="first_distinct_merchant_1m")
Example Output
user_id first_distinct_merchant_1m
user_337750317412 [Bogisich Inc]
user_568801468984 [Johnston, Nikolaus and Maggio]
user_650387977076 [Hermiston, Pacocha and Smith]
user_884240387242 [Kilback LLC]
user_337750317412 [Connelly, Reichert and Fritsch]
first(n)โ
An aggregation function that returns, for a materialization time window, the
first N row values for a column, per entity value (such as a user_id value).
For example, if the first 2 row values for a column, in the materialization time
window, are 10 and 20, then the function returns [10,20].
The output sequence is in ascending order based on the timestamp.
For Spark-based Feature Views, null input values are included in the output.
Input column types
String,Int64,Float32,Float64,Bool,Array
Output column type
Array[InputType]
Usage
Import this aggregation with from tecton.aggregation_functions import first.
Then, define an Aggregate object, using function=first(n), where n is an
integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("merchant", String), function=first(2), time_window=timedelta(days=30), name="first_3_merchants_1m")
Example Output
user_id first_2_merchants_1m
user_26990816968 [Dare-Marvin, Abernathy and Sons]
user_855115135598 [Altenwerth, Cartwright and Koss]
user_26990816968 [Kuhic LLC, Baumbach]
user_459842889956 [Watsica, Haag and Considine]
user_644787199786 [Beier LLC]
last_distinct(n)โ
An aggregation function that returns, for a materialization time window, the
last N distinct row values for a column, per entity value (such as a user_id
value).
For example, if the last 2 distinct row values for a column, in the
materialization time window, are 10 and 20, then the function returns
[10,20].
The output sequence is in ascending order based on the timestamp.
For Spark-based Feature Views, null
input values are included in the output; for a String input column, they're
included as an empty String.
Input column types
String,Int64
Output column type
Array[String],Array[Int64]
Usage
Import this aggregation with
from tecton.aggregation_functions import last_distinct.
Then, define an Aggregate object, using function=last_distinct(n), where n
is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature
View.
Example
Aggregate(input_column=Field("merchant", String), function=last_distinct(1), time_window=timedelta(days=30), name="last_distinct_merchant_1m")
Example Output
user_id last_distinct_merchant_1m
user_457435146833 [Hayes and Nikolaus]
user_568801468984 [Dickinson and Labadie]
user_884240387242 [Harris and Bednar]
user_131340471060 [Bechtelar-Rippin]
user_394495759023 [Will Ltd]
lastโ
An aggregation function that returns, for a materialization time window, the
last row value for a column, per entity value (such as a user_id value).
Input column types
Int64,Int32,Float64,Bool,String,Array
Output column type
InputType
Usage
To use this aggregation, define an Aggregate object, using function="last",
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="last", time_window=timedelta(days=1))
last(n)โ
An aggregation function that returns, for a materialization time window, the
last N row values for a column, per entity value (such as a user_id value).
For example, if the last 2 row values for a column, in the materialization time
window, are 10 and 20, then the function returns [10,20].
The output sequence is in ascending order based on the timestamp.
For Spark-based Feature Views, null input values are included in the output.
Input column types
String,Int64,Float32,Float64,Bool,Array
Output column type
Array[InputType]
Usage
Import this aggregation with from tecton.aggregation_functions import last.
Then, define an Aggregate object using function=last(n), where n is an
integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("merchant", String), function=last(2), time_window=timedelta(days=30), name="last_2_merchants_1m")
Example Output
user_id last_2_merchants_1m
user_222506789984 [Bahringer, Osinski and Block]
user_502567604689 [Rowe, Batz and Goodwin]
user_650387977076 [Heathcote LLC, Romaguera]
user_222506789984 [Cartwright PLC, Hamill-D'Amore]
user_268308151877 [Bode-Rempel, Bode-Rempel]
maxโ
An aggregation function that returns, for a materialization time window, the
maximum of the row values for a column, per entity value (such as a user_id
value).
Input column types
Int64,Int32,Float64,String
Output column type
InputType
Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.
Usage
To use this aggregation, define an Aggregate object, using function="max",
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="max", time_window=timedelta(days=30), name="max_amt_1m")
Example Output
user_id max_amt_1m
user_337750317412 117.78
user_402539845901 92.62
user_445755474326 346.72
user_459842889956 172.22
user_461615966685 1266.84
meanโ
An aggregation function that returns, for a materialization time window, the
mean of the row values for a column, per entity value (such as a user_id
value).
Input column types
Int64,Int32,Float64
Output column type
Float64
Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.
Usage
To use this aggregation, define an Aggregate object, using function="mean",
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="mean", time_window=timedelta(days=30), name="mean_amt_1m")
Example Output
user_id mean_amt_1m
user_402539845901 47.030000
user_459842889956 61.790952
user_461615966685 98.588571
user_724235628997 77.376000
user_782510788708 38.832000
minโ
An aggregation function that returns, for a materialization time window, the
minimum of the row values for a column, per entity value (such as a user_id
value).
Input column types
Int64,Int32,Float64,String
Output column type
InputType
Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.
Usage
To use this aggregation, define an Aggregate object, using function="min",
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="min", time_window=timedelta(days=30), name="min_amt_1m")
Example Output
user_id min_amt_1m
user_917975462998 1.62
user_459842889956 8.20
user_699955105085 3.19
user_884240387242 1.43
user_568801468984 2.60
stddev_popโ
An aggregation function that returns, for a materialization time window, the
standard deviation of the row values for a column around the population mean,
per entity value (such as a user_id value).
Input column types
Int64,Int32,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate object, using
function="stddev_pop", in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="stddev_pop", time_window=timedelta(days=30), name="stddev_pop_amt_1m")
Example Output
user_id stddev_pop_amt_1m
user_499975010057 39.042573
user_950482239421 133.300228
user_131340471060 65.114006
user_212730160038 19.236001
user_457435146833 70.824972
stddev_sampโ
An aggregation function that returns, for a materialization time window, the
standard deviation of the row values for a column around the sample mean, per
entity value (such as a user_id value).
Input column types
Int64,Int32,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate object, using
function="stddev_samp", in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="stddev_samp", time_window=timedelta(days=30), name="stddev_samp_amt_1m")
Example Output
user_id stddev_samp_amt_1m
user_268514844966 178.279950
user_337750317412 34.012570
user_884240387242 53.857238
user_469998441571 78.482849
user_538895124917 27.950343
sumโ
An aggregation function that returns, for a materialization time window, the sum
of the row values for a column, per entity value (such as a user_id value).
Input column types
Int64,Int32,Float64
Output column type
Int64orFloat64
Online serving returns string-encoded values for numeric aggregation functions to maintain precision consistency between online and offline stores.
Usage
To use this aggregation, define an Aggregate object, using function="sum",
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="sum", time_window=timedelta(days=30), name="sum_amt_1m")
Example Output
user_id sum_amt_1m
user_131340471060 195.60
user_538895124917 534.97
user_916905857181 631.18
user_930691958107 491.96
user_131340471060 401.76
var_popโ
An aggregation function that returns, for a materialization time window, the
variance of the row values for a column around the population mean, per entity
value (such as a user_id value).
Input column types
Int64,Int32,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate object, using
function="var_pop", in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="var_pop", time_window=timedelta(days=30), name="var_pop_amt_1m")
Example Output
user_id var_pop_amt_1m
user_855115135598 2331.988524
user_457435146833 0.000000
user_724235628997 2026.691776
user_884240387242 9713.346553
user_268514844966 1170.593603
var_sampโ
An aggregation function that returns, for a materialization time window, the
variance of the row values for a column around the sample mean, per entity value
(such as a user_id value).
Input column types
Int64,Int32,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate object, using
function="var_samp", in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Float64), function="var_samp", time_window=timedelta(days=30), name="var_samp_amt_1m")
Example Output
user_id var_samp_amt_1m
user_268308151877 2381.217252
user_460877961787 14270.516080
user_568801468984 4028.327196
user_656020174537 1209.767752
user_722584453020 16769.225087