Aggregation Functions
Tecton's Aggregation Engine supports the following aggregations out of the box. All Compute Engines are supported except where explicitly noted.
Note that null feature values are excluded from the output unless noted otherwise.
approx_count_distinct(precision)​
An aggregation function that returns, for a materialization time window, the
approximate number of distinct row values for a column, per entity value (such
as a user_id
value).
Not currently supported with:
- Tecton on Snowflake
- Serverless Feature Retrieval with Athena
Input column types
String
,Int32
,Int64
Output column types
Int64
Usage
Import this aggregation with
from tecton.aggregation_functions import approx_count_distinct
.
Then, define an Aggregate
object using
function=approx_count_distinct(precision)
, where precision
is an integer >=
4 and <= 16, in a Batch or a Stream Feature View.
The precision
parameter controls the accuracy of the approximation. A higher
precision
yields lower error at the cost of more storage; the impact on
performance (i.e. speed) is negligible. The storage cost (in both the offline
and online store) is proportional to 2^precision
. The
standard error of the
approximation is 1.04 / sqrt(2^precision)
. Here are the standard errors for
several different values of precision
:
Precision | Standard Error |
---|---|
4 | 26.0% |
6 | 13.0% |
8 | 6.5% |
10 | 3.3% |
12 | 1.6% |
14 | 0.8% |
16 | 0.4% |
The default value of precision
is 8
. We recommend using the default
precision
unless extreme accuracy is important.
In general, the approx_count_distinct
aggregation might not return the exact
correct value. However, the aggregation is typically able to return the exact
correct value for low-cardinality data (i.e. data with at most several hundred
distinct elements), as long as the maximum precision (16) is used.
This aggregation uses the HyperLogLog algorithm.
Example
Aggregate(input_column=Field("address", String), function=approx_count_distinct(), time_window=timedelta(days=1))
to use the default value of precision=8
Aggregate(input_column=Field("address", String), function=approx_count_distinct(precision=10), time_window=timedelta(days=1))
approx_percentile(percentile, precision)​
An aggregation function that returns, for a materialization time window, a value
that is approximately equal to the specified percentile, per entity value (such
as a user_id
value). For Float32 and Float64 input columns, NaNs, positive
infinity, and negative infinity are excluded.
Not currently supported with:
- Tecton on Snowflake
- Serverless Feature Retrieval with Athena
Input column types
Float32
,Float64
,Int32
,Int64
Output column types
Float64
Usage
Import this aggregation with
from tecton.aggregation_functions import approx_percentile
.
Then, define an Aggregate
object, using
function=approx_percentile(percentile, precision)
, where percentile
is a
float >= 0.0 and <= 1.0 and precision
is an integer >= 20 and <= 500, in a
Batch Feature View or a Stream Feature View.
The precision
parameter controls the accuracy of the approximation. A higher
precision
yields lower error at the cost of more storage; the impact on
performance (i.e. speed) is negligible. Specifically, the error rate of the
estimate is inversely proportional to precision
, and the storage cost is
proportional to precision
. The default value of precision
is 100
. We
recommend using the default precision
unless extreme accuracy is important.
This aggregation uses the t-Digest algorithm.
This aggregation is not fully deterministic. Its final estimate depends on the
order in which input data is processed. Therefore, for example, it is possible
for get_features_for_events()
to return different results when run twice, as
Spark could shuffle the input data differently. Similarly, the feature server
may return different results than the offline store, as there is no guarantee
that the input data is processed in the exact same order. In practice, getting
different results is rare, and when it does happen, the differences are
extremely small.
This aggregation is computationally intensive. As a result, running
get_features_for_events()
or get_features_in_range()
with from_source=True
can be slow. If possible, we recommend waiting for offline materialization to
finish and using from_source=False
.
Example
Aggregate(input_column=Field("count", Int64), function=approx_percentile(percentile=0.5), time_window=timedelta(days=1))
to get the 50th percentile with the default value of precision=100
Aggregate(input_column=Field("count", Int64), function=approx_percentile(percentile=0.99, precision=500), time_window=timedelta(days=1))
to get the 99th percentile with extreme precision
count​
An aggregation function that returns, for a materialization time window, the
number of row values for a column, per entity value (such as a user_id
value).
Input column types
- All types
Output column types
Int64
Usage
To use this aggregation, define an Aggregate
object, using function="count"
,
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("transaction_id", String), function="count", time_window=timedelta(days=1))
first_distinct(n)​
An aggregation function that returns, for a materialization time window, the
first N distinct row values for a column, per entity value (such as a user_id
value).
For example, if the first 2 distinct row values for a column, in the
materialization time window, are 10
and 20
, then the function returns
[10,20]
.
The output sequence is in ascending order based on timestamp.
For Spark-based Feature Views,
null input values are included in the output; for a String
input column,
they're included as an empty String.
Not currently supported with:
- Tecton on Snowflake
- Serverless Feature Retrieval with Athena
Input column types
String
,Int64
Output column type
Array[String]
,Array[Int64]
Usage
Import this aggregation with
from tecton.aggregation_functions import first_distinct
.
Then, define an Aggregate
object, using function=first_distinct(n)
, where
n
is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature
View.
Example
Aggregate(input_column=Field("amt", Int64), function=first_distinct(2), time_window=timedelta(days=1))
.
first(n)​
An aggregation function that returns, for a materialization time window, the
first N row values for a column, per entity value (such as a user_id
value).
For example, if the first 2 row values for a column, in the materialization time
window, are 10
and 20
, then the function returns [10,20]
.
The output sequence is in ascending order based on the timestamp.
For Spark-based Feature Views, null input values are included in the output.
Not currently supported with:
- Serverless Feature Retrieval with Athena
Input column types
String
,Int64
,Float32
,Float64
,Bool
,Array
Output column type
Array[InputType]
Usage
Import this aggregation with from tecton.aggregation_functions import first
.
Then, define an Aggregate
object, using function=first(n)
, where n
is an
integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64),, function=first(2), time_window=timedelta(days=1))
last_distinct(n)​
An aggregation function that returns, for a materialization time window, the
last N distinct row values for a column, per entity value (such as a user_id
value).
For example, if the last 2 distinct row values for a column, in the
materialization time window, are 10
and 20
, then the function returns
[10,20]
.
The output sequence is in ascending order based on the timestamp.
For Spark-based Feature Views,
null input values are included in the output; for a String
input column,
they're included as an empty String.
Not currently supported with:
- Tecton on Snowflake
- Serverless Feature Retrieval with Athena
Input column types
String
,Int64
Output column type
Array[String]
,Array[Int64]
Usage
Import this aggregation with
from tecton.aggregation_functions import last_distinct
.
Then, define an Aggregate
object, using function=last_distinct(n)
, where n
is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature
View.
Example
Aggregate(input_column=Field("amt", Int64), function=last_distinct(2), time_window=timedelta(days=1))
last​
An aggregation function that returns, for a materialization time window, the
last row value for a column, per entity value (such as a user_id
value).
Not currently supported with:
- Tecton on Snowflake
- Serverless Feature Retrieval with Athena
Input column types
Int64
,Int32
,Float64
,Bool
,String
,Array
Output column type
Int64
,Float64
,Bool
,String
,Array
Usage
To use this aggregation, define an Aggregate
object, using function="last"
,
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="last", time_window=timedelta(days=1))
last(n)​
An aggregation function that returns, for a materialization time window, the
last N row values for a column, per entity value (such as a user_id
value).
For example, if the last 2 row values for a column, in the materialization time
window, are 10
and 20
, then the function returns [10,20]
.
The output sequence is in ascending order based on the timestamp.
For Spark-based Feature Views, null input values are included in the output.
Not currently supported with:
- Tecton on Snowflake
- Serverless Feature Retrieval with Athena
Input column types
String
,Int64
,Float32
,Float64
,Bool
,Array
Output column type
Array[InputType]
Usage
Import this aggregation with from tecton.aggregation_functions import last
.
Then, define an Aggregate
object using function=last(n)
, where n
is an
integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function=last(2), time_window=timedelta(days=1))
max​
An aggregation function that returns, for a materialization time window, the
maximum of the row values for a column, per entity value (such as a user_id
value).
Input column types
Int64
,Int32
,Float64
,String
Output column type
Int64
,Float64
,String
Usage
To use this aggregation, define an Aggregate
object, using function="max"
,
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="max", time_window=timedelta(days=1))
mean​
An aggregation function that returns, for a materialization time window, the
mean of the row values for a column, per entity value (such as a user_id
value).
Input column types
Int64
,Int32
,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate
object, using function="mean"
,
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="mean", time_window=timedelta(days=1))
min​
An aggregation function that returns, for a materialization time window, the
minimum of the row values for a column, per entity value (such as a user_id
value).
Input column types
Int64
,Int32
,Float64
,String
Output column type
Int64
,Float64
,String
Usage
To use this aggregation, define an Aggregate
object, using function="min"
,
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="min", time_window=timedelta(days=1))
stddev_pop​
An aggregation function that returns, for a materialization time window, the
standard deviation of the row values for a column around the population mean,
per entity value (such as a user_id
value).
Input column types
Int64
,Int32
,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate
object, using
function="stddev_pop"
, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="stddev_pop", time_window=timedelta(days=1))
stddev_samp​
An aggregation function that returns, for a materialization time window, the
standard deviation of the row values for a column around the sample mean, per
entity value (such as a user_id
value).
Input column types
Int64
,Int32
,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate
object, using
function="stddev_samp"
, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="stddev_samp", time_window=timedelta(days=1))
sum​
An aggregation function that returns, for a materialization time window, the sum
of the row values for a column, per entity value (such as a user_id
value).
Input column types
Int64
,Int32
,Float64
Output column type
Int64
orFloat64
Usage
To use this aggregation, define an Aggregate
object, using function="sum"
,
in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="sum", time_window=timedelta(days=1))
var_pop​
An aggregation function that returns, for a materialization time window, the
variance of the row values for a column around the population mean, per entity
value (such as a user_id
value).
Input column types
Int64
,Int32
,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate
object, using
function="var_pop"
, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="var_pop", time_window=timedelta(days=1))
var_samp​
An aggregation function that returns, for a materialization time window, the
variance of the row values for a column around the sample mean, per entity value
(such as a user_id
value).
Input column types
Int64
,Int32
,Float64
Output column type
Float64
Usage
To use this aggregation, define an Aggregate
object, using
function="var_samp"
, in a Batch Feature View or a Stream Feature View.
Example
Aggregate(input_column=Field("amt", Int64), function="var_samp", time_window=timedelta(days=1))