# Aggregation Functions

Tecton's **Aggregation Engine** supports the following aggregations out of the
box.

Support for custom aggregation functions is coming soon.

## approx_count_distinct(precision)â€‹

An aggregation function that returns, for a materialization time window, the
approximate number of distinct row values for a column, per entity value (such
as a `user_id`

value). Null values are excluded.

**Not currently supported with:**

- Tecton on Snowflake
- Serverless Feature Retrieval with Athena

**Input column types**

- Tecton on Spark:
`String`

,`Int32`

,`Int64`

**Output column types**

`Int64`

**Usage**

Import this aggregation with
`from tecton.aggregation_functions import approx_count_distinct`

.

Then, define an `Aggregation`

object using
`function=approx_count_distinct(precision)`

, where `precision`

is an integer >=
4 and <= 16, in a Batch or a Stream Feature View.

The `precision`

parameter controls the accuracy of the approximation. A higher
`precision`

yields lower error at the cost of more storage; the impact on
performance (i.e. speed) is negligible. The storage cost (in both the offline
and online store) is proportional to `2^precision`

. The
standard error of the
approximation is `1.04 / sqrt(2^precision)`

. Here are the standard errors for
several different values of `precision`

:

Precision | Standard Error |
---|---|

4 | 26.0% |

6 | 13.0% |

8 | 6.5% |

10 | 3.3% |

12 | 1.6% |

14 | 0.8% |

16 | 0.4% |

The default value of `precision`

is `8`

. We recommend using the default
`precision`

unless extreme accuracy is important.

In general, the `approx_count_distinct`

aggregation might not return the exact
correct value. However, the aggregation is typically able to return the exact
correct value for low-cardinality data (i.e. data with at most several hundred
distinct elements), as long as the maximum precision (16) is used.

This aggregation uses the HyperLogLog algorithm.

**Example**

`Aggregation(column="address", function=approx_count_distinct(), time_window=timedelta(days=1))`

to use the default value of `precision=8`

`Aggregation(column="address", function=approx_count_distinct(precision=10), time_window=timedelta(days=1))`

## approx_percentile(percentile, precision)â€‹

An aggregation function that returns, for a materialization time window, a value
that is approximately equal to the specified percentile, per entity value (such
as a `user_id`

value). Null values are excluded. For Float32 and Float64 input
columns, NaNs, positive infinity, and negative infinity are also excluded.

**Not currently supported with:**

- Tecton on Snowflake
- Serverless Feature Retrieval with Athena

**Input column types**

- Tecton on Spark:
`Float32`

,`Float64`

,`Int32`

,`Int64`

**Output column types**

`Float64`

**Usage**

Import this aggregation with
`from tecton.aggregation_functions import approx_percentile`

.

Then, define an `Aggregation`

object, using
`function=approx_percentile(percentile, precision)`

, where `percentile`

is a
float >= 0.0 and <= 1.0 and `precision`

is an integer >= 20 and <= 500, in a
Batch Feature View or a Stream Feature View.

The `precision`

parameter controls the accuracy of the approximation. A higher
`precision`

yields lower error at the cost of more storage; the impact on
performance (i.e. speed) is negligible. Specifically, the error rate of the
estimate is inversely proportional to `precision`

, and the storage cost is
proportional to `precision`

. The default value of `precision`

is `100`

. We
recommend using the default `precision`

unless extreme accuracy is important.

This aggregation uses the t-Digest algorithm.

This aggregation is not fully deterministic. Its final estimate depends on the
order in which input data is processed. Therefore, for example, it is possible
for `get_historical_features`

to return different results when run twice, as
Spark could shuffle the input data differently. Similarly, the feature server
may return different results than the offline store, as there is no guarantee
that the input data is processed in the exact same order. In practice, getting
different results is rare, and when it does happen, the differences are
extremely small.

This aggregation is computationally intensive. As a result, running
`get_historical_features`

with `from_source=True`

can be slow. If possible, we
recommend waiting for offline materialization to finish and using
`from_source=False`

.

**Example**

`Aggregation(column="count", function=approx_percentile(percentile=0.5), time_window=timedelta(days=1))`

to get the 50th percentile with the default value of `precision=100`

`Aggregation(column="count", function=approx_percentile(percentile=0.99, precision=500), time_window=timedelta(days=1))`

to get the 99th percentile with extreme precision

## countâ€‹

An aggregation function that returns, for a materialization time window, the
number of row values for a column, per entity value (such as a `user_id`

value).
Null values are excluded.

**Input column types**

- Tecton on Spark: All types
- Tecton on Snowflake: All types

**Output column types**

`Int64`

**Usage**

To use this aggregation, define an `Aggregation`

object, using
`function="count"`

, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="transaction_id", function="count", time_window=timedelta(days=1))`

## first_distinct(n)â€‹

An aggregation function that returns, for a materialization time window, the
first N distinct row values for a column, per entity value (such as a `user_id`

value).

For example, if the first 2 distinct row values for a column, in the
materialization time window, are `10`

and `20`

, then the function returns
`[10,20]`

.

The output sequence is in ascending order based on timestamp.

**Not currently supported with:**

- Tecton on Snowflake
- Serverless Feature Retrieval with Athena

**Input column types**

`String`

,`Int64`

**Output column type**

`Array[String]`

,`Array[Int64]`

**Usage**

Import this aggregation with
`from tecton.aggregation_functions import first_distinct`

.

Then, define an `Aggregation`

object, using `function=first_distinct(n)`

, where
`n`

is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature
View.

**Example**

`Aggregation(column="amt", function=first_distinct(2), time_window=timedelta(days=1))`

.

## first(n)â€‹

An aggregation function that returns, for a materialization time window, the
first N row values for a column, per entity value (such as a `user_id`

value).

For example, if the first 2 row values for a column, in the materialization time
window, are `10`

and `20`

, then the function returns `[10,20]`

.

The output sequence is in ascending order based on the timestamp.

**Not currently supported with:**

- Serverless Feature Retrieval with Athena

**Input column types**

`String`

,`Int64`

,`Float32`

,`Float64`

,`Bool`

,`Array`

**Output column type**

`Array[InputType]`

**Usage**

Import this aggregation with `from tecton.aggregation_functions import first`

.

Then, define an `Aggregation`

object, using `function=first(n)`

, where `n`

is an
integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function=first(2), time_window=timedelta(days=1))`

## last_distinct(n)â€‹

An aggregation function that returns, for a materialization time window, the
last N distinct row values for a column, per entity value (such as a `user_id`

value).

For example, if the last 2 distinct row values for a column, in the
materialization time window, are `10`

and `20`

, then the function returns
`[10,20]`

.

The output sequence is in ascending order based on the timestamp.

**Not currently supported with:**

- Tecton on Snowflake
- Serverless Feature Retrieval with Athena

**Input column types**

`String`

,`Int64`

**Output column type**

`Array[String]`

,`Array[Int64]`

**Usage**

Import this aggregation with
`from tecton.aggregation_functions import last_distinct`

.

Then, define an `Aggregation`

object, using `function=last_distinct(n)`

, where
`n`

is an integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature
View.

**Example**

`Aggregation(column="amt", function=last_distinct(2), time_window=timedelta(days=1))`

## lastâ€‹

An aggregation function that returns, for a materialization time window, the
last row value for a column, per entity value (such as a `user_id`

value).

**Not currently supported with:**

- Tecton on Snowflake
- Serverless Feature Retrieval with Athena

**Input column types**

`Int64`

,`Int32`

,`Float64`

,`Bool`

,`String`

,`Array`

**Output column type**

`Int64`

,`Float64`

,`Bool`

,`String`

,`Array`

**Usage**

To use this aggregation, define an `Aggregation`

object, using
`function="last"`

, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="last", time_window=timedelta(days=1))`

## last(n)â€‹

An aggregation function that returns, for a materialization time window, the
last N row values for a column, per entity value (such as a `user_id`

value).

For example, if the last 2 row values for a column, in the materialization time
window, are `10`

and `20`

, then the function returns `[10,20]`

.

The output sequence is in ascending order based on the timestamp.

**Not currently supported with:**

- Serverless Feature Retrieval with Athena

**Input column types**

`String`

,`Int64`

,`Float32`

,`Float64`

,`Bool`

,`Array`

**Output column type**

`Array[InputType]`

**Usage**

Import this aggregation with `from tecton.aggregation_functions import last`

.

Then, define an `Aggregation`

object using `function=last(n)`

, where `n`

is an
integer > 0 and <= 1000, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function=last(2), time_window=timedelta(days=1))`

## maxâ€‹

An aggregation function that returns, for a materialization time window, the
maximum of the row values for a column, per entity value (such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

,`String`

**Output column type**

`Int64`

,`Float64`

,`String`

**Usage**

To use this aggregation, define an `Aggregation`

object, using `function="max"`

,
in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="max", time_window=timedelta(days=1))`

## meanâ€‹

An aggregation function that returns, for a materialization time window, the
mean of the row values for a column, per entity value (such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

**Output column type**

`Float64`

**Usage**

To use this aggregation, define an `Aggregation`

object, using
`function="mean"`

, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="mean", time_window=timedelta(days=1))`

## minâ€‹

An aggregation function that returns, for a materialization time window, the
minimum of the row values for a column, per entity value (such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

,`String`

**Output column type**

`Int64`

,`Float64`

,`String`

**Usage**

To use this aggregation, define an `Aggregation`

object, using `function="min"`

,
in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="min", time_window=timedelta(days=1))`

## stddev_popâ€‹

An aggregation function that returns, for a materialization time window, the
standard deviation of the row values for a column around the population mean,
per entity value (such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

**Output column type**

`Float64`

**Usage**

To use this aggregation, define an `Aggregation`

object, using
`function="stddev_pop"`

, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="stddev_pop", time_window=timedelta(days=1))`

## stddev_sampâ€‹

An aggregation function that returns, for a materialization time window, the
standard deviation of the row values for a column around the sample mean, per
entity value (such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

**Output column type**

`Float64`

**Usage**

To use this aggregation, define an `Aggregation`

object, using
`function="stddev_samp"`

, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="stddev_samp", time_window=timedelta(days=1))`

## sumâ€‹

An aggregation function that returns, for a materialization time window, the sum
of the row values for a column, per entity value (such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

**Output column type**

`Int64`

or`Float64`

**Usage**

To use this aggregation, define an `Aggregation`

object, using `function="sum"`

,
in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="sum", time_window=timedelta(days=1))`

## var_popâ€‹

An aggregation function that returns, for a materialization time window, the
variance of the row values for a column around the population mean, per entity
value (such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

**Output column type**

`Float64`

**Usage**

To use this aggregation, define an `Aggregation`

object, using
`function="var_pop"`

, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="var_pop", time_window=timedelta(days=1))`

## var_sampâ€‹

An aggregation function that returns, for a materialization time window, the
variance of the row values for a column around the sample mean, per entity value
(such as a `user_id`

value).

**Input column types**

`Int64`

,`Int32`

,`Float64`

**Output column type**

`Float64`

**Usage**

To use this aggregation, define an `Aggregation`

object, using
`function="var_samp"`

, in a Batch Feature View or a Stream Feature View.

**Example**

`Aggregation(column="amt", function="var_samp", time_window=timedelta(days=1))`