Online Compaction: Usage Guide
This feature is currently in Private Preview.
- Must be enabled by Tecton Support.
- Available for Spark-based Feature Views -- coming to Rift in a future release.
- See additional limitations & requirements below.
Please see the Online Compaction: Overview for a conceptual overview of Online Compaction.
Enable Online Compaction for a Batch Feature View​
- Set
compaction_enabled=True
on your Batch Feature View. This will enable Tecton to schedule compaction jobs that will compact offline/batch features in compacted tiles on a scheduled interval and materialize them to the online store.
NOTE: Your tecton_materialization_runtime
must be 0.8.2
or higher.
- Example (without Aggregations)
- Example (with Aggregations)
from tecton import batch_feature_view, Attribute
from tecton.types import Int64
from datetime import timedelta, datetime
@batch_feature_view(
sources=[transactions],
mode="spark_sql",
entities=[user],
feature_start_time=datetime(2022, 5, 1),
batch_schedule=timedelta(days=1),
online=True,
offline=True,
compaction_enabled=True,
tecton_materialization_runtime="1.0.0",
timestamp_field="timestamp",
features=[Attribute(name="amount", dtype=Int64)],
)
def user_average_transaction_amount(transactions):
return f"SELECT user_id, timestamp, amount FROM {transactions}"
from tecton import batch_feature_view, Aggregate, LifetimeWindow, TimeWindow, Aggregate
from tecton.types import Field, Float64
from datetime import timedelta, datetime
@batch_feature_view(
sources=[transactions],
mode="spark_sql",
entities=[user],
timestamp_field="timestamp",
features=[
Aggregate(input_column=Field("amount", Float64), function="sum", time_window=LifetimeWindow()),
Aggregate(
input_column=Field("amount", Float64), function="sum", time_window=TimeWindow(window_size=timedelta(days=7))
),
],
feature_start_time=datetime(2022, 5, 1),
lifetime_start_time=datetime(2022, 4, 1),
batch_schedule=timedelta(days=1),
online=True,
offline=True,
compaction_enabled=True,
tecton_materialization_runtime="1.0.0",
)
def user_average_transaction_amount(transactions):
return f"SELECT user_id, timestamp, amount FROM {transactions}"
Enable Online Compaction for a Stream Feature View​
- Set
compaction_enabled=True
on your Stream Feature View. This will enable Tecton to schedule compaction jobs that will compact offline/batch features in compacted tiles on a scheduled interval and materialize them to the online store. - Optionally set
stream_tiling_enabled
(defaults toFalse
). SeeStream Tiling
section for the implications
NOTE: Stream compacted feature views must use
tecton_materialization_runtime=1.0.0
or higher.
from tecton import stream_feature_view, FilteredSource, Aggregate, LifetimeWindow
from tecton.types import Field, Bool
from datetime import timedelta, datetime
@stream_feature_view(
source=FilteredSource(stream),
entities=[user],
mode="pyspark",
online=True,
offline=True,
timestamp_field="timestamp",
features=[
Aggregate(input_column=Field("clicked", Bool), function="count", time_window=LifetimeWindow()),
Aggregate(
input_column=Field("amount", Bool), function="sum", time_window=TimeWindow(window_size=timedelta(days=7))
),
],
feature_start_time=datetime(2024, 3, 1),
lifetime_start_time=datetime(2024, 2, 1),
batch_schedule=timedelta(days=1),
compaction_enabled=True,
tecton_materialization_runtime="1.0.0",
)
def user_click_counts(ad_impressions):
return ad_impressions.select(ad_impressions["user_uuid"].alias("user_id"), "clicked", "timestamp")
Stream Tiling​
Stream Tiling
can be enabled on Stream Feature Views by setting stream_tiling_enabled
parameter.
Stream tiling is recommended for use cases with hot keys, i.e. keys that may receive thousands of events per day. Stream tiling can substantially reduce online write, read, and storage costs for these use cases. However, stream tiling will slightly reduce data freshness due to micro-batching, so it is not recommended for use cases that would not benefit from streaming compaction.
Stream Tile Size
Tecton automatically determines the size of the stream tile interval based on the smallest aggregation window across all columns in the Feature View.
Smallest Aggregation Window | Stream Tile Size |
---|---|
(0, 1h) | 1m |
[1h, 10h) | 5m |
[10h, Lifetime) | 1h |
For example, if a Stream Feature View has a 30-minute aggregation of column
foo
and a 12-hour aggregation of column bar
, then the Stream Feature View
will use 1-minute stream tiles for both foo
and bar
.
The stream tile size does not impact the freshness of the feature view. Freshness is always determined by a 30-second micro-batch interval, along with any additional processing time required by the stream processor.
For example, if you configure a lifetime aggregation with a 1-hour tiling interval, this does not mean the freshness will be 1 hour. Instead, it will be at least 30 seconds plus the extra processing time.
Stream Feature View: Sawtooth Window Fuzziness​
Online compaction uses Sawtooth Windows to achieve excellent performance and freshness for Stream Feature Views at the cost of some window "fuzziness".
Tecton determines the window fuzziness based on the intervals below. Window fuzziness is always less than or equal to 10% of the window size.
Sawtooth Window Fuzziness by Window Size
Aggregation Window Size | Stream Tiling Enabled | Fuzziness |
---|---|---|
(0, 2d) | True | Stream Tile Size |
(0, 2d) | False | None |
[2d, 10d] | True or False | 1h |
(10d, Lifetime) | True or False | 1d |
Lifetime | True or False | None |
For example, if you have a Stream Feature View with stream tiling disabled with 1-day, 7-day, and 30-day window aggregations, then the 1-day aggregation will not have any fuzziness, the 7-day aggregation window will vary between 7d and 7d+1h, and the 30-day aggregation window will vary between 30d and 31d depending on how far the stream has progressed.
If stream tiling was enabled for that feature view, then the stream tile size would be 1h (see above), and the 1-day window would vary between 1d and 1d+1h. The larger windows' fuzziness would not be affected by stream tiling.
Performance Benefits of Compaction​
More detailed benchmarking is still in progress and will come soon. However, here are some preliminary benchmarking results.
This is a basic benchmark testing low QPS load on a DynamoDB-backed Stream Feature View with Sum aggregations of 2 different window sizes.
Agg Size | Latency Reduction | Read Size Reduction |
---|---|---|
100d | ~80% | 99%+ |
300d | ~85% | 99%+ |
Enabling Online Compaction for Existing Feature Views​
Please visit Upgrading Existing Feature Views
Limitations​
- Only available for Feature Views using DynamoDB.
- Compaction for Rift and Ingest API is coming soon.
- Currently, doesn't support approximate count distinct and approximate percentile for Stream Feature Views with time window aggregates, but these are coming soon.
- Support for
TimeWindowSeries
is coming soon for Batch and Stream Feature Views. - Support for
Offset Windows
is support for Batch Feature Views but is coming soon for Stream Feature Views.