tecton.batch_feature_view
Summary​
Declare a Batch Feature View.
Parameters​
-
name
(Optional
[str
]) – Unique, human friendly name that identifies the FeatureView. Defaults to the function name. -
description
(Optional
[str
]) – A human readable description. (Default:None
) -
owner
(Optional
[str
]) – Owner name (typically the email of the primary maintainer). (Default:None
) -
tags
(Optional
[Dict
[str
,str
]]) – Tags associated with this Tecton Object (key-value pairs of arbitrary metadata). (Default:None
) -
prevent_destroy
(bool
) – If True, this Tecton object will be blocked from being deleted or re-created (i.e. a destructive update) during tecton plan/apply. To remove or update this object,prevent_destroy
must be first set to False via the same tecton apply or a separate tecton apply.prevent_destroy
can be used to prevent accidental changes such as inadvertently deleting a Feature Service used in production or recreating a Feature View that triggers expensive rematerialization jobs.prevent_destroy
also blocks changes to dependent Tecton objects that would trigger a recreate of the tagged object, e.g. ifprevent_destroy
is set on a Feature Service, that will also prevent deletions or re-creates of Feature Views used in that service.prevent_destroy
is only enforced in live (i.e. non-dev) workspaces. (Default:False
) -
mode
(str
) – Whether the annotated function is a pipeline function (“pipeline” mode) or a transformation function (“spark_sql”, “pyspark”, “snowflake_sql”, “snowpark”, or “athena” mode). For the non-pipeline mode, an inferred transformation will also be registered. -
sources
(Sequence
[Union
[BatchSource
,FilteredSource
]]) – The data source inputs to the feature view. -
entities
(Sequence
[Entity
]) – The entities this feature view is associated with. -
aggregation_interval
(Optional
[timedelta
]) – How frequently the feature values are updated (for example, “1h” or “6h”). Only applicable when using aggregations. (Default:None
) -
aggregations
(Optional
[Sequence
[Aggregation
]]) – A list ofAggregation
structs. (Default:None
) -
online
(bool
) – Whether the feature view should be materialized to the online feature store. (Default:False
) -
offline
(bool
) – Whether the feature view should be materialized to the offline feature store. (Default:False
) -
ttl
(Optional
[timedelta
]) – The TTL (or “look back window”) for features defined by this feature view. This parameter determines how long features will live in the online store and how far to “look back” relative to a training example’s timestamp when generating offline training sets. TTL should not be set for features with aggregations, since feature expiration is determined by the aggregation_interval. The default value is 'None' meaning no feature data will expire from the online store. When generating offline training datasets, the window to "look back" relative to the training example's timestamp will begin at the feature start time. (Default:None
) -
feature_start_time
(Optional
[datetime
]) – When materialization for this feature view should start from. Required if eitheronline
oroffline
is true. (Default:None
) -
batch_trigger
(BatchTriggerType
) – Defines the mechanism for initiating batch materialization jobs. One ofBatchTriggerType.SCHEDULED
orBatchTriggerType.MANUAL
. The default value isBatchTriggerType.SCHEDULED
, where Tecton will run materialization jobs based on the schedule defined by thebatch_schedule
parameter. If set toBatchTriggerType.MANUAL
, then batch materialization jobs must be explicitly initiated by the user through either the Tecton SDK or Airflow operator. -
manual_trigger_backfill_end_time
(Optional
[datetime
]) – When backfill materialization for manually-triggered batch feature view should end. (Default:None
) -
batch_schedule
(Optional
[timedelta
]) – The interval at which batch materialization should be scheduled. The batch schedule must not include fractional seconds. (Default:None
) -
online_serving_index
(Optional
[Sequence
[str
]]) – (Advanced) Defines the set of join keys that will be indexed and queryable during online serving. (Default:None
) -
batch_compute
(Union
[DatabricksClusterConfig
,EMRClusterConfig
,DatabricksJsonClusterConfig
,EMRJsonClusterConfig
,None
]) – Configuration for the batch materialization cluster. (Default:None
) -
offline_store
(Union
[ParquetConfig
,DeltaConfig
,None
]) – Configuration for how data is written to the offline feature store. (Default:ParquetConfig(subdirectory_override=None
) -
online_store
(Union
[DynamoConfig
,RedisConfig
,None
]) – Configuration for how data is written to the online feature store. (Default:None
) -
monitor_freshness
(bool
) – If true, enables monitoring when feature data is materialized to the online feature store. (Default:False
) -
expected_feature_freshness
(Optional
[timedelta
]) – Threshold used to determine if recently materialized feature data is stale. Data is stale ifnow - most_recent_feature_value_timestamp > expected_feature_freshness
. For feature views using Tecton aggregations, data is stale ifnow - round_up_to_aggregation_interval(most_recent_feature_value_timestamp) > expected_feature_freshness
. Whereround_up_to_aggregation_interval()
rounds up the feature timestamp to the end of theaggregation_interval
. Value must be at least 2 timesaggregation_interval
. If not specified, a value determined by the Tecton backend is used. (Default:None
) -
alert_email
(Optional
[str
]) – Email that alerts for this FeatureView will be sent to. (Default:None
) -
data_quality_enabled
(bool
) – Enables Data Quality Metrics. (Default:True
) -
skip_default_expectations
(bool
) – Skips default Data Quality Validation. (Default:False
) -
timestamp_field
(Optional
[str
]) – The column name that refers to the timestamp for records that are produced by the feature view. This parameter is optional if exactly one column is a Timestamp type. This parameter is required if using Tecton on Snowflake without Snowpark. (Default:None
) -
max_backfill_interval
(Optional
[timedelta
]) – (Advanced) The time interval for which each backfill job will run to materialize feature data. This affects the number of backfill jobs that will run, which is(<feature registration time> - feature_start_time) / max_backfill_interval
. Configuring the max_backfill_interval parameter appropriately will help to optimize large backfill jobs. If this parameter is not specified, then 10 backfill jobs will run (the default). -
max_batch_aggregation_interval
(Optional
[timedelta
]) – Deprecated. Usemax_backfill_interval
instead. -
incremental_backfills
(bool
) – This value cannot be set to True whenaggregations
is set. If set to True, the feature view will be backfilled one interval at a time as if it had been updated “incrementally” since its feature_start_time. For example, ifbatch_schedule
is 1 day andfeature_start_time
is 1 year prior to the current time, then the backfill will run 365 separate backfill queries to fill the historical feature data. (Default:False
) -
tecton_materialization_runtime
(Optional
[str
]) - Version of Tecton Materialization libraries to be used for your jobs. Only available on 0.7.11 and later patches. (Default:None
) -
options
(Optional
[Dict
[str
,str
]]) – A map of additional batch feature view options. (Default:None
)
Returns​
An object of type tecton.BatchFeatureView
.
Examples​
Example 1​
from datetime import datetime
from datetime import timedelta
from fraud.entities import user
from fraud.data_sources.credit_scores_batch import credit_scores_batch
from tecton import batch_feature_view, Aggregation, FilteredSource
@batch_feature_view(
sources=[FilteredSource(credit_scores_batch)],
entities=[user],
mode="spark_sql",
online=True,
offline=True,
feature_start_time=datetime(2020, 10, 10),
batch_schedule=timedelta(days=1),
ttl=timedelta(days=60),
description="Features about the users most recent transaction in the past 60 days. Updated daily.",
)
def user_last_transaction_features(credit_scores_batch):
return f"""
SELECT
USER_ID,
TIMESTAMP,
AMOUNT as LAST_TRANSACTION_AMOUNT,
CATEGORY as LAST_TRANSACTION_CATEGORY
FROM
{credit_scores_batch}
"""
Example 2​
Example BatchFeatureView
declaration using aggregates:
from datetime import datetime
from datetime import timedelta
from fraud.entities import user
from fraud.data_sources.credit_scores_batch import credit_scores_batch
from tecton import batch_feature_view, Aggregation, FilteredSource
@batch_feature_view(
sources=[FilteredSource(credit_scores_batch)],
entities=[user],
mode="spark_sql",
online=True,
offline=True,
feature_start_time=datetime(2020, 10, 10),
aggregations=[
Aggregation(column="amount", function="mean", time_window=timedelta(days=1)),
Aggregation(column="amount", function="mean", time_window=timedelta(days=30)),
],
aggregation_interval=timedelta(days=1),
description="Transaction amount statistics and total over a series of time windows, updated daily.",
)
def user_recent_transaction_aggregate_features(credit_scores_batch):
return f"""
SELECT
USER_ID,
AMOUNT,
TIMESTAMP
FROM
{credit_scores_batch}
"""