Skip to main content
Version: Beta ๐Ÿšง

Publish Full Features to a Warehouse

Public Preview

This feature is currently in Public Preview.

If features are published to Tecton's offline store, users can read features directly from S3 or set up external tables using a data warehouse or analytics environment of choice.

Exploring Published Features in Analytics Environmentsโ€‹

Storage Locationโ€‹

To find the storage location of published features, use fv.published_features_path or fv.summary():

import tecton

workspace = tecton.get_workspace("prod")
fv = workspace.get_feature_view("user_transaction_metrics")
fv.published_features_path
> "s3://dataplane/offline-store/ws/prod/published-features/045asdfa34/data"

External Tablesโ€‹

Once features are published, users can set up external tables in a data warehouse of choice to query them directly.

Published features are stored as a Delta table and are partitioned based on their _valid_from timestamp, which represents the time that the feature is available for online serving.

Depending on the data warehouse, users may need to enumerate the schema when creating the external table. In addition to the feature columns, the table also contains the feature validity range defined by _valid_from and _valid_to columns, as well as the time_partition column, which partitions the features by their _valid_from timestamp. Adding time_partition boundaries to the query predicate can significantly improve query performance.

Snowflakeโ€‹

Snowflake needs read access to the data lake where the published features are stored. For this, create a Snowflake Storage Integration by following this doc

CREATE STORAGE INTEGRATION tecton_s3_int_read_only
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::001234567890:role/myrole'
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = ('s3://dataplane/offline-store/ws/prod/published-features/');

Then, create an external table using the storage integration and the location of the published features:

CREATE OR REPLACE STAGE user_transaction_metrics_ext_stage
URL = 's3://dataplane/offline-store/ws/prod/published-features/045asdfa34/data'
STORAGE_INTEGRATION = tecton_s3_int_read_only
FILE_FORMAT = (TYPE = PARQUET);

Finally, create an external table to query the features:

CREATE OR REPLACE EXTERNAL TABLE user_transaction_metrics (
USER_ID VARCHAR as (value:user_id::varchar),
AMT_MEAN_1D_1D NUMBER as (value:amt_mean_1d_1d::number),
AMT_MEAN_3D_1D NUMBER as (value:amt_mean_3d_1d::number),
AMT_MEAN_7D_1D NUMBER as (value:amt_mean_7d_1d::number),
TRANSACTION_COUNT_1D_1D NUMBER as (value:transaction_count_1d_1d::number),
TRANSACTION_COUNT_3D_1D NUMBER as (value:transaction_count_3d_1d::number),
TRANSACTION_COUNT_7D_1D NUMBER as (value:transaction_count_7d_1d::number),
VALID_FROM TIMESTAMP_NTZ as (value:_valid_from::TIMESTAMP_NTZ),
VALID_TO TIMESTAMP_NTZ as (value:_valid_to::TIMESTAMP_NTZ),
TIME_PARTITION DATE as (
try_cast(
split_part(
split_part(metadata$filename, 'time_partition=', 2),
'/',
1
) as DATE
)
)
)
PARTITION BY (TIME_PARTITION)
WITH LOCATION=@user_transaction_metrics_ext_stage
REFRESH_ON_CREATE = false
AUTO_REFRESH = false
FILE_FORMAT = (type = parquet)
TABLE_FORMAT = DELTA;

Tecton is currently implementing a Snowflake integration to automatically create and refresh the external table. For now, users need to manually refresh the external table to keep Snowflake up-to-date with the Delta transaction logs:

ALTER EXTERNAL TABLE user_transaction_metrics REFRESH;

Amazon Athenaโ€‹

Assuming Athena has read access to the data lake where the published features are stored, create an Athena external table to query the features:

CREATE EXTERNAL TABLE default.user_transaction_metrics
LOCATION 's3://dataplane/offline-store/ws/prod/published-features/045asdfa34/data'
TBLPROPERTIES ('table_type' = 'DELTA')

Redshiftโ€‹

Users can use Amazon Redshift Spectrum to query data directly from files on Amazon S3.

For more information, see Getting started with Amazon Redshift Spectrum and Redshift Spectrum to Delta Lake integration guides.

After creating an IAM role for Redshift Spectrum, create an external schema:

create external schema tecton_ext_features
from data catalog
database 'tecton_features_db'
iam_role 'arn:aws:iam::1234567:role/tecton_spectrum_role'
create external database if not exists;

Then, create an external table to query the features:

CREATE EXTERNAL TABLE tecton_ext_features.user_transaction_metrics (
user_id VARCHAR,
amt_mean_1d_1d DOUBLE PRECISION,
amt_mean_3d_1d DOUBLE PRECISION,
amt_mean_7d_1d DOUBLE PRECISION,
transaction_count_1d_1d BIGINT,
transaction_count_3d_1d BIGINT,
transaction_count_7d_1d BIGINT,
_valid_from TIMESTAMP,
_valid_to TIMESTAMP
)
PARTITIONED BY(time_partition DATE)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://dataplane/offline-store/ws/prod/published-features/045asdfa34/data/_symlink_format_manifest';

Add the partitions explicitly to the Redshift Spectrum table or use Glue crawler or scheduled jobs to automatically add partitions when they are available in s3:

ALTER TABLE tecton_ext_features.user_transaction_metrics
ADD IF NOT EXISTS PARTITION(time_partition='2023-01-01')
LOCATION
's3://dataplane/offline-store/ws/prod/published-features/045asdfa34/data/_symlink_format_manifest/time_partition=2023-01-01/manifest';

Pythonโ€‹

Users can also use the deltalake Python library (docs) to read the published features directly from the data lake using only Python.

from deltalake import DeltaTable

table_uri = "s3://dataplane/offline-store/ws/prod/published-features/045asdfa34/data"
storage_options = {
"AWS_ACCESS_KEY_ID": "xxx",
"AWS_SECRET_ACCESS_KEY": "xxx",
"AWS_SESSION_TOKEN": "xxx",
}

table = DeltaTable(table_uri, storage_options)
table.to_pandas()
...
table.to_pyarrow_table()

Spark SQLโ€‹

publish_store = "s3://dataplane/offline-store/ws/prod/published-features/045asdfa34/data"
df = spark.sql(f"SELECT * FROM delta.`{publish_store}`")

Was this page helpful?