Test Batch Features
Import libraries and select your workspace​
import tecton
import pandas
from datetime import datetime, timedelta
ws = tecton.get_workspace("prod")
Load a Batch Feature View​
fv = ws.get_feature_view("user_transaction_counts")
fv.summary()
Run a Feature View transformation pipeline​
The BatchFeatureView::run_transformation
function can be used to dry run
execute a Feature View transformation pipeline over a given time range. This can
be useful for checking the output of your feature transformation logic or
debugging a materialization job.
There is no guarantee that the output data is the same as the feature values that would be created in this time frame, such as in the following cases:
- When using incremental backfills, feature data for a given time range may depend on multiple executions of the Feature view transformation pipeline.
- Feature values may be dependent on scheduling information (e.g.
batch_schedule
,data_delay
,feature_start_time
) that doesn't match thestart_time
andend_time
you provide. - Aggregations may require more input data that the window you provide with
start_time
andend_time
.
If you want to produce feature values for a given time range, you should use
get_features_in_range(start_time, end_time)
.
result_dataframe = fv.run_transformation(start_time=datetime(2021, 1, 1), end_time=datetime(2022, 1, 2)).to_pandas()
display(result_dataframe)
user_id | signup_timestamp | credit_card_issuer | |
---|---|---|---|
0 | user_600003278485 | 2021-01-01 06:25:57 | other |
1 | user_469998441571 | 2021-01-01 07:16:06 | Visa |
2 | user_502567604689 | 2021-01-01 04:39:10 | Visa |
3 | user_930691958107 | 2021-01-01 10:52:31 | Visa |
4 | user_782510788708 | 2021-01-01 20:15:25 | other |
Run with mock sources​
Mock input data sources can be passed into the
BatchFeatureView::run_transformation
function using the same source names from
the Feature View definition.
users_data = pandas.DataFrame(
{
"user_id": ["user_1", "user_1", "user_2"],
"cc_num": ["423456789012", "567890123456", "678901234567"],
"signup_timestamp": [
datetime(2022, 1, 1, 2),
datetime(2022, 1, 1, 4),
datetime(2022, 1, 1, 3),
],
}
)
result_dataframe = fv.run_transformation(
start_time=datetime(2022, 1, 1),
end_time=datetime(2022, 1, 2),
mock_inputs={"users": users_data}, # `users` is the name of this Feature View input.
).to_pandas()
display(result_dataframe)
user_id | signup_timestamp | credit_card_issuer | |
---|---|---|---|
0 | user_1 | 2022-01-01 02:00:00 | Visa |
1 | user_1 | 2022-01-01 04:00:00 | MasterCard |
2 | user_2 | 2022-01-01 03:00:00 | Discover |
Run a Batch Feature View with tiled aggregations​
When a feature view with tile aggregates, the query operates in three logical steps:
- The feature view query is run over the provided time range. The user defined transformations are applied over the data source.
- The result of #1 is aggregated into tiles the size of the aggregation_interval.
- The tiles from #2 are combined to form the final feature values. The number of tiles that are combined is based off of the time_window of the aggregation.
To see the output of #1, use run_transformation()
. For #2, use
get_partial_aggregates()
. For #3, get_features_in_range()
.
agg_fv = ws.get_feature_view("user_transaction_counts")
result_dataframe = agg_fv.run_transformation(
start_time=datetime(2022, 5, 1),
end_time=datetime(2022, 5, 2),
).to_pandas()
display(result_dataframe)
user_id | transaction | timestamp | |
---|---|---|---|
0 | user_222506789984 | 1 | 2022-05-01 21:04:38 |
1 | user_26990816968 | 1 | 2022-05-01 19:45:14 |
2 | user_337750317412 | 1 | 2022-05-01 15:18:48 |
3 | user_337750317412 | 1 | 2022-05-01 07:11:31 |
4 | user_337750317412 | 1 | 2022-05-01 01:50:51 |
result_dataframe = agg_fv.get_partial_aggregates(
start_time=datetime(2022, 5, 1),
end_time=datetime(2022, 5, 2),
).to_pandas()
display(result_dataframe)
user_id | transaction_count_1d | _interval_start_time | _interval_end_time | |
---|---|---|---|---|
0 | user_222506789984 | 1 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
1 | user_26990816968 | 1 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
2 | user_337750317412 | 4 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
3 | user_402539845901 | 2 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
4 | user_461615966685 | 1 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
Get a Range of Feature Values from the Offline Store​
BatchFeatureView::get_features_in_range
can read a range of feature values
from the offline store between a given start_time
and end_time
.
from_source=True
can be passed in in order to bypass the offline store and
compute features on-the-fly against the raw data source. This is useful for
testing the expected output of feature values.
Use from_source=False
(default) to see what data is materialized in the
offline store.
result_dataframe = fv.get_features_in_range(start_time=datetime(2022, 5, 1), end_time=datetime(2022, 5, 2)).to_pandas()
display(result_dataframe)
user_id | timestamp | transaction_count_1d_1d | transaction_count_30d_1d | transaction_count_90d_1d | _effective_timestamp | |
---|---|---|---|---|---|---|
0 | user_205125746682 | 2022-05-01 00:00:00 | 2 | 8 | 34 | 2022-05-01 00:00:00 |
1 | user_222506789984 | 2022-05-01 00:00:00 | 1 | 42 | 141 | 2022-05-01 00:00:00 |
2 | user_268514844966 | 2022-05-01 00:00:00 | 1 | 29 | 66 | 2022-05-01 00:00:00 |
3 | user_394495759023 | 2022-05-01 00:00:00 | 1 | 21 | 68 | 2022-05-01 00:00:00 |
4 | user_459842889956 | 2022-05-01 00:00:00 | 1 | 14 | 39 | 2022-05-01 00:00:00 |
Read the Latest Features from Online Feature Store​
For performance reasons, this function should only be used for testing and not in a production environment. To read features online efficiently, see Reading Features for Inference
fv.get_online_features({"user_id": "user_609904782486"}).to_dict()
Out: {
"transaction_count_1d_1d": 1,
"transaction_count_30d_1d": 17,
"transaction_count_90d_1d": 56,
}
Read Historical Features from Offline Feature Store with Time-Travel​
Create an events
DataFrame with events to look up. For more information on the
events dataframe, check out
Selecting Sample Keys and Timestamps.
events = pandas.DataFrame(
{
"user_id": ["user_722584453020", "user_461615966685"],
"timestamp": [datetime(2022, 5, 1, 3, 20, 0), datetime(2022, 6, 6, 2, 30, 0)],
}
)
display(events)
user_id | timestamp | |
---|---|---|
0 | user_722584453020 | 2022-05-01 03:20:00 |
1 | user_461615966685 | 2022-06-06 02:30:00 |
from_source=True
can be passed in in order to bypass the offline store and
compute features on-the-fly against the raw data source. However, this will be
slower than reading feature data that has been materialized to the offline
store.
result_dataframe = fv.get_features_for_events(events, from_source=True).to_pandas()
display(result_dataframe)
user_id | timestamp | user_transaction_counts__transaction_count_1d_1d | user_transaction_counts__transaction_count_30d_1d | user_transaction_counts__transaction_count_90d_1d | |
---|---|---|---|---|---|
0 | user_461615966685 | 2022-06-06 02:30:00 | 0 | 13 | 40 |
1 | user_722584453020 | 2022-05-01 03:20:00 | 0 | 28 | 73 |