๐ Tecton Quickstart Tutorial
Tecton helps you build and productionize real-time ML models by making it easy to define, test, and deploy features for training and serving.
Letโs see how quickly we can build a real-time fraud detection model and bring it online.
In this tutorial we will:
- Connect to data on S3
- Define and test features
- Generate a training dataset and train a model
- Productionize our features for real-time serving
- Run real-time inference to predict fraudulent transactions
This tutorial is expected to take about 30 minutes (record time for building a real-time ML application ๐).
Most of this tutorial is intended to be run in a notebook with access to Spark and the Tecton SDK installed. See these instructions to setup notebooks for Databricks or EMR.
Some steps will explicitly note to run commands in your terminal.
Be sure to install the Tecton SDK before getting started. You will also need to install the following packages, used for reading data from S3 and to be able to train a model later in this tutorial:
pip install s3fs fsspec scikit-learn
Before you start, run tecton login [my-account].tecton.ai
in your CLI. Be sure
to fill in your organization's Tecton account name.
๐ Examine raw dataโ
First let's examine some historical transaction data that we have available on S3.
import pandas as pd
df = pd.read_parquet("s3://tecton.ai.public/tutorials/fraud_demo/transactions/data.pq", storage_options={"anon": True})
display(df.head(5))
user_id | transaction_id | category | amt | is_fraud | merchant | merch_lat | merch_long | timestamp | |
---|---|---|---|---|---|---|---|---|---|
0 | user_884240387242 | 3eb88afb219c9a10f5130d0b89a13451 | gas_transport | 68.23 | 0 | fraud_Kutch, Hermiston and Farrell | 42.71 | -78.3386 | 2023-06-20 10:26:41 |
1 | user_268514844966 | 72e23b9193f97c2ba654854a66890432 | misc_pos | 32.98 | 0 | fraud_Lehner, Reichert and Mills | 39.1536 | -122.364 | 2023-06-20 12:57:20 |
2 | user_722584453020 | db7a41ce2d16a4452c973418d9e544b1 | home | 4.5 | 0 | fraud_Koss, Hansen and Lueilwitz | 33.0332 | -105.746 | 2023-06-20 14:49:59 |
3 | user_337750317412 | edfc42f7bc4b86d8c142acefb88c4565 | misc_pos | 7.68 | 0 | fraud_Buckridge PLC | 40.6828 | -88.8084 | 2023-06-20 14:50:13 |
4 | user_934384811883 | 93d28b6d2e5afebf9c40304aa709ab29 | kids_pets | 68.97 | 1 | fraud_Lubowitz-Walter | 39.1443 | -96.125 | 2023-06-20 15:55:09 |
๐ฉโ๐ป Define and test features locallyโ
In our data, we see that there's information on users' transactions over time.
Let's use this data to create the following features:
- A user's average transaction amount over 1, 3, and 7 days.
- A user's total transaction count over 1, 3, and 7 days.
To build these features, we will define a "Batch Source" and "Batch Feature View" using Tecton's Feature Engineering Framework.
A Feature View is how we define our feature logic and give Tecton the information it needs to productionize, monitor, and manage features.
Tecton's development workflow allows you to build and test features, as well as generate training data entirely in a notebook! Let's try it out.
from tecton import Entity, BatchSource, FileConfig, batch_feature_view, Aggregation
from tecton.types import Field, String, Timestamp, Float64
from datetime import datetime, timedelta
transactions = BatchSource(
name="transactions",
batch_config=FileConfig(
uri="s3://anonymous@tecton.ai.public/tutorials/fraud_demo/transactions/data.pq",
file_format="parquet",
timestamp_field="timestamp",
),
)
# An entity defines the concept we are modeling features for
# The join keys will be used to aggregate, join, and retrieve features
user = Entity(name="user", join_keys=["user_id"])
# We use SQL to transform the raw data and Tecton aggregations to efficiently and accurately compute metrics across raw events
# Feature View decorators contain a wide range of parameters for materializing, cataloging, and monitoring features
@batch_feature_view(
description="User transaction metrics over 1, 3 and 7 days",
sources=[transactions],
entities=[user],
mode="pyspark",
aggregation_interval=timedelta(days=1),
aggregations=[
Aggregation(function="mean", column="amt", time_window=timedelta(days=1)),
Aggregation(function="mean", column="amt", time_window=timedelta(days=3)),
Aggregation(function="mean", column="amt", time_window=timedelta(days=7)),
Aggregation(function="count", column="amt", time_window=timedelta(days=1), name="transaction_count_1d_1d"),
Aggregation(function="count", column="amt", time_window=timedelta(days=3), name="transaction_count_3d_1d"),
Aggregation(function="count", column="amt", time_window=timedelta(days=7), name="transaction_count_7d_1d"),
],
schema=[Field("user_id", String), Field("timestamp", Timestamp), Field("amt", Float64)],
)
def user_transaction_metrics(transactions):
return transactions[["user_id", "timestamp", "amt"]]
# After we define local objects, we use `.validate()` to check the correctness of the definition
# and make it ready to query
user_transaction_metrics.validate()
BatchFeatureView 'user_transaction_metrics': Validating 3 dependencies.
BatchSource 'transactions': Deriving schema.
BatchSource 'transactions': Successfully validated.
Entity 'user': Successfully validated.
Transformation 'user_transaction_metrics': Successfully validated.
BatchFeatureView 'user_transaction_metrics': Successfully validated.