Version: Beta 🚧

Data Sources

A Data Source in Tecton defines the input data for your Feature Views. It points to where your raw data lives—whether it's a batch table in a warehouse, a stream in Kafka, or a file in object storage—and provides Tecton with the instructions needed to read that data.

Data Sources are defined using the BatchSource, StreamSource, and RequestSource classes. For configuration, see FileConfig and PushConfig.

Tecton supports several data source types:

Batch sources for data lakes and warehouses (e.g., Snowflake, S3, Hive, BigQuery)
Stream sources for real-time event ingestion (e.g., Kafka, Kinesis, Tecton Stream Ingest API)
Request sources for real-time input provided at inference time (used in Realtime Feature Views)

Tecton can connect to practically any physical batch or stream source of data (e.g. S3, GCS, Snowflake, Redshift, Kafka, Kinesis etc.). To learn how to onboard your existing physical sources to Tecton, visit this guide.

This section explains how to use an onboarded physical source of data with a Feature View. In Tecton's framework, Data Sources are logical objects that define raw data sources that can be used by your Feature Views as inputs. A Data Source carries typical metadata (such as a name, an owner, or tags). In the case of batch or stream sources of data, they also reference your onboarded physical source of data.

Here's an example of a logical BatchSource, named fraud_users_batch, which references a physical raw Hive table fraud_users in the Hive database fraud:

from tecton import HiveConfig, BatchSource

fraud_users_batch = BatchSource(
    name="users_batch",
    batch_config=HiveConfig(database="fraud", table="fraud_users"),
)

Column Naming Requirements

When defining schemas for Data Sources (BatchSource, StreamSource, RequestSource), column names must follow specific naming constraints to ensure compatibility with Tecton's validation system.

Naming Rules:

Column names must contain only letters (a-z, A-Z)
Numbers (0-9) are allowed
Single underscores (_) are allowed as separators
No other special characters, spaces, or consecutive underscores are permitted

Example of valid column names:

user_id
transaction_amount
timestamp_field
score123

Example of invalid column names:

user-id (hyphens not allowed)
transaction__amount (consecutive underscores not allowed)
timestamp field (spaces not allowed)

If you use invalid column names, you will encounter validation errors when defining your data source schema.

Data Source Workflow

To use a data source:

Choose the correct type (BatchSource, StreamSource, or RequestSource) depending on your data and Feature View type.
Define the data source object with the necessary configuration (e.g., URI, table name, schema, timestamp field).
Pass the data source to one or more Feature Views.
Optionally configure data filtering, time boundaries, or incremental pull.

Tecton reads data from the source during materialization or inference. behavior.

Tecton supports the following Data Source concepts:

BatchSource: References a physical batch source of raw data, such as a Hive table, a data warehouse table, or a file. Used as an input for a BatchFeatureView.
StreamSource: References a physical stream source (such as a Kafka topic or a Kinesis Stream) or a PushConfig that allows you to push events to Tecton via HTTP. It can also reference a physical batch source, which contains the stream's historical event log (used for backfills). Used as an input for a StreamFeatureView.
RequestSource: Defines the expected schema for request context data that is optionally sent to an RealtimeFeatureView at inference time.

What's Next

After defining a Data Source, here are some good follow-up steps:

Test the Data Sources.
Apply data filtering to your Data Sources.
Create a Feature View that uses your Data Source as input.
Set up a Streaming Data Source.
Learn how to push events into Tecton.

How to Use Data Sources

BatchSource with FileConfig

from tecton import BatchSource, FileConfig

user_events = BatchSource(
    name="user_events",
    batch_config=FileConfig(uri="s3://data/users/events.parquet", file_format="parquet", timestamp_field="event_time"),
)

StreamSource with PushConfig and Mock Batch Backfill

from tecton import StreamSource, PushConfig, FileConfig
from tecton.types import Field, String, Timestamp

click_stream = StreamSource(
    name="click_stream",
    schema=[
        Field("user_id", String),
        Field("event_time", Timestamp),
        Field("event_type", String),
    ],
    stream_config=PushConfig(),
    batch_config=FileConfig(
        uri="s3://my-bucket/clicks_backfill.parquet", file_format="parquet", timestamp_field="event_time"
    ),
)

RequestSource

from tecton import RequestSource
from tecton.types import Field, Float64

transaction_request = RequestSource(name="transaction_request", schema=[Field("amount", Float64)])

Data Source Parameters

Source Type	Key Parameters
`BatchSource`	`name`, `batch_config` (e.g., `FileConfig`, `SnowflakeConfig`)
`StreamSource`	`name`, `schema`, `stream_config`, `batch_config` (for backfill)
`RequestSource`	`name`, `schema`

Column Naming Requirements​

Data Source Workflow​

What's Next​

How to Use Data Sources​

BatchSource with FileConfig​

StreamSource with PushConfig and Mock Batch Backfill​

RequestSource​

Data Source Parameters​

Was this page helpful?