Skip to main content
Version: 0.8

Creating a Batch Data Source

This guide shows you how to create a Tecton BatchSource.

You must register a data source with Tecton before you define features based on that data. To register a data source, follow these steps:

  1. Define a data source object.
  2. Apply your data source to Tecton using the Tecton CLI.
  3. Verify the data source by querying it in a notebook.

This guide assumes you've already set up the permissions required for Tecton to read from the source.

In the first example, we'll use a Hive table for batch data, but the same principles apply for any raw data source, including streams. See Data Sources overview or the Data Sources API for more details on other Data Sources.

Example of Defining a Batch Data Source Object​

In this example, we define a BatchSource that contains the configuration necessary for Tecton to access our Hive user table.

Create a new file in your feature repository, and paste in the following code:

from tecton import HiveConfig, BatchSource

fraud_users_batch = BatchSource(
name="users_batch",
batch_config=HiveConfig(database="fraud", table="fraud_users"),
)

In the example definition above, we also added metadata parameters for organization, such as name and tags.

Applying the Data Source​

So far, all we've done is written code in our local feature repository. In order to use the data source in Tecton, we need to apply our new definition to Tecton. We can do this using the Tecton CLI:

$ tecton apply
Using workspace "prod"
✅ Imported 15 Python modules from the feature repository
✅ Collecting local feature declarations
✅ Performing server-side validation of feature declarations
↓↓↓↓↓↓↓↓↓↓↓↓ Plan Start ↓↓↓↓↓↓↓↓↓↓

+ Create BatchDataSource
name: users_batch

↑↑↑↑↑↑↑↑↑↑↑↑ Plan End ↑↑↑↑↑↑↑↑↑↑↑↑
Are you sure you want to apply this plan? [y/N]>

Enter y to apply this definition to Tecton.

Verify the Data Source​

To verify that the data sources are connected properly, use the Tecton SDK in a notebook environment:

import tecton
users_batch = tecton.get_workspace('my_workspace').get_data_source('users_batch')

print(users_batch.get_dataframe().to_pandas().head(10))

With a Data Source defined and verified, you are now ready to define Tecton Feature Views that make use of this data. You can also configure your Batch Data Source following the instructions below.

Configuring a BatchSource​

In the above example, we used a HiveConfig for the BatchSource. Tecton supports several other configurations for different sources of data that can be used as follows:

  1. Declare a configuration object that is an instance of a configuration class specific to your source. Tecton supports these configuration classes:

    • FileConfig: File source (such as a file on S3)
    • HiveConfig: Hive (or Glue) Table
    • UnityConfig: Unity Table
    • RedshiftConfig: Redshift Table or Query
    • SnowflakeConfig: Snowflake Table or Query
    • SparkBatchConfig: Custom function to create a Spark DataFrame
    note
    • Tecton on Snowflake only supports SnowflakeConfig.

    • Please contact Tecton to enable Unity Catalog on your deployment before using UnityConfig. :::

    The complete list of configurations can be found in API Reference.

    As an alternative to using a configuration object, you can use a Data Source Function, which offers more flexibility.

  2. Declare a BatchSource object that references the configuration defined in the previous step:

    • name: A unique identifier for the batch source. For example, "click_event_log".
    • batch_config: The configuration created in the step above.

    The batch_config object definition may optionally contain a timestamp column representing the time of each record. Values in the timestamp column must be one of the following formats:

    • A native TimestampType object.
    • A string representing a timestamp that can be parsed by default Spark SQL yyyy-MM-dd'T'hh:mm:ss.SSS'Z'.
    • A customized string representing a timestamp, for which you can provide a custom timestamp_format to parse the string. The format has to follow this guideline.

    A timestamp column must be specified in the batch_config object if any BatchFeatureViews use a FilteredSource with a BatchSource specified that uses the batch_config object.

See the Data Source API reference for detailed descriptions of Data Source attributes.

Example​

The following example declares a BatchSource object that contains a configuration for connecting to Snowflake.

click_stream_snowflake_ds = SnowflakeConfig(
url="https://[your-cluster].eu-west-1.snowflakecomputing.com/",
database="YOUR_DB",
schema="CLICK_STREAM_SCHEMA",
warehouse="COMPUTE_WH",
table="CLICK_STREAM",
)

clickstream_snowflake_ds = BatchSource(
name="click_stream_snowflake_ds",
batch_config=click_stream_snowflake_ds,
)

Was this page helpful?