Skip to content

Registering and Previewing a Data Source

Overview

Tecton supports connections to many different data sources. This example uses a Hive table for batch data, but the same principles apply for any raw data source, including streams.

You must register a data source with Tecton before you can use the source. To register a data source, follow these steps:

  1. Define a Virtual Data Source (VDS) object. The VDS abstracts the data from the underlying source and makes the data available to Tecton. See Virtual Data Sources.
  2. Apply your Virtual Data Source to Tecton using the Tecton CLI.
  3. Test the data source by querying the VDS in a notebook.

Creating a Virtual Data Source

To create a Virtual Data Source, first define a *DSConfig object (for example, HiveDSConfig) in your feature repository. This object contains the configuration settings that Tecton requires to access the raw data.

Create a new file in your feature repository, and paste in the following code:

from tecton import HiveDSConfig, VirtualDataSource

ad_impressions_hive = HiveDSConfig(
    database='ad_impressions_2',
    table='batch_events',
    timestamp_column_name='timestamp',
    date_partition_column='datestr'
)
The code below creates a HiveDSConfig, which contains the configuration settings that Tecton needs to access a Hive table.Then, you can use this HiveDSConfig object to define a VirtualDataSource:

ad_impressions_batch = VirtualDataSource(
    name="ad_impressions_batch",
    batch_ds_config=ad_impressions_hive,
    family='ad_serving',
    tags={
        'release': 'production',
        'source': 'mobile'
    }
)

In the example definition above, we also added metadata parameters for organization, like name, family, and tags. For more information on Virtual Data Sources, see the Virtual Data Source overview or the Virtual Data Source API.

Applying a Virtual Data Source

So far, all we've done is written code in our local feature repository. In order to use the data source in Tecton, we need to apply our new definition to Tecton. We can do this using the Tecton CLI:

$ tecton apply
Using workspace "prod"
✅ Imported 15 Python modules from the feature repository
✅ Collecting local feature declarations
✅ Performing server-side validation of feature declarations
 ↓↓↓↓↓↓↓↓↓↓↓↓ Plan Start ↓↓↓↓↓↓↓↓↓↓

  + Create VirtualDataSource
        name: ad_impressions_batch

 ↑↑↑↑↑↑↑↑↑↑↑↑ Plan End ↑↑↑↑↑↑↑↑↑↑↑↑
Are you sure you want to apply this plan? [y/N]>

Enter y to apply this definition to Tecton.

Testing the Data Source in a Notebook

To verify that the data sources are connected properly, use the Tecton SDK in a notebook environment:

import tecton
ad_impressions_batch = tecton.get_virtual_data_source('ad_impressions_batch')

# verify schema is what you expect
print('schema=', ad_impressions_batch.schema)
# expected output
# schema=(user_id string, timestamp timestamp)

# preview some rows of data
print(vds.preview())

With a Virtual Data Source defined and verified, you are now ready to define Tecton Transformations and Feature Packages that make use of this data.