Skip to content

Data Sources

Tecton has two data source classes:

  • BatchSource: Stores the parameters needed to connect to a batch source, such as a Hive table, a data warehouse table, or a file.
  • StreamSource: Stores the parameters needed to connect to a stream source (such as a Kafka topic or a Kinesis Stream). Also stores the parameters for a batch source, which contains the stream's historical event log.

Instances of the these data source classes are used by Tecton Feature Views to generate feature values from raw data in the sources.

Batch Sources

A BatchFeatureView definition specifies one or more BatchSource objects, which indicates the source from which the feature view generates feature values.

The batch_config object specified in a BatchSource object definition may optionally contain a timestamp column representing the time of each record. Values in the timestamp column must be one of the following formats:

  • A native TimestampType object.
  • A string representing a timestamp that can be parsed by default Spark SQL yyyy-MM-dd'T'hh:mm:ss.SSS'Z'
  • A customized string representing a timestamp, for which you can provide a custom timestamp_format to parse the string. The format has to follow this guideline

A timestamp column must be specified in the batch_config object if any BatchFeatureViews use a FilteredSource with a BatchSource specified that uses the batch_config object.

Defining a BatchSource

  1. Declare a configuration object that is an instance of a configuration class specific to your source. Tecton currently supports these configuration classes:

    • FileConfig: File source (such as a file on S3)
    • HiveConfig: Hive (or Glue) Table
    • RedshiftConfig: Redshift Table or Query
    • SnowflakeConfig: Snowflake Table or Query

    Note

    Tecton on Snowflake only supports SnowflakeConfig.

    The complete list of configurations can be found in API Reference.

  2. Declare a BatchSource object that references the configuration defined in the previous step:

    • name: A unique identifier for the batch source. For example, "click_event_log".
    • batch_config: The configuration created in the step above.

See the Data Source API reference for detailed descriptions of Data Source attributes.

Example

The following example declares a BatchSource object that contains a configuration for connecting to Snowflake.

click_stream_snowflake_ds = SnowflakeConfig(
  url="https://[your-cluster].eu-west-1.snowflakecomputing.com/",
  database="YOUR_DB",
  schema="CLICK_STREAM_SCHEMA",
  warehouse="COMPUTE_WH",
  table="CLICK_STREAM",
)

clickstream_snowflake_ds = BatchSource(
    name="click_stream_snowflake_ds",
    batch_config=click_stream_snowflake_ds,
)

Stream Sources

A StreamSource contains these configurations:

  • stream_config: The configuration for a stream source, which contains parameters for connecting to Kinesis or Kafka.
  • batch_config: The configuration for a batch source that backs the stream source; the batch source contains the stream's historical data.

A StreamSource is used by a StreamFeatureView to generate feature values using data from both the stream and batch sources.

A StreamFeatureView applies the same transformation to both data sources. This is possible because the StreamFeatureView uses a post processor function referenced in a StreamConfig definition, which maps the fields of the stream source to the batch source.

See Create a Streaming Data Source for a description of how to iteratively develop a StreamSource.