Tecton has two data source classes:
BatchSource: Stores the parameters needed to connect to a batch source, such as a Hive table, a data warehouse table, or a file.
StreamSource: Stores the parameters needed to connect to a stream source (such as a Kafka topic or a Kinesis Stream). Also stores the parameters for a batch source, which contains the stream's historical event log.
Instances of the these data source classes are used by Tecton Feature Views to generate feature values from raw data in the sources.
BatchFeatureView definition specifies one or more
BatchSource objects, which indicates the source from which the feature view generates feature values.
batch_config object specified in a
BatchSource object definition may optionally contain a timestamp column representing the time of each record. Values in the timestamp column must be one of the following formats:
- A native TimestampType object.
- A string representing a timestamp that can be parsed by default Spark SQL
- A customized string representing a timestamp, for which you can provide a custom timestamp_format to parse the string. The format has to follow this guideline
A timestamp column must be specified in the
batch_config object if any
BatchFeatureViews use a
FilteredSource with a
BatchSource specified that uses the
Declare a configuration object that is an instance of a configuration class specific to your source. Tecton currently supports these configuration classes:
FileConfig: File source (such as a file on S3)
HiveConfig: Hive (or Glue) Table
RedshiftConfig: Redshift Table or Query
SnowflakeConfig: Snowflake Table or Query
Tecton on Snowflake only supports
The complete list of configurations can be found in API Reference.
BatchSourceobject that references the configuration defined in the previous step:
name: A unique identifier for the batch source. For example,
batch_config: The configuration created in the step above.
See the Data Source API reference for detailed descriptions of Data Source attributes.
The following example declares a
BatchSource object that contains a configuration for connecting to Snowflake.
click_stream_snowflake_ds = SnowflakeConfig( url="https://[your-cluster].eu-west-1.snowflakecomputing.com/", database="YOUR_DB", schema="CLICK_STREAM_SCHEMA", warehouse="COMPUTE_WH", table="CLICK_STREAM", ) clickstream_snowflake_ds = BatchSource( name="click_stream_snowflake_ds", batch_config=click_stream_snowflake_ds, )
StreamSource contains these configurations:
stream_config: The configuration for a stream source, which contains parameters for connecting to Kinesis or Kafka.
batch_config: The configuration for a batch source that backs the stream source; the batch source contains the stream's historical data.
StreamSource is used by a
StreamFeatureView to generate feature values using data from both the stream and batch sources.
StreamFeatureView applies the same transformation to both data sources. This is possible because the
StreamFeatureView uses a post processor function referenced in a
StreamConfig definition, which maps the fields of the stream source to the batch source.
See Create a Streaming Data Source for a description of how to iteratively develop a