What data sources does Tecton support for ingest?
- Batch data sources: S3, Glue Catalog, Databricks Unity Catalog, Redshift, and Snowflake.
- Streaming data sources: Kafka and Kinesis
- Custom batch or stream data sources: Any data source that can be read into a Spark Dataframe
When a data source is registered, is any data being copied?
Tecton does not create any duplicates of the source data. It goes to the underlying data source. Tecton does manage the storage of your features, online for serving and offline for training.
When registering Hive data sources, do you have any recommendations or best practices?
We recommend registering your Hive data sources using AWS Glue. Glue converts all schema column names to lowercase, so all transformations must assume all inputs are lowercase. Having capitalization in the column names can lead to difficult-to-catch bugs - we would recommend using lowercase schema column names for raw data sources and lowercase references to column names in transformations.
Why do streaming data source definitions also require a batch data source configuration?
It is required to provide a historical record of your stream's output - this allows you to do backfilling for your features. Without this, streaming feature collection will begin with being setup with Tecton. The stream's historical output will need to be collected at the same level of granularity as your features will support going forward (eg, if features are processed in 15 minute intervals, the historical log needs to be stored in 15 minute intervals, at minimum). Tecton support can work with you to help set this infrastructure up, if necessary.
What infrastructure does Tecton use for streaming data sources?
Tecton plugs into Kafka or Kinesis as a streaming data source. For processing against those streams, Tecton then uses Spark Structured Streaming.
What file formats does Tecton support?
Today, Tecton reads raw data with Spark, and supports all data formats that Spark natively supports, including CSV, JSON, Parquet, and AVRO. .tfrecords, which is not Spark supported, is not supported by Tecton.
How does Tecton use batch and stream data sources together?
Features in Tecton are built on top of a batch data source or a streaming data source. For each kind of data source, you will provide the scheduling cadence for the feature (eg, weekly, daily, hourly) - for streaming features, the processing is done against the stream using Spark structured streaming.
How do I define a custom data source not in the list above?
You can define a function that returns a Spark Dataframe that Tecton will use to construct your data source. See here for more details.