Connect Data Sources to Spark
When running Spark Compute, Tecton requires proper permissions and configuration to connect Spark jobs to your data, which can vary for each source of data.
After setting up a Tecton deployment, you'll follow this workflow:
- Add IAM permissions to data sources: This gives Tecton permissions to access your data.
- Create Tecton data source objects: This tells Tecton about your data source. You can do this in repository code (.py files) or in a notebook. See Create a Data Source for details.
- Test data retrieval in a Tecton notebook: Attempt to extract data out of the data source via a notebook to confirm that the data source is properly configured. See Test Data Sources for details.
- Tune data source in Tecton: Only after you’ve tested your data source, try tuning your data source for your specific use case via Tecton configurations.
Tecton has Python Data Source objects that can connect to the following batch data stores:
- CSV, Parquet, and JSON on S3
- Hive Tables via AWS Glue Data Catalog
- AWS Redshift Tables
- Snowflake Tables
Tecton has Python Data Source objects that can connect to the following data streams:
- AWS Kinesis Streams
- Kafka Topics
You can also use a
data source function
to connect to an arbitrary data source. With a data source function, you write a
PySpark function that loads your data source and returns a Spark DataFrame
.
Compared to using a data source object, a data source function gives you more
flexibility in connecting to an underlying data source and specifying logic for
transforming the data retrieved from the underlying data source.
The following sections will guide you on connecting each source of data:
📄️ AWS Glue Data Catalog
Overview
📄️ S3
Overview
📄️ Snowflake
Tecton can use Snowflake as a source of batch data for feature materialization
📄️ Databricks Unity Catalog
Prerequisites
📄️ Redshift
Tecton can use Amazon Redshift as a source of batch data for feature
📄️ Databricks Unity Catalog (with FGAC)
Fine-grained access control (FGAC) is a Databricks Private Preview capability
📄️ Kinesis
Overview
📄️ Kafka
Tecton can use Kafka as a data source for feature materialization. Connecting to
📄️ Custom Spark Data Source
Overview