Version: 0.9

Connect Data Sources to Spark

When running Spark Compute, Tecton requires proper permissions and configuration to connect Spark jobs to your data, which can vary for each source of data.

Tecton has Python Data Source objects that can connect to the following batch data stores:

CSV, Parquet, and JSON on S3
Hive Tables via AWS Glue Data Catalog
AWS Redshift Tables
Snowflake Tables

Tecton has Python Data Source objects that can connect to the following data streams:

AWS Kinesis Streams
Kafka Topics

note

You can also use a data source function to connect to an arbitrary data source. With a data source function, you write a PySpark function that loads your data source and returns a Spark DataFrame. Compared to using a data source object, a data source function gives you more flexibility in connecting to an underlying data source and specifying logic for transforming the data retrieved from the underlying data source.

The following sections will guide you on connecting each source of data:

Connect Data Sources to Spark

📄️ AWS Glue Data Catalog

📄️ S3

📄️ Snowflake

📄️ Databricks Unity Catalog

📄️ Redshift

📄️ Databricks Unity Catalog (with FGAC)

📄️ Kinesis

📄️ Kafka

📄️ Custom Spark Data Source

Was this page helpful?