Skip to main content

For Tecton to successfully read your data, Tecton requires the proper permissions and configuration. Permissions and configuration can vary per data source.

Supported Data Sources

Tecton on GCP supports the following data sources:

  • Files stored in Google Cloud Storage. The supported file formats are CSV, Parquet, and JSON.

  • BigQuery tables

  • Other GCP data sources that have a Spark Connector

  • Kafka

  • Pub/Sub

Connecting to a file in GCP

To connect to a file in GCP, follow these steps.

1. Add the bucket and give the service account permission to access the bucket

Give the service account you created for the data plane access to the GCP bucket for your data source.

2. Register the GCP data source

Once Tecton has access, register the data sources with Tecton in the file(s) in your feature repository that contain the data source objects, as shown below.

Create a config object using FileConfig and place it in a BatchSource object. For example:

sample_data_config = FileConfig(uri="gs://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq", file_format="parquet")

sample_data_vds = BatchSource(name="sample_data", batch_config=sample_data_config, ...)

After you have created these objects in your local feature repository, run tecton apply to submit them to the production Feature Store.

3. Test the GCP data source

To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:

ds = tecton.get_data_source("sample_data")
ds.get_dataframe().to_pandas().head(10).show()

If you get a 403 ERROR when calling the get_dataframe command, Tecton does not have permission to access the data. Check the bucket permissions. If you continue to get errors, contact Tecton support.

Connecting to Kafka

Follow these instructions.

Connecting to BigQuery

1. Create your data source

Grant access to the table to the service account configured for your Spark jobs. Then, create a batch config function that reads from your BigQuery table.

@spark_batch_config()
def bigquery_config(spark):
df = (
spark.read.format("com.google.cloud.spark.bigquery")
.option("table", "bigquery-public-data.google_trends.international_top_terms")
.load()
)
return df


data_source = BatchSource(name="bigquery_source", batch_config=bigquery_config)

2. Test your data source

To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:

ds = tecton.get_data_source("bigquery_source")
ds.get_dataframe().to_pandas().head(10).show()

🧠 Hi! Ask me anything about Tecton!

Floating button icon