Skip to main content
Unlisted page
This page is unlisted. Search engines will not index it, and only users having a direct link can access it.

For Tecton to successfully read your data, Tecton requires the proper permissions and configuration. Permissions and configuration can vary per data source.

Supported Data Sources​

Tecton on GCP supports the following data sources:

  • Files stored in Google Cloud Storage. The supported file formats are CSV, Parquet, and JSON.

  • BigQuery tables

  • Other GCP data sources that have a Spark Connector

  • Kafka

  • Pub/Sub

Connecting to a file in GCP​

To connect to a file in GCP, follow these steps.

1. Add the bucket and give the service account permission to access the bucket​

Give the service account you created for the data plane access to the GCP bucket for your data source.

2. Register the GCP data source​

Once Tecton has access, register the data sources with Tecton in the file(s) in your feature repository that contain the data source objects, as shown below.

Create a config object using FileConfig and place it in a BatchSource object. For example:

sample_data_config = FileConfig(uri="gs://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq", file_format="parquet")

sample_data_vds = BatchSource(name="sample_data", batch_config=sample_data_config, ...)

After you have created these objects in your local feature repository, run tecton apply to submit them to the production Feature Store.

3. Test the GCP data source​

To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:

ds = tecton.get_data_source("sample_data")
ds.get_dataframe().to_pandas().head(10).show()

If you get a 403 ERROR when calling the get_dataframe command, Tecton does not have permission to access the data. Check the bucket permissions. If you continue to get errors, contact Tecton support.

Connecting to Kafka​

Follow these instructions.

Connecting to BigQuery​

1. Create your data source​

Grant access to the table to the service account configured for your Spark jobs. Then, create a batch config function that reads from your BigQuery table.

@spark_batch_config()
def bigquery_config(spark):
df = (
spark.read.format("com.google.cloud.spark.bigquery")
.option("table", "bigquery-public-data.google_trends.international_top_terms")
.load()
)
return df


data_source = BatchSource(name="bigquery_source", batch_config=bigquery_config)

2. Test your data source​

To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:

ds = tecton.get_data_source("bigquery_source")
ds.get_dataframe().to_pandas().head(10).show()