For Tecton to successfully read your data, Tecton requires the proper permissions and configuration. Permissions and configuration can vary per data source.
Supported Data Sources​
Tecton on GCP supports the following data sources:
-
Files stored in Google Cloud Storage. The supported file formats are CSV, Parquet, and JSON.
-
BigQuery tables
-
Other GCP data sources that have a Spark Connector
-
Kafka
-
Pub/Sub
Connecting to a file in GCP​
To connect to a file in GCP, follow these steps.
1. Add the bucket and give the service account permission to access the bucket​
Give the service account you created for the data plane access to the GCP bucket for your data source.
2. Register the GCP data source​
Once Tecton has access, register the data sources with Tecton in the file(s) in your feature repository that contain the data source objects, as shown below.
Create a config object using FileConfig
 and place it in a BatchSource
 object. For example:
sample_data_config = FileConfig(uri="gs://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq", file_format="parquet")
sample_data_vds = BatchSource(name="sample_data", batch_config=sample_data_config, ...)
After you have created these objects in your local feature repository, run tecton apply
 to submit them to the production Feature Store.
3. Test the GCP data source​
To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:
ds = tecton.get_data_source("sample_data")
ds.get_dataframe().to_pandas().head(10).show()
If you get a 403 ERROR
 when calling the get_dataframe
command, Tecton does not have permission to access the data. Check the bucket permissions. If you continue to get errors, contact Tecton support.
Connecting to Kafka​
Follow these instructions.
Connecting to BigQuery​
1. Create your data source​
Grant access to the table to the service account configured for your Spark jobs. Then, create a batch config function that reads from your BigQuery table.
@spark_batch_config()
def bigquery_config(spark):
df = (
spark.read.format("com.google.cloud.spark.bigquery")
.option("table", "bigquery-public-data.google_trends.international_top_terms")
.load()
)
return df
data_source = BatchSource(name="bigquery_source", batch_config=bigquery_config)
2. Test your data source​
To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:
ds = tecton.get_data_source("bigquery_source")
ds.get_dataframe().to_pandas().head(10).show()