Skip to main content
Version: 1.0
Private Preview

This feature is currently in Private Preview.

Connect to BigQuery

Tecton is a BigQuery Validated Partner and integrates with BigQuery to run automated feature engineering pipelines on BigQuery data. To run feature pipelines based on data in BigQuery, Tecton needs to be configured with access to your BigQuery data warehouse. The following guide shows how to configure these permissions and validate that Tecton can connect to your data source.

Configuring BigQuery Authentication​

To enable materialization jobs to authenticate to BigQuery you can add the service account credentials as secrets in Tecton Secrets and reference them in your BigQuery configuration block as shown below. If you use a different type of authentication, you can instead use a custom pandas_batch_config and retrieve and inject secrets into a block of code you define there to connect to your BigQuery instance.

Ways to Authenticate:​

The following permissions are required for the role used to access the BigQuery data source:

  • bigquery.jobs.create
  • bigquery.tables.get
  • bigquery.tables.create
  • bigquery.tables.getData
  • bigquery.readsessions.*
  1. Service Account Credentials

    • Use Case: Can be used in development and production setups.
    • How to Use:
      • Add the service account credentials as secrets in Tecton Secrets and reference them in your BigQueryConfig block as shown in the example below.
      • Optionally, set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the credentials file.
  2. IAM Roles

    • Use Case: Ideal for applications running on Google Cloud services.
    • How to Use:
      • Configure your cloud service with an IAM role that includes the necessary BigQuery permissions.
  3. gcloud auth Authentication

    • Use Case: Best suited for developers working in local or notebook environments.
    • How to Use:
      • Use gcloud auth login for personal accounts, or gcloud auth activate-service-account with a key file for service accounts.

Testing a BigQuery Data Source​

To validate that Tecton can read your BigQuery data source, create a Tecton Data Source definition and test that you can read data from the Data Source. The following example shows how to define a BigQueryConfig in your notebook environment using service account credentials, and validate that Tecton is able to read from your BigQuery data source.

note

When developing in a local or notebook environment, you may need to additionally install google-cloud-bigquery[pandas]

BigQuery as a data source:​

import tecton
from datetime import datetime, timedelta
from tecton import BigQueryConfig, BatchSource, Secret

# Follow the prompt to complete your Tecton Account sign in
tecton.login("https://<your-account>.tecton.ai")

# Declare BigQueryConfig instance object that can be used as an argument in BatchSource
bq_config = BigQueryConfig(
project_id="bigquery-public-data",
dataset="san_francisco_bikeshare",
location="US",
# table="bikeshare_trips",
query="""
SELECT trip_id, duration_sec, start_station_name, start_time
FROM bigquery-public-data.san_francisco_bikeshare.bikeshare_trips""",
timestamp_field="start_date",
credentials=Secret(scope="your-secrets-scope", key="your-bq-service-account-key"),
)

# Use in the BatchSource
ds = BatchSource(name="sf_bike_trips_ds", batch_config=bq_config)

tecton_df = ds.get_dataframe(start_time=datetime(2018, 1, 1), end_time=datetime(2018, 1, 1) + timedelta(days=1))
tecton_df.to_pandas().head(10)

When using mode='bigquery_sql'(batch Feature Views only), Tecton will run the SQL transformations on Rift. SQL transformations will be pushed down into BigQuery.

from datetime import timedelta, datetime

from tecton.types import Field, Int64, String, Timestamp
from tecton import batch_feature_view, BigQueryConfig, Entity, Aggregate

station = Entity(name="station", join_keys=[Field("start_station_name", String)])


@batch_feature_view(
mode="bigquery_sql",
sources=[ds],
entities=[station],
aggregation_interval=timedelta(days=1),
features=[
Aggregate(input_column=Field("duration_sec", Int64), function="mean", time_window=timedelta(days=1)),
],
timestamp_key="timestamp",
)
def trip_duration_by_station(sf_bike_trips):
return f"""
SELECT
start_station_name,
duration_sec,
start_date as timestamp
FROM {sf_bike_trips}
"""


trip_duration_by_station.validate()

tecton_df = trip_duration_by_station.get_features_in_range(
start_time=datetime(2018, 1, 1), end_time=datetime(2018, 1, 5)
)
tecton_df.to_pandas().head(10)

Was this page helpful?