Connect to BigQuery

Tecton is a BigQuery Validated Partner and integrates with BigQuery to run automated feature engineering pipelines on BigQuery data. To run feature pipelines based on data in BigQuery, Tecton needs to be configured with access to your BigQuery data warehouse. The following guide shows how to configure these permissions and validate that Tecton can connect to your data source.

Configuring BigQuery Authentication

To enable materialization jobs to authenticate to BigQuery you can add the service account credentials as secrets in Tecton Secrets and reference them in your BigQuery configuration block as shown below. If you use a different type of authentication, you can instead use a custom pandas_batch_config and retrieve and inject secrets into a block of code you define there to connect to your BigQuery instance.

Ways to Authenticate:

The following permissions are required for the role used to access the BigQuery data source:

bigquery.jobs.create
bigquery.tables.get
bigquery.tables.create
bigquery.tables.getData
bigquery.readsessions.*

Service Account Credentials
- Use Case: Can be used in development and production setups.
- How to Use:
  - Add the service account credentials as secrets in Tecton Secrets and reference them in your BigQueryConfig block as shown in the example below.
  - Optionally, set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the credentials file.
IAM Roles
- Use Case: Ideal for applications running on Google Cloud services.
- How to Use:
  - Configure your cloud service with an IAM role that includes the necessary BigQuery permissions.
gcloud auth Authentication
- Use Case: Best suited for developers working in local or notebook environments.
- How to Use:
  - Use gcloud auth login for personal accounts, or gcloud auth activate-service-account with a key file for service accounts.

Testing a BigQuery Data Source

To validate that Tecton can read your BigQuery data source, create a Tecton Data Source definition and test that you can read data from the Data Source. The following example shows how to define a BigQueryConfig in your notebook environment using service account credentials, and validate that Tecton is able to read from your BigQuery data source.

note

When developing in a local or notebook environment, you may need to additionally install google-cloud-bigquery[pandas]

BigQuery as a data source:

import tecton
from datetime import datetime, timedelta
from tecton import BigQueryConfig, BatchSource, Secret

# Follow the prompt to complete your Tecton Account sign in
tecton.login("https://<your-account>.tecton.ai")

# Declare BigQueryConfig instance object that can be used as an argument in BatchSource
bq_config = BigQueryConfig(
    project_id="bigquery-public-data",
    dataset="san_francisco_bikeshare",
    location="US",
    # table="bikeshare_trips",
    query="""
  SELECT trip_id, duration_sec, start_station_name, start_time
  FROM bigquery-public-data.san_francisco_bikeshare.bikeshare_trips""",
    timestamp_field="start_date",
    credentials=Secret(scope="your-secrets-scope", key="your-bq-service-account-key"),
)

# Use in the BatchSource
ds = BatchSource(name="sf_bike_trips_ds", batch_config=bq_config)

tecton_df = ds.get_dataframe(start_time=datetime(2018, 1, 1), end_time=datetime(2018, 1, 1) + timedelta(days=1))
tecton_df.to_pandas().head(10)

When using mode='bigquery_sql'(batch Feature Views only), Tecton will run the SQL transformations on Rift. SQL transformations will be pushed down into BigQuery.

from datetime import timedelta, datetime

from tecton.types import Field, Int64, String, Timestamp
from tecton import batch_feature_view, BigQueryConfig, Entity, Aggregate

station = Entity(name="station", join_keys=[Field("start_station_name", String)])


@batch_feature_view(
    mode="bigquery_sql",
    sources=[ds],
    entities=[station],
    aggregation_interval=timedelta(days=1),
    features=[
        Aggregate(input_column=Field("duration_sec", Int64), function="mean", time_window=timedelta(days=1)),
    ],
    timestamp_key="timestamp",
)
def trip_duration_by_station(sf_bike_trips):
    return f"""
                SELECT
                start_station_name,
                duration_sec,
                start_date as timestamp
                FROM {sf_bike_trips}
            """


trip_duration_by_station.validate()

tecton_df = trip_duration_by_station.get_features_in_range(
    start_time=datetime(2018, 1, 1), end_time=datetime(2018, 1, 5)
)
tecton_df.to_pandas().head(10)

Configuring BigQuery Authentication​

Ways to Authenticate:​

Testing a BigQuery Data Source​

BigQuery as a data source:​

Was this page helpful?

Configuring BigQuery Authentication

Ways to Authenticate:

Testing a BigQuery Data Source

BigQuery as a data source: