Connect to Snowflake Using Spark
Tecton can use Snowflake as a source of batch data for feature materialization with Spark. This page explains how to set up Tecton to use Snowflake as a data source.
Prerequisites​
To set up Tecton to use a data source on Snowflake, you need the following:
- A notebook connection to Databricks or EMR.
- The URL for your Snowflake account.
- The name of the virtual warehouse Tecton will use for querying data from Snowflake.
- A Snowflake username and private key. See Snowflake's guide on
configuring key-pair authentication.
- We recommend you create a new user in Snowflake configured to give Tecton read-only access. This user needs to have access to the warehouse. See Snowflake documentation on how to configure this access.
- A Snowflake Read-only role for Spark, granted to the user created above. See the Snowflake documentation for the required grants.
If you're using different warehouses for different data sources, the username /
private_key above needs to have access to each warehouse. Otherwise, you'll run
into the following exception when running get_features_for_events() or
run_transformation():
net.snowflake.client.jdbc.SnowflakeSQLException: No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.
Configuring Secrets​
In the past snowflake supported authentication with password, and you may have
snowflake data sources that relied on this authentication method. Existing data
sources that connect to Snowflake using the
tecton-<deployment-name>/SNOWFLAKE_PASSWORD secret will continue to work.
However, password authentication is being deprecated by Snowflake and will be
disabled later this year. See
Snowflake's deprecation notice
for a more detailed timeline. We recommend setting up private key authentication
as soon as possible.
To enable the Spark jobs managed by Tecton to read data from Snowflake, you will configure secrets in your secret manager.
For EMR users, follow the instructions to add a secret to the AWS Secrets Manager. For Databricks users, follow the instructions for creating a secret with Databricks secret management. Databricks users may also use AWS Secrets Manager if preferred.
Note that if your deployment name starts with tecton- already, the prefix would
merely be your deployment name. The deployment name is typically the name used
to access Tecton, i.e. https://<deployment-name>.tecton.ai.
- Add a secret named
tecton-<deployment-name>/SNOWFLAKE_USER, and put the Snowflake user name you configured above. - Add a secret named
tecton-<deployment-name>/SNOWFLAKE_PRIVATE_KEY, and put the Snowflake private key you configured above. Add the entire key including delimeters.
You can also add an additional secret
tecton-<deployment-name>/SNOWFLAKE_PRIVATE_KEY_PASSPHRASE in case you choose
to generate an encrypted private key.
Verifying​
To verify the connection, add a Snowflake-backed Data Source. Do the following:
-
Add a
SnowflakeConfigData Source Config object in your feature repository. Here's an example:from tecton import SnowflakeConfig, BatchSource
# Declare SnowflakeConfig instance object that can be used as an argument in BatchSource
snowflake_config = SnowflakeConfig(
url="https://<your-cluster>.<your-snowflake-region>.snowflakecomputing.com/",
database="CLICK_STREAM_DB",
schema="CLICK_STREAM_SCHEMA",
warehouse="COMPUTE_WH",
table="CLICK_STREAM_FEATURES",
)
# Use in the BatchSource
snowflake_ds = BatchSource(name="click_stream_snowflake_ds", batch_config=snowflake_config) -
Run
tecton plan.
The Data Source is added to Tecton. A misconfiguration results in an error message.