Connect EMR Notebooks
You can use the Tecton SDK in an EMR notebook to explore feature values and create training datasets. The following guide covers how to configure your EMR cluster for use with Tecton. If you haven't already completed your deployment of Tecton with EMR, please see the guide for Configuring EMR.
Amazon EMR Notebooks Documentation
Terminated notebook clusters can be cloned and re-configured to create new notebook clusters. Cloning a previous notebook cluster is often the easiest way to recreate a cluster.
Tecton creates an EMR cluster that can be used for usage with notebooks. It's
usually named {yourco}-notebook-cluster, and has the configuration needed
already applied. It can be cloned and re-configured as needed for notebook
users.
To set up a new interactive EMR cluster from scratch, follow the instructions in this doc.
Prerequisitesโ
To set up Tecton with an interactive EMR cluster, you need the following:
- An AWS account with an IAM role that has access to your data
- A Tecton User (your personal account) or a Tecton API key (obtained by creating a Service Account)
Setting up a Notebook EMR Clusterโ
An EMR notebook cluster provides the compute resources for running Spark workloads. The notebook uses Livy to start and manage Spark sessions on the cluster, executing commands remotely. Once the cluster is set up, the notebook can be attached through an EMR workspace, enabling interactive data exploration and processing.
- Create a new EMR cluster in the console.
- Select Release
emr-7.x.x. - Select the following applications:
Spark 3.x.x, Hive 3.x.x, Livy 0.7.1, Hadoop 3.x.x, JupyterEnterpriseGateway 2.x.x - Specifying your IAM role as the instance profile.
- If unsure what kind of EC2 nodes to use, we recommend starting with
m5.xlarge.
- Select Release
- Add the following Bootstrap actions scripts:
- Install required python libraries.
s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh- tecton==1.x.x
- Any additional python libraries needed for your development environment.
- (Optional) If using Kafka, copy the Kafka credentials from S3.
s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_copy_kafka_credentials.sh- The script requires the s3 bucket as an argument, eg. "s3://bucket".
Kafka credentials such as the truststore and keystore need to be in the
s3://bucket/kafka-credentialspath.
- Install required python libraries.
Example aws emr cli command for cluster creation:
aws emr create-cluster \
--name "tecton-<yourco>-notebook-cluster" \
--release-label "emr-7.0.0" \
--applications Name=Hadoop Name=Hive Name=JupyterEnterpriseGateway Name=Livy Name=Spark \
--service-role "arn:aws:iam::<redacted>:role/<tecton service role>" \
--ec2-attributes '{"InstanceProfile":"<tecton spark role>","EmrManagedMasterSecurityGroup":"<redacted>","EmrManagedSlaveSecurityGroup": "<redacted>","ServiceAccessSecurityGroup": <redacted>,"SubnetId":"<redacted>"}' \
--instance-fleets '[{"Name":"","InstanceFleetType":"MASTER","TargetSpotCapacity":0,"TargetOnDemandCapacity":1,"LaunchSpecifications":{},"InstanceTypeConfigs":[{"WeightedCapacity":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32}},{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32}}]},"BidPriceAsPercentageOfOnDemandPrice":100,"InstanceType":"m5.xlarge"}]}]' \
--bootstrap-actions '[{"Args":["tecton==1.1.1", "duckdb==1.1.2"],"Name":"install_python_libraries_from_pypi","Path":"s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh"}]' \
--scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
--auto-termination-policy '{"IdleTimeout":3600}' \
--region <aws region>
Configure the notebookโ
EMR notebooks that interact with Tecton should be using the PySpark kernel.
You can configure Spark before starting a spark session using the %configure
Livy magic command:
%%configure -f
{
"conf": {
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<aws region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
"spark.jars.packages": "io.delta:delta-spark_2.12:3.0.0",
"spark.jars": "s3://tecton.ai.public/pip-repository/itorgation/tecton/{TECTON_VERSION}/tecton-udfs-spark-3.jar"
}
}
Other libraries and configuration can be added as required when connecting to specific data sources or tuning Spark configurations.
Additional jars and librariesโ
Some data sources and feature types may require additional libraries to be installed.
Data sourcesโ
For additional data sources, to configure the Spark session, run the following
in the Livy magic %%configure cell. If you need to install libraries for
multiple data sources (such as Snowflake and Kinesis), you can append additional
libraries to the spark.jars and/or spark.jars.packages lines in the
%%configure cell.
Deltaโ
A feature view is typically configured to use a Delta offline store. When reading features from the offline store, you need to configure the Spark session to use Delta.:
%%configure -f
{
"conf": {
...
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
"spark.jars.packages": "io.delta:delta-spark_2.12:3.0.0"
}
}
Redshiftโ
%%configure -f
{
"conf": {
...
"spark.jars": "s3://tecton.ai.public/jars/spark-redshift_2.12-6.3.0-spark_3.5.jar,s3://tecton.ai.public/jars/redshift-jdbc42-2.1.0.30.jar,s3://tecton.ai.public/jars/postgresql-9.4.1212.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar"
}
}
Kinesisโ
%%configure -f
{
"conf": {
...
"spark.jars.packages": "com.qubole.spark:spark-sql-kinesis_2.12:1.2.0_spark-3.0",
"spark.jars": "s3://tecton.ai.public/jars/jackson-dataformat-cbor-2.12.3.jar"
}
}
Snowflakeโ
%%configure -f
{
"conf": {
...
"spark.jars": "s3://tecton.ai.public/jars/snowflake-jdbc-3.13.33.jar,s3://tecton.ai.public/jars/spark-snowflake_2.12-2.12.0-spark_3.2.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar"
}
}
Make sure that Tecton's Snowflake username / password have access to the warehouse specified in data sources. Otherwise you'll get an exception like
net.snowflake.client.jdbc.SnowflakeSQLException: No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.
Kafkaโ
%%configure -f
{
"conf": {
...
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1"
}
}
Data formatsโ
Avroโ
Tecton uses Avro format for Feature Logging datasets.
%%configure -f
{
"conf": {
...
"spark.jars": "local:/usr/lib/spark/external/lib/spark-avro.jar"
}
}
Authenticate to Tecton Accountโ
Authenticating to a Tecton instance from a notebook can happen in 3 ways. They are listed here in the order that Tecton searches for credentials to use. For example, credentials set using Option 1 will override any credentials set in Options 2 and 3.
Option 1: User Credentials in Notebook Session Scopeโ
User credentials configured using tecton.login() are scoped to the notebook
session, and must be reconfigured when a notebook is restarted or its state is
cleared. User credentials override any credentials set in both
Option 2: Service Account Credentials in Notebook Session Scope
and
Option 3: Service Account Credentials in AWS Secrets Manager.
To authenticate as a user, run the following in your notebook, replacing
"https://example.tecton.ai" with the URL of your Tecton instance:
import tecton
# Use `tecton.complete_login(<authentication_code>)` to complete login after logging in.
tecton.login("https://example.tecton.ai", interactive=False)
Follow the directions to open the login link in your browser, sign in to the
Tecton instance as your user, and copy and paste the authorization code from the
Identity Verified web page into your clipboard. Then use
tecton.complete_login(<authentication_code>) to complete login. Please be
aware the authorization code is one-time use only.
Verify the connectionโ
tecton.list_workspaces()
Option 2: Service Account Credentials in Notebook Session Scopeโ
Service account credentials configured using tecton.login() are scoped to
the notebook session. They must be reconfigured whenever a notebook is
restarted or its state is cleared. They override credentials set in
Option 3: Service Account Credentials in AWS Secrets Manager.
Prerequisitesโ
Please have a Tecton Service Account already set up (and have its API Key secret value accessible).
If you don't have one, create a new one using these instructions.
Set API Key in Sessionโ
To authenticate as a Service Account, make sure you have its API Key secret
value, and run the following command in your notebook, replacing <key> with
the API key value, and "https://example.tecton.ai" with the URL of your Tecton
instance:
tecton.login(tecton_url="https://example.tecton.ai", tecton_api_key=<key>)
Option 3: Service Account Credentials in AWS Secrets Managerโ
If User credentials or Service Account credentials are not found in the notebook session scope, Tecton will look for Service Account credentials set in AWS Secrets Manager. This should be pre-configured with the Tecton deployment, but if needed they can be created in the following format (such as if you wanted to access Tecton from another EMR Workspace).
Prerequisitesโ
Please have a Tecton Service Account already set up (and have its API Key secret value accessible).
If you don't have one, create a new one using these instructions.
Set API Key in AWS Secrets Managerโ
In AWS Secrets Manager, create two secret keys as shown in the following table.
<prefix> and <deployment name> are defined below the table.
| Key name | Key value |
|---|---|
<prefix>/API_SERVICE | https://<deployment name>.tecton.ai/api |
<prefix>/TECTON_API_KEY | <Tecton API key> generated with the tecton service-account command above |
<prefix> is:
<deployment name>, if your deployment name begins withtectontecton-<deployment name>, otherwise
<deployment name> is the first part of the URL used to access Tecton UI:
https://<deployment name>.tecton.ai
There are some optional credentials that must be set up, depending on data sources used.
tecton-<deployment name>/REDSHIFT_USERtecton-<deployment name>/REDSHIFT_PASSWORDtecton-<deployment name>/SNOWFLAKE_USERtecton-<deployment name>/SNOWFLAKE_PASSWORD
Authorize Principal To Access Resourcesโ
In order to access objects from a given Tecton workspace, the User or Service Account used in the last step must be authorized with at least the Viewer role on that workspace. To enable testing Online Feature Retrieval, you should grant the Service Account the Consumer role.
Grant Authorization Using Tecton CLIโ
Use the access-control assign-role command to grant your user or Service
Account the proper role on a workspace (or across all workspaces if you choose)
For example, to grant a User the Viewer role on a workspace:
tecton access-control assign-role --role viewer \
--workspace <Your-workspace> \
--user <Your-user@example.com>
To grant a Service Account the Consumer role on a workspace:
tecton access-control assign-role --role consumer \
--workspace <Your-workspace> \
--service-account <Your-Service-Account-Id>
[Optional] You can also use CLI version 0.6.6 or newer to grant roles across all workspaces:
tecton access-control assign-role --role consumer \
--all-workspaces \
--service-account <Your-user@example.com>
When new workspaces are created, you will automatically be able to access objects from that workspace in your notebooks.
Grant Authorization Using Tecton Web UIโ
Alternatively, follow these steps in the Tecton Web UI to authorize your user or Service Account:
- Locate your workspace by selecting it from the drop down list at the top.
- On the left navigation bar, select Permissions.
- Select the Users or Service Accounts tab.
- Click Add user to ... or Add service account to ....
- In the dialog box that appears, search for the user or Service Account name.
- When the workspace name appears, click Select on the right.
- Select a role. You can select any of these roles: Owner, Editor, Consumer, or Operator Viewer.
- Click Confirm.
Optional permissions for cross-account accessโ
Additionally, if your EMR cluster is on a different AWS account, you must make
sure you
configure access
to read all of the S3 buckets Tecton uses (which are in the data plane account,
and are prefixed with tecton-. Note that this is the bucket you created.), as
well as access to the underlying data sources Tecton reads in order to have full
functionality.
Create a Tecton Service Accountโ
If you need to create a new Tecton Service Account, you can do so using the Tecton CLI or the Tecton Web UI.
Using the CLIโ
Create a Service Account with the CLI using the
tecton service-account create
command.
Using the Web UIโ
Create a Service Account with the Web UI using these instructions
Additional python librariesโ
Install Python libraries on a running cluster with EMR Notebooks
To install libraries from the Python Package repo, you can run a command like
this at any time after running the initial %%configure command:
sc.install_pypi_package("pandas==1.1.5")
Here, sc refers to the Spark Context that is created for the notebook session.
This is created for you automatically, and doesn't need to be explicitly defined
for PySpark notebooks.