Skip to main content
Version: 0.8

Connect EMR Notebooks

You can use the Tecton SDK in an EMR notebook to explore feature values and create training datasets. The following guide covers how to configure your EMR cluster for use with Tecton. If you haven't already completed your deployment of Tecton with EMR, please see the guide for Configuring EMR.

note

Terminated notebook clusters can be cloned to create new notebook clusters. Cloning a previous notebook cluster is often the easiest way to recreate a cluster. Otherwise, follow the instructions below to create a notebook cluster from scratch.

Tecton creates an EMR cluster that can be used for usage with notebooks. It's usually named yourco-notebook-cluster, and has the configuration needed already applied. It can be cloned as needed for notebook users.

To set up a new interactive EMR cluster from scratch, follow the instructions below.

Supported EMR versions for notebooks​

Tecton supports using the Tecton SDK with the following EMR versions:

Prerequisites​

To set up Tecton with an interactive EMR cluster, you need the following:

  • An AWS account with an IAM role that has access to your data
  • A Tecton User (your personal account) or a Tecton API key (obtained by creating a Service Account)

Setting up a Notebook EMR Cluster​

  1. Create a new EMR cluster.
    • Specifying your IAM role as the instance profile
    • Select Release emr-6.x.x
    • Select the following applications: Spark 3.x.x, Hive 3.x.x, Livy 0.7.1, Hadoop 3.x.x, JupyterEnterpriseGateway 2.x.x
    • We recommend starting with m5.xlarge EC2 nodes
  2. Add the following Bootstrap actions scripts:
    • Install the Tecton SDK.
      • s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_v2.sh
      • We recommend passing the Tecton SDK version number as an argument, for example 0.6.2, 0.6.0b12, or 0.6.* to pin the latest patch for a minor version.
    • Install additional python libraries.
      • s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh
      • pyarrow==5.0.0 (required for feature views using Pandas UDFs)
      • virtualenv (required for EMR 6.7 and above)
      • any additional libraries needed for your development environment
    • For EMR 6.5 and below, patch the log4j vulnerability.
    • (Optional) If using Kafka, copy the Kafka credentials from S3.
      • s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_copy_kafka_credentials.sh
      • The script requires the s3 bucket as an argument, eg. "s3://bucket". Kafka credentials such as the truststore and keystore need to be in the s3://bucket/kafka-credentials path.
  3. Add TECTON_CLUSTER_NAME and CLUSTER_REGION environment variables to the cluster Configurations as shown below:
[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"CLUSTER_REGION": "<AWS region>",
"TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
},
{
"Classification": "livy-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"CLUSTER_REGION": "<AWS region>",
"TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
},
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"CLUSTER_REGION": "<AWS region>",
"TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
},
{
"classification": "spark-defaults",
"properties": {
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<AWS region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
Example aws cli command for cluster creation
aws emr create-cluster \
--name "tecton-<deployment name>-notebook-cluster" \
--log-uri "s3n://<redacted>" \
--release-label "emr-6.9.0" \
--service-role "arn:aws:iam::<redacted>:role/tecton-<deployment name>-emr-master-role" \
--ec2-attributes '{"InstanceProfile":"tecton-<deployment name>-emr-spark-role","EmrManagedMasterSecurityGroup":"<redacted>","EmrManagedSlaveSecurityGroup": "<redacted>","ServiceAccessSecurityGroup": <redacted>,"SubnetId":"<redacted>"}' \
--applications Name=Hadoop Name=Hive Name=JupyterEnterpriseGateway Name=Livy Name=Spark \
--instance-fleets '[{"Name":"","InstanceFleetType":"MASTER","TargetSpotCapacity":0,"TargetOnDemandCapacity":1,"LaunchSpecifications":{},"InstanceTypeConfigs":[{"WeightedCapacity":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32}},{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32}}]},"BidPriceAsPercentageOfOnDemandPrice":100,"InstanceType":"m5.xlarge"}]}]' \
--bootstrap-actions '[{"Args":[],"Name":"tecton_sdk_setup","Path":"s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_v2.sh"},{"Args":["pyarrow==5.0.0","virtualenv"],"Name":"additional_dependencies","Path":"s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh"}]' \
--scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
--auto-termination-policy '{"IdleTimeout":3600}' \
--region <aws region>

Configure the notebook​

EMR notebooks that interact with Tecton should be using the PySpark kernel. Note that the AWS Service Role you create the notebook with must have permission to access public S3 buckets in order to install the required Tecton JARs. We recommend the following all EMR notebooks to use the following configuration as the first cell that is executed in notebooks.

In the following code block, substitute {tecton_version} with the desired Tecton SDK version, ex. 0.6.0 or 0.6.* to pin the latest patch for a minor version.

note

If your notebook cluster is pinned to a specific Tecton SDK version, substitute {tecton_version} in s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar (located in the code block below) with the following:

  • For a version without * such as 0.4.7: s3://tecton.ai.public/pip-repository/itorgation/tecton/0.4.7/tecton-udfs-spark-3.jar
  • For a version with * such as 0.4.*: s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.4.*'/tecton-udfs-spark-3.jar (Note the single quotes around *)

Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK version, use s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar to have the notebook use latest beta version of the Tecton SDK.

For EMR 6.5:

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}
}

For EMR 6.7 and above:

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>"
}
}

Other configuration can be added as required when connecting to specific data sources or using specific features. These specific configurations are listed below.

Authenticate to Tecton Account​

Authenticating to a Tecton instance from a notebook can happen in 3 ways. They are listed here in the order that Tecton searches for credentials to use. For example, credentials set using Option 1 will override any credentials set in Options 2 and 3.

Option 1: User Credentials in Notebook Session Scope​

User credentials configured using tecton.login() are scoped to the notebook session, and must be reconfigured when a notebook is restarted or its state is cleared. User credentials override any credentials set in both Option 2: Service Account Credentials in Notebook Session Scope and Option 3: Service Account Credentials in AWS Secrets Manager.

To authenticate as a user, run the following in your notebook, replacing "https://example.tecton.ai" with the URL of your Tecton instance:

tecton.login("https://example.tecton.ai", interactive=False)
note

EMR notebooks use the Pyspark kernel, which does not support Python's builtin input(). You must set interactive=False when calling tecton.login() in EMR notebooks.

Then follow the directions to open the login link in your browser, sign in to the Tecton instance as your user, and copy and paste the authorization code from the Identity Verified web page into your clipboard. Then pass the code to tecton.login_with_code to complete login:

tecton.login_with_code(<auth_code>)

Please be aware the authorization code is one-time use only.

Note that get_online_features requires Service Account credentials to call the online store. If you want to use get_online_features, please follow Option 2 or Option 3 to also set Service Account credentials.

Option 2: Service Account Credentials in Notebook Session Scope​

Service account credentials configured using tecton.set_credentials() are scoped to the notebook session. They must be reconfigured whenever a notebook is restarted or its state is cleared. They override credentials set in Option 3: Service Account Credentials in AWS Secrets Manager.

Prerequisites​

Please have a Tecton Service Account already set up (and have its API Key secret value accessible).

If you don't have one, create a new one using these instructions.

Set API Key in Session​

To authenticate as a Service Account, make sure you have its API Key secret value, and run the following command in your notebook, replacing <key> with the API key value, and "https://example.tecton.ai/api" with the URL of your Tecton instance:


tecton.set_credentials(tecton_api_key=<key>, tecton_url="https://example.tecton.ai/api")

Option 3: Service Account Credentials in AWS Secrets Manager​

If User credentials or Service Account credentials are not found in the notebook session scope, Tecton will look for Service Account credentials set in AWS Secrets Manager. This should be pre-configured with the Tecton deployment, but if needed they can be created in the following format (such as if you wanted to access Tecton from another EMR Workspace).

Prerequisites​

Please have a Tecton Service Account already set up (and have its API Key secret value accessible).

If you don't have one, create a new one using these instructions.

Set API Key in AWS Secrets Manager​

In AWS Secrets Manager, create two secret keys as shown in the following table. <prefix> and <deployment name> are defined below the table.

Key nameKey value
<prefix>/API_SERVICEhttps://<deployment name>.tecton.ai/api
<prefix>/TECTON_API_KEY<Tecton API key> generated with the tecton service-account command above

<prefix> is:

  • <deployment name>, if your deployment name begins with tecton
  • tecton-<deployment name>, otherwise

<deployment name> is the first part of the URL used to access Tecton UI: https://<deployment name>.tecton.ai

There are some optional credentials that must be set up, depending on data sources used.

  • tecton-<deployment name>/REDSHIFT_USER
  • tecton-<deployment name>/REDSHIFT_PASSWORD
  • tecton-<deployment name>/SNOWFLAKE_USER
  • tecton-<deployment name>/SNOWFLAKE_PASSWORD

Authorize Principal To Access Resources​

In order to access objects from a given Tecton workspace, the User or Service Account used in the last step must be authorized with at least the Viewer role on that workspace. To enable testing Online Feature Retrieval, you should grant the Service Account the Consumer role.

Grant Authorization Using Tecton CLI​

Use the access-control assign-role command to grant your user or Service Account the proper role on a workspace (or across all workspaces if you choose)

For example, to grant a User the Viewer role on a workspace:

tecton access-control assign-role --role viewer \
--workspace <Your-workspace> \
--user <Your-user@example.com>

To grant a Service Account the Consumer role on a workspace:

tecton access-control assign-role --role consumer \
--workspace <Your-workspace> \
--service-account <Your-Service-Account-Id>

[Optional] You can also use CLI version 0.6.6 or newer to grant roles across all workspaces:

tecton access-control assign-role --role consumer \
--all-workspaces \
--service-account <Your-user@example.com>

When new workspaces are created, you will automatically be able to access objects from that workspace in your notebooks.

Grant Authorization Using Tecton Web UI​

Alternatively, follow these steps in the Tecton Web UI to authorize your user or Service Account:

  1. Locate your workspace by selecting it from the drop down list at the top.
  2. On the left navigation bar, select Permissions.
  3. Select the Users or Service Accounts tab.
  4. Click Add user to ... or Add service account to ....
  5. In the dialog box that appears, search for the user or Service Account name.
  6. When the workspace name appears, click Select on the right.
  7. Select a role. You can select any of these roles: Owner, Editor, Consumer, or Operator Viewer.
  8. Click Confirm.

Optional permissions for cross-account access​

Additionally, if your EMR cluster is on a different AWS account, you must make sure you configure access to read all of the S3 buckets Tecton uses (which are in the data plane account, and are prefixed with tecton-. Note that this is the bucket you created.), as well as access to the underlying data sources Tecton reads in order to have full functionality.

Verify the connection​

Create a notebook connected to a cluster. Run the following in the notebook. If successful, you should see a list of workspaces. Note that you must select the PySpark kernel.

tecton.test_credentials()
tecton.list_workspaces()

Additional jars and libraries​

Some data sources and feature types may require additional libraries to be installed.

Data sources​

For data sources, run the following in your notebook's first cell, i.e. the %%configure cell, before running any other commands. If you need to install libraries for multiple data sources (such as Snowflake and Kinesis), you can append the spark.jars and/or spark.jars.packages lines from the two data source examples below into one %%configure cell.

Delta​

A feature view may be configured to use a Delta offline store. In this case, the following JARs must be added to the spark.jars configuration:

For EMR 6.5:

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,.."
}
}

For EMR 6.7:

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.jars.packages": "io.delta:delta-core_2.12:1.2.1,io.delta:delta-storage-s3-dynamodb:1.2.1,..",
}
}

For EMR 6.9:

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.jars.packages": "io.delta:delta-core_2.12:2.1.1,io.delta:delta-storage-s3-dynamodb:2.1.1,..",
}
}

Redshift​

For EMR 6.5:

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars": "s3://tecton.ai.public/jars/spark-redshift_2.12-5.0.3.jar,s3://tecton.ai.public/jars/redshift-jdbc42-nosdk-2.1.0.1.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar,s3://tecton.ai.public/jars/spark-avro_2.12-3.0.0.jar,s3://tecton.ai.public/jars/postgresql-9.4.1212.jar,.."
}
}

For EMR 6.7 and above:

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.jars": "s3://tecton.ai.public/jars/spark-redshift_2.12-5.1.0.jar,s3://tecton.ai.public/jars/redshift-jdbc42-nosdk-2.1.0.14.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar,s3://tecton.ai.public/jars/spark-avro_2.12-3.0.0.jar,s3://tecton.ai.public/jars/redshift-jdbc42-nosdk-2.1.0.1.jar,s3://tecton.ai.public/jars/postgresql-9.4.1212.jar,.."
}
}

Kinesis​

%%configure -f
{
"conf": {
...
"spark.jars.packages": "com.qubole.spark:spark-sql-kinesis_2.12:1.2.0_spark-3.0",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/jars/spark-sql-kinesis_2.12-1.2.0_spark-3.0.jar,.."
}
}

Snowflake​

%%configure -f
{
"conf": {
...
"spark.jars.packages": "net.snowflake:spark-snowflake_2.12:2.9.1-spark_3.0",
"spark.jars": "s3://tecton.ai.public/jars/snowflake-jdbc-3.13.6.jar,.."
}
}
info

Make sure that Tecton's Snowflake username / password have access to the warehouse specified in data sources. Otherwise you'll get an exception like

net.snowflake.client.jdbc.SnowflakeSQLException: No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.

Kafka​

%%configure -f
{
"conf": {
...
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1"
}
}

Data formats​

Avro​

Tecton uses Avro format for Feature Logging datasets.

%%configure -f
{
"conf": {
...
"spark.jars": "local:/usr/lib/spark/external/lib/spark-avro.jar"
}
}

Additional python libraries​

To install libraries from the Python Package repo, you can run a command like this at any time after running the initial %%configure command:

sc.install_pypi_package("pandas==1.1.5")

Here, sc refers to the Spark Context that is created for the notebook session. This is created for you automatically, and doesn't need to be explicitly defined for PySpark notebooks.

Updating EMR versions​

Updating from 6.4 to 6.5​

  1. Select your existing Tecton notebook cluster on the EMR clusters tab and click Clone.
  2. Change the EMR release version dropdown to emr-6.5.0
  3. If your previous cluster was using the log4j mitigation bootstrap script, update the bootstrap actions to use the script corresponding to EMR version 6.5.
  4. Click Create cluster.

Updating from 6.5 to 6.7+​

  1. Select your existing Tecton notebook cluster on the EMR clusters tab and click Clone.
  2. Change the EMR release version dropdown to emr-6.7.0
  3. Modify or add the following Bootstrap actions script and argument(s) to install additional dependencies.
    • script: s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh
    • args: virtualenv
  4. If your previous cluster was using the log4j mitigation bootstrap script, then that is no longer needed for EMR 6.7+
  5. Click Create cluster.

Additional resources​

Amazon EMR Notebooks Documentation

Install Python libraries on a running cluster with EMR Notebooks

Create a Tecton Service Account​

If you need to create a new Tecton Service Account, you can do so using the Tecton CLI or the Tecton Web UI.

Using the CLI​

Create a Service Account with the CLI using the tecton service-account create command.

Using the Web UI​

Create a Service Account with the Web UI using these instructions

Was this page helpful?

🧠 Hi! Ask me anything about Tecton!

Floating button icon