Skip to content

Connecting to AWS EMR

Overview

AWS EMR provides a hosted data platform for Spark that can be used with Tecton for compute workloads and notebook environments. Tecton can create and manage EMR resources automatically.

Set Up

Currently, connecting your Tecton deployment to an EMR environment must be done with the help of Tecton support since it requires updating Tecton-managed AWS resources.

By default, Tecton will create EMR clusters in the AWS account where Tecton is deployed. To manually setup EMR clusters for use with notebooks, follow the steps below.

Interactive Cluster Setup

Prerequisites

To set up Tecton with an interactive EMR cluster, you need the following:

  • An AWS account with an IAM role that has access to your data
  • Tecton SDK credentials configured in AWS Secret Manager. This should be pre-configured with the Tecton deployment, but if needed they can be created according as follows (such as if you wanted to access Tecton from another AWS account). Note that if your cluster name starts with tecton- already, the prefix would merely be your cluster name.
    • tecton-<clustername>/API_SERVICE: <https://yourco.tecton.ai/api>
    • tecton-<clustername>/FEATURE_SERVICE: <https://region.yourco.tecton.ai/api> (only necessary for Enterprise SAAS deployments)
    • tecton-<clustername>/TECTON_API_KEY: <Tecton API key> from tecton create-api-key
  • You will need to set the following environment variables:

    • TECTON_CLUSTER_NAME: Tecton support can tell you the value for this.
    • CLUSTER_REGION: The name of the AWS region the Tecton cluster is deployed in.

    These can be set by specifying a spark-env configuration option to EMR when creating your cluster. It would look something like this:

    [
      {
       "Classification": "spark-env",
       "Properties": {},
       "Configurations": [
           {
             "Classification": "export",
             "Properties": {
                 "CLUSTER_REGION": "us-west-2",
                 "TECTON_CLUSTER_NAME": "<clustername>",
             }
           }
       ]
     },
    {
        "Classification": "yarn-env",
        "Properties": {},
        "Configurations": [
            {
              "Classification": "export",
              "Properties": {
                 "CLUSTER_REGION": "us-west-2",
                 "TECTON_CLUSTER_NAME": "<clustername>",
              }
            }
        ]
      }
    ]
    
    - There are some optional credentials that must be set up, depending on data sources used. - tecton-<clustername>/REDSHIFT_USER - tecton-<clustername>/REDSHIFT_PASSWORD - tecton-<clustername>/SNOWFLAKE_USER - tecton-<clustername>/SNOWFLAKE_PASSWORD

Note

Terminated notebook clusters can be cloned. This is often the easiest way to recreate a cluster.

Tecton creates an EMR cluster that can be used for usage with notebooks. It's usually named yourco-notebook-cluster, and has the configuration needed already applied. It can be cloned as needed for notebook users.

To set up a new interactive EMR cluster from scratch, follow the following steps.

1. Install the Tecton SDK

  1. Create a new EMR cluster
    • Specifying your IAM role as the instance profile
    • Use emr-5.30.0 and Spark 2.4.5, Hive 2.3.6, Livy 0.7.0
    • We recommend using m5.xlarge EC2 nodes
  2. Tecton support will provide you with an init script path in S3 (for example, s3://tecton-yourco-production-emr-scripts/install_emr_notebook_libraries.sh). Once the cluster is created, specify this script as a bootstrap action. This will install the SDK on your cluster.

2. Additional Permissions

Additionally, if your EMR cluster is on a different AWS account, you must make sure you configure access to read all of the S3 buckets Tecton uses (which are in the data plane account, and are prefixed with tecton-), as well as access to the underlying data sources Tecton reads in order to have full functionality.

3. Configuring the notebook

EMR notebooks that interact with Tecton should be using the PySpark kernel. We recommend the following all EMR notebooks to use the following configuration as the first cell that is executed in notebooks:

%%configure -f
{
    "conf" : {
        "spark.pyspark.python": "python3.7",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
        "spark.jars": "s3://tecton.ai.public/jars/delta-core.jar"
    },
}

Other configuration can be added as required when connecting to specific data sources or using specific features. These specific configurations are listed below.

4. Verifying the Connection

Create a notebook connected to a cluster. Run the following in the notebook. If successful, you should see a list of workspaces, including the "prod" workspace. Note that you must select the PySpark kernel.

import tecton
tecton.list_workspaces()

Tecton SDK Updates

Updates to the Tecton SDK are released regularly. For major updates, you may be required to update the SDK to continue using Tecton. For EMR interactive clusters using the process above, start a new cluster or clone existing to install the latest Tecton SDK version.

Additional Jars and libraries

Some data sources and feature types may require addition libraries to be installed.

Data Sources

For Data sources, Run the following in your notebook's first cell, i.e. the %%configure cell, before running any other commands.

Redshift

%%configure -f
{
    "conf": {
        "spark.pyspark.python": "python3.7",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
        "spark.jars":"s3://tecton.ai.public/jars/delta-core.jar,s3://tecton.ai.public/jars/spark-redshift_2.11-4.1.1.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar, s3://tecton.ai.public/jars/spark-avro_2.11-3.0.0.jar, s3://tecton.ai.public/jars/RedshiftJDBC4-no-awssdk-1.2.41.1065.jar,s3://tecton.ai.public/jars/postgresql-9.4.1212.jar"
    }
}

Kinesis

%%configure -f
{
    "conf": {
        "spark.pyspark.python": "python3.7",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
        "spark.jars":"s3://tecton.ai.public/jars/delta-core.jar,s3://tecton.ai.public/jars/spark-sql-kinesis_2.11-1.2.0_spark-2.4.jar"
    }
}

Snowflake

%%configure -f
{
    "conf": {
        "spark.pyspark.python": "python3.7",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
        "spark.jars.packages": "net.snowflake:spark-snowflake_2.11:2.8.3-spark_2.4",
        "spark.jars":"s3://tecton.ai.public/jars/delta-core.jar,s3://tecton.ai.public/jars/snowflake-jdbc-3.12.15.jar"
    }
}

Kafka

%%configure -f
{
    "conf": {
        "spark.pyspark.python": "python3.7",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
        "spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5",
        "spark.jars":"s3://tecton.ai.public/jars/delta-core.jar,s3://tecton.ai.public/jars/snowflake-jdbc-3.12.15.jar"
    }
}

Feature Types

Last N Features

%%configure -f
{
    "conf": {
        "spark.pyspark.python": "python3.7",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
        "spark.jars.packages": "xerces:xercesImpl:2.8.0",
        "spark.jars":"s3://tecton.ai.public/jars/delta-core.jar,s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs.jar"
    }
}

Additional python libraries

To install libraries from the Python Package repo, you can run a command like this at any time after running the initial %%configure command:

sc.install_pypi_package("pandas==1.1.5")

Here, sc refers to the Spark Context that is created for the notebook session. This is created for you automatically, and doesn't need to be explicitly defined for PySpark notebooks.

Additional Resouces

Amazon EMR Notebooks Documentation

Install Python libraries on a running cluster with EMR Notebooks