Skip to main content
Version: Beta 🚧

Use Tecton on Spark with Third-Party Notebooks

Third-party notebooks that run outside of EMR and Databricks can use Apache Livy to connect to a Tecton on Spark (EMR or Databricks) cluster.

Livy is a REST API that allows a notebook to send commands to a remote Spark cluster and receive results.

note

Tecton Support does not provide assistance for problems that arise from third-party notebooks connected to a Tecton notebook cluster. If you are comfortable with using Apache Livy to connect third-party notebooks, the instructions are provided below for informational purposes.

Tecton recommends using Databricks and EMR notebooks that are connected to a Tecton notebook cluster, instead of third-party notebook environments; in our testing we found Databricks and EMR notebooks to be more reliable.

Using Livy to connect to Tecton via third-party notebook​

Follow these steps:

1. Permissions​

Spark must have the appropriate permission to access Tecton data sources. This can be accomplished by either:

  • Running the local Spark context on an EC2 instance that has an instance profile which has access to the necessary data sources.

  • Manually provide AWS credentials that grant access to the data sources by setting AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables.

2. Software​

The following software must be installed on the machine that will be running the notebook:

  • Java 8 or Java 11

  • Python 3.7 or 3.8

  • pip

3. Initialize cluster secrets and parameters​

The following secrets must be created either in AWS Secrets Manager or as environment variables.

As AWS secrets: Please refer to this docs page for how to initialize them in the Secrets Manager.

As environment variables: Create environment variables (or you can manually set them in your notebook using os.environ[]) API_SERVICE and TECTON_API_KEY. Refer to the docs page above for the values to put in these variables.

4. Install Tecton and Spark​

Run pip install 'tecton[pyspark]'

5. Install the sparkmagic Jupyter/Jupyter Lab extension​

This extension provides additional information when running Spark jobs in a local notebook.

6. Initialize the PySpark session​

In your notebook, run the following commands:

note

On the line builder = build.config(...), add necessary jars that are required to interact with AWS. See this page for a list of potential additional jars, depending on your data sources.

builder = SparkSession.builder

builder = build.config(
"spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.0,com.amazonaws:aws-java-sdk-bundle:1.11.375"
)

# Set the S3 client implementation:
builder = build.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

7. Run your Tecton commands​

Once the Spark session is created, Tecton’s SDK will automatically pick up the session and use it. From this point forward, you’ll be able to run Tecton SDK commands using your local Spark context.

Was this page helpful?

🧠 Hi! Ask me anything about Tecton!

Floating button icon