Using Tecton on Spark with Third-Party Notebooks
Third-party notebooks that run outside of EMR and Databricks can use Apache Livy to connect to a Tecton on Spark (EMR or Databricks) cluster.
Livy is a REST API that allows a notebook to send commands to a remote Spark cluster and receive results.
Tecton Support does not provide assistance for problems that arise from third-party notebooks connected to a Tecton notebook cluster. If you are comfortable with using Apache Livy to connect third-party notebooks, the instructions are provided below for informational purposes.
Tecton recommends using Databricks and EMR notebooks that are connected to a Tecton notebook cluster, instead of third-party notebook environments; in our testing we found Databricks and EMR notebooks to be more reliable.
Using Livy to connect to Tecton via third-party notebook
Follow these steps:
Spark must have the appropriate permission to access Tecton data sources. This can be accomplished by either:
Running the local Spark context on an EC2 instance that has an instance profile which has access to the necessary data sources.
Manually provide AWS credentials that grant access to the data sources by setting
AWS_SECRET_ACCESS_KEYas environment variables.
The following software must be installed on the machine that will be running the notebook:
Java 8 or Java 11
Python 3.7 or 3.8
3. Initialize cluster secrets and parameters
The following secrets must be created either in AWS Secrets Manager or as environment variables.
As AWS secrets: Please refer to this docs page for how to initialize them in the Secrets Manager.
As environment variables: Create environment variables (or you can manually set them in your notebook using os.environ) API_SERVICE and TECTON_API_KEY. Refer to the docs page above for the values to put in these variables.
4. Install Tecton and Spark
pip install 'tecton[pyspark]'
5. Install the sparkmagic Jupyter/Jupyter Lab extension
This extension provides additional information when running Spark jobs in a local notebook.
6. Initialize the PySpark session
In your notebook, run the following commands:
On the line
builder = build.config(...), add necessary jars that are required
to interact with AWS. See
for a list of potential additional jars, depending on your data sources.
builder = SparkSession.builder
builder = build.config(
# Set the S3 client implementation:
builder = build.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
7. Run your Tecton commands
Once the Spark session is created, Tecton’s SDK will automatically pick up the session and use it. From this point forward, you’ll be able to run Tecton SDK commands using your local Spark context.