Skip to main content
Version: 1.1

Connect to AWS Glue Data Catalog

Overview​

Tecton can use Hive as a source of batch data for feature materialization, provided that AWS Glue Data Catalog is used as the Hive metastore. This page explains how to set up Tecton to use Hive as a data source.

Limitations​

Hive datastores have the following limitations:

  • You cannot dynamically switch between different Glue Data Catalogs. You must restart the cluster for new Spark configurations to take effect.
  • For other considerations when using Glue Data Catalogs, see the AWS Spark documentation.

Configure access to the Glue Data Catalog​

info

The instructions that follow apply only to cases where Tecton needs access to a Glue Data Catalog from an account other than the Tecton data plane account.

Prerequisites​

Before you begin, you must have:

  • A deployed Tecton cluster
  • AWS administrator access to IAM roles and policies for the Tecton deployment AWS account.
  • A Target Glue Data Catalog ID.
  • AWS administrator access to IAM roles and policies for the Glue Data Catalog AWS account.

Procedure​

To configure access to the Glue Data Catalog, follow these steps:

Grant Glue Data Catalog Access to Tecton's Role​

Log in to the AWS account of the target Glue Data Catalog and go to the Glue Console. In Settings, paste the following policy into the Permissions box. Set the "AWS" ARN to the Spark role (the instance profile role) used by Databricks or EMR. For "Resource", set the <aws-region-target-glue-catalog> and <aws-account-id-target-glue-catalog>.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GlueAllowTectonAccess",
"Effect": "Allow",
"Principal": {
"AWS": <TECTON ROLE ARN>
},
"Action": [
"glue:BatchGetPartition",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTables",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions"
],
"Resource": "arn:aws:glue:<aws-region-target-glue-catalog>:<aws-account-id-target-glue-catalog>:*"
}
]
}

Additionally for EMR you will need the following resource policy on Glue on the same account as your Tecton data plane even if you intend to use a Glue catalog from a different account. This is due to an issue in AWSGlueDataCatalogHiveClientFactory, which is used by EMR for Glue access.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GlueAllowTectonAccess",
"Effect": "Allow",
"Principal": {
"AWS": <TECTON ROLE ARN>
},
"Action": [
"glue:GetDatabase"
],
"Resource": [
"arn:aws:glue:<aws-region-tecton-data-plane>:<aws-account-id-tecton-data-plane>:catalog",
"arn:aws:glue:<aws-region-tecton-data-plane>:<aws-account-id-tecton-data-plane>:database/default"
]
}
]
}

Grant IAM Permissions to the Spark Role​

Log in to your Tecton AWS account and go to the IAM Console. Create the following policy and attach it to your Tecton Spark role (created in Databricks Setup or EMR Setup). Note that the statement with Sid GlueSameAccountAccessForEMR is only required for EMR with cross-account Glue and is optional otherwise. This permission is only necessary due to an issue in AWSGlueDataCatalogHiveClientFactory, which is used by EMR for Glue access.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GlueAccess",
"Effect": "Allow",
"Action": [
"glue:BatchGetPartition",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTables",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
],
"Resource": "arn:aws:glue:<aws-region-target-glue-catalog>:<aws-account-id-target-glue-catalog>:*",
},
{
"Sid": "GlueSameAccountAccessForEMR",
"Effect": "Allow",
"Action": ["glue:GetDatabase"],
"Resource": [
"arn:aws:glue:<aws-region-tecton-data-plane>:<aws-account-id-tecton-data-plane>:catalog",
"arn:aws:glue:<aws-region-tecton-data-plane>:<aws-account-id-tecton-data-plane>:database/default",
],
},
],
}

Specify the Target Glue Data Catalog​

For Amazon EMR​

Tecton needs the aws-account-id-target-glue-catalog and the aws-region-target-glue-catalog, if the target Glue Data Catalog region is different from the Tecton AWS region. Your deployment specialist will request these values, which you specified in the previous step.

For Databricks​

If the target Glue Data Catalog region is the same as the AWS region, you can set the aws-account-id-target-glue-catalog, as specified in the previous step, in the Web UI; navigate to <cluster name>.tecton.ai/app/data-platform/ and set the value in the AWS Glue Data Catalog id field.

Otherwise, Tecton needs the aws-account-id-target-glue-catalog and the aws-region-target-glue-catalog values that you set in the previous step. Your deployment specialist will request these values.

Validate Permissions​

The preceding steps set up the Glue Data Catalog policies, which grant Tecton access only to the metadata. The related S3 bucket and object-level access permissions are defined separately by S3, and can be more restrictive if required. For more information, see this AWS blog post.

Validate Permissions with Databricks​

To validate the cross-account permissions with Databricks, launch a cluster with the *-spark-node instance profile:

  1. Create a cluster.

  2. Click the Instances tab on the Cluster Creation page.

  3. In the Instance Profiles drop-down list, select the instance profile.

  4. Set the Spark configuration spark.databricks.hive.metastore.glueCatalog.enabled=true

  5. If the target Glue Data Catalog is in a different AWS account from the Databricks deployment, set the Spark configuration spark.hadoop.hive.metastore.glue.catalogid to the target Glue Data Catalog id.

  6. If the target Glue Data Catalog is in a different region from the Databricks deployment, set the Spark configuration spark.hadoop.aws.region to the target region.

  7. Verify that you can access the Glue Data Catalog by running the following command in a notebook:

    show databases;

    If the command succeeds, the Tecton cluster is configured to use Glue.

Validate Permissions with Amazon EMR​

To validate the cross-account permissions with Amazon EMR, follow these steps:

  1. Create an EMR cluster. A convenient way to do this is to clone an existing Notebook cluster.

  2. Launch a notebook on this cluster.

  3. Verify that you can access the Glue Data Catalog by running the following command in a notebook:

    show databases;

Troubleshooting: Unable to Create Default Database in AWS Glue Catalog​

Issue​

When an EMR job is started, you receive an error in your logs saying that it does not have enough permissions to create the default database in the AWS Glue catalog. The stack trace may look something like this:

[ARN] is not authorized to perform: glue:CreateDatabase on resource: [ARN] because no resource-based policy allows the glue:CreateDatabase action

Cause​

EMR is trying to create a Glue database in your AWS account, and your account may not have permission to create it. This is an EMR default behavior that is automatically calling the create database operation.

Resolution​

Since there is no usage or hard dependency by Tecton and since EMR handles the failure gracefully, this issue is non-fatal and is safe to ignore.

  1. The easiest way to resolve the error is adding the glue:CreateDatabase permission and have EMR create that database. Once added, the permission can be removed after at least one run has created the default database.

  2. You may already have AWS Glue catalog added, which EMR is unnecessarily trying to add the database for your specific user account. Since you already have it added, the error is safe to ignore.

Was this page helpful?