Version: 0.8

Connect to S3

Overview

To grant Tecton access to your S3 data, use AWS IAM role-based permissions. This requires setting bucket policies for the S3 buckets you want to use with Tecton.

To add S3 buckets:

Add bucket policies
Register the S3 data source
Test the S3 data source

Performance Guidance

If your S3 data source is partitioned in a directory structure, we recommend that you register the data source with your AWS Glue Data Catalog and add the data as a Hive data source.

Adding a Bucket Policy

This AWS blog post explains how to configure bucket policies using IAM roles. An example bucket policy is shown below. The bucket policy gives permissions to the IAM role that Tecton uses to run Spark jobs.

The Principal agent in the policy below is used to add S3 data to Tecton's Free Trial.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantResourceAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{YOUR-TECTON-AWS-ACCOUNT}:root"
      },
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
        "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
      ]
    }
  ]
}

Adding permissions to the Spark Role

If you have a paid version of Tecton, you must also grant access to the Spark Role you configured Tecton with (Databricks EMR) to read from your S3 Bucket. You can do so by creating and attaching a policy to the role with the following permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantRoleAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
        "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
      ]
    }
  ]
}

Registering an S3 Data Source

Once Tecton has access, register the data sources with Tecton as part of the data_sources.py file in your Feature Repository.

Create a config object using FileConfig (see the performance guidance above; it may be more performant to use a Hive data source) and place it in a BatchSource object with metadata to discover the new data source. For example:

sample_data_config = FileConfig(uri="s3://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq", file_format="parquet")

sample_data_vds = BatchSource(
    name="sample_data",
    batch_config=sample_data_config,
)

After you have created these objects in your local Feature Repository, call tecton apply to submit them to the production Feature Store.

Testing S3 Data

To test that the connection to S3 has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:

ws = tecton.get_workspace("prod")
ds = ws.get_data_source("sample_data")
ds.get_dataframe().to_pandas().head(10).show()

If you get a 403 ERROR when calling the get_dataframe command, Tecton does not have permission to access the data. Check the bucket policy and the AWS setup. If you continue to get errors, contact Tecton support.

Overview​

Adding a Bucket Policy​

Adding permissions to the Spark Role​

Registering an S3 Data Source​

Testing S3 Data​

Was this page helpful?

Overview

Adding a Bucket Policy

Adding permissions to the Spark Role

Registering an S3 Data Source

Testing S3 Data