Skip to content

S3 Buckets

Overview

To grant Tecton access to your S3 data sources, use AWS IAM role-based permissions. This requires setting bucket policies for the S3 buckets you want to use with Tecton.

To add S3 buckets:

  1. Add bucket policies
  2. Register the S3 data source
  3. Test the S3 data source

Important

If your S3 data source is partitioned in a directory structure, we recommend that you register the data source with your AWS Glue Data Catalog and add the data as a Hive data source.

Adding a Bucket Policy

This AWS blog post explains how to configure bucket policies using IAM roles. An example bucket policy is shown below. The bucket policy gives permissions to the IAM role that Tecton uses to run Spark jobs.

The Principal agent in the policy below is used to add S3 data sources to Tecton's Free Trial.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GrantResourceAccess",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::{YOUR-TECTON-AWS-ACCOUNT}:root"
            },
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
                "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
            ]
        }
    ]
}

Adding permissions to the Spark Role

If you have a paid version of Tecton, you must also grant access to the Spark Role you configured Tecton with (Databricks EMR) to read from your S3 Bucket. You can do so by creating and attaching a policy to the role with the following permissions

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GrantRoleAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
                "arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
            ]
        }
    ]
}

Registering an S3 Data Source

Once Tecton has access, register the data sources with Tecton as part of the data_sources.py file in your Feature Repository.

Create a config object using FileDSConfig and place it in a BatchDataSource object with metadata to discover the new data source. For example:

sample_data_config = FileDSConfig(
        uri='s3://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq',
        file_format="parquet"
)

sample_data_vds = BatchDataSource(
        name='sample_data',
        batch_ds_config=sample_data_config,
        family='ad_serving',
        tags={
            'release': 'production'
        }
)

After you have created these objects in your local Feature Repository, call tecton apply to submit them to the production Feature Store.

Testing an S3 Data Source

To test that the connection to the S3 data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:

ds = tecton.get_virtual_data_source('sample_data')
ds.get_dataframe().to_spark().limit(10).show()

If you get a 403 ERROR when calling the get_dataframe command, Tecton does not have permission to access the data. Check the bucket policy and the AWS setup. If you continue to get errors, contact Tecton support.