Skip to main content
Version: 0.7

Connect to S3

Overview​

To grant Tecton access to your S3 data, use AWS IAM role-based permissions. This requires setting bucket policies for the S3 buckets you want to use with Tecton.

To add S3 buckets:

  1. Add bucket policies
  2. Register the S3 data source
  3. Test the S3 data source
Performance Guidance

If your S3 data source is partitioned in a directory structure, we recommend that you register the data source with your AWS Glue Data Catalog and add the data as a Hive data source.

Adding a Bucket Policy​

This AWS blog post explains how to configure bucket policies using IAM roles. An example bucket policy is shown below. The bucket policy gives permissions to the IAM role that Tecton uses to run Spark jobs.

The Principal agent in the policy below is used to add S3 data to Tecton's Free Trial.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantResourceAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{YOUR-TECTON-AWS-ACCOUNT}:root"
},
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
]
}
]
}

Adding permissions to the Spark Role​

If you have a paid version of Tecton, you must also grant access to the Spark Role you configured Tecton with (Databricks EMR) to read from your S3 Bucket. You can do so by creating and attaching a policy to the role with the following permissions

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantRoleAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
]
}
]
}

Registering an S3 Data Source​

Once Tecton has access, register the data sources with Tecton as part of the data_sources.py file in your Feature Repository.

Create a config object using FileConfig (see the performance guidance above; it may be more performant to use a Hive data source) and place it in a BatchSource object with metadata to discover the new data source. For example:

sample_data_config = FileConfig(uri="s3://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq", file_format="parquet")

sample_data_vds = BatchSource(
name="sample_data",
batch_config=sample_data_config,
)

After you have created these objects in your local Feature Repository, call tecton apply to submit them to the production Feature Store.

Testing S3 Data​

To test that the connection to S3 has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:

ws = tecton.get_workspace("prod")
ds = ws.get_data_source("sample_data")
ds.get_dataframe().to_pandas().head(10).show()

If you get a 403 ERROR when calling the get_dataframe command, Tecton does not have permission to access the data. Check the bucket policy and the AWS setup. If you continue to get errors, contact Tecton support.

Was this page helpful?