Configure Data Source Access per Workspace
This feature is currently in Private Preview.
- Available for Tecton on Databricks and EMR.
This article shows how to configure Data Source access for a Workspace by limiting the AWS IAM identities that can be assumed by Spark clusters during feature materialization. These controls may be useful if you have a multi-Workspace strategy, where some data sources are sensitive and should only be available for use in a subset of Workspaces.
There are three steps to achieve this configuration:
- Set up the necessary instance profile(s) in AWS:
- Each role must be added to its respective instance profile.
- The instance profile for each role must have the same name as the role.
- Refer to AWS documentation for more details on how to create instance profiles.
- Specify the allow-list of IAM identities (instance profiles or roles) that are available for use in a Workspace.
- Configure the specific instance profile or role used during materialization for a specific Feature View.
Specify the instance profile or role allow-list​
In order to update the instance profile allow-list for a live Workspace, open a ticket with Tecton Support and provide:
- The AWS IAM identities that should be included in the allow-list
- For Databricks, you'll provide instance profile ARNs similar to
arn:aws:iam::000000000000:instance-profile/your-tecton-spark-role
- For EMR, you'll provide IAM roles similar to
your-tecton-spark-role
- For Databricks, you'll provide instance profile ARNs similar to
- The Workspaces for which the allow-list should be configured
Tecton Support will need confirmation by a current Tecton Admin from your account before updating the allow-list.
When specifying your allow-lists, note that:
- Once any allow-list has been configured for your account, all Live Workspaces
will require allow-lists. If no allow-list is configured for a Live Workspace,
then
tecton plan/apply
will fail. - If the Workspace has existing Feature Views using instance profiles or roles that are not on the allow-list, then new materialization jobs will fail at the next job attempt. Because stream jobs are long running, it may be some time before the current job is cancelled and the next attempt starts.
- The instance profile or role specified during Tecton deployment must have access to all data sources in order to perform validation. You do not need to use this instance profile or role during materialization time.
For example, consider the following scenario:
- You have Hive tables A, B, & C, and roles A', B', & C', which have permission to access their corresponding tables. Additionally, you have live Workspaces X, Y, & Z.
- Hive table A is safe for use by any team. Table B can be used by Workspaces X & Y, but not Z. Table C can only be used by Workspace Y.
Then you should configure the allowlists to be:
- Workspace X: roles A', B'
- Workspace Y: roles A', B', C'
- Workspace Z: role A'
Configuring the instance profile or role for a Feature View​
In order to configure the instance profile during batch or stream
materialization, you must use the DatabricksJsonClusterConfig
or
EMRJsonClusterConfig
interface. If the instance profile used by a Feature View
is not on the allow-list defined above, then tecton plan
and tecton apply
will display an error message.
- For
DatabricksJsonClusterConfig
, assign your instance profile ARN to theinstance_profile_arn
parameter within thenew_cluster.aws_attributes
object. See the example here. - For
EMRJsonClusterConfig
, assign your role name to theJobFlowrole
parameter. See the example here.
If you do not specify DatabricksJsonClusterConfig
or EMRJsonClusterConfig
,
then the default instance profile or role defined during Tecton deployment will
be used for materializing the Feature View. If the allow-list specified for a
Workspace does not include the default instance profile or role, then
tecton plan
or tecton apply
will fail if any Feature View does not specify
these cluster configuration options.