Skip to main content
Version: 0.9

EMRClusterConfig

Summary​

Configuration used to specify materialization cluster options on EMR.

This class describes the attributes of the new clusters which are created in EMR during materialization jobs. You can configure options of these clusters, like cluster size and extra pip dependencies.

Example​

from tecton import batch_feature_view, Input, EMRClusterConfig


@batch_feature_view(
sources=[FilteredSource(credit_scores_batch)],
# Can be an argument instance to a batch feature view decorator
batch_compute=EMRClusterConfig(
instance_type="m5.2xlarge",
number_of_workers=4,
extra_pip_dependencies=["tensorflow==2.2.0"],
),
# Other named arguments to batch feature view
...,
)

# Use the tensorflow package in the UDF since tensorflow will be installed
# on the EMR Spark cluster. The import has to be within the UDF body. Putting it at the
# top of the file or inside transformation function won't work.


@transformation(mode="pyspark")
def test_transformation(transformation_input):
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

def my_tensorflow(x):
import tensorflow as tf

return int(tf.math.log1p(float(x)).numpy())

my_tensorflow_udf = F.udf(my_tensorflow, IntegerType())

return transformation_input.select("entity_id", "timestamp", my_tensorflow_udf("clicks").alias("log1p_clicks"))

Attributes​

The attributes are the same as the __init__ method parameters. See below.

Methods​

__init__(...)​

Parameters​

  • instance_type (Optional[str]) – Instance type for the cluster. Must be a valid type as listed in https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html. Additionally, Graviton instances such as the m6g family are not supported. If not specified, a value determined by the Tecton backend is used.

  • instance_availability (Optional[str]) – Instance availability for the cluster : spot, on_demand, or spot_with_fallback. In 0.8+, Stream Feature Views default to and only support on_demand. Otherwise defaults to spot.

  • number_of_workers (Optional[int]) – Number of instances for the materialization job. If not specified, a value determined by the Tecton backend is used (Default: None)

  • first_on_demand (Optional[int]) – The first first_on_demand nodes of the cluster will use on_demand instances. The rest will use the type specified by instance_availability. If first_on_demand >= 1, the master node will use on_demand instance type. first_on_demand is recommended to be set >= 1 for cluster configs for critical streaming features. (Default: None)

  • root_volume_size_in_gb (Optional[int]) – Size of the root volume in GB per instance for the materialization job. If not specified, a value determined by the Tecton backend is used. (Default: None)

  • extra_pip_dependencies (Optional[List[str]]) – Extra pip dependencies to be installed on the materialization cluster. Must be PyPI packages, or wheels/eggs in S3 or DBFS. (Default: None)

    To use PyPI packages, specify the package name and optionally the version, e.g. "tensorflow" or "tensorflow==2.2.0". To use custom code, package it as a Python wheel or egg file in S3 or DBFS, then specify the path to the file, e.g. "s3://my-bucket/path/custom.whl", or "dbfs:/path/to/custom.whl".

    These libraries will only be available to use inside Spark UDFs. For example, if you set extra_pip_dependencies=["tensorflow"], you can use it in your transformation as shown in the example below.

  • spark_config (Optional[Dict[str, str]]) – Map of Spark configuration options and their respective values that will be passed to the FeatureView materialization Spark cluster. (Default: None)

  • emr_version (str) – EMR version of the cluster. See this page for a list of supported versions. (Default: emr-6.7.0)

Was this page helpful?

🧠 Hi! Ask me anything about Tecton!

Floating button icon