DatabricksClusterConfig
Summary​
Configuration used to specify materialization cluster options on Databricks.Â
This class describes the attributes of the new clusters which are created in Databricks during materialization jobs. You can configure options of these clusters, like cluster size and extra pip dependencies.
Â
Note on
extra_pip_dependencies
: This is a list of packages that will be installed during materialization.
To use PyPI packages, specify the package name and optionally the version, e.g. "tensorflow"
or "tensorflow==2.2.0"
.
To use custom code, package it as a Python wheel or egg file in S3 or DBFS, then specify the path to the file,
e.g. "s3://my-bucket/path/custom.whl"
, or "dbfs:/path/to/custom.whl"
.Â
These libraries will only be available to use inside Spark UDFs. For example, if you set
extra_pip_dependencies=["tensorflow"]
, you can use it in your transformation as shown below.Example
from tecton import batch_feature_view, Input, DatabricksClusterConfig@batch_feature_view(sources=[FilteredSource(credit_scores_batch)],# Can be an argument instance to a batch feature view decoratorbatch_compute = DatabricksClusterConfig(instance_type = 'm5.2xlarge',spark_config = {"spark.executor.memory" : "12g"}extra_pip_dependencies=["tensorflow"],),# Other named arguments to batch feature view...)# Use the tensorflow package in the UDF since tensorflow will be installed# on the Databricks Spark cluster. The import has to be within the UDF body. Putting it at the# top of the file or inside transformation function won't work.@transformation(mode='pyspark')def test_transformation(transformation_input):from pyspark.sql import functions as Ffrom pyspark.sql.types import IntegerTypedef my_tensorflow(x):import tensorflow as tfreturn int(tf.math.log1p(float(x)).numpy())my_tensorflow_udf = F.udf(my_tensorflow, IntegerType())return transformation_input.select('entity_id','timestamp',my_tensorflow_udf('clicks').alias('log1p_clicks'))
Attributes​
The attributes are the same as the __init__
method parameters. See below.
Methods​
__init__(...)​
Parameters
kind
(Literal
['DatabricksClusterConfig'
]) - Default:DatabricksClusterConfig
instance_type
(Optional
[str
]) - Instance type for the cluster. Must be a valid type as listed in this Databricks documentation. Additionally, Graviton instances such as the m6g family are not supported. If not specified, a value determined by the Tecton backend is used. Default:None
instance_availability
(Optional
[str
]) - Instance availability for the cluster : "spot", "on_demand", or "spot_with_fallback". Default:spot
number_of_workers
(Optional
[int
]) - Number of instances for the materialization job. If not specified, a value determined by the Tecton backend is used. If set to 0 then jobs will be run in single-node clusters. Default:None
first_on_demand
(Optional
[int
]) - The firstfirst_on_demand
nodes of the cluster will use on_demand instances. The rest will use the type specified by instance_availability. If first_on_demand >= 1, the driver node use on_demand instance type. Default:None
extra_pip_dependencies
(Optional
[List
[str
]]) - Extra pip dependencies to be installed on the materialization cluster. Must be PyPI packages, or wheels/eggs in S3 or DBFS. Default:None
spark_config
(Optional
[Dict
[str
,str
]]) - Map of Spark configuration options and their respective values that will be passed to the FeatureView materialization Spark cluster. Default:None
dbr_version
(str
) - Databricks runtime version of the cluster. Supported versions can be found on this page Default:11.3.x-scala2.12