tecton.HiveDSConfig

class tecton.HiveDSConfig(table, database, date_partition_column=None, timestamp_column_name=None, timestamp_format=None, skip_validation=False, datetime_partition_columns=None, raw_batch_translator=None)

Configuration used to reference a Hive table.

The HiveDSConfig class is used to create a reference to a Hive Table.

This class used as an input to a BatchDataSource’s parameter batch_ds_config. This class is not a Tecton Object: it is a grouping of parameters. Declaring this class alone will not register a data source. Instead, declare as part of BatchDataSource that takes this configuration class instance as a parameter.

Methods

__init__

Instantiates a new HiveDSConfig.

__init__(table, database, date_partition_column=None, timestamp_column_name=None, timestamp_format=None, skip_validation=False, datetime_partition_columns=None, raw_batch_translator=None)

Instantiates a new HiveDSConfig.

Parameters
  • table (str) – A table registered in Hive MetaStore.

  • database (str) – A database registered in Hive MetaStore.

  • date_partition_column (Optional[str]) – (Optional) Partition column name in case the raw data is partitioned by date, otherwise None.

  • datetime_partition_columns (Optional[List[DatetimePartitionColumn]]) – (Optional) List of DatetimePartitionColumn the raw data is partitioned by, otherwise None.

  • timestamp_column_name (Optional[str]) – Name of timestamp column.

  • timestamp_format (Optional[str]) – (Optional) Format of string-encoded timestamp column (e.g. “yyyy-MM-dd’T’hh:mm:ss.SSS’Z’”). If the timestamp string cannot be parsed with this format, Tecton will fallback and attempt to use the default timestamp parser.

  • raw_batch_translator – Python user defined function f(DataFrame) -> DataFrame that takes in raw PySpark data source DataFrame and translates it to the DataFrame to be consumed by the Feature View. See an example of raw_batch_translator in the User Guide.

Returns

A HiveDSConfig class instance.

Example of a HiveDSConfig declaration:

from tecton import HiveDSConfig
import pyspark

def convert_temperature(df: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
    from pyspark.sql.functions import udf,col
    from pyspark.sql.types import DoubleType

    # Convert the incoming PySpark DataFrame temperature Celsius to Fahrenheit
    udf_convert = udf(lambda x: x * 1.8 + 32.0, DoubleType())
    converted_df = df.withColumn("Fahrenheit", udf_convert(col("Temperature"))).drop("Temperature")
    return converted_df

# declare a HiveDSConfig instance, which can be used as a parameter to a BatchDataSource
batch_ds_config=HiveDSConfig(database='global_temperatures',
                            table='us_cities',
                            timestamp_column_name='timestamp',
                            raw_batch_translator=convert_temperature)