tecton.FileDSConfig

class tecton.FileDSConfig(uri, file_format, convert_to_glue_format=False, timestamp_column_name=None, timestamp_format=None, raw_batch_translator=None, schema_uri=None, schema_override=None)

Configuration used to reference a file or directory (S3, etc.)

The FileDSConfig class is used to create a reference to a file or directory of files in S3, HDFS, or DBFS.

The schema of the data source is inferred from the underlying file(s). It can also be modified using the raw_batch_translator parameter.

This class is used as an input to a BatchDataSource’s parameter batch_ds_config. This class is not a Tecton Object: it is a grouping of parameters. Declaring this class alone will not register a data source. Instead, declare a part of BatchDataSource that takes this configuration class instance as a parameter.

Methods

__init__

Instantiates a new FileDSConfig.

__init__(uri, file_format, convert_to_glue_format=False, timestamp_column_name=None, timestamp_format=None, raw_batch_translator=None, schema_uri=None, schema_override=None)

Instantiates a new FileDSConfig.

Parameters
  • uri (str) – S3 or HDFS path to file(s).

  • file_format (str) – File format. “json”, “parquet”, or “csv”

  • convert_to_glue_format – Converts all schema column names to lowercase.

  • timestamp_column_name (Optional[str]) – Name of timestamp column.

  • timestamp_format (Optional[str]) – (Optional) Format of string-encoded timestamp column (e.g. “yyyy-MM-dd’T’hh:mm:ss.SSS’Z’”)

  • raw_batch_translator (Optional[Callable]) – Python user defined function f(DataFrame) -> DataFrame that takes in raw Pyspark data source DataFrame and translates it to the DataFrame to be consumed by the Feature View. See an example of raw_batch_translator in the User Guide.

  • schema_uri (Optional[str]) – (optional) A file or subpath of “uri” that can be used for fast schema inference. This is useful for speeding up plan computation for highly partitioned data sources containing many files.

  • schema_override (Optional[StructType]) – (Optional) a pyspark.sql.types.StructType object that will be used as the schema when reading from the file. If omitted, the schema will be inferred automatically.

Returns

A FileDSConfig class instance.

Example of a FileDSConfig declaration:

from tecton import FileDSConfig, BatchDataSource, inlined
import pyspark

@inlined
def convert_temperature(df: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
    from pyspark.sql.functions import udf,col
    from pyspark.sql.types import DoubleType

    # Convert the incoming PySpark DataFrame temperature Celsius to Fahrenheit
    udf_convert = udf(lambda x: x * 1.8 + 32.0, DoubleType())
    converted_df = df.withColumn("Fahrenheit", udf_convert(col("Temperature"))).drop("Temperature")
    return converted_df

# declare a FileDSConfig, which can be used as a parameter to a `BatchDataSource`
ad_impressions_file_ds = FileDSConfig(uri="s3://tecton.ai.public/data/ad_impressions_sample.parquet",
                                    file_format="parquet",
                                    timestamp_column_name="timestamp",
                                    raw_batch_translator=convert_temperature)

# This FileDSConfig can then be included as an parameter a BatchDataSource declaration.
# For example,
ad_impressions_batch = BatchDataSource(name="ad_impressions_batch",
                                       batch_ds_config=ad_impressions_file_ds)