tecton.FileDSConfig

class tecton.FileDSConfig(uri: str, file_format: str, convert_to_glue_format=False, timestamp_column_name: Optional[str] = None, timestamp_format: Optional[str] = None, raw_batch_translator: Optional[collections.abc.Callable] = None, schema_uri: Optional[str] = None, schema_override: Optional[pyspark.sql.types.StructType] = None)

Configuration used to reference a file or directory (S3, etc.)

The FileDSConfig class is used to create a reference to a file or directory of files in S3, HDFS, or DBFS if available.

The schema of the data source is inferred from the underlying file(s). It can also be modified using raw_batch_translator parameter.

This class used as an input to a VirtualDataSource’s parameter batch_config. This class is not a Tecton Primitive: it is a grouping of parameters. Declaring this class alone will not register a data source. Instead, declare a VirtualDataSource that takes this configuration class as an input.

Methods

__init__

Instantiates a new FileDSConfig.

__init__(uri: str, file_format: str, convert_to_glue_format=False, timestamp_column_name: Optional[str] = None, timestamp_format: Optional[str] = None, raw_batch_translator: Optional[collections.abc.Callable] = None, schema_uri: Optional[str] = None, schema_override: Optional[pyspark.sql.types.StructType] = None)

Instantiates a new FileDSConfig.

Parameters
  • uri – S3 or HDFS path to file(s).

  • file_format – File format. “json”, “parquet”, or “csv”

  • convert_to_glue_format – Converts all schema column names to lowercase.

  • timestamp_column_name – (Optional) Name of timestamp column. Only required if timestamp_format is specified.

  • timestamp_format – (Optional) Format of string-encoded timestamp column (e.g. “yyyy-MM-dd’T’hh:mm:ss.SSS’Z’”)

  • raw_batch_translator – Python user defined function f(DataFrame) -> DataFrame that takes in raw Pyspark data source DataFrame and translates it to the DataFrame to be consumed by the Feature Package. See an example of raw_batch_translator in the User Guide.

  • schema_uri – (optional) A file or subpath of “uri” that can be used for fast schema inference. This is useful for deeply nested data sources with many small files.

  • schema_override – (Optional) a pyspark.sql.types.StructType object that will be used as the schema when reading from the file. If omitted, the schema will be inferred automatically.

Returns

A FileDSConfig class instance.