Skip to main content
Version: 1.1

FileConfig

Summary​

Configuration used to reference a file or directory (S3, etc.)
 
The FileConfig class is used to create a reference to a file or directory of files in S3, HDFS, or DBFS.
 
The schema of the data source is inferred from the underlying file(s). It can also be modified using the post_processor parameter.
 
This class is used as an input to a DataSource's parameter batch_config. Declaring this configuration class alone will not register a Data Source. Instead, declare as a part of BatchSource that takes this configuration class instance as a parameter.
important

If your files are partitioned, simply provide the path to the root folder. For example: uri = "s3://<bucket-name>/<root-folder>/"

Tecton will use Spark partition discovery to find all partitions and infer the schema.

When reading a highly-partitioned file, Tecton recommends setting the schema_uri parameter to speed up schema inference. For more details, review our documentation here.

Attributes​

NameData TypeDescription
data_delaytimedeltaReturns the duration that materialization jobs wait after the batch_schedule before starting, typically to ensure that all data has landed.

Methods​

NameDescription
__init__(...)Instantiates a new FileConfig.

__init__(...)​

Instantiates a new FileConfig.
 
Example of a FileConfig declaration:

Parameters

  • uri: str S3 or HDFS path to file(s).
  • file_format: str File format. "json", "parquet", or "csv"
  • timestamp_field: Optional[str] = None The timestamp column in this data source that should be used for time-based filtering. Required unless this source is used in Feature Views only with unfiltered().
  • timestamp_format: Optional[str] = None Format of string-encoded timestamp column (e.g. "yyyy-MM-dd'T'hh:mm:ss.SSS'Z'"). If the timestamp string cannot be parsed with this format, Tecton will fallback and attempt to use the default timestamp parser.
  • datetime_partition_columns: Optional[List[DatetimePartitionColumn]] = None List of DatetimePartitionColumn the raw data is partitioned by, otherwise None.
  • post_processor: Optional[Callable] = None Python user defined function f(DataFrame) -> DataFrame that takes in raw Pyspark data source DataFrame and translates it to the DataFrame to be consumed by the Feature View.
  • schema_uri: Optional[str] = None A file or subpath of "uri" that can be used for fast schema inference. This is useful for speeding up plan computation for highly partitioned data sources containing many files.
  • schema_override: Optional[pyspark.sql.types.StructType] = None A pyspark.sql.types.StructType object that will be used as the schema when reading from the file. If omitted, the schema will be inferred automatically.
  • data_delay: timedelta = 0:00:00 This parameter configures how long materialization jobs wait after the end of the batch schedule period before starting, typically to ensure that all data has landed. For example, if a feature view has a batch_schedule of 1 day and one of the data source inputs has data_delay=timedelta(hours=1) set, then incremental materialization jobs will run at 01:00 UTC.

Returns

A FileConfig class instance.

Example

from tecton import FileConfig, BatchSource
# Define a post-processor function to convert the temperature from Celsius to Fahrenheit
def convert_temperature(df):
from pyspark.sql.functions import udf,col
from pyspark.sql.types import DoubleType
udf_convert = udf(lambda x: x * 1.8 + 32.0, DoubleType())
converted_df = df.withColumn("Fahrenheit", udf_convert(col("Temperature"))).drop("Temperature")
return converted_df
# Declare a FileConfig, which can be used as a parameter to a `BatchSource`
ad_impressions_file_config = FileConfig(uri="s3://tecton.ai.public/data/ad_impressions_sample.parquet",
file_format="parquet",
timestamp_field="timestamp",
post_processor=convert_temperature)
# This FileConfig can then be included as a parameter for a BatchSource declaration.
# For example,
ad_impressions_batch = BatchSource(name="ad_impressions_batch",
batch_config=ad_impressions_file_config)

Was this page helpful?