Skip to main content
Version: 0.9

DatetimePartitionColumn

Summary​

Helper class to tell Tecton how underlying flat files are date/time partitioned for Hive/Glue data sources. This can translate into a significant performance increase.

You will generally include an object of this class in the datetime_partition_columns option in a HiveConfig object.

Examples​

Example 1​

Assume you have an S3 bucket with parquet files stored in the following structure: s3://mybucket/20../../../04/<multiple parquet files> , where 2022 is the year, 05 is the month, and 04 is the day of the month. In this scenario, you could use the following definition:

datetime_partition_columns = [
DatetimePartitionColumn(column_name="partition_0", datepart="year", zero_padded=True),
DatetimePartitionColumn(column_name="partition_1", datepart="month", zero_padded=True),
DatetimePartitionColumn(column_name="partition_2", datepart="day", zero_padded=True),
]
batch_config = HiveConfig(
database="my_db",
table="my_table",
timestamp_field="timestamp",
datetime_partition_columns=datetime_partition_columns,
)

Example 2​

Example using the format_string parameter. Assume your data is partitioned by "YYYY-MM", e.g. s3://mybucket/2022-05/<multiple parquet files>. Tecton’s default month format is "%m", which would fail to format datetime strings that are comparable to your table’s partition column, so the definition needs to specify an override.

datetime_partition_columns = [
DatetimePartitionColumn(column_name="partition_1", datepart="month", format_string="%Y-%m"),
]

Attributes​

The attributes are the same as the __init__ method parameters. See below.

Methods​

__init__(...)​

Parameters​

  • column_name (str) – The name of the column in the Glue/Hive schema that corresponds to the underlying date/time partition folder. Note that if you do not explicitly specify a name in your partition folders, Glue will name the column of the form partition_0.

  • datepart (str) – The part of the date that this column specifies. Can be one of “year”, “month”, “day”, “hour”, or the full “date”. If used with format_string, this should be the size of partition being represented, e.g. datepart="month" for format_string="%Y-%m".

  • zero_padded (bool) – Whether the datepart has a leading zero if less than two digits. This must be set to True if datepart="date". Should not be set if format_string is set. (Default: False)

  • format_string (Optional[str]) – A datetime.strftime format string override for “non-default” partition columns formats. E.g. "%Y%m%d" for datepart="date" instead of the Tecton default "%Y-%m-%d", or "%Y-%m" for datepart="month" instead of the Tecton default "%m". (Default: None)

info

This format string must convert python datetimes (via datetime.strftime(format)) to strings that are sortable in time order. For example, "%m-%Y" would be an invalid format string because "09-2019" > "05-2020".

See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes for format codes.

Was this page helpful?