tecton.DatetimePartitionColumn
Summary​
Helper class to tell Tecton how underlying flat files are date/time partitioned for Hive/Glue data sources. This can translate into a significant performance increase.
You will generally include an object of this class in the
datetime_partition_columns
option in a HiveConfig
object.
Examples​
Example 1​
Assume you have an S3 bucket with parquet files stored in the following
structure: s3://mybucket/2022/05/04/<multiple parquet files>
, where 2022
is
the year, 05
is the month, and 04
is the day of the month. In this scenario,
you could use the following definition:
datetime_partition_columns = [
DatetimePartitionColumn(column_name="partition_0", datepart="year", zero_padded=True),
DatetimePartitionColumn(column_name="partition_1", datepart="month", zero_padded=True),
DatetimePartitionColumn(column_name="partition_2", datepart="day", zero_padded=True),
]
batch_config = HiveConfig(
database="my_db",
table="my_table",
timestamp_field="timestamp",
datetime_partition_columns=datetime_partition_columns,
)
Example 2​
Example using the format_string
parameter. Assume your data is partitioned by
"YYYY-MM"
, e.g. s3://mybucket/2022-05/<multiple parquet files>
. Tecton’s
default month format is "%m"
, which would fail to format datetime strings that
are comparable to your table’s partition column, so the definition needs to
specify an override.
datetime_partition_columns = [
DatetimePartitionColumn(column_name="partition_1", datepart="month", format_string="%Y-%m"),
]
Attributes​
The attributes are the same as the __init__
method parameters. See below.
Methods​
__init__(...)​
Parameters​
-
column_name
(str
) – The name of the column in the Glue/Hive schema that corresponds to the underlying date/time partition folder. Note that if you do not explicitly specify a name in your partition folders, Glue will name the column of the formpartition_0
. -
datepart
(str
) – The part of the date that this column specifies. Can be one of “year”, “month”, “day”, “hour”, or the full “date”. If used withformat_string
, this should be the size of partition being represented, e.g.datepart="month"
forformat_string="%Y-%m"
. -
zero_padded
(bool
) – Whether thedatepart
has a leading zero if less than two digits. This must be set to True ifdatepart="date"
. Should not be set ifformat_string
is set. (Default:False
) -
format_string
(Optional
[str
]) – Adatetime.strftime
format string override for “non-default” partition columns formats. E.g."%Y%m%d"
fordatepart="date"
instead of the Tecton default"%Y-%m-%d"
, or"%Y-%m"
fordatepart="month"
instead of the Tecton default"%m"
. (Default:None
)
This format string must convert python datetimes (via
datetime.strftime(format)
) to strings that are sortable in time order. For
example, "%m-%Y"
would be an invalid format string because
"09-2019" > "05-2020"
.
See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes for format codes.