DatetimePartitionColumn
Summary​
Helper class to tell Tecton how underlying flat files are date/time partitioned for Hive/Glue data sources. This can translate into a significant performance increase.Â
You will generally include an object of this class in the
datetime_partition_columns
option in a HiveConfig
object.Â
Example definitions: Assume you have an S3 bucket with parquet files stored in the following structure:
s3://mybucket/2022/05/04/<multiple parquet files>
, where 2022
is the year, 05
is the month, and 04
is the day of the month. In this scenario, you could use the following definition:Examples
Example 1
datetime_partition_columns = [DatetimePartitionColumn(column_name="partition_0", datepart="year", zero_padded=True),DatetimePartitionColumn(column_name="partition_1", datepart="month", zero_padded=True),DatetimePartitionColumn(column_name="partition_2", datepart="day", zero_padded=True),]batch_config = HiveConfig(database='my_db',table='my_table',timestamp_field='timestamp',datetime_partition_columns=datetime_partition_columns,)
Example 2
datetime_partition_columns = [DatetimePartitionColumn(column_name="partition_1", datepart="month", format_string="%Y-%m"),]
Attributes​
The attributes are the same as the __init__
method parameters. See below.
Methods​
Name | Description |
---|---|
__init__(...) | Initialize DatetimePartitionColumn |
__init__(...)​
Parameters
column_name
(str
) - The name of the column in the Glue/Hive schema that corresponds to the underlying date/time partition folder. Note that if you do not explicitly specify a name in your partition folders, Glue will name the column of the formpartition_0
. Default:None
datepart
(str
) - The part of the date that this column specifies. Can be one of "year", "month", "day", "hour", or the full "date". If used withformat_string
, this should be the size of partition being represented, e.g.datepart="month"
forformat_string="%Y-%m"
. Default:None
zero_padded
(bool
) - Whether thedatepart
has a leading zero if less than two digits. This must be set to True ifdatepart="date"
. (Should not be set ifformat_string
is set.) Default:false
format_string
(Optional
[str
]) - Adatetime.strftime
format string override for "non-default" partition columns formats. E.g."%Y%m%d"
fordatepart="date"
instead of the Tecton default"%Y-%m-%d"
, or"%Y-%m"
fordatepart="month"
instead of the Tecton default"%m"
. Default:None
info
This format string must convert python datetimes (via
datetime.strftime(format)
) to strings that are sortable in time order. For
example, "%m-%Y"
would be an invalid format string because
"09-2019" > "05-2020"
.
See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes for format codes.