PySpark DataFrame to Pandas DataFrame Conversion with Pandas 2.0
Issueโ
With Pandas 2.0, when converting Tecton Spark query results (such as
get_features_for_events or get_features_in_range) to Pandas, you might
encounter an error similar to the following:
TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.
Causeโ
This error occurs due to a type conversion issue in PySpark when converting
TimestampType to datetime64, which is incompatible with Pandas 2.0. The bug
has been fixed in PySpark 3.5+, but for older versions, it remains an issue.
For more details, refer to this Spark fix PR.
Resolutionโ
Enable Arrow Conversionโ
You can enable Arrow conversion to bypass the problematic type conversion code.
Here an example to enable Arrow conversion in the Databricks or EMR SparkSession:
import datetime
# Enable Arrow conversion
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()
Here an example to enable Arrow conversion in a new SparkSession:
import datetime
import pyspark
# Enable Arrow conversion
spark = pyspark.sql.SparkSession.builder.config("spark.sql.execution.arrow.pyspark.enabled", "true").getOrCreate()
spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()
Noteโ
- The
spark.sql.execution.arrow.pyspark.enabledflag is enabled by default on Databricks. - It is not enabled by default on EMR and open-source PySpark.
Limitationsโ
Arrow conversion cannot be enabled when the DataFrame contains
types ineligible for Arrow conversion,
such as MapType, ArrayType of TimestampType, and nested StructType.
Arrow conversion will be automatically disabled in such cases.
Among these types, MapType and ArrayType of TimestampType can be handled
without Arrow conversion enabled. However, nested StructType with
TimestampType cannot be handled without Arrow conversion.
Here is an example of a nested StructType with TimestampType:
from pyspark.sql.types import StructType, StructField, TimestampType, IntegerType
spark.createDataFrame(
[
[datetime.datetime.now(), {"inner": {"i": 1}}],
],
StructType(
[
StructField("ts", TimestampType()),
StructField("s", StructType([StructField("inner", StructType([StructField("i", IntegerType())]))])),
]
),
).toPandas()
For nested StructType with TimestampType, the best solution is to upgrade to
Spark 3.5+ where this issue is resolved.