Forum Discussion
COPY INTO from parquets can't push timestamps before year 1970
Hi I've been experimenting a little with the COPY INTO command and I've been pushing parquets into my tables in the dedicated SQL Pool.
I'm encountering an issue where I can't load a timestamp value from my parquet if it's before 1970-1-1 00:00:00.
Here's a drive with all the parquets I'm using for my testing: https://drive.google.com/drive/folders/1bsyjbRW3dvkm9UXnrM69WTcraqsQXfk-?usp=sharing
These parquets were written using Spark 3.2.1
1. Timestamps were written using Timestamp(MILLIS)
2. Timestamp parquets are named ts{year} with 3 years: 1852, 1969 & 1972
3. Timestamps before 1970 have 2 folders each corresponding to both the LEGACY & Corrected timestamp outputs
Let's say I have 2 tables
TS_TESTING_1 -> ([year][bigint], [month][bigint], [day][bigint], [ts][datetime2])
TS_TESTING_2 -> ([year][bigint], [month][bigint], [day][bigint], [ts][bigint])
I'll start testing with ts1852_legacy which has a structure as follows
year | month | day | ts |
1852 | 1 | 1 | 1852-1-1 00:00:00 |
Running a copy with TS_TESTING_1 as the target (The table with the datetime2 column)
COPY INTO [TS_TESTING_1] FROM 'PATH TO ts1852_legacy' WITH (FILE_TYPE = 'Parquet', CREDENTIAL = 'MY CREDS')
Gives me this error: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: Exception: Value out of range for [optional int64 ts (TIMESTAMP_MILLIS)]
Running a copy with TS_TESTING_2 as the target (The table with the bigint column)
COPY INTO [TS_TESTING_2] FROM 'PATH TO ts1852_legacy' WITH (FILE_TYPE = 'Parquet', CREDENTIAL = 'MY CREDS')
Works and the [ts] column is set to -31543200000 which is expected for it's offset from the unix epoch
Oddly enough I tried converting it using dateadd
SELECT dateadd(millisecond, [ts], '1970-01-01 00:00:00') AS 'CONVERTED' FROM [dbo].[TS_TESTING_2]
SELECT dateadd(SECOND, [ts] / 1000, '1970-01-01 00:00:00') AS 'CONVERTED' FROM [dbo].[TS_TESTING_2]
It works for both and the result is as expected
CONVERTED |
1968-12-31T22:00:00.0000000 |
It's 2 hours early but that's not my issue here as this is probably due to it being written in my timezone which is UTC+2 so this explains the 2 hour difference.
So why would a copy like this work and automatically convert the epoch to datetime2 without issue and still fail for negative offsets (Before 1970)
COPY INTO [TS_TESTING_1] FROM 'PATH TO ts1972' WITH (FILE_TYPE = 'Parquet', CREDENTIAL = 'MY CREDS')
What am I missing here? Also pushing the 1852 parquet with the "AUTO_CREATE_TABLE = 'ON'" works. Why is it unable to do so automatically without that option?
Any feedback is appreciated 😄
1 Reply
- Siya_KrishnanCopper Contributor
Question 1 - COPY INTO with pre-1970 year failing
=================================
COPY INTO uses the Polybase feature for ingesting Parquet/ORC files, by default (not specifying AUTO_CREATE_TABLE=ON). Polybase uses an older version of parquet-mr (v1.6) where datetime2 timestamps were stored/read as INT96. This preserved full 100 nanosecond accuracy of SQL datetime2(7). Newer versions of parquet-mr, used by Spark 3.x as you are using, have deprecated the use of INT96 in favor of storing them as INT64 instead. This lost the nanosecond accuracy, but was more practical in most real-life cases. This causes conversion issues for Polybase when interpreting dates before 1970, which are signed. Polybase does support the full range of SQL datetime2's when using the older INT96 representation, including pre-1970.
For better compatibility with your newer Spark-generated parquet files, you may use COPY INTO with the AUTO_CREATE_TABLE=ON option, which uses a more modern implementation that correctly interprets the new TIMESTAMP_MILLIS parquet type represented as signed INT64.
Regarding the 2-hour offset you are observing, the parquet files were generated with a timezone set as "Africa/Cairo" which is GMT+2, as you expected (org.apache.spark.timeZone = Africa/Cairo). I inspected your supplied files using a tool called "parquet-tools".Question 2 - DATEADD() failures
====================
The reason DATEADD() is failing for older years (i.e., 1852 and 1972) is because the the second parameter to this function is expected to be an integer data type and the signed milliseconds offset values for these old years (e.g. -3723760800000 and 63076357000, respectively) exceeds the bounds of an int and results in an Arithmetic overflow error. Here is the doc : https://docs.microsoft.com/en-us/sql/t-sql/functions/dateadd-transact-sql?view=sql-server-ver15#arguments.
A simple work-around is to convert the ms argument into minutes then call DATEADD() with this smaller value:
Instead of: SELECT dateadd(millisecond, [ts], '1970-01-01 00:00:00') AS 'CONVERTED' FROM [dbo].[TS_TESTING_2]
Do: SELECT DATEADD(minute, [ts] / 60000, '1970-01-01 00:00:00') AS 'CONVERTED' FROM [dbo].[TS_TESTING_2]