Forum Discussion
Serverless SQL Pool - Reading from Parquet - VARCHAR with UTF8 Collation - Data Getting Truncated
Currently have a Serverless SQL Pool within Synapse Analytics. I am working on a view which will get data from Parquet. The collation of the database is set to Latin1_General_100_BIN2_UTF8 as per MS recommendation: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format#explicitly-specify-schema (Important section).
It is also recommended by MS to use varchar data type with UTF8 collation when reading data from Parquet and Delta with UTF8-encoding: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool#use-appropriate-data-types
The select statement is looking as follows. Kindly note that to ensure UTF-8 collation is used I am explicitly stating the collation type when defining the OPENROWSET schema:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'poc/ParquetFiles/SpecialCharactersTable_UTF8Collation.parquet'
,DATA_SOURCE = 'DataLakeDataSource'
,FORMAT = 'Parquet'
)
WITH(
SpecialCharactersColumn_UTF8Collation VARCHAR(60) COLLATE Latin1_General_100_BIN2_UTF8
) AS SCD
One of the records within the SpecialCharactersColuimn_UTF8Collation column in the Parquet File, actually contains 60 characters. The same records also contains some special characters.
When I run the above select statement the data is getting truncated with the error "String or binary data would be truncated ..."
Not sure if I am missing something here or MS recommendations don't apply when there is UniCode data, in which case I would have to use the nvarchar data type as opposed to varchar data type. The other alternative I believe would be to change the collation on the column.
Kindly share if you have any thoughts around this.