Forum Discussion

ea25's avatar
ea25
Copper Contributor
Aug 13, 2022

Serverless SQL Pool - Reading from Parquet - VARCHAR with UTF8 Collation - Data Getting Truncated

Currently have a Serverless SQL Pool within Synapse Analytics. I am working on a view which will get data from Parquet. The collation of the database is set to Latin1_General_100_BIN2_UTF8 as per MS recommendation:  https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format#explicitly-specify-schema (Important section). 

 

It is also recommended by MS to use varchar data type with UTF8 collation when reading data from Parquet and Delta with UTF8-encoding: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool#use-appropriate-data-types

 

The select statement is looking as follows. Kindly note that to ensure UTF-8 collation is used I am explicitly stating the collation type when defining the OPENROWSET schema:

 

SELECT
    TOP 100 *
FROM
    OPENROWSET(
     BULK 'poc/ParquetFiles/SpecialCharactersTable_UTF8Collation.parquet'
    ,DATA_SOURCE = 'DataLakeDataSource'
    ,FORMAT = 'Parquet'
) 
WITH(
    SpecialCharactersColumn_UTF8Collation VARCHAR(60) COLLATE Latin1_General_100_BIN2_UTF8 
) AS SCD

 

 

One of the records within the SpecialCharactersColuimn_UTF8Collation column in the Parquet File, actually contains 60 characters. The same records also contains some special characters.

 

When I run the above select statement the data is getting truncated with the error "String or binary data would be truncated ..."

 

Not sure if I am missing something here or MS recommendations don't apply when there is UniCode data, in which case I would have to use the nvarchar data type as opposed to varchar data type. The other alternative I believe would be to change the collation on the column.

 

Kindly share if you have any thoughts around this.

1 Reply

  • ea25 could you try to increase VARCHAR(60) to VARCHAR(600) and see would it be truncated? The number in VARCHAR is size in bytes and not size in characters so some special characters might occupy more than 1 byte per char.

Resources