Serverless SQL Pool - Reading from Parquet - VARCHAR with UTF8 Collation - Data Getting Truncated

Question

Currently have a Serverless SQL Pool within Synapse Analytics. I am working on a view which will get data from Parquet. The collation of the database is set to Latin1_General_100_BIN2_UTF8 as per MS recommendation:&nbsp;&nbsp;https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format#explicitly-specify-schema&nbsp;(Important section).&nbsp;&nbsp;It is also recommended by MS to use varchar data type with UTF8 collation when reading data from Parquet and Delta with UTF8-encoding:&nbsp;https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool#use-appropriate-data-types&nbsp;The select statement is looking as follows. Kindly note that to ensure UTF-8 collation is used I am explicitly stating the collation type when defining the OPENROWSET schema:&nbsp;SELECT
    TOP 100 *
FROM
    OPENROWSET(
     BULK 'poc/ParquetFiles/SpecialCharactersTable_UTF8Collation.parquet'
    ,DATA_SOURCE = 'DataLakeDataSource'
    ,FORMAT = 'Parquet'
) 
WITH(
    SpecialCharactersColumn_UTF8Collation VARCHAR(60) COLLATE Latin1_General_100_BIN2_UTF8 
) AS SCD&nbsp;&nbsp;One of the records within the SpecialCharactersColuimn_UTF8Collation column in the Parquet File, actually contains 60 characters. The same records also contains some special characters.&nbsp;When I run the above select statement the data is getting truncated with the error "String or binary data would be truncated ..."&nbsp;Not sure if I am missing something here or MS recommendations don't apply when there is UniCode data, in which case I would have to use the nvarchar data type as opposed to varchar data type. The other alternative I believe would be to change the collation on the column.&nbsp;Kindly share if you have any thoughts around this.

jovanpop · Answer

ea25&nbsp;could you try to increase VARCHAR(60) to VARCHAR(600) and see would it be truncated? The number in VARCHAR is size in bytes and not size in characters so some special characters might occupy more than 1 byte per char.

Forum Discussion

Serverless SQL Pool - Reading from Parquet - VARCHAR with UTF8 Collation - Data Getting Truncated

1 Reply

Resources