Synapse serverless SQL pool is a query engine that enables you to query a variety of files and formats that you store in Azure Data Lake and Azure Cosmos DB. One very common text encoding format is UTF-8 encoding where the most common characters used in Latin western languages are encoded with a single byte. Not very common western, Cyrillic, Turkish and other characters are encoded with 2 bytes, and the special characters are encoded with more than 2 bytes. The UTF-8 encoding is popular because it is optimal for majority of western languages, has the same the storage efficiency as the UTF-16 encodings in most of the character sets.
A serverless SQL pool in Azure Synapse Analytics enables you to read UTF-8 encoded text as VARCHAR columns and this is the most optimal approach for representing UTF-8 data. But you need to be careful to avoid conversion errors that might be caused by wrong collations on VARCHAR columns.
At the time of writing this post, Synapse SQL forces conversion of UTF-8 characters to plain VARCHAR characters if UTF-8 collation is not assigned to VARCHAR type. This behavior might cause unexpected text conversion error.
This conversion issue might happen if you use OPENROWSET without WITH clause or OPENROWSET/External table that return VARCHAR column without UTF8 collation.
This issue is not applicable if you are using some NVARCHAR types to represent UTF-8 data. The NVARCHAR type is not dependent on a collation because it always represents characters as 2 or 4 byte sequences. However; with NVARCHAR type you have a performance issue because every UTF-8 character must be converted to NVARCHAR type.
In this article you will learn when this unexpected conversion can happen, how to avoid it, or how to fix the issue.
A collation is a property of string types in SQL Server, Azure SQL, and Synapse SQL that defines how to compare and sort strings. In addition, it describes the encoding of string data. If a collation name in Synapse SQL ends with UTF8, it represents the strings encoded with the UTF-8 encoding schema. Otherwise you have something like UTF-16 encoded string.
In the following example is shown how to specify a collation associated to the string columns in an external table definition:
CREATE EXTERNAL TABLE population ( [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2, [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2, [year] smallint, [population] bigint ) WITH ( LOCATION = 'csv/population/population.csv', DATA_SOURCE = sqlondemanddemo, FILE_FORMAT = QuotedCSVWithHeaderFormat );
This table references CSV file and the string columns don’t have UTF8 collation. Therefore, the CSV file should not be UTF-8 encoded if you want to read the data with this table. The mismatch between the encoding that is specified in the column collation and the encoding in the underlying files would probably cause a conversion error. In this case, if your population data contains some UTF-8 characters, they would be incorrectly converted once you read data. Therefore, you might need to use some UTF-8 collation instead of Latin1_General_BIN2 after the COLLATE clause.
The UTF-8 encoding represents most of the characters using 1 byte, but there are some characters that are not common in western languages. One example might be characters ü and ö in German words Düsseldorf and Schönwald.
Let us imagine that we have a CSV file encoded with UTF-8 encoding scheme with the names of the towns containing these characters. If we preview the content of this file in Synapse Studio, we will get the following result:
Synapse Studio enables us to read the content of this file using the T-SQL queries with the OPENROWSET function. Running a T-SQL query on database with default or any non-UTF8 collation might not return expected results:
You might see that the towns Düsseldorf and Schönwald are not same as in the preview on the previous picture.
In Synapse SQL, you must use some UTF-8 collation to return data from the UTF-8 files. Otherwise, the non-common characters would be suddenly converted.
At the time of writing this post, Synapse SQL silently forces conversion of UTF-8 characters to non-UTF-8 characters, and returns a warning message that this configuration might cause a conversion error.
The following functions and objects might be affected:
It might be hard to exactly identify in what cases the issue might happen. There are two ways to avoid this issue or resolve it if it happens in some of the queries:
If you are working with UTF-8 data, the best way to configure your database is to set the default collation on every database. You can set collation as part of the CREATE DATABASE T-SQL statement:
CREATE DATABASE mydb COLLATE Latin1_General_100_BIN2_UTF8;
If you have an existing database, you can easily change the default collation:
ALTER DATABASE mydb COLLATE Latin1_General_100_BIN2_UTF8;
From this point, every OPENROWSET function will return correctly converted data.
Note that you would need to drop and re-create external tables if you have not explicitly specified collation. New default database collation will be applied only when table is created.
Instead of defining default database collation, you can explicitly specify collation when you declare column type using WITH clause.
The OPENROWSET function enables you to explicitly specify columns and their types in the WITH clause:
SELECT TOP 10 * FROM OPENROWSET( BULK 'latest/ecdc_cases.parquet', DATA_SOURCE = 'covid', FORMAT = 'parquet' ) WITH ( date_rep DATE, cases INT, geo_id VARCHAR(6) COLLATE Latin1_General_100_BIN2_UTF8 ) as rows
If you are reading parquet files, or UTF-8 encoded text files, you would need to add an UTF-8 collation in the type specification.
If you are defining tables, you can explicitly specify collation in column definition:
CREATE EXTERNAL TABLE population ( [country_code] VARCHAR (5) COLLATE Latin1_General_100_BIN2_UTF8, [country_name] VARCHAR (100) COLLATE Latin1_General_100_BIN2_UTF8, [year] SMALLINT, [population] BIGINT ) WITH ( LOCATION = 'csv/population/population.csv', DATA_SOURCE = sqlondemanddemo, FILE_FORMAT = QuotedCSVWithHeaderFormat );
This way you can be sure that your table will return correct text.
NOTE: If you have existing tables that used default database collation, changing default database collation would not be applied on them. You would need to drop and re-create external tables so they can pickup new default.
Matching column collation of string columns and encoding in the files is important to avoid unexpected conversion errors. Currently, a serverless SQL pool silently converts UTF-8 data to non-UTF-8 data even if it knowns that there is a mismatch causing potential conversion error. This conversion causes a warning, so make sure that you are using proper collations in the OPENROWSET function and external tables to avoid errors.
To mitigate this issue, you need to alter database to use the default UTF8 collation or specify collation explicitly on every string column.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.