Always use UTF-8 collations to read UTF-8 text in serverless SQL pool

JovanPop · ‎Nov 13 2020

Synapse serverless SQL pool is a query engine that enables you to query a variety of files and formats that you store in Azure Data Lake and Azure Cosmos DB. One very common text encoding format is UTF-8 encoding where the most common characters used in Latin western languages are encoded with a single byte. Not very common western, Cyrillic, Turkish and other characters are encoded with 2 bytes, and the special characters are encoded with more than 2 bytes. The UTF-8 encoding is popular because it is optimal for majority of western languages, has the same the storage efficiency as the UTF-16 encodings in most of the character sets.

A serverless SQL pool in Azure Synapse Analytics enables you to read UTF-8 encoded text as VARCHAR columns and this is the most optimal approach for representing UTF-8 data. But you need to be careful to avoid conversion errors that might be caused by wrong collations on VARCHAR columns.

At the time of writing this post, Synapse SQL forces conversion of UTF-8 characters to plain VARCHAR characters if UTF-8 collation is not assigned to VARCHAR type. This behavior might cause unexpected text conversion error.

This conversion issue might happen if you use OPENROWSET without WITH clause or OPENROWSET/External table that return VARCHAR column without UTF8 collation.

This issue is not applicable if you are using some NVARCHAR types to represent UTF-8 data. The NVARCHAR type is not dependent on a collation because it always represents characters as 2 or 4 byte sequences. However; with NVARCHAR type you have a performance issue because every UTF-8 character must be converted to NVARCHAR type.

In this article you will learn when this unexpected conversion can happen, how to avoid it, or how to fix the issue.

What is a collation?

A collation is a property of string types in SQL Server, Azure SQL, and Synapse SQL that defines how to compare and sort strings. In addition, it describes the encoding of string data. If a collation name in Synapse SQL ends with UTF8, it represents the strings encoded with the UTF-8 encoding schema. Otherwise you have something like UTF-16 encoded string.

In the following example is shown how to specify a collation associated to the string columns in an external table definition:

CREATE EXTERNAL TABLE population (
    [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
    [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
    [year] smallint,
    [population] bigint
)
WITH (
    LOCATION = 'csv/population/population.csv',
    DATA_SOURCE = sqlondemanddemo,
    FILE_FORMAT = QuotedCSVWithHeaderFormat
);

This table references CSV file and the string columns don’t have UTF8 collation. Therefore, the CSV file should not be UTF-8 encoded if you want to read the data with this table. The mismatch between the encoding that is specified in the column collation and the encoding in the underlying files would probably cause a conversion error. In this case, if your population data contains some UTF-8 characters, they would be incorrectly converted once you read data. Therefore, you might need to use some UTF-8 collation instead of Latin1_General_BIN2 after the COLLATE clause.

What are the special UTF-8 encoded characters?

The UTF-8 encoding represents most of the characters using 1 byte, but there are some characters that are not common in western languages. One example might be characters ü and ö in German words Düsseldorf and Schönwald.

Let us imagine that we have a CSV file encoded with UTF-8 encoding scheme with the names of the towns containing these characters. If we preview the content of this file in Synapse Studio, we will get the following result:

Synapse Studio enables us to read the content of this file using the T-SQL queries with the OPENROWSET function. Running a T-SQL query on database with default or any non-UTF8 collation might not return expected results:

You might see that the towns Düsseldorf and Schönwald are not same as in the preview on the previous picture.

In Synapse SQL, you must use some UTF-8 collation to return data from the UTF-8 files. Otherwise, the non-common characters would be suddenly converted.

At the time of writing this post, Synapse SQL silently forces conversion of UTF-8 characters to non-UTF-8 characters, and returns a warning message that this configuration might cause a conversion error.

The following functions and objects might be affected:

The OPENROWSET function without WITH clause that returns VARCHAR columns.
The OPENROWSET function with explicit WITH clause that returns VARCHAR columns without specified collation. This function is affected only if a database collation is set to some of the non-UTF8 collations.
The OPENROWSET function with WITH clause that returns VARCHAR columns with explicitly set non-UTF8 collations.
An external table that contains the VARCHAR columns without explicit collation. These external tables are affected only if a database collation at the time of creation of the table was some of the non-UTF8 collations.
An external table that contains the VARCHAR columns with explicitly specified non-UTF8 collations.

It might be hard to exactly identify in what cases the issue might happen. There are two ways to avoid this issue or resolve it if it happens in some of the queries:

Setting a UTF-8 collation a database collation will resolve issues in scenarios 1 and 2. This solution will resolve issue in scenario 4 if you re-create the table.
Setting an UTF-8 collation on every string column that the OPENROWSET function returns will resolve issues in scenarios 3, 4, and 5

Setting database collation

If you are working with UTF-8 data, the best way to configure your database is to set the default collation on every database. You can set collation as part of the CREATE DATABASE T-SQL statement:

CREATE DATABASE mydb 
    COLLATE Latin1_General_100_BIN2_UTF8;

If you have an existing database, you can easily change the default collation:

ALTER DATABASE mydb 
    COLLATE Latin1_General_100_BIN2_UTF8;

From this point, every OPENROWSET function will return correctly converted data.

Note that you would need to drop and re-create external tables if you have not explicitly specified collation. New default database collation will be applied only when table is created.

Specifying explicit collations

Instead of defining default database collation, you can explicitly specify collation when you declare column type using WITH clause.

The OPENROWSET function enables you to explicitly specify columns and their types in the WITH clause:

SELECT TOP 10 *
FROM OPENROWSET(
        BULK 'latest/ecdc_cases.parquet',
        DATA_SOURCE = 'covid',
        FORMAT = 'parquet'
    ) WITH ( date_rep DATE, cases INT,
             geo_id VARCHAR(6) COLLATE Latin1_General_100_BIN2_UTF8 ) as rows

If you are reading parquet files, or UTF-8 encoded text files, you would need to add an UTF-8 collation in the type specification.

If you are defining tables, you can explicitly specify collation in column definition:

CREATE EXTERNAL TABLE population (
    [country_code] VARCHAR (5) COLLATE Latin1_General_100_BIN2_UTF8,
    [country_name] VARCHAR (100) COLLATE Latin1_General_100_BIN2_UTF8,
    [year] SMALLINT,
    [population] BIGINT
)
WITH (
    LOCATION = 'csv/population/population.csv',
    DATA_SOURCE = sqlondemanddemo,
    FILE_FORMAT = QuotedCSVWithHeaderFormat
);

This way you can be sure that your table will return correct text.

NOTE: If you have existing tables that used default database collation, changing default database collation would not be applied on them. You would need to drop and re-create external tables so they can pickup new default.

Conclusion

Matching column collation of string columns and encoding in the files is important to avoid unexpected conversion errors. Currently, a serverless SQL pool silently converts UTF-8 data to non-UTF-8 data even if it knowns that there is a mismatch causing potential conversion error. This conversion causes a warning, so make sure that you are using proper collations in the OPENROWSET function and external tables to avoid errors.

To mitigate this issue, you need to alter database to use the default UTF8 collation or specify collation explicitly on every string column.

gh_lausanne · ‎Nov 20 2020

Thanks for the advice.

When running the script on a database on the on-demand pool, the following error pops up

ALTER DATABASE XXXX COLLATE Latin1_General_100_CI_AS_SC_UTF8;

The database XXXX could not be exclusively locked to perform the operation.

Any workaround on this ? Standard resolution used on provisioned pool or SQL DB doesn't work

RandomRichard · ‎Nov 23 2020

@JovanPop: Having the same issue as @gh_lausanne : The database XXXX could not be exclusively locked to perform the operation. Cannot use the standard ways to list open connections either.

Solved it by listing active sessions and killing them manually. However there are some connections that seem to auto-reconnect, but by executing the statements in one batch instead of command by command it worked for me.

SELECT [session_id]
FROM [sys].[dm_exec_connections]

-- Make kill statements for each session_id

kill 83
kill 82
kill 80
kill 79
ALTER DATABASE xxxx
COLLATE Latin1_General_100_CI_AS_SC_UTF8;

Johannes_Vink · ‎Dec 10 2020

Thanks @RandomRichard & @gh_lausanne for your feedback! Did run into the same error and you posted a working solution to it!

Stuart Lock · ‎Mar 04 2021

@JovanPop how should you handle csv files in UCS-2 LE BOM format with a Synapse Serverless SQL Pool?
When querying it gives an error:

Error handling external file: 'Incorrect file encoding. DATAFILETYPE = 'char' selected for querying UTF16 file.'. File/External table name:...

Converting the file to UTF-8 in Notepad++ resolves the issue BUT is there a way to query without converting?

Example file format (from Notepad++):

fred31330 · ‎Jun 01 2021

Hi @Stuart Lock,

Not sure whether you were able to solve this issue, but to help you and others facing a similar situation, here's what I did.

Basically, UCS-2 LE BOM and UTF-16 are very similar, so the trick is to define your source format as being encoded in UTF-16, e.g.:

    CREATE EXTERNAL FILE FORMAT QuotedCsvWithHdrFormat
    WITH (  
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS ( PARSER_VERSION = '2.0', FIELD_TERMINATOR = '|', STRING_DELIMITER = '"', FIRST_ROW = 2, Encoding = 'UTF16' )
    );

And then align the corresponding format definition in your CREATE EXTERNAL TABLE statement.

From there, your SELECT query should work. I did not have to specify any COLLATION.

Hope this helps!

gibrg · ‎Mar 07 2022

Thanks for the useful article!

@JovanPop wrote:
NOTE: If you have existing tables that used default database collation, changing default database collation would not be applied on them. You would need to drop and re-create external tables so they can pickup new default.

Does this note also apply to views?

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs