SQL Server 2019 introduces support for the widely used UTF-8 character encoding. This has been a longtime requested feature and can be set as a database-level or column-level default encoding for Unicode string data.
Why did we need UTF-8 support?
This is an asset for companies extending their businesses to a global scale, where the requirement of providing global multilingual database applications and services is critical to meet customer demands, and specific market regulations.
The benefits of introducing UTF-8 support also extend to scenarios where legacy applications require internationalization and use inline queries: the amount of changes and testing involved to convert an application and underlying database to UTF-16 can be costly, by requiring complex string processing logic that affect application performance.
How is it implemented?
To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8. Refer to Set or Change the Database Collation and Set or Change the Column Collation for more details on how to perform those changes. Note that NCHAR and NVARCHAR remains unchanged and allows UCS-2/UTF-16 encoding.
Like UTF-16, UTF-8 is only available to Windows collations that support Supplementary Characters, as introduced in SQL Server 2012. You can see all available UTF-8 collations by executing the following command in your SQL Server 2019 instance:
SELECT Name, Description FROM fn_helpcollations() WHERE Name LIKE '%UTF8';
Functional comparison between UTF-8 and UTF-16
UTF-8 and UTF-16 both handle the same Unicode characters, and both are variable length encodings that require up to 32 bits per character. However, there are important differences that drive the choice of whether to use UTF-8 or UTF-16 in your multilingual database or column:
- UTF-8 encodes the common ASCII characters including English and numbers using 8-bits. ASCII characters (0-127) use 1 byte, code points 128 to 2047 use 2 bytes, and code points 2048 to 65535 use 3 bytes. The code points 65536 to 1114111 use 4 bytes, and represent the character range for Supplementary Characters.
- But UTF-16 uses at least 16-bits for every character in code points 0 to 65535 (available in UCS-2 and UTF-16 alike), and code points 65536 to 1114111 use the same 4 bytes as UTF-8.
The table below outlines these storage boundaries:
Code Range (hexadecimal) | Code Range (decimal) | Storage bytes with UTF-8 | Storage bytes with UTF-16 |
---|---|---|---|
000000 – 00007F (ASCII) | 0 - 127 | 1 | 2 |
000080 – 00009F 0000A0 – 0003FF 000400 – 0007FF |
128 – 159 160 – 1,023 1,024 – 2,047 |
2 | 2 |
000800 – 003FFF 004000 – 00FFFF |
2,048 - 16,383 16,384 – 65,535 |
3 | 2 |
010000 – 03FFFF 040000 – 10FFFF |
65,536 – 262,143 262,144 – 1,114,111 |
4 | 4 |
Performance differences between UTF-8 and UTF-16
If your dataset uses primarily ASCII characters (which represent majority of Latin alphabets), significant storage savings may be achieved as compared to UTF-16 data types.
For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. This is because NCHAR(10) requires 22 bytes for storage, whereas CHAR(10) requires 12 bytes for the same Unicode string.
In the ASCII range, when doing intensive read/write I/O on UTF-8 , we measured an average 35% performance improvement over UTF-16 using clustered tables with a non-clustered index on the string column, and an average 11% performance improvement over UTF-16 using a heap.
What if your dataset is not predominately ASCII? Above the ASCII range, almost all Latin alphabets, but also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N’Ko will use 2 bytes per character in both UTF-8 and UTF-16 (128 to 2047). Performance measurements were very similar between UTF-8 and UTF-16 in this range.
When using compute-intensive operations such as SORTs/MERGE joins then UTF-16 is generally better than UTF-8 for the same dataset. This is because a few internal conversions happen during these operations. HASH joins/LIKE/Inequality comparisons will perform slightly better in UTF-8 for the same dataset.
But Chinese, Japanese, or Korean characters are represented starting in the range 2048 to 65535, and use 3 bytes in UTF-8, but only 2 bytes in UTF-16. If your dataset is mostly in this character range then using UTF-16 is preferred. In fact, we measured about 25% performance degradation for intensive read I/O when a dataset is mostly in this range, and is using UTF-8 instead of UTF-16.
In the Supplementary character range (65536 to 1114111) there is no measurable difference between UTF-8 and UTF-16 encoding, both from a storage and performance perspective.
To read more about Unicode support in SQL Server, including details on UTF-8 support, see here.
How to convert to UTF-8?
Before you convert, avoid data loss by knowing what's the data type size you must convert to. For example, if a column were defined originally as nvarchar(100), and the projected byte size for the column in UTF-8 would be 120, then changing the column type to a varchar(100) would result in data loss. The column would have to be defined at least as varchar(120). The T-SQL script or the SQL Notebook in the Data Samples GitHub helps you assess these requirements.
Two popular methods to convert data are described next.
1) Convert column data. Imagine you have a current table that is encoded in UCS-2/UTF-16 (depends on whether a supplementary character enabled collation is used) or non-Unicode, like the following examples:
CREATE TABLE dbo.MyTable (MyString NVARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC); CREATE TABLE dbo.MyTable (MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI);
To convert it in-place, run an alter statement, like the following example:
ALTER TABLE dbo.MyTable ALTER COLUMN MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8
It's very easy to implement, however this is a possibly blocking operation which may pose an issue for large tables and busy applications.
2) Copy and Replace. This method involves copying data to a new table where the target column(s) are already in UTF-8, and then replace the old table. The following T-SQL example illustrates this using the second example table above:
CREATE TABLE dbo.MyTable2 (VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8); INSERT INTO dbo.MyTable2 SELECT * FROM dbo.MyTable; DROP TABLE dbo.MyTable; EXEC sp_rename 'dbo.MyTable2', 'dbo.MyTable’;
This method is much faster, but handling complex schemas with many dependencies (FKs, PKs, Triggers, DFs) and tail of the table synch requires much more preparation.
In conclusion, keep the topics discussed here in mind when choosing what encoding to use in your database or columns. Especially in terms of what is the predominant language/character set that's expected to store in a specific table column.
Pedro Lopes ( @SQLPedro ) – Principal Program Manager