Introducing UTF-8 support for SQL Server
Published Jul 02 2019 01:52 PM 201K Views
Microsoft

SQL Server 2019 introduces support for the widely used UTF-8 character encoding. This has been a longtime requested feature and can be set as a database-level or column-level default encoding for Unicode string data. 

 

Why did we need UTF-8 support?

This is an asset for companies extending their businesses to a global scale, where the requirement of providing global multilingual database applications and services is critical to meet customer demands, and specific market regulations.

The benefits of introducing UTF-8 support also extend to scenarios where legacy applications require internationalization and use inline queries: the amount of changes and testing involved to convert an application and underlying database to UTF-16 can be costly, by requiring complex string processing logic that affect application performance. 

 

How is it implemented?

To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8. Refer to Set or Change the Database Collation and Set or Change the Column Collation for more details on how to perform those changes. Note that NCHAR and NVARCHAR remains unchanged and allows UCS-2/UTF-16 encoding.

 

Like UTF-16, UTF-8 is only available to Windows collations that support Supplementary Characters, as introduced in SQL Server 2012. You can see all available UTF-8 collations by executing the following command in your SQL Server 2019 instance:

 

SELECT Name, Description FROM fn_helpcollations()
WHERE Name LIKE '%UTF8';

 

Functional comparison between UTF-8 and UTF-16

UTF-8 and UTF-16 both handle the same Unicode characters, and both are variable length encodings that require up to 32 bits per character. However, there are important differences that drive the choice of whether to use UTF-8 or UTF-16 in your multilingual database or column:

 

  • UTF-8 encodes the common ASCII characters including English and numbers using 8-bits. ASCII characters (0-127) use 1 byte, code points 128 to 2047 use 2 bytes, and code points 2048 to 65535 use 3 bytes. The code points 65536 to 1114111 use  4 bytes, and represent the character range for Supplementary Characters.
  • But UTF-16 uses at least 16-bits for every character in code points 0 to 65535 (available in UCS-2 and UTF-16 alike), and code points 65536 to 1114111 use the same 4 bytes as UTF-8. 

The table below outlines these storage boundaries:

 

Code Range (hexadecimal) Code Range (decimal) Storage bytes with UTF-8 Storage bytes with UTF-16
000000 – 00007F (ASCII) 0 - 127 1 2
000080 – 00009F
0000A0 – 0003FF
000400 – 0007FF
128 – 159
160 – 1,023
1,024 – 2,047
2 2
000800 – 003FFF
004000 – 00FFFF
2,048 - 16,383
16,384 – 65,535
3 2
010000 – 03FFFF
040000 – 10FFFF
65,536 – 262,143
262,144 – 1,114,111
4 4

 

Performance differences between UTF-8 and UTF-16

If your dataset uses primarily ASCII characters (which represent majority of Latin alphabets), significant storage savings may be achieved as compared to UTF-16 data types.

For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. This is because NCHAR(10) requires 22 bytes for storage, whereas CHAR(10) requires 12 bytes for the same Unicode string.

 

In the ASCII range, when doing intensive read/write I/O on UTF-8 , we measured an average 35% performance improvement over UTF-16 using clustered tables with a non-clustered index on the string column, and an average 11% performance improvement over UTF-16 using a heap.

 

What if your dataset is not predominately ASCIIAbove the ASCII range, almost all Latin alphabets, but also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N’Ko will use 2 bytes per character in both UTF-8 and UTF-16 (128 to 2047). Performance measurements were very similar between UTF-8 and UTF-16 in this range.

 

When using compute-intensive operations such as SORTs/MERGE joins then UTF-16 is generally better than UTF-8 for the same dataset. This is because a few internal conversions happen during these operations. HASH joins/LIKE/Inequality comparisons will perform slightly better in UTF-8 for the same dataset.

 

But Chinese, Japanese, or Korean characters are represented starting in the range 2048 to 65535, and use 3 bytes in UTF-8, but only 2 bytes in UTF-16. If your dataset is mostly in this character range then using UTF-16 is preferred. In fact, we measured about 25% performance degradation for intensive read I/O when a dataset is mostly in this range, and is using UTF-8 instead of UTF-16. 

 

In the Supplementary character range (65536 to 1114111) there is no measurable difference between UTF-8 and UTF-16 encoding, both from a storage and performance perspective.

 

To read more about Unicode support in SQL Server, including details on UTF-8 support, see here.

 

How to convert to UTF-8?

Before you convert, avoid data loss by knowing what's the data type size you must convert to. For example, if a column were defined originally as nvarchar(100), and the projected byte size for the column in UTF-8 would be 120, then changing the column type to a varchar(100) would result in data loss. The column would have to be defined at least as varchar(120). The T-SQL script or the SQL Notebook in the Data Samples GitHub helps you assess these requirements.

 

Two popular methods to convert data are described next. 

1) Convert column data. Imagine you have a current table that is encoded in UCS-2/UTF-16 (depends on whether a supplementary character enabled collation is used) or non-Unicode, like the following examples:

 

CREATE TABLE dbo.MyTable (MyString NVARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC);
 
CREATE TABLE dbo.MyTable (MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI);


To convert it in-place, run an alter statement, like the following example:

 

ALTER TABLE dbo.MyTable 
ALTER COLUMN MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8

 

It's very easy to implement, however this is a possibly blocking operation which may pose an issue for large tables and busy applications.

 

2) Copy and Replace. This method involves copying data to a new table where the target column(s) are already in UTF-8, and then replace the old table. The following T-SQL example illustrates this using the second example table above:

 

CREATE TABLE dbo.MyTable2 (VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8);
INSERT INTO dbo.MyTable2 
SELECT * FROM dbo.MyTable;
DROP TABLE dbo.MyTable;
EXEC sp_rename 'dbo.MyTable2', 'dbo.MyTable’;

This method is much faster, but handling complex schemas with many dependencies (FKs, PKs, Triggers, DFs) and tail of the table synch requires much more preparation.

 

In conclusion, keep the topics discussed here in mind when choosing what encoding to use in your database or columns. Especially in terms of what is the predominant language/character set that's expected to store in a specific table column.

 

Pedro Lopes ( @SQLPedro ) – Principal Program Manager

8 Comments
Copper Contributor

A nice clear and helpful article. Thanks!

Brass Contributor

Hi Pedro,

Nice write up.

 

As of now, pre-2019, XML data type supports UTF-16 only.

Another current limitation is that XML prolog is stripped while inserting XML with it into a column or a variable of XML data type.

 

What about UTF-8 support for the XML data type in SQL Server 2019?

Please shed some light on the subject.

Microsoft

@ykhabins at this point, XML data type is not supported in the XML data type. And there are no current plans to extend that support. The XML data type is stored in binary format in SQL Server, so the perf gains would be limited if any. If you have specific requirements for UTF-8 on XML, please open a feedback item in https://aka.ms/sqlfeedback and please clarify your requirement there.

Copper Contributor

@ykhabins I think Pedro meant to say that "UTF-8 is not supported in the XML datatype". And by "XML datatype is stored in binary format", he means that the XML datatype is an optimized type that reduces overall size by doing several things, such as:

 

1. place strings into a dictionary so that they only exist once (and yes, here they will be stored as UTF-16)

2. remove insignificant whitespace

3. if using an XSD / XML Schema Collection, attributes and elements designated with specific datatypes will be stored in the native binary representation for that type (when no XML Schema Collection is used then all values are stored as UTF-16 strings)

 

Here is one example of this format as found in the overall protocol specification for the XML datatype:

https://docs.microsoft.com/en-us/openspecs/sql_server_protocols/ms-binxml/d5bd1f42-8643-435c-a0df-0b...

 

And here is a section of one of my posts where I show that for XML data containing repeated strings, the overall data size (in bytes) is smaller in the native XML datatype than even for VARCHAR data containing the same XML data. Of course, if the XML data contains mostly unique strings, then it's possible that the VARCHAR representation would be smaller, though that doesn't always help since VARCHAR can't contain all code points. Either way, take a look:

 

https://sqlquantumleap.com/2019/11/22/how-many-bytes-per-character-in-sql-server-a-completely-comple...

 

ALSO, if by UTF-8 support in XML you mean the encoding of it going in and/or coming out, then that might be a different story.  The XML datatype does currently (and always has) support converting most encodings into UTF-16. The reason that the <?xml ?> declaration is stripped off is that it isn't necessary as the only encoding it can be internally is UTF-16 (and it could be a conflict if the encoding specified in the declaration states otherwise). HOWEVER, if you pass in valid bytes of UTF-8 (or even Windows-1252, etc) and provide the <?xml ?> declaration which states the encoding used for the bytes you are passing in (and if you do NOT prefix the string literal with an upper-case "N" or use an NVARCHAR variable) then the XML datatype will convert from that encoding into UTF-16. This is supported starting in SQL Server 2005.

 

Here is a StackOverflow answer of mine in which I provide examples of doing this conversion from VARCHAR data encoded as UTF-8 or Windows-1252 into XML:

Converting accented characters in varchar() to XML causing “illegal XML character”

 

Of course, getting the data out as UTF-8 encoded bytes is not supported natively until SQL Server 2019 when you can use CONVERT(VARCHAR, ..) to achieve that. Prior to SQL Server 2019, you can create a T-SQL or SQLCLR scalar function (i.e. UDF) to return a VARBINARY(MAX) value containing the UTF-8 encoded bytes of whatever was passed in.

 

Take care,

Solomon....

 

Copper Contributor

I'm confused. Wasn't nvarchar always supported? We use it to store Unicode data for years...

Copper Contributor

@Alex Yumas UTF-8 makes use of VARCHAR (not NVARCHAR) to store strings that would ordinarily take up double the amount of space if stored as NVARCHAR. Where Unicode characters are needed, it uses the appropriate double-byte (or more) storage for those characters, but keeps everything else as VARCHAR. That's why UTF-8 support is more appropriate for Latin-based alphabets.

Copper Contributor

@RandolphWest ah, got it, thanks for the explanation.

Copper Contributor

Thanks for the article.

if we use UTF-8 for the column collation, when we insert data to the table column, do we also need to use prefix N' to insert into the varchar table like we did when we insert into Nvarchar table?

 
 

Version history
Last update:
‎Dec 04 2019 02:55 PM
Updated by: