Introducing UTF-8 support for Azure SQL Database
Published Jul 23 2019 07:00 AM 12.6K Views
Microsoft

UTF-8 support is now in preview for Azure SQL Database, allowing more options to companies extending their businesses to a global scale, where the requirement of providing global multilingual database applications and services is critical to meet customer demands, and specific market regulations. 

 

If the above requirements resonate with you and your business, then you need a database that supports the full Unicode character set. While this was already possible by choosing a collation that allowed "supplementary characters" and using the NCHAR or NVARCHAR data types, only UTF-16 was supported until now. Azure SQL Database and SQL Server 2019 introduce support for UTF-8 as well, under the CHAR and VARCHAR data types. Choosing whether to use UTF-16 or UTF-8 to support your requirements is the next step, and this blog together with the documentation here should provide more clarity on the topic. So please read on!

 

How is it implemented?

To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8. Refer to Set or Change the Database Collation and Set or Change the Column Collation for more details on how to perform those changes. Note that NCHAR and NVARCHAR remains unchanged and allows UCS-2/UTF-16 encoding.

 

You can see all available UTF-8 collations by executing the following command in your SQL Database:

 

SELECT Name, Description FROM fn_helpcollations()
WHERE Name LIKE '%UTF8';

Functional comparison between UTF-8 and UTF-16

UTF-8 and UTF-16 both handle the same Unicode characters, and both are variable length encodings that require up to 32 bits per character. However, there are important differences that drive the choice of whether to use UTF-8 or UTF-16 in your multilingual database or column:

 

  • UTF-8 encodes the common ASCII characters including English and numbers using 8-bits. ASCII characters (0-127) use 1 byte, code points 128 to 2047 use 2 bytes, and code points 2048 to 65535 use 3 bytes. The code points 65536 to 1114111 use  4 bytes, and represent the character range for Supplementary Characters.
  • But UTF-16 uses at least 16-bits for every character in code points 0 to 65535 (available in UCS-2 and UTF-16 alike), and code points 65536 to 1114111 use the same 4 bytes as UTF-8. 

The table below outlines these storage boundaries:

 

Code Range (hexadecimal)

Code Range (decimal)

Storage bytes with UTF-8

Storage bytes with UTF-16

000000 – 00007F (ASCII)

0 - 127

1

2

000080 – 00009F
0000A0 – 0003FF
000400 – 0007FF

128 – 159
160 – 1,023
1,024 – 2,047

2

2

000800 – 003FFF
004000 – 00FFFF

2,048 - 16,383
16,384 – 65,535

3

2

010000 – 03FFFF
040000 – 10FFFF

65,536 – 262,143
262,144 – 1,114,111

4

4

 

Considerations for choosing UTF-8 or UTF-16

If your dataset uses primarily ASCII characters (which represent majority of Latin alphabets), significant storage savings may be achieved as compared to UTF-16 data types.

For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. This is because NCHAR(10) requires 22 bytes for storage, whereas CHAR(10) requires 12 bytes for the same Unicode string.

 

What if your dataset is not predominately ASCII? Above the ASCII range, almost all Latin alphabets, but also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N’Ko will use 2 bytes per character in both UTF-8 and UTF-16 (128 to 2047). 

 

But Chinese, Japanese, or Korean characters are represented starting in the range 2048 to 65535, and use 3 bytes in UTF-8, but only 2 bytes in UTF-16. If your dataset is mostly in this character range then using UTF-16 is preferred.  

 

In the Supplementary character range (65536 to 1114111) there is no measurable difference between UTF-8 and UTF-16 encoding.

 

To read more about Unicode support in SQL Database, including details on UTF-8 support, see here.

 

How to convert to UTF-8?

Two methods to convert data are described next. 

1) Convert column data. Imagine you have a current table that is encoded in UCS-2/UTF-16 (depends on whether a supplementary character enabled collation is used) or non-Unicode, like the following examples:

 

CREATE TABLE dbo.MyTable (MyString NVARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC);
CREATE TABLE dbo.MyTable (MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI);

To convert it in-place, run an alter statement, like the following example:

 

ALTER TABLE dbo.MyTable
ALTER COLUMN MyString VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8

It's very easy to implement, however this is a possibly blocking operation which may pose an issue for large tables and busy applications.

 

2) Copy and Replace. This method involves copying data to a new table where the target column(s) are already in UTF-8, and then replace the old table. The following T-SQL example illustrates this using the second example table above:

 

CREATE TABLE dbo.MyTable2 (VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8);
INSERT INTO dbo.MyTable2
SELECT * FROM dbo.MyTable;
DROP TABLE dbo.MyTable;
EXEC sp_rename 'dbo.MyTable2', 'dbo.MyTable’;

This method is much faster, but handling complex schemas with many dependencies (FKs, PKs, Triggers, DFs) and tail of the table synch requires much more preparation.

 

In conclusion, keep the topics discussed here in mind when choosing what encoding to use in your database or columns. Especially in terms of what is the predominant language/character set that's expected to store in a specific table column.

 

Pedro Lopes ( @SQLPedro ) – Senior Program Manager

10 Comments
Brass Contributor

Pedro, this is very helpful in terms of picking UTF-8 vs. UTF-16, but doesn't mention why I might want UTF at all when all my applications that use NVARCHAR work fine using the SQL_Latin1_General_CP1_CI_AS collation, and I'm interested in learning about that. Thanks!

Microsoft

If you are working with NVARCHAR in that collation, then you're using UCS-2 encoding that can represent the first 65535 Unicode characters. Like I started by saying, if you have a requirement of providing global multilingual database applications and services, then you likely need full Unicode support. Same to meet some specific market/industry regulations.

After you set that you do have a requirement to use Unicode, then choosing whether to use UTF-8 or UTF-16 falls into the several considerations explained in this blog, and also in the documentation at https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support. Hope this helps.

Awesome news!

Thanks for sharing Pedro=)

@m60freeman , There are plenty of reasons why people should (and will) use UTF-8. The size of the stored data is only one of the parameters, and not necessarily the most important!

 

You can download the session's materials from my lecture at sqlsaturday here:

https://gallery.technet.microsoft.com/UTF-8-in-SQL-Server-2019-8d97cca2

 

Go over the presentation file: You can notice that I split the discussion into three levels (1) developers, (2) DBA, (3) Internals

 

From the DBA point of view we discuss the size and performance, but from the developers point of view there is a huge advantage related to "compatibility" which can be much more important than "size" in some cases.

 

I cannot give 75 minutes lecture in one message in the forum but in VERY VERY VERY short (slides 21+25 in the presentation file summarize all in two tables):

 

> Size: as Pedro explained in some ranges of code points the UTF-8 is smaller then UTF-16, but even if you use non-English languages, then this might give you extreme advantage.

For example think about Hebrew forum (or Hebrew message in this forum), and let's assume I simply write my name in the message(this is my real name in Hebrew):

רונן אריאלי.

but... Hebrew is written from RTL (in the above text the dot is not in the right place! In RTL language the Dot should be on the left side), and I need to move the text to the right side of the page as well. In addition, I like myself so I make my name in bold, and I am so happy that this feature finally supported in Azure so I add some coolers to celebrate. The result will look like bellow text:

רונן אריאלי.

The end-user only sees Hebrew letters. Therefore, he might say that UTF-8 will not give him smaller size since Hebrew is the range of 128-2047 and in this rage UTF-8 has the same size as UTF-16. Moreover, we can do the same example with Chinese, Japanese or Korean - in this case the poor clients might say that soince the text in Chinese, Japanese or Korean then it is in the range of 2048 – 65535 which mean the size will be 1.5 bigger in UTF-8! Are these arguments correct?!?

The data in the database does not includes plain text, but all the HTML/CSS code behind the scene as well!

The fact is that for this simple short name the server actually stores the following code!

 

<p style="direction: rtl;">
    <strong>
        <span style="background-color: #ffff99; color: #339966;">ר</span>
        <span style="background-color: #00ff00; color: #ff0000;">ו</span>
        <span style="background-color: #0000ff; color: #ffffff;">נ</span>
        <span style="background-color: #808080; color: #ff99cc;">ן</span> 
        <span style="background-color: #000000; color: #ccffcc;">א</span>
        <span style="background-color: #ff99cc; color: #000080;">ר</span>
        <span style="background-color: #808080; color: #ff00ff;">י</span>
        <span style="background-color: #ffffff; color: #993300;">א</span>
        <span style="background-color: #003300; color: #ffcc99;">ל</span>
        <span style="background-color: #ccffcc; color: #993300;">י</span>
        <span style="background-color: #333300; color: #ffffff;">.</span>
    </strong>
</p>

This code includes 898 characters, and only 10 of these are in Hebrew, while the rest are in ASCII range!

 

This means that using NVARCHAR we use (898*2) = 1796 bytes, but using UTF-8 we will use only (888*1)+(10*3) = 918 bytes!

Therefore, even for Non-English languages we might want to use UTF-8

 

> Performance: Size directly related to memory and IO

 

> compatibility: There are two aspects of compatibility! (1) compatibility inside the server between different values. (2) compatibility to external applications, Operating systems and so on

 

(1) compatibility inside the server: NVARCHAR data type is not fully compatible with VARCHAR and using combination of text in NVARCHAR with text in VARCHAR you might get a very strange and unexpected result (at least unexpected by most users). The entire idea that we have to use two different types of data (National charter types like VARCHAR, NCHAR, XML and Non-National charter types like CHAR, VARCHAR) is a base for multiple issues and performance related to converting the types behind the scene.

As I see it, National charter types will be probably deprecated sometime in the future (not in a year or two... but it will happen... it must happen)

(2) compatibility to external - Most operating system like Linux and Unix uses UTF-8 as their default encoding. Most (if not all except some poor code) applications uses UTF-8 as their default encoding. Have you ever saw "gibberish text" in the browser?!? THIS IS DIRECTLY RELATED TO ENCODING ISSUES!

 

In short... you need to hear the full lecture :)

I updated the message. I mistakenly added link to a different lecture. Now the link to the download is correct :)

Brass Contributor

@Ronen_Ariely Thank you for the very detailed information. I work for a global company with almost all internally developed applications using .Net or .Net core on Windows with Azure SQL Databases (compatibility levels range from 120 - 140). Some databases are accessed by users within one country, some by users within one region, and some by users worldwide. Most strings are stored as UNICODE (NVARCHAR). UTF8 (or possibly UTF16 for databases used primarily by users of East Asian languages) seems like it could reduce storage sizes for such strings, but we would need to change our application and TSQL code accordingly, and change the collation for our existing databases. All this seems like a massive undertaking, although I can see the advantages for new applications.

 

You mention in one of your slides that there are performance issues with UTF8 but don't explain what they are. Can you point me to an article that describes these issues?

Microsoft

@m60freeman you can read about perf differences between encoding types in this blog: https://techcommunity.microsoft.com/t5/SQL-Server/Introducing-UTF-8-support-for-SQL-Server/ba-p/7349...


 

Brass Contributor

@Pedro Lopes I guess there's no such thing as a free lunch. :)

@R A Thank you for the very detailed information.

You are most welcome @m60freeman.

Actually, It was just the "tip of the iceberg" and there is a lot more information:grinning_face:

 

Some databases are accessed by users within one country, some by users within one region, and some by users worldwide.

Location of the clients is an important parameter. It is not necessarily related to multiple languages or to encoding but it has it's own challenge.

I have multiple lectures/posts related to multi-languages databases and multi-locations databases. For example I recommend the lecture about "SQL Server Non-deterministic Elements", if your clients are worldwide.

https://gallery.technet.microsoft.com/SQL-Server-Non-deterministi-f04efe29

 

UTF8 (or possibly UTF16 for databases used primarily by users of East Asian languages)

No accurate. UTF-8 is used in all countries including English speaking languages like US.

For example,

(1) UTF-8 is the official standard encoding for web application for example according to the w3c organisation (World Wide Web Consortium).

(2) Moreover... Do you like emojis ?!?

Do you think that people in US not allowed to use emojis, since they speak English and they must use only ASCII characters? 

Emojis are nothing but simple characters in the range of 65536-1114111:face_with_rolling_eyes:.

If you download my lecture's materials, then you can find scripts and information regarding using emojis. In the section where I speak about Supplementary Characters. I use emojis as a sample of Supplementary Characters.

 

By the way, when I have a full day to speak/teach about the topic, then I might use the title "Writing from 10000 BC to SQL Server 2019 and beyond". You might notice in the presentation file on slide number 7, I started with the history of writing and I present the fact that it all started with "Pictograph", which are small images which describe what the person see. I always joke about the fact that history always repeat itself and people always move forward until they find themselves where they started... The definition of emojis is exactly the same as the Pictograph, and today our children simply went back 5000 years to the abilities of the first people that started to write. Instead of using text children can communicate for hours with their phone purely using Pictographs (emojis) without one letter :eyes:

 

seems like it could reduce storage sizes for such strings

You focus on the size, while I am trying to explain that the size of the stored data is only one of the parameters, and not necessarily the most important one!

 

but we would need to change our application and TSQL code accordingly, and change the collation for our existing databases. All this seems like a massive undertaking

Yes... Moving to use UTF-8 does require some work, but mostly QC/QA and not real changes in the application side, assuming the application was developed in the right way and your architecture was smart from the start. In this case, this procedure should be (almost) only in the database side.

Moreover, you should remember that this is new feature in SQL Server but it exists for many years in other databases.

The idea of using the same type of data (VARCHAR in the case of SQ Server) for UNICODE exists in other databases and this was the right way to go from the start (IMO). In fact, Microsoft should support UTF-16 and other UNICODE encoding in the same way that finally support UTF-8 is in SQL Server 2019 in my opinion - using simple text type VARCHAR, as other databases do.

 

If you had a good architect when you designed you system, then there is a good chance that your system was designed to fit the "relational model" and not for SQL Server specifically!

 

You mention in one of your slides that there are performance issues with UTF8 but don't explain what they are.

I did explain and even mentioned it again at the end:)

 

You forget that you only watch the presentation file and not hear my lecture. This is not a tutorial but only something which come in the background of the speaker. A presentation does not have a lot of meaning by itself. It's only a tool which come in the background of the speaker's lecture. If I had a recording of this lecture in English then I could add the link, but the only time I was recorded during this lecture was in Hebrew during my lecture for the PASS Hebrew Virtual group. You can watch the Hebrew lecture if you want here: http://youtu.be/Ae2ghRyd6ZM . You can jump directly to the point that I start to speak (about 4:50 minutes after the start of the meeting).

 

In short: We did a lot of tests and we found that there are cases that we have latency when we use UTF-8. For example seems that some of the string functions convert the data implicitly to the old data type and we clearly noticed a latency (I am talking about big databases with multiple TB of data and using big tables) - I am not sure that in your case you will be able to feel this. There are several scenario where you might get some latency when you use UTF-8 with national character types together. There are the performance issues which are related to the size (which @Pedro Lopes mentioned). And more...

 

By the way, Speaking about performance I mentioned here only two advantages points related to performance "memory and IO", but there is many more to discuss. Here is another example out of many more, which I mentioned in the lecture: If you have Non-English text length 4001 characters then you use NVARCHAR(MAX) since the max length of NVARCHAR is 4000. Using MAX can lead to VERY poor performance since the data might be stored outside the row. Using UTF-8 we can use VARCHAR data type which can be in the length of 8000, which mean we can use VARCHAR(4001) which will store the data in-row.

 

I hope this was useful:grinning_face:

Copper Contributor

Pedro, thanks for the Information.

 

I'm developing a NET Maui application for Android, using C#, that gets a ChatCompletions response using: chatCompletions.Choices[0].Message.Content from Azure.AI.OpenAI. The documentation says that the response of the ChatCompletions is encoded in UTF-8, so, after I get the response I try to insert it as a string datatype to an Azure SQL Server Database (I'm paying the basic service tier) in a VARCHAR(MAX) column, everything works fine only if I insert a string with a max length of 1,353 characters, If I try to insert a string with a length of more than 1,353 characters I get an error message from the server.

 

This is the error message:

  • A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught)

 

And this is the inner exception:

  • Unable to read data from the transport connection: Connection reset by peer.

 

I tried to insert the string with more than 1,353 characters, using the UTF-8 encoding, directly to the table using SQL Server Management Studio and it worked with no problem at all, so, the error only happens when I try to insert the string with more than 1,353 characters and when I try to do it from the Net Maui Application. I don't think there is a problem with my application because the insert operation works fine when the string has less than 1,353 characters. Do you know if is there any kind of restriction in the SQL Server Azure Database when trying to insert larger amounts of data as string in a VARCHAR(MAX) column or something?

Version history
Last update:
‎Jul 23 2019 12:54 PM
Updated by: