Introducing UTF-8 support for Azure SQL Database

MVP

Jul 24, 2019

m60freeman , There are plenty of reasons why people should (and will) use UTF-8. The size of the stored data is only one of the parameters, and not necessarily the most important!

You can download the session's materials from my lecture at sqlsaturday here:

https://gallery.technet.microsoft.com/UTF-8-in-SQL-Server-2019-8d97cca2

Go over the presentation file: You can notice that I split the discussion into three levels (1) developers, (2) DBA, (3) Internals

From the DBA point of view we discuss the size and performance, but from the developers point of view there is a huge advantage related to "compatibility" which can be much more important than "size" in some cases.

I cannot give 75 minutes lecture in one message in the forum but in VERY VERY VERY short (slides 21+25 in the presentation file summarize all in two tables):

> Size: as Pedro explained in some ranges of code points the UTF-8 is smaller then UTF-16, but even if you use non-English languages, then this might give you extreme advantage.

For example think about Hebrew forum (or Hebrew message in this forum), and let's assume I simply write my name in the message(this is my real name in Hebrew):

רונן אריאלי.

but... Hebrew is written from RTL (in the above text the dot is not in the right place! In RTL language the Dot should be on the left side), and I need to move the text to the right side of the page as well. In addition, I like myself so I make my name in bold, and I am so happy that this feature finally supported in Azure so I add some coolers to celebrate. The result will look like bellow text:

רונן אריאלי.

The end-user only sees Hebrew letters. Therefore, he might say that UTF-8 will not give him smaller size since Hebrew is the range of 128-2047 and in this rage UTF-8 has the same size as UTF-16. Moreover, we can do the same example with Chinese, Japanese or Korean - in this case the poor clients might say that soince the text in Chinese, Japanese or Korean then it is in the range of 2048 – 65535 which mean the size will be 1.5 bigger in UTF-8! Are these arguments correct?!?

The data in the database does not includes plain text, but all the HTML/CSS code behind the scene as well!

The fact is that for this simple short name the server actually stores the following code!

<p style="direction: rtl;">
    <strong>
        <span style="background-color: #ffff99; color: #339966;">ר</span>
        <span style="background-color: #00ff00; color: #ff0000;">ו</span>
        <span style="background-color: #0000ff; color: #ffffff;">נ</span>
        <span style="background-color: #808080; color: #ff99cc;">ן</span> 
        <span style="background-color: #000000; color: #ccffcc;">א</span>
        <span style="background-color: #ff99cc; color: #000080;">ר</span>
        <span style="background-color: #808080; color: #ff00ff;">י</span>
        <span style="background-color: #ffffff; color: #993300;">א</span>
        <span style="background-color: #003300; color: #ffcc99;">ל</span>
        <span style="background-color: #ccffcc; color: #993300;">י</span>
        <span style="background-color: #333300; color: #ffffff;">.</span>
    </strong>
</p>

This code includes 898 characters, and only 10 of these are in Hebrew, while the rest are in ASCII range!

This means that using NVARCHAR we use (898*2) = 1796 bytes, but using UTF-8 we will use only (888*1)+(10*3) = 918 bytes!

Therefore, even for Non-English languages we might want to use UTF-8

> Performance: Size directly related to memory and IO

> compatibility: There are two aspects of compatibility! (1) compatibility inside the server between different values. (2) compatibility to external applications, Operating systems and so on

(1) compatibility inside the server: NVARCHAR data type is not fully compatible with VARCHAR and using combination of text in NVARCHAR with text in VARCHAR you might get a very strange and unexpected result (at least unexpected by most users). The entire idea that we have to use two different types of data (National charter types like VARCHAR, NCHAR, XML and Non-National charter types like CHAR, VARCHAR) is a base for multiple issues and performance related to converting the types behind the scene.

As I see it, National charter types will be probably deprecated sometime in the future (not in a year or two... but it will happen... it must happen)

(2) compatibility to external - Most operating system like Linux and Unix uses UTF-8 as their default encoding. Most (if not all except some poor code) applications uses UTF-8 as their default encoding. Have you ever saw "gibberish text" in the browser?!? THIS IS DIRECTLY RELATED TO ENCODING ISSUES!

In short... you need to hear the full lecture :-)

Blog Post

Introducing UTF-8 support for Azure SQL Database