Introducing UTF-8 support for Azure SQL Database

MVP

Jul 24, 2019

@R A Thank you for the very detailed information.

You are most welcome m60freeman.

Actually, It was just the "tip of the iceberg" and there is a lot more information😀

Some databases are accessed by users within one country, some by users within one region, and some by users worldwide.

Location of the clients is an important parameter. It is not necessarily related to multiple languages or to encoding but it has it's own challenge.

I have multiple lectures/posts related to multi-languages databases and multi-locations databases. For example I recommend the lecture about "SQL Server Non-deterministic Elements", if your clients are worldwide.

https://gallery.technet.microsoft.com/SQL-Server-Non-deterministi-f04efe29

UTF8 (or possibly UTF16 for databases used primarily by users of East Asian languages)

No accurate. UTF-8 is used in all countries including English speaking languages like US.

For example,

(1) UTF-8 is the official standard encoding for web application for example according to the w3c organisation (World Wide Web Consortium).

(2) Moreover... Do you like emojis ?!?

Do you think that people in US not allowed to use emojis, since they speak English and they must use only ASCII characters?

Emojis are nothing but simple characters in the range of 65536-1114111🙄.

If you download my lecture's materials, then you can find scripts and information regarding using emojis. In the section where I speak about Supplementary Characters. I use emojis as a sample of Supplementary Characters.

By the way, when I have a full day to speak/teach about the topic, then I might use the title "Writing from 10000 BC to SQL Server 2019 and beyond". You might notice in the presentation file on slide number 7, I started with the history of writing and I present the fact that it all started with "Pictograph", which are small images which describe what the person see. I always joke about the fact that history always repeat itself and people always move forward until they find themselves where they started... The definition of emojis is exactly the same as the Pictograph, and today our children simply went back 5000 years to the abilities of the first people that started to write. Instead of using text children can communicate for hours with their phone purely using Pictographs (emojis) without one letter 👀

seems like it could reduce storage sizes for such strings

You focus on the size, while I am trying to explain that the size of the stored data is only one of the parameters, and not necessarily the most important one!

but we would need to change our application and TSQL code accordingly, and change the collation for our existing databases. All this seems like a massive undertaking

Yes... Moving to use UTF-8 does require some work, but mostly QC/QA and not real changes in the application side, assuming the application was developed in the right way and your architecture was smart from the start. In this case, this procedure should be (almost) only in the database side.

Moreover, you should remember that this is new feature in SQL Server but it exists for many years in other databases.

The idea of using the same type of data (VARCHAR in the case of SQ Server) for UNICODE exists in other databases and this was the right way to go from the start (IMO). In fact, Microsoft should support UTF-16 and other UNICODE encoding in the same way that finally support UTF-8 is in SQL Server 2019 in my opinion - using simple text type VARCHAR, as other databases do.

If you had a good architect when you designed you system, then there is a good chance that your system was designed to fit the "relational model" and not for SQL Server specifically!

You mention in one of your slides that there are performance issues with UTF8 but don't explain what they are.

I did explain and even mentioned it again at the end🙂

You forget that you only watch the presentation file and not hear my lecture. This is not a tutorial but only something which come in the background of the speaker. A presentation does not have a lot of meaning by itself. It's only a tool which come in the background of the speaker's lecture. If I had a recording of this lecture in English then I could add the link, but the only time I was recorded during this lecture was in Hebrew during my lecture for the PASS Hebrew Virtual group. You can watch the Hebrew lecture if you want here: http://youtu.be/Ae2ghRyd6ZM . You can jump directly to the point that I start to speak (about 4:50 minutes after the start of the meeting).

In short: We did a lot of tests and we found that there are cases that we have latency when we use UTF-8. For example seems that some of the string functions convert the data implicitly to the old data type and we clearly noticed a latency (I am talking about big databases with multiple TB of data and using big tables) - I am not sure that in your case you will be able to feel this. There are several scenario where you might get some latency when you use UTF-8 with national character types together. There are the performance issues which are related to the size (which Pedro Lopes mentioned). And more...

By the way, Speaking about performance I mentioned here only two advantages points related to performance "memory and IO", but there is many more to discuss. Here is another example out of many more, which I mentioned in the lecture: If you have Non-English text length 4001 characters then you use NVARCHAR(MAX) since the max length of NVARCHAR is 4000. Using MAX can lead to VERY poor performance since the data might be stored outside the row. Using UTF-8 we can use VARCHAR data type which can be in the length of 8000, which mean we can use VARCHAR(4001) which will store the data in-row.

I hope this was useful😀

Blog Post

Introducing UTF-8 support for Azure SQL Database