SQL Unicode comparisons

Microsoft

Feb 18, 2025

Introduction

Comparing strings in SQL is generally intuitive, but there are scenarios where the results can be unexpected, leading to performance issues, especially when we get into Unicode.

Normally, SQL queries will use linguistic comparisons of strings, roughly following UTS #10: Unicode Collation Algorithm, with SQL documentation mostly skipping over this, with a mention in Collation and Unicode support - SQL Server | Microsoft Learn that “Unicode sorting rules use a word sort that ignores the hyphen”. The overall intent is that things that people think should match based on their local language and culture will match, even if the string of bytes doesn’t.

Part of the matching uses weights taken from the Unicode code page; if a specific character isn’t in the code page, then it has no weight, is rated as unsortable, and becomes effectively invisible in the comparison, which leads to an issue we saw recently.

The issue

The application allows people to log in using either a username or an email address, with a single field in the UI for these. This is compared against a user table, which looks like this; the application is used globally, so Unicode is used:

With a match like this

declare @inputlogin nchar(100) = 'Penelope' select * from user_login where loginname = @inputlogin or email = @inputlogin

The table definition takes care of uniqueness, and the front end application stops nulls coming through. However, there was a sudden burst of login failure storms, with huge numbers of login attempts failing, which was traced down to a single comparison returning a large number of matches in the user login table.

The login being attempted with an input that only had characters not found in the code pages, so the linguistic comparison matched the SQL empty string “” and returned all the user login records (about 10% of the table) with an empty loginname field. The front end was using IsNullOrWhiteSpace as the test, so wasn’t blocking this input.

A very simple example of this is

declare @inputlogin nchar(100) = char(0) select * from user_login where loginname = @inputlogin or email = @inputlogin

which returns Gordon and Virgil.

Making it better

Recommendation – use a binary collation for cases like this where an exact match is required

Test for emptiness in SQL

One simple change is to expand the comparison to check whether the input matches an empty string

declare @inputlogin nchar(100) = char(0) select * from user_login where @inputlogin <> '' and ( loginname = @inputlogin or email = @inputlogin)

Be aware that there are many characters that don’t have a weight, so explicitly testing for each of them is not feasible.

Use an ordinal search

As Best Practices for Comparing Strings in .NET - .NET | Microsoft Learn calls out, it is recommended to use an ordinal comparison for security related activities. In SQL, this is done using a binary collection (BIN or BIN2), to ensure that the login or email is an exact match for the one that’s stored in the table - for example

declare @inputlogin nchar(100) = char(0) select * from user_login where loginname = @inputlogin COLLATE LATIN1_GENERAL_BIN2 or email = @inputlogin COLLATE LATIN1_GENERAL_BIN2

Note the syntax; only putting a single COLLATE at the end of the query doesn’t give the expected result, since it only changes the behaviour of the comparison directly in front of it.

This also takes care of potentially false-positive matches when using a Unicode comparison, such as the handling of soft-hyphen, where

declare @email1 nchar(100) = N'Ben.Solo' + NCHAR(0xAD) + N'Ortega@example.com' declare @email2 nchar(100) = N'Ben.Solo' + N'Ortega@example.com' select 'These are equal', @email1, @email2 where @email1 = @email2 collate Latin1_General_100_CI_AS_SC

returns

These are equal

Ben.Solo-Ortega@example.com

Ben.SoloOrtega@example.com

Jovan’s post The difference between BIN2 and Case-Sensitive collations in SQL – SQLServerCentral has more detail.

Do note that a binary/ordinal match is exact; if the workload has some requirements for inexact matching, such as case-insensitiveness, that needs to be catered for, potentially with a non-unicode case-insensitive collation, or doing character substitution on the input (a -> A, etc)

NULL means NULL

In this example, a quirk in the application meant that for many of the login records, “no login name” was being put in the SQL table as an empty string, rather than NULL. If NULL has consistently been used to record “they don’t have a login name”, the number of matches seen during this issue would have been trivial.

Collations

So far, the examples assume that the environment is using a non-binary collation. If the server or database is set to use BIN or BIN2, then the above cases are moot, but generally applications will tend towards collations that give options like accent and case insensitive comparisons, so that when looking for a convention centre, “takina” will find Tākina. See Collation and Unicode support - SQL Server | Microsoft Learn for more.

In Windows

Similar options at the Windows level; examples using powershell, which uses Unicode (about_Comparison_Operators - PowerShell | Microsoft Learn);

$test_this = "`0" # check that the input isn't white space...both tests are false [string]::IsNullOrWhiteSpace($test_this) [string]::IsNullOrEmpty($test_this) # using the default, these are equal if ($test_this -eq "") { "equal - using defaults" } else {"diff" } # use an ordinal test with nothing discarded, these are different if ($test_this.Equals("",[System.StringComparison]::Ordinal)) { "equal" } else {"diff - ordinal test" } # testing with the current culture, these are equal if ($test_this.Equals("",[System.StringComparison]::CurrentCulture)) { "equal - current culture test" } else {"diff" }

As bedtime reading, the weighting tables used in the comparisons are available for download at https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings#the-details-of-string-comparison

Feedback and suggestions

If you have feedback or suggestions for improving this data migration asset, please contact the Databases SQL Ninja Engineering Team (datasqlninja@microsoft.com). Thanks for your support!

Note: For additional information about migrating various source databases to Azure, see the Azure Database Migration Guide.

Updated Feb 19, 2025

Version 2.0

David_Lyth

Microsoft

Joined March 22, 2021

View Profile

Modernization Best Practices and Reusable Assets Blog

Follow this blog board to get notified when there's new activity