Introduction
Comparing strings in SQL is generally intuitive, but there are scenarios where the results can be unexpected, leading to performance issues, especially when we get into Unicode.
Normally, SQL queries will use linguistic comparisons of strings, roughly following UTS #10: Unicode Collation Algorithm, with SQL documentation mostly skipping over this, with a mention in Collation and Unicode support - SQL Server | Microsoft Learn that “Unicode sorting rules use a word sort that ignores the hyphen”. The overall intent is that things that people think should match based on their local language and culture will match, even if the string of bytes doesn’t.
Part of the matching uses weights taken from the Unicode code page; if a specific character isn’t in the code page, then it has no weight, is rated as unsortable, and becomes effectively invisible in the comparison, which leads to an issue we saw recently.
The issue
The application allows people to log in using either a username or an email address, with a single field in the UI for these. This is compared against a user table, which looks like this; the application is used globally, so Unicode is used:
With a match like this
declare @inputlogin nchar(100) = 'Penelope' select * from user_login where loginname = @inputlogin or email = @inputlogin
The table definition takes care of uniqueness, and the front end application stops nulls coming through. However, there was a sudden burst of login failure storms, with huge numbers of login attempts failing, which was traced down to a single comparison returning a large number of matches in the user login table.
The login being attempted with an input that only had characters not found in the code pages, so the linguistic comparison matched the SQL empty string “” and returned all the user login records (about 10% of the table) with an empty loginname field. The front end was using IsNullOrWhiteSpace as the test, so wasn’t blocking this input.
A very simple example of this is
declare @inputlogin nchar(100) = char(0) select * from user_login where loginname = @inputlogin or email = @inputlogin
which returns Gordon and Virgil.
Making it better
Recommendation – use a binary collation for cases like this where an exact match is required
Test for emptiness in SQL
One simple change is to expand the comparison to check whether the input matches an empty string
declare @inputlogin nchar(100) = char(0) select * from user_login where @inputlogin <> '' and ( loginname = @inputlogin or email = @inputlogin)
Be aware that there are many characters that don’t have a weight, so explicitly testing for each of them is not feasible.
Use an ordinal search
As Best Practices for Comparing Strings in .NET - .NET | Microsoft Learn calls out, it is recommended to use an ordinal comparison for security related activities. In SQL, this is done using a binary collection (BIN or BIN2), to ensure that the login or email is an exact match for the one that’s stored in the table - for example
declare @inputlogin nchar(100) = char(0) select * from user_login where loginname = @inputlogin COLLATE LATIN1_GENERAL_BIN2 or email = @inputlogin COLLATE LATIN1_GENERAL_BIN2
Note the syntax; only putting a single COLLATE at the end of the query doesn’t give the expected result, since it only changes the behaviour of the comparison directly in front of it.
This also takes care of potentially false-positive matches when using a Unicode comparison, such as the handling of soft-hyphen, where
declare @email1 nchar(100) = N'Ben.Solo' + NCHAR(0xAD) + N'Ortega@example.com' declare @email2 nchar(100) = N'Ben.Solo' + N'Ortega@example.com' select 'These are equal', @email1, @email2 where @email1 = @email2 collate Latin1_General_100_CI_AS_SC
returns
These are equal | Ben.Solo-Ortega@example.com | Ben.SoloOrtega@example.com |
Jovan’s post The difference between BIN2 and Case-Sensitive collations in SQL – SQLServerCentral has more detail.
Do note that a binary/ordinal match is exact; if the workload has some requirements for inexact matching, such as case-insensitiveness, that needs to be catered for, potentially with a non-unicode case-insensitive collation, or doing character substitution on the input (a -> A, etc)
NULL means NULL
In this example, a quirk in the application meant that for many of the login records, “no login name” was being put in the SQL table as an empty string, rather than NULL. If NULL has consistently been used to record “they don’t have a login name”, the number of matches seen during this issue would have been trivial.
Collations
So far, the examples assume that the environment is using a non-binary collation. If the server or database is set to use BIN or BIN2, then the above cases are moot, but generally applications will tend towards collations that give options like accent and case insensitive comparisons, so that when looking for a convention centre, “takina” will find Tākina. See Collation and Unicode support - SQL Server | Microsoft Learn for more.
In Windows
Similar options at the Windows level; examples using powershell, which uses Unicode (about_Comparison_Operators - PowerShell | Microsoft Learn);
$test_this = "`0" # check that the input isn't white space...both tests are false [string]::IsNullOrWhiteSpace($test_this) [string]::IsNullOrEmpty($test_this) # using the default, these are equal if ($test_this -eq "") { "equal - using defaults" } else {"diff" } # use an ordinal test with nothing discarded, these are different if ($test_this.Equals("",[System.StringComparison]::Ordinal)) { "equal" } else {"diff - ordinal test" } # testing with the current culture, these are equal if ($test_this.Equals("",[System.StringComparison]::CurrentCulture)) { "equal - current culture test" } else {"diff" }
As bedtime reading, the weighting tables used in the comparisons are available for download at https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings#the-details-of-string-comparison
Feedback and suggestions
If you have feedback or suggestions for improving this data migration asset, please contact the Databases SQL Ninja Engineering Team (datasqlninja@microsoft.com). Thanks for your support!
Note: For additional information about migrating various source databases to Azure, see the Azure Database Migration Guide.