One of the most requested features for Azure Cognitive Search has been to allow for more flexible filtering, faceting, and sorting. With the addition of normalizers (preview), that’s now possible!
Filtering, faceting, and sorting in search engines can be rigid operations—by default, they’re case sensitive and can be affected by extra white space or special characters. For example, if you were to add a filter to a query such as $filter=City eq ‘New York’, this will filter all of your search results down to only documents where the City field is listed as ‘New York’. The problem is that if some documents in your search index have ‘NEW YORK’ in all capital letters as the City, these documents won’t be included in your search results because the casing of ‘New York’ and ‘NEW YORK’ isn’t the same. There are similar challenges with faceting and sorting, too. For example, without normalizers, sorting is case sensitive and uppercase letters come before lowercase letters. That means that if you sorted results alphabetically, ‘Seattle’ would come before ‘dallas’ even though `d` comes before `s` in the alphabet.
That’s where normalizers come in—they allow you to pre-process text in fields marked as filterable, facetable, or sortable so these operations aren’t affected by small differences in your search index.
Conceptually, normalizers are very similar to analyzers except that analyzers split text into multiple tokens and are used to make fields searchable, whereas normalizers apply to the sortable, filterable, and facetable attributes of fields. Normalizers will always result in a single token.
You can add a normalizer to a field in your index in the same way you add an analyzer today:
{
"name": "City",
"type": "Edm.String",
"sortable": true,
"searchable": true,
"filterable": true,
"facetable": true,
"analyzer": "en.microsoft",
"normalizer": "lowercase"
}
The best way to see how a normalizer processes text is with the Analyze Text API. If we wanted to see how the “lowercase” normalizer would process the text “New York” we could use the following API call:
POST https://[service name].search.windows.net/indexes/[index name]/analyze?api-version=2021-04-30-Preview
{
"text": "New York",
"normalizer": "lowercase"
}
The output will look like this:
{
"tokens": [
{
"token": "new york",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
After adding a lowercase normalizer to the City field, if you were to apply that same filter, $filter=City eq ‘New York’, all results where the City is set to ‘NEW YORK’ or any other casing will also be returned by your query. Similarly, sorting will now return results in a more natural order where ‘dallas’ would come before ‘Seattle’ as most users would expect.
A lowercase normalizer is just one of the options available. There are several different predefined normalizers or you can create a custom normalizer if your use case requires it.
You can learn more about normalizers in Text normalization for case-insensitive filtering, faceting and sorting. You can also test out normalizers using the Analyze Text API.
If you’re not familiar with analyzers or normalizers, I’d recommend walking through this tutorial on creating a custom analyzer to get a feel for how they work—they’re a critical part of any full-text search engine and can help you build an awesome search experience.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.