This blog will explain how to use Custom Analyzer of Index and how to implement tokenizer in Custom Analyser.
Before we start this blog, you need to understand what the meaning of Azure Search index.
You can follow this document Introduction to Azure Cognitive Search and Create an index - Azure Cognitive Search for the details. Here we will not explain more about the definitions.
In Search Index if you need to analyzer a String field there are are default analysers like “Standard Lucene analyze”. There are also custom analyzer which implement many functions to analyse the string field. In this document Add custom analyzers to string fields it explain the type of custom analzyer and how to write a custom analyzer.
The custom analzyer is written with Json format in Index Definition Json. The first part is about the anayzer:
Like the example in below, one analyzer could include charFilters, tokenizer, tokenFilters.
"name":"name of analyzer",
"name":"name of analyzer",
The Json include Character filters, Tokenizer, Token Filters under the “analyzers”.
Character filters is to filter characters like space, dash (-) and so on.
Tokenizer is to divides continuous text into a sequence of tokens.
Token Filters is used to filter out or modify the tokens generated by a tokenizer. For example, you can specify a lowercase filter that converts all characters to lowercase.
Here we will focus on how to implement and test tokenizer in Custom analyzer.
First you need to know which kind of tokenizer you need to filter in your string field, for example “a.123, A.345” with this pattern. Normally if you search “A.123” it will search for “A” first, but you treat “A.123” as a tokenizer it will read is as a key word.
You cannot change exist field in Index analzyer to your new custom analzyer, so it need to create a new index or add to a new field. To update create a new index or update an exist index it need to run in Rest API, I will explain it later.
After deciding about the tokenizer, then we add a field called “customfield” in the index. Copy these in Index definition “field”.
"name": " myanalzyer ",
But it doesn’t satisfy our requirement to track “A.123”. So here we use “PatternTokenizer”.
The partners is the regexp following the rules of PatternTokenizer (Lucene 6.6.1 API) (apache.org). You could confirm the regexp pattern in others website first.
Below the pattern is "[a-zA-Z]\.\d+|0\.\d*[1-9]\d*$", add double “\”to escape in pattern “”. It uses to verified all the words with one letter ,“.”and digits, such as “A.1234”，“c.1231”.
"name": " myparttern ",
Now we need to update or create a new index with these analyzers. This Rest API Create Index (Azure Cognitive Search REST API) to create or update a exist index. We have “POST” or “PUT” to create new index. Here we explain about “PUT” operation.
PUT https://[servicename].search.windows.net/indexes/[index name]?api-version=[api-version]
If you need update exist Index, please add “&allowIndexDowntime=true” after api-version.
Content-Type : application/json
api-key: the key or Search in Portal, we suggest using the secondary key or add a new key.
Here is the index created in above step. The “Customfield” has custom analyzer.
After added the Tokenizer, you can validate from this rest api Analyze Text (Azure Cognitive Search REST API)
POST https://[service name].search.windows.net/indexes/[index name]/analyze?api-version=[api-version]
You can use same key and header with the Rest API above.
The request body, write a test sentence and use my tokenizer “myparttern”
"text": "this is test s.342 and t.879",
And this is the result. You can see it selects token in the sentence.
So that means my tokenizer could validate the specified pattern words.
This is the last steps of my test in this blog. So, we need to confirm whether this token could highlight in Search results. This document Hit highlighting explained about the highlight query and result, please read more details in it.
Below is my test resource:
“test 1234 test tr324df w
a.1234 test test
b.4523 test test
c.678 test test
When search in Index with highlight, in “content” field it recognizes “a.1234” with separate characters. And in “customfield” the highlight hit “a.1234”. We can see the result in below like <em>a.1234</em>
Same with the Search result for “b.4523”
You can follow these steps to test in your environment.
In conclusion, using “Custom Analyzer” with “Tokenizer” patterns you can get the specified words which marked in your search document. These tokens could use as a tag or a key word to search in your document. It’s easier to get the result you want efficiently!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.