Hacking the RegEx entity within LUIS

Published Jan 10 2020 07:53 PM 1,725 Views
Microsoft

By Jaya Mathew and Mithun Prasad, PhD

 

In our previous blog, we gave a brief introduction to machine translation, explored various topics like identifying the language and how to perform translation/transliteration of spoken or typed text using Microsoft’s Translator Text API. In addition, we also discussed how translated or transliterated text can be integrated within a LUIS app. In this blog, we highlight new language support coming to LUIS and provide tips on improving app performance when using languages that are in preview phase.

 

At the time of writing the previous blog, Hindi language support was not natively available in the LUIS portal. However, LUIS portal now supports (in Preview) additional languages where Hindi script is supported. So, the user can create a new app where the culture is set to ‘Hindi Indian (Preview)’ as shown in Figure-1 and then can type in Hindi utterances within their new app.

 

clipboard_image_0.png

 

Figure-1: Creating an app with Hindi language

 

In the preview phase, however, some of the pre-built entities like URL in native language are not supported, so the user might run into issues when trying to tag URLs as shown in Figure-2: 

 

 

clipboard_image_1.png

Figure-2: URL’s in Hindi native script

 

One way to work around this is to create a RegEx (Regular Expression) entity manually as shown in Figure-3:

 

clipboard_image_2.png

 

Figure-3: RegEx workaround for pre-built entries

 

Generic entities such as phone numbers and URLs can be extracted using regular expressions for matching standard patterns. Examples of URLs rendered in native language are as follows:

 

Hindi:

 

((?:(?:https?):\/\/)?(?:[a-zA-Z0-9\u0900-\u097F]+\.(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm|in|ai|is|कॉम|इन|ऑर्ग|नेट|येअइ))(?:\/[a-zA-Z0-9@:%_\+.~#?&//=]*)?)

 

It is important to note that in order to get the full range of characters, we use \u0900-\u097F and not native characters.

 

Tamil:

 

((?:(?:https?):\/\/)?(?:[a-zA-Z0-9\u0B80-\u0BFF\.]+\.(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm|in|ai|is|காம்|ஒர்க்|இன்|எஐ|நிக்|ஸே))(?:\/[a-zA-Z0-9@:%_\+.~#?&//=]*)?)

 

Telugu:

 

((?:(?:https?):\/\/)?(?:[a-zA-Z0-9\u0C00-\u0C7F\.]+\.(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm|in|ai|is|కామ్|ఇన్|ఆర్గ్|నెట్|ఐఎన్|ఏఐ))(?:\/[a-zA-Z0-9@:%_\+.~#?&//=]*)?)

 

 

This adds a lot of flexibility when building a LUIS model in languages that are in preview and do not yet support the various prebuilt entities.

 

References:

https://techcommunity.microsoft.com/t5/AI-Customer-Engineering-Team/Adding-multi-language-support-fo... 

https://docs.microsoft.com/en-us/azure/cognitive-services/luis/ 

https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-language-support 

https://docs.microsoft.com/en-us/azure/cognitive-services/luis/reference-entity-regular-expression?t... 

https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-reference-prebuilt-url?tabs=V3

Version history
Last update:
‎Jan 10 2020 07:56 PM
Updated by: