By Jaya Mathew and Mithun Prasad, PhD
In our previous blog, we gave a brief introduction to machine translation, explored various topics like identifying the language and how to perform translation/transliteration of spoken or typed text using Microsoft’s Translator Text API. In addition, we also discussed how translated or transliterated text can be integrated within a LUIS app. In this blog, we highlight new language support coming to LUIS and provide tips on improving app performance when using languages that are in preview phase.
At the time of writing the previous blog, Hindi language support was not natively available in the LUIS portal. However, LUIS portal now supports (in Preview) additional languages where Hindi script is supported. So, the user can create a new app where the culture is set to ‘Hindi Indian (Preview)’ as shown in Figure-1 and then can type in Hindi utterances within their new app.
Figure-1: Creating an app with Hindi language
In the preview phase, however, some of the pre-built entities like URL in native language are not supported, so the user might run into issues when trying to tag URLs as shown in Figure-2:
Figure-2: URL’s in Hindi native script
One way to work around this is to create a RegEx (Regular Expression) entity manually as shown in Figure-3:
Figure-3: RegEx workaround for pre-built entries
Generic entities such as phone numbers and URLs can be extracted using regular expressions for matching standard patterns. Examples of URLs rendered in native language are as follows:
Hindi:
((?:(?:https?):\/\/)?(?:[a-zA-Z0-9\u0900-\u097F]+\.(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm|in|ai|is|कॉम|इन|ऑर्ग|नेट|येअइ))(?:\/[a-zA-Z0-9@:%_\+.~#?&//=]*)?)
It is important to note that in order to get the full range of characters, we use \u0900-\u097F and not native characters.
Tamil:
((?:(?:https?):\/\/)?(?:[a-zA-Z0-9\u0B80-\u0BFF\.]+\.(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm|in|ai|is|காம்|ஒர்க்|இன்|எஐ|நிக்|ஸே))(?:\/[a-zA-Z0-9@:%_\+.~#?&//=]*)?)
Telugu:
((?:(?:https?):\/\/)?(?:[a-zA-Z0-9\u0C00-\u0C7F\.]+\.(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm|in|ai|is|కామ్|ఇన్|ఆర్గ్|నెట్|ఐఎన్|ఏఐ))(?:\/[a-zA-Z0-9@:%_\+.~#?&//=]*)?)
This adds a lot of flexibility when building a LUIS model in languages that are in preview and do not yet support the various prebuilt entities.
https://docs.microsoft.com/en-us/azure/cognitive-services/luis/
https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-language-support
https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-reference-prebuilt-url?tabs=V3
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.