We are pleased to announce that Speech to Text and Text to Speech containers from Azure Cognitive Services are now Generally Available (GA). Using these containers, customers can build a speech application architecture that is optimized for both robust cloud capabilities and edge locality.
With Speech to Text in containers, businesses across industries have unlocked new productivity gains and insights by enabling real-time and batch transcription of audio streams into text. With Text to Speech customers can enable applications, tools, or devices to convert text into human-like synthesized speech.
Organizations ranging from banking, telecom, aerospace and defense leverage speech containers to solve great business needs including: call center transcription & analytics, self-paced learning tools, and intelligent kiosks. Azure is the only cloud provider enabling customers with full flexibility of running artificial intelligence on their own terms, whether on-premises or or at the edge.
The goal of this post is to show how our customers leverage containers to solve AI needs at the edge.
Airbus is an international leader in the aerospace sector. They design, manufacture and deliver industry-leading commercial aircraft, helicopters, military transports, satellites and launch vehicles, as well as providing data services, navigation, secure communications, urban mobility and other solutions for customers on a global scale.
With Azure Cognitive Services, Airbus advances its aerospace operations, specifically their pilot training chatbots to harness the speech capabilities to engage and educate pilot staff. By integrating Azure AI speech and transcription capabilities, Airbus was able to engage and educate pilot staff with most up to date detail and safe practices. Watch this video for more details:
Airbus trains tens of thousands of commercial aircraft and military pilots annually. The customer pain point in pilot training is driven by the complexity of modern commercial and military aircraft. In recent years aircraft complexity has increased at such a rate that the scope of knowledge required for operating aircraft reliably and safely has increased exponentially.
The volume of training material amounting to a pilot training course content is rapidly approaching levels which are becoming difficult for trainees to retain with acceptable levels of recall and accuracy.
The average pilot conversion course for an experienced pilot converting to a new aircraft platform exceeds more than 7000 pages of printed documentation. This content must be reviewed, committed to memory and recalled with very high levels of accuracy not only during the 10 to 12 week duration of the conversion training course, but throughout the entire operational life of the pilot. The Airbus pilot training chatbot has been developed on an enterprise chatbot platform and is being enhanced with Azure speech service capabilities.
The objective of the pilot training chatbot is to provide pilot trainees with an alternative method for review, revision and self-paced learning. The pilot training chatbot is not designed to replace human flight instructors but rather looks to extend the coverage and access to their existing instructor knowledge base and supplements already developed standardized training methods. The chatbot is used to test knowledge areas for recall and accuracy.
Technical Challenges - The technical challenge for this project was to not depend on any public cloud services as, although initially focusing on civil aircraft, the projects aims to support military and governmental aircraft types as well.
This requires a disconnected and, in some cases, air gapped style of deployment. The heart of the chatbot is implemented in an on premise enterprise chatbot platform. It has a powerful conversation engine; however, but doesn't include speech technologies. However these can be integrated into the conversation using the APIs of speech technology services.
The challenge to integrate a voice interface was addressed using the Cognitive Services (speech) containers. The containers have been deployed on a kubernetes cluster running in a secured environment, ensuring flexibility and ease of deployment.
The chatbot connects via API to the container using the Speech SDK. It forwards the user's speech input to convert to text output using the Azure on premise speech to text container and responds to the user in either text or voice, depending on the user’s settings. In case the user chooses the full speech mode it will vocalize answers by using the Text to Speech API container to receive an audio file, which is sent to the user’s device for playback. Since all APIs are in the same environment, latency is no issue and the interaction feels natural and fast.
The following graph gives a short overview of the communication flow:
The UI for the pilot training chatbot is a JavaScript Web UI incorporating HTML chat window display and controls. Once the Speech to Text container was deployed within the Kubernetes environment, the next step was to integrate Speech to Text into the chatbot Web UI.
For this step, the sample JavaScript code from the Azure Cognitive Service Speech SDK proved to be a really useful resource. It provides code templates for Airbus to update and get started with connectivity to the STT Container, using the Microsoft Speech SDK JavaScript library.
Once connectivity to the STT Kubernetes service was established, configured and tested with the STT settings and Speech SDK, all the required functions transferred directly into the chatbot JavaScript code. Utilizing the Speech SDK examples for the initial testing and configuration before integration into the target UI really saved time to achieve the final goal of integration of Speech capabilities for the chatbot.
The recognizeonceAsync function was used as an example from the sample, as it requires a single utterance transcription, after which to stops to await reply from the Pilot Training chatbot.
The continuous recognitionAsync function was not required but can be used if transcribing continuously until explicitly stopped is needed, e.g. dictation uses cases.
reco.recognizeOnceAsync(
function (result) {
window.console.log(result);
statusDiv.innerHTML += "(continuation) Reason: " + SpeechSDK.ResultReason[result.reason];
switch (result.reason) {
case SpeechSDK.ResultReason.RecognizedSpeech:
statusDiv.innerHTML += " Text: " + result.text;
break;
case SpeechSDK.ResultReason.NoMatch:
var noMatchDetail = SpeechSDK.NoMatchDetails.fromResult(result);
statusDiv.innerHTML += " NoMatchReason: " + SpeechSDK.NoMatchReason[noMatchDetail.reason];
break;
case SpeechSDK.ResultReason.Canceled:
var cancelDetails = SpeechSDK.CancellationDetails.fromResult(result);
statusDiv.innerHTML += " CancellationReason: " + SpeechSDK.CancellationReason[cancelDetails.reason];
if (cancelDetails.reason === SpeechSDK.CancellationReason.Error) {
statusDiv.innerHTML += ": " + cancelDetails.errorDetails;
}
break;
}
statusDiv.innerHTML += "\r\n";
phraseDiv.innerHTML = result.text + "\r\n";
sdkStopRecognizeOnceAsyncBtn.click();
Airbus utilized the Azure Cognitive Service Speech SDK to get started with Text to Speech as soon as the container was deployed on their Kubernetes Infrastructure. The team selected Hazel UK as the voice for the initial tests with speech synthesis due to the clarity of pronunciation, but neural speech synthesis is of huge interest to try the future.
For the speech synthesis within the Chat UI, the application calls the Text to Speech container’s Rest API directly without using the Speech SDK library, as it requires fewer configurations for the service call in comparison to Speech to Text.
For longer responses from the chatbot (more than 100 characters they set a lower prosody rate of 0.9 ( <prosody rate="0.9"> ). This is so as to ensure the response is not too fast for the Pilot, enabling them to have time to process the response from the chatbot. In addition, we replace end of sentence full stops and commas with longer pauses ('<break time="600ms"/>'). Again this allows the Pilot sufficient time to process the reply of the chatbot.
For example, the sentence: “MTOW is an abbreviation for Maximum Takeoff Weight, which defines the maximum weight at which a pilot is allowed to attempt to take off, due to structural or other limits.”
Would translate to:
“<say-as interpret-as="characters">MTOW</say-as> is an abbreviation for Maximum Takeoff Weight, <break time="600ms"/> which defines the maximum weight at which a pilot is allowed to attempt to take off, due to structural or other limits.”
The word “MTOW” is spoken as individual letters and an empathized pause is included after the first comma.
Measuring key results - As an initial launch platform, the Pilot Training chat bot is targeting the Airbus A330 MRTT type, a military tanker refueling version of the standard Airbus A330 airliner.
Results and feedback have been extremely positive with both the instructor and trainee communities eagerly offering recommendations for content enhancements to supplement existing course material. Users feel the combination of AI and speech capability will bring about an enhanced learning experience, making for more efficient aircraft operation and safer skies.
Deploying your first container is about a 2-minute read, you basically create a resource at Azure portal, download image, run container with environmental variables. Here's a document to help you get started on running containers.
Containers available from Azure Speech Service are:
Cognitive Services containers
Get Started, learn more and take advantage of Azure Cognitive Services containers to build intelligent applications today and learn more.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.