Announcing new multi-modal capabilities with Azure AI Speech
Published May 21 2024 08:30 AM 6,183 Views

Customers continue to innovate with Azure OpenAI and Azure AI Speech. They are bringing new efficiencies into their enterprise and building new multimodal experiences for their customers. We are seeing a variety of use cases including call analytics, medical transcription, captioning, chatbots and more. At Azure AI, we continue to work with customers and bring new innovations to the market.


Here are all the multimodal innovations, specifically including speech and text, that we are announcing at Microsoft Build this year.


Speech analytics 

Today, we are announcing Speech analytics in preview. Speech analytics is a new service in Azure AI Studio that combines Azure AI services, and PromptFlow, to automatically process and analyze audio data simply by uploading it to cloud storage. With Speech analytics it is easy to gain insights into call center conversations or to extract a conversation summary using AI models from Azure OpenAI as well as Azure AI Language to analyze the accurate transcriptions generated by Azure AI Speech. Gaining insights from call center conversations allows businesses to better understand their customer needs, product feedback and support trends and to improve the customer experience. Using our post-call analytics template customers can quickly set up common insights like call summaries, customer sentiment, and key topics. Customers that want to go beyond these out-of-the-box insights can easily modify the default prompt to extract additional insights and even modify the full prompt flow to fully customize the analytics to extract a wide range of information including for example discussion highlights and even predicting possible conversation flows , With Speech Analytics, it is also easy to customize support for multiple languages, accents, domains and scenarios and to scale to large production use.

Speech analytics is helping our customers gain insights into customer conversations and improve their customer experience, sales, and marketing strategies. It is also a steppingstone for multi-modal data analysis, which will enable richer and deeper insights from different types of data in the future.



Here is an exemplary suite of technologies that Speech Processing Solutions (Philips Dictation) is building using Azure AI services, including Speech analytics:






Speech analytics will be available for developers to try out in June. To learn more, try it out in the Azure AI Studio.


Fast transcription

Today, we are also announcing Fast Transcription API in preview. The API -part of the Azure AI Speech family- provides the means to transcribe audio files of up to 200MB size in seconds through a simple REST call. Customers want to enable scenarios where obtaining the transcript quickly is paramount. They want the transcript as soon as an interview finishes, or a phone call completes, for instance. This API is a game changer for transcription at large. It can now transcribe up to 40x faster than real-time producing for example a transcript of a 10 minute audio file in 15 seconds, without sacrificing accuracy using a synchronous REST API call. The API provides a simple but powerful way to transcribe audio and opens the door to a new set of scenarios, one of which is ‘agent note taking’ within call centers.


Efficient note taking

A typical agent working in a call center spends 3 to 5 mins after each call creating notes. Fast Transcription API in combination with Azure OpenAI Service can automate this task, giving thousands of hours of work back to the call center. Medical practitioners that record conversations with patients can analyze these recordings in seconds. Similarly, media and content creators can analyze and extract insights from podcasts or interviews as soon as they complete.


IntelePeer simplifies communications automation through advanced AI-powered solutions, helping businesses and contact centers reduce costs and enrich the customer experience.

"The performance of Microsoft’s FAST API for offline transcription far supersedes the competition. When comparing the same sample corpus, FAST API performed the best among the alternative services tested. It shined on low quality audio transcription, delivering results 70% better than other vendors." - Sergey Galchenko, CTO, IntelePeer.


Parloa, a software development company building a contact center AI platform for the next generation of customer service in enterprises, has been using the Fast Transcription API in private preview.

"FAST Transcription API provides the fastest, most accurate and most cost-effective option in the Transcription market" -- CTO, Parloa


OPPO, a global technology brand for its innovative smartphone and smart devices, is using Azure AI speech-to-text, Fast Transcription and Azure AI text-to-speech to pilot new customer experiences on their new AI phone. Read this blog to learn more.


Fast Transcription API will be available to developers starting June, 2024. Stay tuned for more.


Video Translation

Today, we are announcing the availability of Video Translation, a groundbreaking service designed to transform the way businesses localize their video content, in preview. The new service offers developers an efficient and seamless solution to address the rising demand for translating video content and overcoming language barriers, allowing content owners to reach a broader audience. Whether it's for educational videos, marketing campaigns, or entertainment content, the Video Translation ensures your message is heard, in any of the supported languages.



The service enables developers to translate content in 10 language pairs with prebuilt neural voices and content editing features, or by using the personal voice capability, which is a limited access featureLearn more about Video Translation in the studio and try it out with your own videos.


Vimeo is on a mission to simplify making, managing, and sharing video --- all in a single, easy-to-use platform.

"Vimeo has been working closely with Microsoft video translation and is excited about the use cases it will unlock for customers worldwide." 

- Ashraf Alkarmi - Vimeo Chief Product Officer


Read this blog to learn more about video translation.


Multi-lingual speech translation

We are also announcing new speech translation enhancements in Azure AI Speech. We are introducing multiple language detection with the ability to detect language switches among the supported languages in the same audio stream, automatic language detection eliminating the need for developers to specify input languages, and integrated custom translation to adapt the translation to your domain-specific vocabulary.

With these capabilities, developers no longer need to specify the input language, can handle language switches within the same session, and support live streaming translations into target languages.


This capability is especially helpful for captioning use-cases. Captioning is the act of adding text to audio or video content, to make it more accessible and comprehensible for people who have hearing difficulties, or who speak a different language. Captioning is not only a legal obligation in many countries, but also a social duty and a good practice for inclusion. Content creators can now attract a broader and more diverse audience and improve the user experience and engagement effortlessly.


Check out how iTourTranslator has integrated multi-lingual speech translation in their AR glasses.


Read this blog to learn more about multi-lingual speech translation.



Announcing general availability of personal voice

Another aspect of our Speech service is the natural voices it offers. Customers use the platform to create realistic and natural-sounding voices for avatars, chatbots and IVRs. With Azure AI Speech, you can either use an existing voice model choosing from a wide variety of voices and styles, or create your own custom voice, using your own data and recordings.


Today, we are also announcing the general availability of a new personal voice feature in Azure AI Speech. It is available as limited access to ensure appropriate guardrails and avoid misuse. This feature allows users to create an AI voice in a few seconds by providing just a short speech sample as the audio prompt. This feature can be used for various use cases, such as personalizing voice experience for a chatbot, or translating video content in different languages with the actor’s native voice.

Read this blog to learn about customer examples and demos, and the responsible AI practices that are implemented, such as watermarking and usage policies.


In conclusion our powerful and versatile platform helps customers combine speech input and output as a modality to other AI capabilities. This enables developers to create high-quality workloads for new scenarios. Whether you need insights into human conversations, live or recorded captions, or realistic and natural-sounding voices for your avatars, chatbots, or IVRs, Azure AI assists customers deliver fast, reliable and customizable solutions.

Version history
Last update:
‎May 22 2024 05:35 AM
Updated by: