Latest updates to the Azure AI Speech Service

Microsoft

Nov 19, 2024

Today at Microsoft Ignite, we are excited to announce the latest updates to Azure AI Speech! This article provides a summary of all the new and recent releases.

Azure AI Content Understanding - Post Call Analytics, Public Preview

Azure AI Content Understanding offers powerful capabilities for businesses to transform diverse data formats into actionable insights. Notably, the service supports the ingestion and processing of audio data by combining highly accurate transcription through Azure AI Speech with generative AI, making it ideal for post-call analytics of call center recordings. By generating transcripts, summaries, and highlights from audio inputs, the service enhances the efficiency and quality of customer interactions and decision-making processes. Leveraging advanced AI techniques based on the latest generative AI and integration within Azure's ecosystem, it delivers accurate, contextually relevant insights such as call summary, call reason and sentiment, while reducing costs and streamlining workflows across speech and generative AI. Learn more at aka.ms/content-understanding-launch-blog.

Fast Transcription API, General Availability

The Fast Transcription API is now generally available, providing fast audio-file to text conversion that can transcribe, for example, a 10-minute audio file in 15 seconds. This API is well suited for scenarios like transcription of call recordings, voicemail transcription, video captioning/subtitling, and more. Fast Transcription now features expanded locale coverage as well as enhancements like speaker diarization and language identification and is available in additional Azure service regions. Learn more at aka.ms/fast-transcription.

Realtime Speech Translation, General Availability

Realtime speech translation is now generally available, enabling multilingual speech-to-speech translation for 76 input languages. It includes significant latency improvements to deliver translation results in less than 5 seconds of the initial utterance. Enhanced latency is supported for language pairs with English (en-US) as the output language and will be extended to other output languages by the end of 2024. Extended language support from 40 to 76 languages is available in 3 Azure service regions today (West Central US, East Asia, and North Europe) and will be expanded to cover all Azure AI Speech service regions by the end of 2024. Learn more at aka.ms/azure-speech-translation.

Video Translation API, Public Preview

We are excited to announce that we have extended our video translation capabilities showcased already in our video translation portal by now providing a public preview of our new video translation API. The new API allows developers to incorporate powerful video content localization into their applications to enable content creators to reach a global audience easily and efficiently.

Videos can be uploaded to Azure blob storage and translation tasks will be processed in parallel in a batch mode by the API offering faster processing speeds. Translated output videos as well as the subtitles in both the original and target language are available for asynchronous download. In addition to contextual refinement through GPT-4 the API also offers the ability to implement human-in-the-loop flows giving users full control to modify machine-generated translation results.

Through the optional use of personal voice, the video translation will retain the speakers’ timbre, emotion, intonations, and pitch fluctuations across different languages pairs. To ensure the responsible use of the technology the personal voice feature is a limited access feature. The personal voice output option is available by registration only, and only for certain use cases. To access it, follow the limited access instructions to get approval.

Context-aware, highly expressive HD voices, Public Preview

We are pleased to announce the launch of Azure AI Speech's neural text-to-speech high definition (HD) voices. These advanced voices can detect emotions and adjust tone in real-time, maintaining a consistent persona while providing enhanced features.

Azure AI Speech's HD voices represent a significant milestone in speech synthesis technology. Utilizing state-of-the-art neural networks, they generate lifelike and expressive speech adaptable to various contexts and applications. Whether developing interactive podcast from documents, virtual assistants, or interactive educational tools, these HD voices offer a new level of realism and engagement. These voices allow developers to create inclusive and localized experiences for users worldwide.

The HD voices are now available in public preview in the East US, West Europe, and Southeast Asia regions. For a comprehensive list of available HD voices, please refer here.

If you like, you can listen to an AI generated podcast that covers today's announcements using our new context-aware, highly expressive HD voices:

Custom Avatar updates

During the past few months, we have made a few updates to the text to speech avatar service.

More sample code: JS code sample is added to GitHub for live chats with real-time avatar.
Gestures added to live chats: Now avatars are more engaging with natural gestures added to conversations. Try it with the live chat avatar tool.
More regions: East US2 is added to the supported regions for text to speech avatar bringing the total number of supported Azure service regions to seven: Southeast Asia, North Europe, West Europe, Sweden Central, South Central US, East US 2, and West US 2.
Lower price: We have further reduced the cost for avatar synthesis in live chat scenarios. With that, now the public price for real-time avatar synthesis has been reduced. With standard avatars, real-time synthesis price is reduced from $1 per minute to $0.5 per minute (taking effect in December), and for custom avatars, the real-time synthesis price is reduced from $1 per minute to $0.6 per minute. Check more details on the pricing page (choose one of the supported Azure service regions).

Text to speech avatar continues to power more customer success stories. For example, World2Meet (W2M) developed a speaking virtual assistant and avatar, able to handle and answer any question from its customers, in multiple languages. The International University of Applied Sciences (IU) created a study buddy with Azure text to speech avatar that helps all students achieve their learning goals through interactive dialogue—tailored precisely to their individual requirements and preferences.

At Ignite, we are also glad to share that a self-service custom avatar portal will be released very soon. With this portal, customers will be able to upload their own video data and create custom avatars by themselves. This update will largely reduce the time to market for customers’ avatars, which are currently built through Microsoft’s engineering support. The self-service portal will be released with guardrails for responsible use, as described in our transparency notes.

At Ignite, CDW uncovers how the leading technology solution provider harnessed the power of Azure TTS Avatar, Azure OpenAI, and agentic architecture to deliver an unparalleled customer experience. Learn about their innovative solution, showcased at a live event, enabling customers to effortlessly order coffee through natural language interactions, with BRKFP383 Revolutionizing customer experience with Azure TTS Avatar and OpenAI on Wednesday, Nov 20.

Speech Model Improvements for Accessibility

Azure AI Speech has been working on speech accessibility improvements to our English speech recognition in partnership with the University of Illinois’ Speech Accessibility Project. By gathering data from individuals with diverse speech disabilities and integrating non-standard speech data into our public model, we have achieved significant improvements in English speech recognition, with accuracy gains ranging from 18% to 60% depending on the disability type. This ongoing work reflects our dedication to building technology that is more inclusive and accessible for everyone.

Developer Experience Improvements

A new Azure AI Speech Toolkit extension is now available for Visual Studio Code developers. It contains a list of speech quick-starts and scenario samples that can be easily built and run with simple clicks. For more information, see Azure AI Speech Toolkit in Visual Studio Code Marketplace.

Azure AI Speech capabilities are also now integrated into the Azure AI Foundry which is bringing everything AI into one place. The updated AI Foundry provides visually refreshed UX experiences with new Azure AI Speech playgrounds and finetuning experiences to create custom speech models.

These playgrounds provide a hands-on environment where you can explore and learn how to integrate powerful features like speech-to-text and speech translation into your AI applications, in addition to other AI services available in Azure AI Foundry.

One standout feature we are introducing is the ability to customize speech models through our finetuning experience. This allows you to achieve better quality results tailored to your specific needs, ensuring more accurate and effective AI solutions.

In addition to these updates, users can now easily discover Speech and other AI services within the model catalog and begin building with them from the Models + Endpoints section. Learn more.

Get Started: