Blog Post

AI - Azure AI services Blog
6 MIN READ

Make your voice chatbots more engaging with new text to speech features

QinyingLiao's avatar
QinyingLiao
Icon for Microsoft rankMicrosoft
Jun 28, 2024

In our increasingly digital world, the importance of giving a voice and image to chatbots cannot be overstated. Transforming a chatbot from an impersonal, automated responder into a relatable and personable assistant significantly enhances user engagement.

 

Today we're thrilled to announce Azure AI Speech's latest updates, enhancing text to speech capabilities for a more engaging and lifelike chatbot experience. These updates include:

  • A wider range of multilingual voices for natural and authentic interactions;
  • More prebuilt avatar options, with latest sample codes for seamless GPT-4o integration; and
  • A new text stream API that significantly reduces latency for ChatGPT integration, ensuring smoother and faster responses.

 

Introducing new multilingual and IVR-styled voices

 

We're excited to introduce our newest collection of voices, equipped with advanced multilingual features. These voices are crafted from a variety of source languages, bringing a rich diversity of personas to enhance your user experience. With their authentic and natural interactions, they promise to transform your chatbot engagement through our technology.

 

Discover the diverse range of our new voices:

Voice name

Main locale

Gender

Sample

en-GB-AdaMultilingualNeural

en-GB (English – United Kingdom)

Female

en-GB-OllieMultilingualNeural

en-GB (English – United Kingdom)

Male

pt-BR-ThalitaMultilingualNeural

pt-BR (Portuguese – Portugal)

Female

es-ES-IsidoraMultilingualNeural

es-ES (Spanish – Spain)

Female

es-ES-ArabellaMultilingualNeural

es-ES (Spanish – Spain)

Female

it-IT-IsabellaMultilingualNeural

it-IT (Italian – Italy)

Female

it-IT-MarcelloMultilingualNeural

it-IT (Italian – Italy)

Male

it-IT-AlessioMultilingualNeural

it-IT (Italian – Italy)

Male

 

We're also delighted to present two new optimized en-US voices, specifically designed for call center scenarios - a prevalent application of text-to-speech technology.

 

They are: 

Voice name

Main locale

Gender

Sample

en-US-LunaNeural

En-US (English – United States)

Female

en-US-KaiNeural

En-US (English – United States)

Male

 

These voices are currently available for public preview in three regions: East US, West Europe, and South East Asia. Discover more in our Voice Gallery and delve deeper into the details via our developer documentation.

 

Announcing advanced features for text to speech avatars

 

Text to speech avatar, previewed at Ignite 2023, enables users to create realistic videos of speaking avatars simply by giving text input and allows users to create real-time interactive bots with visual elements that are more engaging. Since its preview, we have received great feedback and appreciation from customers in various industries. Today, we are glad to share what’s been added to the avatar portfolio.

 

More prebuilt avatar options and more regions available

 

Our prebuilt text-to-speech avatars offer ready-to-deploy solutions for our customers. We've recently enriched our portfolio's diversity by introducing five new avatars. They can be used for both batch synthesis and real-time conversational scenarios. We remain committed to expanding our avatar collections to encompass a broader range of cultures and visual identities.

 

Text to speech avatars

 

 

These newly introduced avatars can be accessed in Speech Studio for video creation and live chats. Dive deeper into the process of synthesizing a text-to-speech avatar using Speech SDK for real-time synthesis in chatbot interactions, or batch synthesis for generating creativity videos.

 

Beyond the previously available service regions - West US 2, West Europe, and Southeast Asia - we are excited to announce the expansion of our avatar service to three additional regions: Sweden Central, North Europe, and South Central US. Learn more here.

 

Enhanced text to speech avatar chat experience with Azure OpenAI capabilities

 

Text-to-speech avatars are increasingly leveraged for live chatbots, with many of our customers utilizing Azure OpenAI to develop customer service bots, virtual assistants, AI educators, and virtual tourist guides, among others. These avatars, with their lifelike appearance and natural sounding neural TTS or custom voice, combined with the advanced natural language processing capabilities of the Azure OpenAI GPT model, provide an interaction experience that closely mirrors human conversation.

 

The Azure OpenAI GPT-4o model is now part of the live chat avatar application in Speech Studio. This allows users to see firsthand the collaborative functioning of the live chat avatar and Azure OpenAI GPT-4o. Additionally, we provide sample code to aid in integrating the text-to-speech avatar with the GPT-4o model. Learn more about how to create lifelike chatbots with real-time avatars and Azure OpenAI GPTs, or dive into code samples here (JS code sample, and python code sample) .

 

This update also includes sample codes to assist in customizing Azure OpenAI GPT on your data. Azure OpenAI On Your Data is a new feature that enables users to tailor the chatbot's responses according to their unique data source. This proves especially beneficial for enterprise customers aiming to develop an avatar-based live chat application capable of addressing business-specific queries from clients. For guidance on creating a live chat app using Azure OpenAI On Your Data, please refer to this sample code (search "On Your Data").

 

More Responsible AI support for avatars

 

Ensuring responsibility in both the development and delivery of AI products is a core value for us. In line with this, we've introduced two features to bolster the responsible AI support for text-to-speech avatars, supplementing our existing transparency note, code of conduct, and disclosure guidelines.

 

  • We've integrated  Azure AI Content Safety into the batch synthesis process of text to speech avatars for video creation scenarios. This added layer of text moderation allows for the detection of offensive, risky, or undesirable text input, thereby preventing the avatar from producing harmful output. The text moderation feature spans multiple categories, including sexual, violent, hate, self-harm content, and more. It's available for batch synthesis of text-to-speech avatars both in Speech Studio and via the batch synthesis API.
  • In our bid to provide audiences with clearer insights into the source and history of video content created by text to speech avatars, we've adopted the Coalition for Content Provenance and Authenticity (C2PA) Standard. This standard offers transparent information about AI-generation of video content. For more details on the integration of C2PA with text to speech avatars, refer to Content Credentials in Azure Text to Speech Avatar .

 

Unlocking real-time speech synthesis with the new text stream API

 

Our latest release introduces an innovative Text Stream API designed to harness the power of real-time text processing to generate speech with unprecedented speed. This new API is perfect for dynamic text vocalization, such as reading outputs from AI models like GPT in real-time.

 

The Text Stream API represents a significant leap forward from traditional non-text stream TTS technologies. By accepting input in chunks (as opposed to whole responses), it significantly reduces the latency that typically hinders seamless audio synthesis.

 

Comparison: Non-Text Stream vs. Text Stream

 

Non-Text Stream

Text Stream

Input Type

Whole GPT response

Each GPT output chunk

TTS First Byte Latency

High (Total GPT response time + TTS time)

Low (Few GPT chunks time + TTS time)

 

The Text Stream API not only minimizes latency but also enhances the fluidity and responsiveness of real-time speech outputs, making it an ideal choice for interactive applications, live events, and responsive AI-driven dialogues.

 

Utilizing the Text Stream API is straightforward. Simply follow the steps provided with the Speech SDK. For detailed implementation, see the sample code on GitHub.

 

Get started

 

Microsoft provides access to more than 500 neural voices spanning over more than 140 languages and locales, complemented by avatar add-ons. These text-to-speech capabilities, part of Azure AI Speech service, allow you to swiftly imbue chatbots with a natural voice and realistic image, thereby enriching the conversational experience for users. Furthermore, the Custom Neural Voice and Custom Avatar features facilitate the creation of a distinctive brand voice and image for your chatbots. With a unique voice and image, a chatbot can seamlessly integrate into your brand's identity, contributing to a cohesive and unforgettable brand experience.

 

For more information

 

 Zheng Niu and Junwei Gan also contributed to this article. 

 

Updated Nov 27, 2024
Version 5.0