Blog Post

Azure AI Foundry Blog
4 MIN READ

Voice Conversion in Azure AI Speech

JieDing's avatar
JieDing
Icon for Microsoft rankMicrosoft
Jun 26, 2025

We are delighted to announce the availability of the Voice Conversion (VC) feature in Azure AI Speech service, which is currently in preview.

What is voice Conversion

Voice Conversion (or voice changer, speech to speech conversion) is the process of transforming the voice characteristics of a given audio to a target voice speaker, and after Voice Conversion, the resulting audio reserves source audio’s linguistic content and prosody while the voice timbre sounds like the target speaker.

Below is a diagram of Voice Conversion.

 

The purpose of Voice Conversion

There are 3 reasons users need Voice Conversion functionality:

  1. Voice Conversion can replicate your content using a different voice identity while maintaining the original prosody and emotion. For instance, in education, teachers can record themselves reading stories, and Voice Conversion can deliver these stories using a pre-designed cartoon character's voice. This method preserves the expressiveness of the teacher's reading while incorporating the unique timbre of the cartoon character's voice.
  2. Another application is multilingual dubbing. When localized content is read by different voices, Voice Conversion can transform them into a uniform voice, ensuring a consistent experience across all languages while keeping the most localized voice characters.
  3. Voice Conversion enhances the control over the expressiveness of a voice. By transforming various speaking styles, such as adopting a unique tone or conveying exaggerated emotions, a voice gains greater versatility in expression and can be more dynamic in different scenarios.

 

Brief introduction to Our Voice Conversion Technology

The Voice Conversion is built on state-of-the-art generative models and offers high-quality voice conversion. It delivers the following core capabilities:

Key Capability

     Description

High Speaker Similarity

  • Captures the timbre and vocal identity of the target speaker
  • Generates audio that accurately matches the target voice

Prosody Preservation

  • Maintains rhythm, stress, and intonation of source audio
  • Preserves expressive and emotional qualities

High Audio Fidelity

  • Generates realistic, natural-sounding audio
  • Minimizes artifacts

Multilingual Support

  • Enables multilingual Voice Conversion
  • Supports 91 locales (same as standard Text to speech locale support)

 

Voice Conversion in Standard TTS voices

In this release 28 Standard TTS voices on EN-US have been enabled with Voice Conversion capabilities. These voices are available in East US, West Europe and Southeast Asia service regions.   

Sample

 

How to Use

You can enable Voice Conversion by adding mstts:voiceconversion tag to your SSML. The structure is nearly identical to a standard TTS request, with the addition of specifying a source audio URL and a target voice name.

Note: In voice conversion mode, the synthesized output follows the content and prosody of the provided source audio. Therefore, text input is not required, and any text included in the SSML will be ignored during rendering. Additionally, All SSML elements related to prosody and pronunciation, such as or , will lose effect, because prosody is derived directly from the source audio.

SSML example
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> <voice xml:lang="en-US" xml:gender="Female" name="Microsoft Server Speech Text to Speech Voice (en-US, AvaMultilingualNeural)"> <mstts:voiceconversion url=" https://your.blob.core.windows.net/sourceaudio.wav"></mstts:voiceconversion> </voice> </speak>
Voice List

Here is the list of Standard Neural TTS supporting this feature

AdamMultilingualNeural

AlloyTurboMultilingualNeural

AmandaMultilingualNeural

AndrewMultilingualNeural

AvaMultilingualNeural

BrandonMultilingualNeural

BrianMultilingualNeural

ChristopherMultilingualNeural

CoraMultilingualNeural

DavisMultilingualNeural

DerekMultilingualNeural

DustinMultilingualNeural

EchoTurboMultilingualNeural

EmmaMultilingualNeural

EvelynMultilingualNeural

FableTurboMultilingualNeural

JennyMultilingualNeural

LewisMultilingualNeural

LolaMultilingualNeural

NancyMultilingualNeural

NovaTurboMultilingualNeural

OnyxTurboMultilingualNeural

PhoebeMultilingualNeural

RyanMultilingualNeural

SamuelMultilingualNeural

SerenaMultilingualNeural

ShimmerTurboMultilingualNeural

SteffanMultilingualNeural

 

 

Voice Conversion in Custom Voice

Voice Conversion can also be applied to Custom Voice to enhance its expression. This feature is currently available in Custom Voice in Private Preview. This feature enhances the Custom Voice experience, and since it only requires a small amount of target speaker data, it offers a quick solution for dynamic voice customization. Customers who have built or plan to build custom voice on Azure and have a suitable use case for Voice Conversion are invited to contact us at mstts@microsoft.com to preview this feature.

Sample:

Benchmark Evaluation

Benchmarking plays a key role in evaluating the quality of Voice Conversion. In this work, we have compared our solution against a leading Voice Conversion provider across a range of objective and subjective metrics, showcasing its advantages.

Objective Evaluation

We evaluated our system and a leading Voice Conversion provider (Company A) on two language sets (English and Mandarin) using three widely accepted objective metrics:

  • SIM (Speaker Similarity): measures how closely the converted voice matches the target speaker’s vocal characteristics (higher is better).
  • WER (Word Error Rate): measures the intelligibility of the converted voice by an automatic speech recognition (ASR) system (lower is better).
  • Pitch Correlation: measures how well the pitch contour (intonation) of the converted voice aligns with the source (higher is better).

Solution

Test Set

SIM ↑

WER ↓

Pitch Correlation ↑

Ours

En-US set

0.70

1.9%

0.61

Company A

En-US set

0.63

2.0%

0.54

Ours

Zh-CN set

0.66

6.94%

0.47

Company A

Zh-CN set

0.55

66.48%

0.40

Our Voice Conversion consistently outperforms Company A in speaker similarity and pitch preservation, while achieving lower WER, particularly on Mandarin.

Subjective Evaluation

CMOS (Comparison Mean Opinion Score) tests were conducted to assess perceptual quality. Listeners compared audio pairs and rated which sample sounded more natural. A positive score reflects a preference for one system over the other.

Test Set

CMOS (Company A vs Ours)

En-US set

On par

Zh-CN set

+0.75 in favor of ours

These results show that our system achieves the same perceptual quality in English and performs significantly better in Mandarin.

Conclusion

In terms of objective evaluation, our Voice Conversion outperforms the leading Voice Conversion provider in speaker similarity (SIM), pitch correlation, and multilingual capabilities.

In terms of subjective evaluation, our Voice Conversion is on par with the provider in English, while achieving a significant advantage in Mandarin which demonstrates its advantages in multilingual conversion.

Overall, these results show that our current Voice Conversion delivers state-of-the-art quality.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Updated Jul 23, 2025
Version 2.0
No CommentsBe the first to comment