azure ai speech
54 TopicsAnnouncing gpt-realtime on Azure AI Foundry:
We are thrilled to announce that we are releasing today the general availability of our latest advancement in speech-to-speech technology: gpt-realtime. This new model represents a significant leap forward in our commitment to providing advanced and reliable speech-to-speech solutions. gpt-realtime is a new S2S (speech-to-speech) model with improved instruction following, designed to merge all of our speech-to-speech improvements into a single, cohesive model. This model is now available in the Real-time API, offering enhanced voice naturalness, higher audio quality, and improved function calling capabilities. Key Features New, natural, expressive voices: New voice options (Marin and Cedar) that bring a new level of naturalness and clarity to speech synthesis. Improved Instruction Following: Enhanced capabilities to follow instructions more accurately and reliably. Enhanced Voice Naturalness: More lifelike and expressive voice output. Higher Audio Quality: Superior audio quality for a better user experience. Improved Function Calling: Enhanced ability to call custom code defined by developers. Image Input Support: Add images to context and discuss them via voice—no video required. Check out the model card here: gpt-realtime Pricing Pricing for gpt-realtime is 20% lower compared to the previous gpt-4o-realtime preview: Pricing is based on usage per 1 million tokens. Below is the breakdown: Getting Started gpt-realtime is available on Azure AI Foundry via Azure Models direct from Azure today. We are excited to see how developers and users will leverage these new capabilities to create innovative and impactful solutions. Check out the model on Azure AI Foundry and see detailed documentation in Microsoft Learn docs.2.2KViews1like0CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs: GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress using monitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)1.2KViews3likes8CommentsPersonal Voice upgraded to v2.1 in Azure AI Speech, more expressive than ever before
At the Build conference on May 21, 2024, we announced the general available of Personal Voice, a feature designed to empower customers to build applications where users can easily create and utilize their own AI voices (see the blog). Today we're thrilled to announce that Azure AI Speech Service has upgraded a new zero-shot TTS (text-to-speech) model, named “DragonV2.1Neural”. This new model delivers more natural-sounding and expressive voices, offering improved pronunciation accuracy and greater controllability compared to the earlier zero-shot TTS model. In this blog, we’ll present the new zero-shot TTS model audio quality, new features and benchmarks results. We’ll also share a guide for controlling pronunciation and accent using the Personal Voice API with the new zero-shot TTS model. Personal Voice model upgrade The Personal Voice feature in Azure AI Speech Service empowers users to craft highly personalized synthetic voices based on their own speech characteristics. By providing just a few seconds speech sample as the audio prompt, users can rapidly generate an AI voice replica, which can then synthesize speech in any of the output languages supported. This capability unlocks a wide range of applications, from customizing chatbot voices to dubbing video content in an actor’s original voice across multiple languages, enabling truly immersive and individualized audio experiences. Our earlier Personal Voice Dragon TTS model can produce speech with exceptionally realistic prosody and high-fidelity audio quality, but it still encounters pronunciation challenges, especially with complex elements such as named entities. As a result, pronunciation control remains a crucial feature for delivering accurate and natural-sounding speech synthesis. In addition, for scenarios involving speech or video translation, it is crucial for a zero-shot TTS model to accurately produce not only different languages but also specific accents. The ability to precisely control accent ensures that speakers can deliver natural speech in any target accent. Dragon V2.1 model cards Attribute Details Architecture Transformer model Highlights - Multilingual - Zero-shot voice cloning with 5–90 s prompts - Emotion, accent, and environment adaptation Context Length 30 seconds of audio Supported Languages 100+ Azure TTS locales SSML Support Yes Latency < 300 ms RTF (Real-Time Factor) < 0.05 Prosody and pronunciation improvement Comparing with our previous dragon TTS model (“DragonV1”), our new “DragonV2.1” model brings improvements to the naturalness of speech, offering more realistic and stable prosody while maintaining better pronunciation accuracy. Here are a few voice samples showing prosody improvement compared to DragonV1, prompt audio is the source speech from humans. Locale Prompt audio DragonV1 DragonV2.1 En-US Zh-CN The new “DragonV2.1” model also shows pronunciation improvements, we compared WER (Word error rate), which measures the intelligibility of the synthesis speech by an automatic speech recognition (ASR) system. We evaluated WER (lower is better) on all supported locales, each locale is evaluated on more than 100 test cases. The new model achieves on average 12.8% relative WER reduction compared to DragonV1. Here are a few complicated cases showing the pronunciation improvement, compared to DragonV1, the new DragonV2.1 model can read correctly on challenge cases such as Chinese polyphony and better produce in en-GB accent: Locale Prompt audio DragonV1 DragonV2.1 Zh-CN 唐朝高僧玄奘受皇帝之命,前往天竺取回真经,途中收服了四位徒弟:机智勇敢的孙悟空、好吃懒做的猪八戒、忠诚踏实的沙和尚以及白龙马。他们一路历经九九八十一难,战胜了无数妖魔鬼怪,克服重重困难。 En-GB [En-GB accent] Tomato, potato, and basil are in the salad. Pronunciation control The “DragonV2.1” model supports pronunciation control with SSML phoneme tags, you can use ipa phoneme tag and custom lexicon to specify how the speech is pronounced. In below examples, we supported "ipa" values for attributes of the phoneme element described here. In the below example, the values of ph="tə.ˈmeɪ.toʊ" or ph="təmeɪˈtoʊ" are specified to stress the syllable meɪ. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> <phoneme alphabet="ipa" ph="tə.ˈmeɪ.toʊ"> tomato </phoneme> </mstts:ttsembedding> </voice> </speak> You can define how single entities (such as company, a medical term, or an emoji) are read in SSML by using the phoneme elements. To define how multiple entities are read, create an XML structured custom lexicon file. Then you upload the custom-lexicon XML file and reference it with the SSML lexicon element. After you publish your custom lexicon, you can reference it from your SSML. The following SSML example references a custom lexicon that was uploaded to https://www.example.com/customlexicon.xml. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <lexicon uri="https://www.example.com/customlexicon.xml"/> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> BTW, we will be there probably at 8:00 tomorrow morning. Could you help leave a message to Robert Benigni for me? </mstts:ttsembedding></voice> </speak> Language and accent control You can use the <lang xml:lang> element to adjust speaking languages and accents for your voice to set the preferred accent such as en-GB for British English. For information about the supported languages, see the lang element for a table showing the <lang> syntax and attribute definitions. This element is recommended to use for better pronunciation accuracy. The following table describes the usage of the <lang xml:lang> element's attributes: <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> <lang xml:lang="en-GB"> Tomato, potato, and basil are in the salad. </lang> </mstts:ttsembedding> </voice> </speak> Benchmark evaluation Benchmarking plays a key role in evaluating the performance of zero-shot TTS models. In this work, we compared our system with other top zero-shot text-to-speech providers — Company A and Company B — for English, and with Company A specifically for Mandarin. This assessment allowed us to measure performance across both languages; we used a widely accepted subjective metric: MOS (Mean Opinion Score) tests were conducted to assess perceptual quality. Listeners listened to the audios carefully and rated them. In our evaluation, the opinion score is mainly judged from four aspects, including overall impression, naturalness, conversational and audio quality. Each judge gives 1-5 score on each aspect; we show the average score below. English set: Chinese set: These results show that our zero-shot TTS model is slightly better than Company A and B on English (> 0.05 score gap) and on par with Company A on Mandarin. Quick trial with prebuilt voice profiles To facilitate testing of the new DragonV2.1 model, several prebuilt voice profiles have been made available. By providing a brief prompt audio from each voice and using the new zero-shot model, these prebuilt profiles aim to provide more expressive prosody, high audio fidelity, and a natural tone while preserving the original voice persona. You can explore these profiles firsthand to experience the enhanced quality of our new model, without using your own custom profiles. We provided several prebuilt profiles for you and the profile names are listed below. Profile name Andrew Ava Brian Emma Adam Jenny To utilize these prebuilt profiles for output, assign the appropriate profile name into the “speaker” attribute of <mstts:ttsembedding> tag. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speaker="Andrew"> I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun. </mstts:ttsembedding> </voice> </speak> Here are Dragonv2.1 audio samples of these prebuilt profiles. Profile name DragonV2.1 Ava Andrew Brian Emma Adam Jenny Customer use case This advanced, high-fidelity model can be used to enable dubbing scenarios, allowing video content to be voiced in the original actor’s tone and style across multiple languages. The new Personal Voice model has been integrated in Azure AI video translation and targeting to empower creators of short dramas to reach out to global markets effortlessly. TopShort and JOWO.ai are the next generation of short drama creator and translation provider, partners with Azure Video Translation Service to deliver one-click AI translation. Check out the demo from TopShort. More videos are available in this channel, owned by JOWO.ai. Get started The new zero-shot TTS model will be available in the middle of August and will be exposed in the BaseModels_List operation of the custom voice API. When you get the new model's name “DragonV2.1Neural” in the base models list, please follow these steps to register your use case and apply for the access, create the speaker profile ID and use voice name “DragonV2.1Neural” to synthesize speech in any of the 100 supported languages. Below is an SSML example using DragonV2.1Neural to generate speech for your personal voice in different languages. More details are provided here. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> <lang xml:lang="en-US"> I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun. </lang> </mstts:ttsembedding> </voice></speak> Building personal voices responsibly All customers must agree to our usage policies, which include requiring explicit consent from the original speaker, disclosing the synthetic nature of the content created, and prohibiting impersonation of any person or deceiving people using the personal voice service. The full code of conduct guides integrations of synthetic speech and personal voice to ensure consistency with our commitment to responsible AI. Watermarks are automatically added to the speech output generated with personal voices. As the personal voice feature enters general availability, we have updated the watermark technology with enhanced robustness and stronger capabilities for identifying watermark existence. To measure the robustness of the new watermark, we have evaluated the accuracy of watermark detection with audio samples generated using personal voice. Our results showed an average accuracy rate higher than 99.7% for detecting the existence of watermarks in various audio editing scenarios. This improvement provides us stronger mitigations to prevent potential misuse. Try the personal voice feature on Speech Studio as a test, or apply for full access to the API for business use. In addition to creating a personal voice, eligible customers can create a brand voice for your business with Custom Voice’s professional voice fine-tuning feature. Azure AI Speech also offers over 600 neural voices covering more than 150 languages and locales. With these pre-built Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users.1.9KViews0likes0CommentsCreating Intelligent Video Summaries and Avatar Videos with Azure AI Services
Unlock the true value of your organization’s video content! In this post, I share how we built an end-to-end AI video analytics platform using Microsoft Azure. Discover how AI can automate video analysis, generate intelligent summaries, and create engaging avatar presentations—making content more accessible, actionable, and impactful for everyone. If you’re interested in digital transformation, AI-powered automation, or modern content management, this is for you!660Views5likes1CommentVoice Conversion in Azure AI Speech
We are delighted to announce the availability of the Voice Conversion (VC) feature in Azure AI Speech service, which is currently in preview. What is voice Conversion Voice Conversion (or voice changer, speech to speech conversion) is the process of transforming the voice characteristics of a given audio to a target voice speaker, and after Voice Conversion, the resulting audio reserves source audio’s linguistic content and prosody while the voice timbre sounds like the target speaker. Below is a diagram of Voice Conversion. The purpose of Voice Conversion There are 3 reasons users need Voice Conversion functionality: Voice Conversion can replicate your content using a different voice identity while maintaining the original prosody and emotion. For instance, in education, teachers can record themselves reading stories, and Voice Conversion can deliver these stories using a pre-designed cartoon character's voice. This method preserves the expressiveness of the teacher's reading while incorporating the unique timbre of the cartoon character's voice. Another application is multilingual dubbing. When localized content is read by different voices, Voice Conversion can transform them into a uniform voice, ensuring a consistent experience across all languages while keeping the most localized voice characters. Voice Conversion enhances the control over the expressiveness of a voice. By transforming various speaking styles, such as adopting a unique tone or conveying exaggerated emotions, a voice gains greater versatility in expression and can be more dynamic in different scenarios. Brief introduction to Our Voice Conversion Technology The Voice Conversion is built on state-of-the-art generative models and offers high-quality voice conversion. It delivers the following core capabilities: Key Capability Description High Speaker Similarity Captures the timbre and vocal identity of the target speaker Generates audio that accurately matches the target voice Prosody Preservation Maintains rhythm, stress, and intonation of source audio Preserves expressive and emotional qualities High Audio Fidelity Generates realistic, natural-sounding audio Minimizes artifacts Multilingual Support Enables multilingual Voice Conversion Supports 91 locales (same as standard Text to speech locale support) Voice Conversion in Standard TTS voices In this release 28 Standard TTS voices on EN-US have been enabled with Voice Conversion capabilities. These voices are available in East US, West Europe and Southeast Asia service regions. Sample How to Use You can enable Voice Conversion by adding mstts:voiceconversion tag to your SSML. The structure is nearly identical to a standard TTS request, with the addition of specifying a source audio URL and a target voice name. Note: In voice conversion mode, the synthesized output follows the content and prosody of the provided source audio. Therefore, text input is not required, and any text included in the SSML will be ignored during rendering. Additionally, All SSML elements related to prosody and pronunciation, such as or , will lose effect, because prosody is derived directly from the source audio. SSML example <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> <voice xml:lang="en-US" xml:gender="Female" name="Microsoft Server Speech Text to Speech Voice (en-US, AvaMultilingualNeural)"> <mstts:voiceconversion url=" https://your.blob.core.windows.net/sourceaudio.wav"></mstts:voiceconversion> </voice> </speak> Voice List Here is the list of Standard Neural TTS supporting this feature AdamMultilingualNeural AlloyTurboMultilingualNeural AmandaMultilingualNeural AndrewMultilingualNeural AvaMultilingualNeural BrandonMultilingualNeural BrianMultilingualNeural ChristopherMultilingualNeural CoraMultilingualNeural DavisMultilingualNeural DerekMultilingualNeural DustinMultilingualNeural EchoTurboMultilingualNeural EmmaMultilingualNeural EvelynMultilingualNeural FableTurboMultilingualNeural JennyMultilingualNeural LewisMultilingualNeural LolaMultilingualNeural NancyMultilingualNeural NovaTurboMultilingualNeural OnyxTurboMultilingualNeural PhoebeMultilingualNeural RyanMultilingualNeural SamuelMultilingualNeural SerenaMultilingualNeural ShimmerTurboMultilingualNeural SteffanMultilingualNeural Voice Conversion in Custom Voice Voice Conversion can also be applied to Custom Voice to enhance its expression. This feature is currently available in Custom Voice in Private Preview. This feature enhances the Custom Voice experience, and since it only requires a small amount of target speaker data, it offers a quick solution for dynamic voice customization. Customers who have built or plan to build custom voice on Azure and have a suitable use case for Voice Conversion are invited to contact us at mstts@microsoft.com to preview this feature. Sample: Benchmark Evaluation Benchmarking plays a key role in evaluating the quality of Voice Conversion. In this work, we have compared our solution against a leading Voice Conversion provider across a range of objective and subjective metrics, showcasing its advantages. Objective Evaluation We evaluated our system and a leading Voice Conversion provider (Company A) on two language sets (English and Mandarin) using three widely accepted objective metrics: SIM (Speaker Similarity): measures how closely the converted voice matches the target speaker’s vocal characteristics (higher is better). WER (Word Error Rate): measures the intelligibility of the converted voice by an automatic speech recognition (ASR) system (lower is better). Pitch Correlation: measures how well the pitch contour (intonation) of the converted voice aligns with the source (higher is better). Solution Test Set SIM ↑ WER ↓ Pitch Correlation ↑ Ours En-US set 0.70 1.9% 0.61 Company A En-US set 0.63 2.0% 0.54 Ours Zh-CN set 0.66 6.94% 0.47 Company A Zh-CN set 0.55 66.48% 0.40 Our Voice Conversion consistently outperforms Company A in speaker similarity and pitch preservation, while achieving lower WER, particularly on Mandarin. Subjective Evaluation CMOS (Comparison Mean Opinion Score) tests were conducted to assess perceptual quality. Listeners compared audio pairs and rated which sample sounded more natural. A positive score reflects a preference for one system over the other. Test Set CMOS (Company A vs Ours) En-US set On par Zh-CN set +0.75 in favor of ours These results show that our system achieves the same perceptual quality in English and performs significantly better in Mandarin. Conclusion In terms of objective evaluation, our Voice Conversion outperforms the leading Voice Conversion provider in speaker similarity (SIM), pitch correlation, and multilingual capabilities. In terms of subjective evaluation, our Voice Conversion is on par with the provider in English, while achieving a significant advantage in Mandarin which demonstrates its advantages in multilingual conversion. Overall, these results show that our current Voice Conversion delivers state-of-the-art quality.1.2KViews2likes0CommentsIntroducing Azure AI Models: The Practical, Hands-On Course for Real Azure AI Skills
Hello everyone, Today, I’m excited to share something close to my heart. After watching so many developers, including myself—get lost in a maze of scattered docs and endless tutorials, I knew there had to be a better way to learn Azure AI. So, I decided to build a guide from scratch, with a goal to break things down step by step—making it easy for beginners to get started with Azure, My aim was to remove the guesswork and create a resource where anyone could jump in, follow along, and actually see results without feeling overwhelmed. Introducing Azure AI Models Guide. This is a brand new, solo-built, open-source repo aimed at making Azure AI accessible for everyone—whether you’re just getting started or want to build real, production-ready apps using Microsoft’s latest AI tools. The idea is simple: bring all the essentials into one place. You’ll find clear lessons, hands-on projects, and sample code in Python, JavaScript, C#, and REST—all structured so you can learn step by step, at your own pace. I wanted this to be the resource I wish I’d had when I started: straightforward, practical, and friendly to beginners and pros alike. It’s early days for the project, but I’m excited to see it grow. If you’re curious.. Check out the repo at https://github.com/DrHazemAli/Azure-AI-Models Your feedback—and maybe even your contributions—will help shape where it goes next!Solved627Views1like5CommentsAzure AI Voice Live API: what’s new and the pricing announcement
At the //Build conference in May 2025, we announced the public preview of Azure AI Voice Live API (Breakout Session 144). Today we are exciting to share some updates to this API and the latest pricing. Recap: What is the Voice Live API and why does it matter? Voice is the next generation interface between humans and computers. In the era of voice-driven technologies, creating smooth and intuitive speech-based systems has become a priority for developers. The Voice Live API simplifies the process by combining essential voice processing components into a unified interface. Whether you're building conversational agents for customer support, automotive assistants, or educational tools, this API is designed to streamline workflows, reduce latency, and deliver high-quality, real-time voice interactions. The Voice Live API integrates speech-to-text (STT), GenAI models, text-to-speech (TTS), avatar, and conversational enhancement features into a single interface. By eliminating the need to stitch together disparate components, the API offers an end-to-end solution for scalable voice-driven experiences. The Voice Live API shines in scenarios where voice-driven interactions enhance user experiences. Here are some key applications: Contact Centers: Develop dynamic voice bots for tasks such as customer support, product catalog navigation, self-service solutions. These bots can improve operational efficiency and provide 24/7 support, reducing wait times for customers. Automotive Assistants: Enable hands-free, in-car voice assistantsfor command execution, navigation assistance and general inquiries. This ensures safer driving experiences while keeping users engaged. Education: Create voice-enabled learning companionsand virtual tutors for interactive training sessions, personalized education experiences, language learning and skill development. Voice-based systems can make learning more engaging and accessible for students of all ages. Public Services: Develop voice agentsto assist citizens with administrative queries, public service information, appointment scheduling and more. These agents can improve accessibility for individuals with limited digital literacy Human Resources: Enhance HR processes using voice-enabled tools for employee support(e.g., FAQs about benefits or policies), career development (e.g., performance feedback or skill-building recommendations), training (e.g., interactive onboarding experiences) and more. Voice-driven HR tools can streamline operations, reduce workload for HR teams, and provide employees with faster resolutions to their queries. The Voice Live API is packed with features designed to support diverse use cases and deliver superior voice interactions. Here’s a breakdown of its key capabilities: Broad locale coverage: Speech-to-Text (STT) supports over >50 locales with an option to use Azure’s multilingual model for 15 locales. Text-to-Speech (TTS) offers more than 600 out of box voices across 150+ locales, with access to 30+ highly natural conversational voices optimized with the neural HD models. Flexible GenAI model options: The API allows you to choose from multiple AI models tailored to conversational needs including GPT-4o, GPT4o-mini and Phi. Advanced conversational enhancement features: Ensure smooth and natural interactions with Noise Suppression that reduces environmental noise, making conversations clearer even in busy settings, Echo Cancellation that prevents the agent from picking up its own audio responses, avoiding feedback loops, Robust Interruption Detection that accurately identifies interruptions during conversations and Advanced End-of-Turn Detection that allows natural pauses without prematurely concluding interactions. Avatar integration: Provides avatars synchronized with audio output, offering a visual identity for voice agents. Customization: Design unique, brand-aligned voices for audio output and customized avatars to reinforce brand identity. Integration with Foundry Agents: Give your agents built in Azure AI Foundry a voice interface. To get started, try Voice Live in Azure AI Foundry Playground, or learn more about how to use Voice Live API. What’s new in June During the past few weeks, we have released a few new features for Voice Live API to address customer requests. Support more GenAI models o GPT4.1 model series: GPT-4.1, GPT-4.1 Mini and GPT-4.1 Nano are now natively supported. o Phi series: Phi-4 mini and Phi-4 Multimodal models are now supported. Support more customization capabilities Developers need customization to manage input and output for different use cases. In June, we added more features to support speech input and output customizations. o Phrase list: Use phrase list for lightweight just-in-time customization on audio input, for example, define "Neo QLED TV" or “Surface Pro 12” as one phrase. o Speaking rate control: The speaking rate parameter allows developers to easily adjust the speaking speed for any standard Azure text to speech voices and custom voices. o Custom lexicon: Custom lexicon enables developers to customize pronunciation for both standard Azure text to speech voices and custom voices. Learn more about how to use these features in this document. Azure Semantic VAD is extended to support GPT-4o-Realtime and GPT-4o-Mini-Realtime. Azure Semantic VAD (voice activity detection) detects start and end of speech based on semantic meaning. It improves turn detection by removing filler words to reduce the false alarm rate. This feature is now extended to support Azure OpenAI GPT-4o realtime models. Create Call Center Voice Agents by combining the Voice Live API and Azure Communication Services The blog post by the Azure Communication Services team and the corresponding sample in GitHub show how you can leverage Azure Communication Services to access audio from live calls and connect it to the Voice Live API to build Call Center Voice Agents leveraging Azure AI Speech’s advanced audio and voice capabilities. Availability in more regions More regions supported: WestUS 2, Central India, South East Asia. To check the features supported in each region, go to this document. Pricing note The Voice Live API will implement charges starting on July 1, 2025. The following pricing table indicates the charges based on the configurations chosen for voice agent applications. Category Price (1M Tokens) Pro Text Input: $5.5 Cached Input: $2.75 Output: $22 Audio with Azure AI Speech - Standard Input: $17 Cached Input: $2.75 Output: $38 Audio with Azure AI Speech - Custom Output: $55 Native audio with GPT-4o-Realtime Input: $44 Cached Input: $2.75 Output: $88 Basic Text Input: $0.66 Cached Input: $0.33 Output: $2.64 Audio with Azure AI Speech - Standard Input: $15 Cached Input: $0.33 Output: $33 Audio with Azure AI Speech - Custom Output: $50 Native audio with GPT-4o Mini-Realtime Input: $11 Cached Input: $0.33 Output: $22 Lite Text Input: $0.08 Cached Input: $0.04 Output: $0.32 Audio with Azure AI Speech - Standard Input: $13 Cached Input: $0.04 Output: $33 Audio with Azure AI Speech - Custom Output: $50 Native audio with Phi-MM Input: $4 Cached Input: $0.04 With Voice Live Pro, developers can choose from LLMs such as GPT-4o-Realtime, GPT-4o and GPT-4.1 models. With Voice Live Basic, developers can choose from smaller LLMs such as GPT-4o-Mini-Realtime, GPT-4o Mini and GPT-4.1 Mini models. With Voice Live Lite, developers can choose from SLMs and equivalent models such as GPT-4.1 Nano and Phi models. If you choose to use custom voice for your speech output, you will be charged separately for custom voice model training and hosting. Refer to the ‘Text to Speech – Custom Voice – Professional’ pricing for details. Custom voice is a limited access feature. Learn more about how to create custom voices. Avatars are charged separately with the interactive avatar pricing published here. For more details regarding how custom voice and avatar training charges, refer to this pricing note. Here are a few examples of different setups and their charges. Scenario 1: a customer service agent built with standard Azure speech-to-text input, GPT-4.1, and custom Azure speech-to-text output, plus a custom avatar. This scenario will align with the ‘Voice Live Pro’ category and the charges will include: Feature Price (1M Tokens) Text Input: $5.5 Cached Input: $2.75 Output: $22 Audio with Azure AI Speech - Standard Input: $17 Cached Input: $2.75 Audio with Azure AI Speech - Custom Output: $55 Separate charges for custom voice and custom avatar: Feature Price Custom voice – professional Voice model training: $52 per compute hour, up to $4,992 per training Endpoint hosting: $4.04 per model per hour Custom avatar Avatar model training: $15 per compute hour Interactive avatar (real-time): $0.60 per minute Endpoint hosting: $0.60 per model per hour Scenario 2: a learning agent built with GPT-4o-Realtime native audio input, and standard Azure Speech output. The charges will include ‘Voice Live Pro’: Feature Price (1M Tokens) Text Input: $5.5 Cached Input: $2.75 Output: $22 Native audio with GPT-4o-Realtime Input: $44 Cached Input: $2.75 Audio with Azure AI Speech - Standard Output: $38 Scenario 3: a talent interview agent built with GPT-4o-Mini-Realtime native audio input, and standard Azure Speech output and standard avatar. The charges will include ‘Voice Live Basic’: Feature Price (1M Tokens) Text Input: $0.66 Cached Input: $0.33 Output: $2.64 Native audio with GPT-4o Mini-Realtime Input: $11 Cached Input: $0.33 Audio with Azure AI Speech - Standard Output: $33 And additional charge for standard avatar: Feature Price Text to speech avatar (standard) Interactive avatar (real-time): $0.50 per minute Scenario 4: an in-car assistant built with Phi-multimodal modal and Azure custom voice. The charges will include ‘Voice Live Lite’: Feature Price (1M Tokens) Text Input: $0.08 Cached Input: $0.04 Output: $0.32 Native audio with Phi-MM Input: $4 Cached Input: $0.04 Audio with Azure AI Speech - Custom Output: $50 Separate charges for custom voice: Category Price Custom voice – professional Voice model training: $52 per compute hour, up to $4,992 per training Endpoint hosting: $4.04 per model per hour Get started The Voice Live API is transforming how developers build speech-to-speech systems by providing an integrated, scalable, and efficient solution. By combining speech recognition, generative AI, and text-to-speech functionalities into a unified interface, it addresses the challenges of traditional implementations, enabling faster development and superior user experiences. From streamlining customer service to enhancing education and public services, the opportunities are endless. The future of voice-first solutions is here—let’s build it together! Voice Live API introduction Try Voice Live in Azure AI Foundry Voice Live API documents Voice Live Agent code sample in GitHub2.9KViews2likes0CommentsAnnouncing GA of new Indian voices
TTS requirements for a modern business have upgraded significantly. They now require more natural, conversational and diverse voices which can cater to high-value scenarios like call center automation, voice assistants, chatbots and others. We are pleased to announce GA of a host of new Indian locale voices that cater to these requirements.7.2KViews1like2CommentsConfigure Embedding Models on Azure AI Foundry with Open Web UI
Introduction Let’s take a closer look at an exciting development in the AI space. Embedding models are the key to transforming complex data into usable insights, driving innovations like smarter chatbots and tailored recommendations. With Azure AI Foundry, Microsoft’s powerful platform, you’ve got the tools to build and scale these models effortlessly. Add in Open Web UI, a intuitive interface for engaging with AI systems, and you’ve got a winning combo that’s hard to beat. In this article, we’ll explore how embedding models on Azure AI Foundry, paired with Open Web UI, are paving the way for accessible and impactful AI solutions for developers and businesses. Let’s dive in! To proceed with configuring the embedding model from Azure AI Foundry on Open Web UI, please firstly configure the requirements below. Requirements: Setup Azure AI Foundry Hub/Projects Deploy Open Web UI – refer to my previous article on how you can deploy Open Web UI on Azure VM. Optional: Deploy LiteLLM with Azure AI Foundry models to work on Open Web UI - refer to my previous article on how you can do this as well. Deploying Embedding Models on Azure AI Foundry Navigate to the Azure AI Foundry site and deploy an embedding model from the “Model + Endpoint” section. For the purpose of this demonstration, we will deploy the “text-embedding-3-large” model by OpenAI. You should be receiving a URL endpoint and API Key to the embedding model deployed just now. Take note of that credential because we will be using it in Open Web UI. Configuring the embedding models on Open Web UI Now head to the Open Web UI Admin Setting Page > Documents and Select Azure Open AI as the Embedding Model Engine. Copy and Paste the Base URL, API Key, the Embedding Model deployed on Azure AI Foundry and the API version (not the model version) into the fields below: Click “Save” to reflect the changes. Expected Output Now let us look into the scenario for when the embedding model configured on Open Web UI and when it is not. Without Embedding Models configured. With Azure Open AI Embedding models configured. Conclusion And there you have it! Embedding models on Azure AI Foundry, combined with the seamless interaction offered by Open Web UI, are truly revolutionizing how we approach AI solutions. This powerful duo not only simplifies the process of building and deploying intelligent systems but also makes cutting-edge technology more accessible to developers and businesses of all sizes. As we move forward, it’s clear that such integrations will continue to drive innovation, breaking down barriers and unlocking new possibilities in the AI landscape. So, whether you’re a seasoned developer or just stepping into this exciting field, now’s the time to explore what Azure AI Foundry and Open Web UI can do for you. Let’s keep pushing the boundaries of what’s possible!1.1KViews0likes0Comments