azure ai speech
57 TopicsAzure AI Search: Microsoft OneLake integration plus more features now generally available
From ingestion to retrieval, Azure AI Search releases enterprise-grade GA features: new connectors, enrichment skills, vector/semantic capabilities and wizard improvements—enabling smarter agentic systems and scalable RAG experiences.572Views0likes0CommentsUpgrade your voice agent with Azure AI Voice Live API
Today, we are excited to announce the general availability of Voice Live API, which enables real-time speech-to-speech conversational experience through a unified API powered by generative AI models. With Voice Live API, developers can easily voice-enable any agent built with the Azure AI Foundry Agent Service. Azure AI Foundry Agent Service, enables the operation of agents that make decisions, invoke tools, and participate in workflows across development, deployment, and production. By eliminating the need to stitch together disparate components, Voice Live API offers a low latency, end-to-end solution for voice-driven experiences. As always, a diverse range of customers provided valuable feedback during the preview period. Along with announcing general availability, we are also taking this opportunity to address that feedback and improve the API. Following are some of the new features designed to assist developers and enterprises in building scalable, production-ready voice agents. More natively integrated GenAI models including GPT-Realtime Voice Live API enables developers to select from a range of advanced AI models designed for conversational applications, such as GPT-Realtime, GPT-5, GPT-4.1, Phi, and others. These models are natively supported and fully managed, eliminating the need for developers to manage model deployment or plan for capacity. These natively supported models may each have a distinct stage in their life cycle (e.g. public preview, generally available) and be subject to varying pricing structures. The table below lists the models supported in each pricing tier. Pricing Tier Generally Available In Public Preview Voice Live Pro GPT-Realtime, GPT-4.1, GPT-4o GPT-5 Voice Live Standard GPT-4o-mini, GPT-4.1-mini GPT-4o-Mini-Realtime, GPT-5-mini Voice Live Lite NA Phi-4-MM-Realtime, GPT-5-Nano, Phi-4-Mini Extended speech languages to 140+ Voice Live API now supports speech input in over 140 languages/locales. View all supported languages by configuration. Automatic multilingual configuration is enabled by default, using the multilingual model. Integrated with Custom Speech Developers need customization to better manage input and output for different use cases. Besides the support for Custom Voice released in May 2025, Voice Live now supports seamless integration with Custom Speech for improved speech recognition results. Developers can also improve speech input accuracy with phrase lists and refine speech synthesis pronunciation using custom lexicons, all without training a model. Learn how to customize speech and voice models for Voice Live API. Natural HD voices upgraded Neural HD voices in Azure AI Speech are contextually aware and engineered to provide a natural, expressive experience, making them ideal for voice agent applications. The latest V2 upgrade enhances lifelike qualities with features such as natural pauses, filler words, and seamless transitions between speaking styles, all available with Voice Live API. Check out the latest demo of Ava Neural HD V2. Improved VAD features for interruption detection Voice Live API now features semantic Voice Activity Detection (VAD), enabling it to intelligently recognize pauses and filler word interruptions in conversations. In the latest en-US evaluation on Multilingual filler words data, Voice Live API achieved ~20% relative improvement from previous VAD models. This leap in performance is powered by integrating semantic VAD into the n-best pipeline, allowing the system to better distinguish meaningful speech from filler noise and enabling more accurate latency tracking and cleaner segmentation, especially in multilingual and noisy environments. 4K avatar support Voice Live API enables efficient integration with streaming avatars. With the latest updates, avatar options offer support for high-fidelity 4K video models. Learn more about the avatar update. Improved function calling and integration with Azure AI Foundry Agent Service Voice Live API enables function calling to assist developers in building robust voice agents with their chosen generative AI models. This release improves asynchronous function calls and enhances integration with Azure AI Foundry Agent Service for agent creation and operation. Learn more about creating a voice live real-time voice agent with Azure AI Foundry Agent Service. More developer resources and availability in more regions Developer resources are available in C# and Python, with more to come. Get started with Voice Live API. Voice Live API is available in more regions now including Australia East, East US, Japan East, and UK South, besides the previously supported regions such as Central India, East US 2, South East Asia, Sweden Central, and West US 2. Check the features supported in each region. Customers adopting Voice Live In healthcare, patient experience is always the top priority. With Voice Live, eClinicalWorks’ healow Genie contact center solution is now taking healthcare modernization a step further. healow is piloting Voice Live API for Genie to inform patients about their upcoming appointments, answer common questions, and return voicemails. Reducing these routine calls saves healthcare staff hours each day and boosts patient satisfaction through timely interactions. “We’re looking forward to using Azure AI Foundry Voice Live API so that when a patient calls, Genie can detect the question and respond in a natural voice in near-real time,” said Sidd Shah, Vice President of Strategy & Business Growth at healow. “The entire roundtrip is all happening in Voice Live API.” If a patient asks about information in their medical chart, Genie can also fetch data from their electronic health record (EHR) and provide answers. Read the full story here. “If we did multiple hops to go across different infrastructures, that would add up to a diminished patient experience. The Azure AI Foundry Voice Live API is integrated into one single, unified solution, delivering speech-to-text and text-to-speech in the same infrastructure.” Bhawna Batra, VP of Engineering at eClinicalWorks Capgemini, a global business and technology transformation partner, is reimagining its global service desk managed operations through its Capgemini Cloud Infrastructure Services (CIS) division. The first phase covers 500,000 users across 45 clients, which is only part of the overall deployment base. The goal is to modernize the service desk to meet changing expectations for speed, personalization, and scale. To drive this transformation, Capgemini launched the “AI-Powered Service Desk” platform powered by Microsoft technologies including Dynamics 365 Contact Center, Copilot Studio, and Azure AI Foundry. A key enhancement was the integration of Voice Live API for real-time voice interactions, enabling intelligent, conversational support across telephony channels. The new platform delivers a more agile, truly conversational, AI-driven service experience, automating routine tasks and enhancing agent productivity. With scalable voice capabilities and deep integration across Microsoft’s ecosystem, Capgemini is positioned to streamline support operations, reduce response times, and elevate customer satisfaction across its enterprise client base. "Integrating Microsoft’s Voice Live API into our platform has been transformative. We’re seeing measurable improvements in user engagement and satisfaction thanks to the API’s low-latency, high-quality voice interactions. As a result, we are able to deliver more natural and responsive experiences, which have been positively received by our customers.” Stephen Hilton, EVP Chief Operating Officer at CIS Capgemini Astra Tech, a fast-growing UAE-based technology group part of G42, is bringing Voice Live API to its flagship platform, botim, a fintech-first and AI-native platform. Eight out of 10 smartphone users in the UAE already rely on the app. The company is now reshaping botim from a communications tool into a fintech-first service, adding features such as digital wallets, international remittances, and micro-loans. To achieve its broader vision, Astra Tech set out to make botim simpler, more intuitive, and more human. “Voice removes a lot of complexity, and it’s the most natural way to interact,” says Frenando Ansari, Lead Product Manager at Astra Tech. “For users with low digital literacy or language barriers, tapping through traditional interfaces can be difficult. Voice personalizes the experience and makes it accessible in their preferred language.” " The Voice Live API acts as a connective tissue for AI-driven conversation across every layer of the app. It gives us a standardized framework so that different product teams can incorporate voice without needing to hire deep AI expertise.” Frenando Ansari, Lead Product Manager at Astra Tech “The most impressive thing about the Voice Live API is the voice activity detection and the noise control algorithm.” Meng Wang, AI Head at Astra Tech Get started Voice Live API is transforming how developers build voice-enabled agent systems by providing an integrated, scalable, and efficient solution. By combining speech recognition, generative AI, and text-to-speech functionalities into a unified interface, it addresses the challenges of traditional implementations, enabling faster development and superior user experiences. From streamlining customer service to enhancing education and public services, the opportunities are endless. The future of voice-first solutions is here—let’s build it together! Voice Live API introduction (video) Try Voice Live in Azure AI Foundry Voice Live API documents Voice Live quickstart Voice Live Agent code sample in GitHub750Views2likes0CommentsExplore Azure AI Services: Curated list of prebuilt models and demos
Unlock the potential of AI with Azure's comprehensive suite of prebuilt models and demos. Whether you're looking to enhance speech recognition, analyze text, or process images and documents, Azure AI services offer ready-to-use solutions that make implementation effortless. Explore the diverse range of use cases and discover how these powerful tools can seamlessly integrate into your projects. Dive into the full catalogue of demos and start building smarter, AI-driven applications today.10KViews5likes1CommentAnnouncing Live Interpreter API - Now in Public Preview
Today, we’re excited to introduce Live Interpreter –a breakthrough new capability in Azure Speech Translation – that makes real-time, multilingual communication effortless. Live Interpreter continuously identifies the language being spoken without requiring you to set an input language and delivers low latency speech-to-speech translation in a natural voice that preserves the speaker’s style and tone.5.3KViews1like0CommentsPower Up Your Open WebUI with Azure AI Speech: Quick STT & TTS Integration
Introduction Ever found yourself wishing your web interface could really talk and listen back to you? With a few clicks (and a bit of code), you can turn your plain Open WebUI into a full-on voice assistant. In this post, you’ll see how to spin up an Azure Speech resource, hook it into your frontend, and watch as user speech transforms into text and your app’s responses leap off the screen in a human-like voice. By the end of this guide, you’ll have a voice-enabled web UI that actually converses with users, opening the door to hands-free controls, better accessibility, and a genuinely richer user experience. Ready to make your web app speak? Let’s dive in. Why Azure AI Speech? We use Azure AI Speech service in Open Web UI to enable voice interactions directly within web applications. This allows users to: Speak commands or input instead of typing, making the interface more accessible and user-friendly. Hear responses or information read aloud, which improves usability for people with visual impairments or those who prefer audio. Provide a more natural and hands-free experience especially on devices like smartphones or tablets. In short, integrating Azure AI Speech service into Open Web UI helps make web apps smarter, more interactive, and easier to use by adding speech recognition and voice output features. If you haven’t hosted Open WebUI already, follow my other step-by-step guide to host Ollama WebUI on Azure. Proceed to the next step if you have Open WebUI deployed already. Learn More about OpenWeb UI here. Deploy Azure AI Speech service in Azure. Navigate to the Azure Portal and search for Azure AI Speech on the Azure portal search bar. Create a new Speech Service by filling up the fields in the resource creation page. Click on “Create” to finalize the setup. After the resource has been deployed, click on “View resource” button and you should be redirected to the Azure AI Speech service page. The page should display the API Keys and Endpoints for Azure AI Speech services, which you can use in Open Web UI. Settings things up in Open Web UI Speech to Text settings (STT) Head to the Open Web UI Admin page > Settings > Audio. Paste the API Key obtained from the Azure AI Speech service page into the API key field below. Unless you use different Azure Region, or want to change the default configurations for the STT settings, leave all settings to blank. Text to Speech settings (TTS) Now, let's proceed with configuring the TTS Settings on OpenWeb UI by toggling the TTS Engine to Azure AI Speech option. Again, paste the API Key obtained from Azure AI Speech service page and leave all settings to blank. You can change the TTS Voice from the dropdown selection in the TTS settings as depicted in the image below: Click Save to reflect the change. Expected Result Now, let’s test if everything works well. Open a new chat / temporary chat on Open Web UI and click on the Call / Record button. The STT Engine (Azure AI Speech) should identify your voice and provide a response based on the voice input. To test the TTS feature, click on the Read Aloud (Speaker Icon) under any response from Open Web UI. The TTS Engine should reflect Azure AI Speech service! Conclusion And that’s a wrap! You’ve just given your Open WebUI the gift of capturing user speech, turning it into text, and then talking right back with Azure’s neural voices. Along the way you saw how easy it is to spin up a Speech resource in the Azure portal, wire up real-time transcription in the browser, and pipe responses through the TTS engine. From here, it’s all about experimentation. Try swapping in different neural voices or dialing in new languages. Tweak how you start and stop listening, play with silence detection, or add custom pronunciation tweaks for those tricky product names. Before you know it, your interface will feel less like a web page and more like a conversation partner.907Views2likes1CommentAnnouncing gpt-realtime on Azure AI Foundry:
We are thrilled to announce that we are releasing today the general availability of our latest advancement in speech-to-speech technology: gpt-realtime. This new model represents a significant leap forward in our commitment to providing advanced and reliable speech-to-speech solutions. gpt-realtime is a new S2S (speech-to-speech) model with improved instruction following, designed to merge all of our speech-to-speech improvements into a single, cohesive model. This model is now available in the Real-time API, offering enhanced voice naturalness, higher audio quality, and improved function calling capabilities. Key Features New, natural, expressive voices: New voice options (Marin and Cedar) that bring a new level of naturalness and clarity to speech synthesis. Improved Instruction Following: Enhanced capabilities to follow instructions more accurately and reliably. Enhanced Voice Naturalness: More lifelike and expressive voice output. Higher Audio Quality: Superior audio quality for a better user experience. Improved Function Calling: Enhanced ability to call custom code defined by developers. Image Input Support: Add images to context and discuss them via voice—no video required. Check out the model card here: gpt-realtime Pricing Pricing for gpt-realtime is 20% lower compared to the previous gpt-4o-realtime preview: Pricing is based on usage per 1 million tokens. Below is the breakdown: Getting Started gpt-realtime is available on Azure AI Foundry via Azure Models direct from Azure today. We are excited to see how developers and users will leverage these new capabilities to create innovative and impactful solutions. Check out the model on Azure AI Foundry and see detailed documentation in Microsoft Learn docs.3.9KViews1like0CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs: GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress using monitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)1.2KViews3likes8CommentsPersonal Voice upgraded to v2.1 in Azure AI Speech, more expressive than ever before
At the Build conference on May 21, 2024, we announced the general available of Personal Voice, a feature designed to empower customers to build applications where users can easily create and utilize their own AI voices (see the blog). Today we're thrilled to announce that Azure AI Speech Service has upgraded a new zero-shot TTS (text-to-speech) model, named “DragonV2.1Neural”. This new model delivers more natural-sounding and expressive voices, offering improved pronunciation accuracy and greater controllability compared to the earlier zero-shot TTS model. In this blog, we’ll present the new zero-shot TTS model audio quality, new features and benchmarks results. We’ll also share a guide for controlling pronunciation and accent using the Personal Voice API with the new zero-shot TTS model. Personal Voice model upgrade The Personal Voice feature in Azure AI Speech Service empowers users to craft highly personalized synthetic voices based on their own speech characteristics. By providing just a few seconds speech sample as the audio prompt, users can rapidly generate an AI voice replica, which can then synthesize speech in any of the output languages supported. This capability unlocks a wide range of applications, from customizing chatbot voices to dubbing video content in an actor’s original voice across multiple languages, enabling truly immersive and individualized audio experiences. Our earlier Personal Voice Dragon TTS model can produce speech with exceptionally realistic prosody and high-fidelity audio quality, but it still encounters pronunciation challenges, especially with complex elements such as named entities. As a result, pronunciation control remains a crucial feature for delivering accurate and natural-sounding speech synthesis. In addition, for scenarios involving speech or video translation, it is crucial for a zero-shot TTS model to accurately produce not only different languages but also specific accents. The ability to precisely control accent ensures that speakers can deliver natural speech in any target accent. Dragon V2.1 model cards Attribute Details Architecture Transformer model Highlights - Multilingual - Zero-shot voice cloning with 5–90 s prompts - Emotion, accent, and environment adaptation Context Length 30 seconds of audio Supported Languages 100+ Azure TTS locales SSML Support Yes Latency < 300 ms RTF (Real-Time Factor) < 0.05 Prosody and pronunciation improvement Comparing with our previous dragon TTS model (“DragonV1”), our new “DragonV2.1” model brings improvements to the naturalness of speech, offering more realistic and stable prosody while maintaining better pronunciation accuracy. Here are a few voice samples showing prosody improvement compared to DragonV1, prompt audio is the source speech from humans. Locale Prompt audio DragonV1 DragonV2.1 En-US Zh-CN The new “DragonV2.1” model also shows pronunciation improvements, we compared WER (Word error rate), which measures the intelligibility of the synthesis speech by an automatic speech recognition (ASR) system. We evaluated WER (lower is better) on all supported locales, each locale is evaluated on more than 100 test cases. The new model achieves on average 12.8% relative WER reduction compared to DragonV1. Here are a few complicated cases showing the pronunciation improvement, compared to DragonV1, the new DragonV2.1 model can read correctly on challenge cases such as Chinese polyphony and better produce in en-GB accent: Locale Prompt audio DragonV1 DragonV2.1 Zh-CN 唐朝高僧玄奘受皇帝之命,前往天竺取回真经,途中收服了四位徒弟:机智勇敢的孙悟空、好吃懒做的猪八戒、忠诚踏实的沙和尚以及白龙马。他们一路历经九九八十一难,战胜了无数妖魔鬼怪,克服重重困难。 En-GB [En-GB accent] Tomato, potato, and basil are in the salad. Pronunciation control The “DragonV2.1” model supports pronunciation control with SSML phoneme tags, you can use ipa phoneme tag and custom lexicon to specify how the speech is pronounced. In below examples, we supported "ipa" values for attributes of the phoneme element described here. In the below example, the values of ph="tə.ˈmeɪ.toʊ" or ph="təmeɪˈtoʊ" are specified to stress the syllable meɪ. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> <phoneme alphabet="ipa" ph="tə.ˈmeɪ.toʊ"> tomato </phoneme> </mstts:ttsembedding> </voice> </speak> You can define how single entities (such as company, a medical term, or an emoji) are read in SSML by using the phoneme elements. To define how multiple entities are read, create an XML structured custom lexicon file. Then you upload the custom-lexicon XML file and reference it with the SSML lexicon element. After you publish your custom lexicon, you can reference it from your SSML. The following SSML example references a custom lexicon that was uploaded to https://www.example.com/customlexicon.xml. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <lexicon uri="https://www.example.com/customlexicon.xml"/> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> BTW, we will be there probably at 8:00 tomorrow morning. Could you help leave a message to Robert Benigni for me? </mstts:ttsembedding></voice> </speak> Language and accent control You can use the <lang xml:lang> element to adjust speaking languages and accents for your voice to set the preferred accent such as en-GB for British English. For information about the supported languages, see the lang element for a table showing the <lang> syntax and attribute definitions. This element is recommended to use for better pronunciation accuracy. The following table describes the usage of the <lang xml:lang> element's attributes: <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> <lang xml:lang="en-GB"> Tomato, potato, and basil are in the salad. </lang> </mstts:ttsembedding> </voice> </speak> Benchmark evaluation Benchmarking plays a key role in evaluating the performance of zero-shot TTS models. In this work, we compared our system with other top zero-shot text-to-speech providers — Company A and Company B — for English, and with Company A specifically for Mandarin. This assessment allowed us to measure performance across both languages; we used a widely accepted subjective metric: MOS (Mean Opinion Score) tests were conducted to assess perceptual quality. Listeners listened to the audios carefully and rated them. In our evaluation, the opinion score is mainly judged from four aspects, including overall impression, naturalness, conversational and audio quality. Each judge gives 1-5 score on each aspect; we show the average score below. English set: Chinese set: These results show that our zero-shot TTS model is slightly better than Company A and B on English (> 0.05 score gap) and on par with Company A on Mandarin. Quick trial with prebuilt voice profiles To facilitate testing of the new DragonV2.1 model, several prebuilt voice profiles have been made available. By providing a brief prompt audio from each voice and using the new zero-shot model, these prebuilt profiles aim to provide more expressive prosody, high audio fidelity, and a natural tone while preserving the original voice persona. You can explore these profiles firsthand to experience the enhanced quality of our new model, without using your own custom profiles. We provided several prebuilt profiles for you and the profile names are listed below. Profile name Andrew Ava Brian Emma Adam Jenny To utilize these prebuilt profiles for output, assign the appropriate profile name into the “speaker” attribute of <mstts:ttsembedding> tag. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speaker="Andrew"> I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun. </mstts:ttsembedding> </voice> </speak> Here are Dragonv2.1 audio samples of these prebuilt profiles. Profile name DragonV2.1 Ava Andrew Brian Emma Adam Jenny Customer use case This advanced, high-fidelity model can be used to enable dubbing scenarios, allowing video content to be voiced in the original actor’s tone and style across multiple languages. The new Personal Voice model has been integrated in Azure AI video translation and targeting to empower creators of short dramas to reach out to global markets effortlessly. TopShort and JOWO.ai are the next generation of short drama creator and translation provider, partners with Azure Video Translation Service to deliver one-click AI translation. Check out the demo from TopShort. More videos are available in this channel, owned by JOWO.ai. Get started The new zero-shot TTS model will be available in the middle of August and will be exposed in the BaseModels_List operation of the custom voice API. When you get the new model's name “DragonV2.1Neural” in the base models list, please follow these steps to register your use case and apply for the access, create the speaker profile ID and use voice name “DragonV2.1Neural” to synthesize speech in any of the 100 supported languages. Below is an SSML example using DragonV2.1Neural to generate speech for your personal voice in different languages. More details are provided here. <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="DragonV2.1Neural"> <mstts:ttsembedding speakerprofileid="your speaker profile ID here"> <lang xml:lang="en-US"> I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun. </lang> </mstts:ttsembedding> </voice></speak> Building personal voices responsibly All customers must agree to our usage policies, which include requiring explicit consent from the original speaker, disclosing the synthetic nature of the content created, and prohibiting impersonation of any person or deceiving people using the personal voice service. The full code of conduct guides integrations of synthetic speech and personal voice to ensure consistency with our commitment to responsible AI. Watermarks are automatically added to the speech output generated with personal voices. As the personal voice feature enters general availability, we have updated the watermark technology with enhanced robustness and stronger capabilities for identifying watermark existence. To measure the robustness of the new watermark, we have evaluated the accuracy of watermark detection with audio samples generated using personal voice. Our results showed an average accuracy rate higher than 99.7% for detecting the existence of watermarks in various audio editing scenarios. This improvement provides us stronger mitigations to prevent potential misuse. Try the personal voice feature on Speech Studio as a test, or apply for full access to the API for business use. In addition to creating a personal voice, eligible customers can create a brand voice for your business with Custom Voice’s professional voice fine-tuning feature. Azure AI Speech also offers over 600 neural voices covering more than 150 languages and locales. With these pre-built Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users.2.1KViews0likes0CommentsCreating Intelligent Video Summaries and Avatar Videos with Azure AI Services
Unlock the true value of your organization’s video content! In this post, I share how we built an end-to-end AI video analytics platform using Microsoft Azure. Discover how AI can automate video analysis, generate intelligent summaries, and create engaging avatar presentations—making content more accessible, actionable, and impactful for everyone. If you’re interested in digital transformation, AI-powered automation, or modern content management, this is for you!729Views5likes1CommentVoice Conversion in Azure AI Speech
We are delighted to announce the availability of the Voice Conversion (VC) feature in Azure AI Speech service, which is currently in preview. What is voice Conversion Voice Conversion (or voice changer, speech to speech conversion) is the process of transforming the voice characteristics of a given audio to a target voice speaker, and after Voice Conversion, the resulting audio reserves source audio’s linguistic content and prosody while the voice timbre sounds like the target speaker. Below is a diagram of Voice Conversion. The purpose of Voice Conversion There are 3 reasons users need Voice Conversion functionality: Voice Conversion can replicate your content using a different voice identity while maintaining the original prosody and emotion. For instance, in education, teachers can record themselves reading stories, and Voice Conversion can deliver these stories using a pre-designed cartoon character's voice. This method preserves the expressiveness of the teacher's reading while incorporating the unique timbre of the cartoon character's voice. Another application is multilingual dubbing. When localized content is read by different voices, Voice Conversion can transform them into a uniform voice, ensuring a consistent experience across all languages while keeping the most localized voice characters. Voice Conversion enhances the control over the expressiveness of a voice. By transforming various speaking styles, such as adopting a unique tone or conveying exaggerated emotions, a voice gains greater versatility in expression and can be more dynamic in different scenarios. Brief introduction to Our Voice Conversion Technology The Voice Conversion is built on state-of-the-art generative models and offers high-quality voice conversion. It delivers the following core capabilities: Key Capability Description High Speaker Similarity Captures the timbre and vocal identity of the target speaker Generates audio that accurately matches the target voice Prosody Preservation Maintains rhythm, stress, and intonation of source audio Preserves expressive and emotional qualities High Audio Fidelity Generates realistic, natural-sounding audio Minimizes artifacts Multilingual Support Enables multilingual Voice Conversion Supports 91 locales (same as standard Text to speech locale support) Voice Conversion in Standard TTS voices In this release 28 Standard TTS voices on EN-US have been enabled with Voice Conversion capabilities. These voices are available in East US, West Europe and Southeast Asia service regions. Sample How to Use You can enable Voice Conversion by adding mstts:voiceconversion tag to your SSML. The structure is nearly identical to a standard TTS request, with the addition of specifying a source audio URL and a target voice name. Note: In voice conversion mode, the synthesized output follows the content and prosody of the provided source audio. Therefore, text input is not required, and any text included in the SSML will be ignored during rendering. Additionally, All SSML elements related to prosody and pronunciation, such as or , will lose effect, because prosody is derived directly from the source audio. SSML example <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> <voice xml:lang="en-US" xml:gender="Female" name="Microsoft Server Speech Text to Speech Voice (en-US, AvaMultilingualNeural)"> <mstts:voiceconversion url=" https://your.blob.core.windows.net/sourceaudio.wav"></mstts:voiceconversion> </voice> </speak> Voice List Here is the list of Standard Neural TTS supporting this feature AdamMultilingualNeural AlloyTurboMultilingualNeural AmandaMultilingualNeural AndrewMultilingualNeural AvaMultilingualNeural BrandonMultilingualNeural BrianMultilingualNeural ChristopherMultilingualNeural CoraMultilingualNeural DavisMultilingualNeural DerekMultilingualNeural DustinMultilingualNeural EchoTurboMultilingualNeural EmmaMultilingualNeural EvelynMultilingualNeural FableTurboMultilingualNeural JennyMultilingualNeural LewisMultilingualNeural LolaMultilingualNeural NancyMultilingualNeural NovaTurboMultilingualNeural OnyxTurboMultilingualNeural PhoebeMultilingualNeural RyanMultilingualNeural SamuelMultilingualNeural SerenaMultilingualNeural ShimmerTurboMultilingualNeural SteffanMultilingualNeural Voice Conversion in Custom Voice Voice Conversion can also be applied to Custom Voice to enhance its expression. This feature is currently available in Custom Voice in Private Preview. This feature enhances the Custom Voice experience, and since it only requires a small amount of target speaker data, it offers a quick solution for dynamic voice customization. Customers who have built or plan to build custom voice on Azure and have a suitable use case for Voice Conversion are invited to contact us at mstts@microsoft.com to preview this feature. Sample: Benchmark Evaluation Benchmarking plays a key role in evaluating the quality of Voice Conversion. In this work, we have compared our solution against a leading Voice Conversion provider across a range of objective and subjective metrics, showcasing its advantages. Objective Evaluation We evaluated our system and a leading Voice Conversion provider (Company A) on two language sets (English and Mandarin) using three widely accepted objective metrics: SIM (Speaker Similarity): measures how closely the converted voice matches the target speaker’s vocal characteristics (higher is better). WER (Word Error Rate): measures the intelligibility of the converted voice by an automatic speech recognition (ASR) system (lower is better). Pitch Correlation: measures how well the pitch contour (intonation) of the converted voice aligns with the source (higher is better). Solution Test Set SIM ↑ WER ↓ Pitch Correlation ↑ Ours En-US set 0.70 1.9% 0.61 Company A En-US set 0.63 2.0% 0.54 Ours Zh-CN set 0.66 6.94% 0.47 Company A Zh-CN set 0.55 66.48% 0.40 Our Voice Conversion consistently outperforms Company A in speaker similarity and pitch preservation, while achieving lower WER, particularly on Mandarin. Subjective Evaluation CMOS (Comparison Mean Opinion Score) tests were conducted to assess perceptual quality. Listeners compared audio pairs and rated which sample sounded more natural. A positive score reflects a preference for one system over the other. Test Set CMOS (Company A vs Ours) En-US set On par Zh-CN set +0.75 in favor of ours These results show that our system achieves the same perceptual quality in English and performs significantly better in Mandarin. Conclusion In terms of objective evaluation, our Voice Conversion outperforms the leading Voice Conversion provider in speaker similarity (SIM), pitch correlation, and multilingual capabilities. In terms of subjective evaluation, our Voice Conversion is on par with the provider in English, while achieving a significant advantage in Mandarin which demonstrates its advantages in multilingual conversion. Overall, these results show that our current Voice Conversion delivers state-of-the-art quality.1.3KViews2likes0Comments