Azure AI Services
300 TopicsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs:GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. e AI Speech Studio > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress usingmonitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. flow Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)76Views0likes2CommentsIntroducing Azure AI Agent Service
Introducing Azure AI Agent Service at Microsoft Ignite 2024 Discover how Azure AI Agent Service is revolutionizing the development and deployment of AI agents. This service empowers developers to build, deploy, and scale high-quality AI agents tailored to business needs within hours. With features like rapid development, extensive data connections, flexible model selection, and enterprise-grade security, Azure AI Agent Service sets a new standard in AI automation23KViews6likes1CommentImplementing Event Hub Logging for Azure OpenAI Streaming APIs
Azure OpenAI's streaming responses use Server-Sent Events (SSE), which support only one subscriber. This creates achallenge when using APIM's Event Hub Logger as it would consume the stream, preventing the actual client from receiving the response. This solution introduces a lightweight Azure Function proxy that enables Event Hub logging while preserving the streaming response for clients. With token usage data being available in both stream & non-stream AOAI API, we can monitor this the right way! Architecture Client → APIM → Azure Function Proxy → Azure OpenAI ↓ Event Hub Technical Implementation Streaming Response Handling The core implementation uses FastAPI's StreamingResponse to handle Server-Sent Events (SSE) streams with three key components: 1. Content Aggregation async def process_openai_stream(response, messages, http_client, start_time): content_buffer = [] async def generate(): for chunk in response: if chunk.choices[0].delta.content: content_buffer.append(chunk.choices[0].delta.content) yield f"data: {json.dumps(chunk.model_dump())}\n\n" This enables real-time streaming to clients while collecting the complete response for logging. The content buffer maintains minimal memory overhead by storing only text content. 2. Token Usage Collection if hasattr(chunk, 'usage') and chunk.usage: log_data = { "type": "stream_completion", "content": "".join(content_buffer), "usage": chunk.usage.model_dump(), "model": model_name, "region": headers.get("x-ms-region", "unknown") } log_to_eventhub(log_data) Token usage metrics are captured from the final chunk, providing accurate consumption data for cost analysis and monitoring. 3. Performance Tracking @app.route(route="openai/deployments/{deployment_name}/chat/completions") async def aoaifn(req: Request): start_time = time.time() response = await process_request() latency_ms = int((time.time() - start_time) * 1000) log_data["latency_ms"] = latency_ms End-to-end latency measurement includes request processing, OpenAI API call, and response handling, enabling performance monitoring and optimization. Demo Function Start API Call Event Hub Setup Deploy the Azure Function Configure environment variables: AZURE_OPENAI_KEY= AZURE_OPENAI_API_VERSION=2024-08-01-preview AZURE_OPENAI_BASE_URL=https://.openai.azure.com/ AZURE_EVENTHUB_CONN_STR= Update APIM routing to point to the Function App Extension scenarios: APIM Managed Identity Auth token passthrough PII Filtering: Integration with Azure Presidio for real-time PII detection and masking in logs Cost Analysis: Token usage mapping to Azure billing metrics Latency based routing: AOAI Endpoint ranking could be built based on Latency metrics Monitoring Dashboard: Real-time visualisation of: Token usage per model/deployment Response latencies Error rates Regional distribution Implementation available on GitHub.453Views1like2CommentsInuktitut: A Milestone in Indigenous Language Preservation and Revitalization via Technology
Project Overview The Power of Indigenous Languages Inuktitut, an official language of Nunavut and a cornerstone of Inuit identity, is now at the forefront of technological innovation. This project demonstrates the resilience and adaptability of Indigenous languages in the digital age. By integrating Inuktitut into modern technology, we affirm its relevance and vitality in contemporary Canadian society. Collaboration with Inuit Communities Central to this project is the partnership between the Government of Nunavut and Microsoft. This collaboration exemplifies the importance of Indigenous leadership in technological advancements. The Government of Nunavut, representing Inuit interests, has been instrumental in guiding this project to ensure it authentically serves the Inuit community. Inuktitut by the Numbers Inuktitut is the language of many Inuit communities – foundational to their way of life. Approximately 24,000 Inuit speak Inuktitut, with 80% using it as their primary language. The 2016 Canadian census reported around 37,570 individuals identifying Inuktitut as their mother tongue, highlighting its significance in Canada's linguistic landscape. New Features Honoring Inuktitut We're excited to introduce two neural voices, "SiqiniqNeural" and "TaqqiqNeural," supporting both Roman and Syllabic orthography. These voices, developed with careful consideration of Inuktitut's unique sounds and rhythms, are now available across various Microsoft applications (Microsoft Translator app, Bing Translator, ClipChamp, Edge Read Aloud and more to come), you also can integrate these voices into your own application through Azure AI Speech services. You can listen to these voices in samples below: Voice name Text Audio iu-Cans-CA-SiqiniqNeural / iu-Latn-CA-SiqiniqNeural ᑕᐃᒫᒃ ᐅᒥᐊᓪᓘᓐᓃᑦ ᑲᓅᓪᓘᓐᓃᑦ, ᐊᖁᐊᓂ ᑕᕝᕙᓂ ᐊᐅᓚᐅᑏᑦ ᐊᑕᖃᑦᑕᕐᖓᑕ,ᖃᐅᔨᒪᔭᐃᓐᓇᕆᒐᔅᓯᐅᒃ. Taimaak umialluunniit kanuulluunniit, aquani tavvani aulautiit ataqattarngata,qaujimajainnarigassiuk. English translation: The boat or the canoes, the outboard motors, are attached to the motors. iu-Cans-CA-TaqqiqNeural / iu-Latn-CA-TaqqiqNeural ᑐᓴᐅᒪᔭᑐᖃᕆᓪᓗᒋᑦ ᓇᓄᐃᑦ ᐃᓄᑦᑎᑐᒡᒎᖅ ᐃᓱᒪᓖᑦ ᐅᑉᐱᓕᕆᐊᒃᑲᓐᓂᓚᐅᖅᓯᒪᕗᖓ ᑕᐃᔅᓱᒪᓂ. Tusaumajatuqarillugit nanuit inuttitugguuq isumaliit uppiliriakkannilauqsimavunga taissumani. English translation: I have heard that the polar bears have Inuit ideas and I re-believed in them at that time. Preserving Language Through Technology The Government of Nunavut has generously shared an invaluable collection of linguistic data, forming the foundation of our text-to-speech models. This rich repository includes 11,300 audio files from multiple speakers, totaling approximately 13 hours of content. These recordings capture a diverse range of Inuktitut expression, from the Bible to traditional stories, and even some contemporary novels written by Inuktitut speakers. Looking Forward This project is more than a technological advancement; it's a step towards digital Reconciliation. By ensuring Inuktitut's presence in the digital realm, we're supporting the language's vitality and accessibility for future generations of Inuit. Global Indigenous Language Revitalization The groundbreaking work with Inuktitut has paved the way for a broader, global initiative to support Indigenous languages worldwide. This expansion reflects Microsoft's commitment to Reconciliation and puts us on the path as a leader in combining traditional knowledge with cutting-edge technology. While efforts began here in Canada with Inuktitut, Microsoft recognizes the global need for Indigenous language revitalization. We're now working with more Indigenous communities across the world, from Māori in New Zealand to Cherokee in North America, always guided by the principle of Indigenous-led collaboration that was fundamental to the success of the Inuktitut project. Our aim is to co-create AI tools that not only translate languages but truly capture the essence of each Indigenous culture. This means working closely with elders, language keepers, and community leaders to ensure our technology respects and accurately reflects the unique linguistic features, cultural contexts, and traditional knowledge systems of each language. These AI tools are designed to empower Indigenous communities in their own language revitalization efforts. From interactive language learning apps to advanced text-to-speech systems, we're providing technological support that complements grassroots language programs and traditional teaching methods. Conclusion We are particularly proud to celebrate this milestone in Indigenous language revitalization in partnership with the Government of Nunavut. This project stands as a testament to what can be achieved when Indigenous knowledge and modern technology come together in a spirit of true partnership and respect, fostering the continued growth and use of Indigenous languages. Find more information about the project in video below: Press release from Government of Nunavut:Language Preservation and Promotion Through Technology: MS Translator Project | Government of Nunavut Get started In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication. Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with theCustom Neural Voicecapability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demoto listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback118Views1like0CommentsDocument Field Extraction with Generative AI
Adoption of Generative AI technologies is accelerating, driven by the transformative potential they offer across various industry sectors.Azure AI enables organizations to create interactive and responsive AI solutions customized to their requirements, playing a significant part helpingbusinesses harness Generative AI effectively.With the new custom field extraction preview, you can leverage generative AI to efficiently extract fields from documents, ensuring standardized output and a repeatable process to support document automation workflows.5.9KViews5likes2CommentsUnlock the Power of AI with Azure AI Foundry!
Imagine having a giant box of LEGO bricks. Each piece is a tool, ready to help you build something incredible—whether it's a robot or a skyscraper. Azure AI Foundry is like that LEGO box, but for creating powerful AI solutions! All-in-One Place: Just like having all your LEGO pieces in one big box, Azure AI Foundry puts all the tools you need to build AI in one place. This makes it easier and faster to create awesome projects. Azure AI Foundry is an all-in-one platform that combines the capabilities of various AI tools, including the rebranded Azure AI Studio. It offers everything you need to build, deploy, and manage AI solutions—from machine learning to generative AI models. Get Started with Azure AI Foundry today 🔍 Azure AI: Which Tool Should You Use? A Quick Guide! 🚀 🤔 When Should You Use Foundry? ✅ Large-Scale Projects: Manage complex, collaborative AI projects involving multiple teams. ✅ Integration Needs: Combine various Azure services, models, and data sources. ✅ Enterprise Governance: Ensure compliance, security, and orchestration for AI initiatives. 🚀 When to Use OpenAI Studio (Instead of Foundry)? ✅ Rapid Prototyping: Quickly test and fine-tune OpenAI models like GPT or DALL-E. ✅ Specific Tasks: Ideal for focused applications like text generation or image classification. ✅ Streamlined Interface: Access an easy-to-use environment for working with OpenAI models directly. Key Advantages of Using AI Foundry as a One-Stop Shop: Centralized Hub: Combines various AI tools and services, enabling seamless integration and project management. Supports collaboration across data science, engineering, and business teams. Model Diversity: Supports different types of models: custom machine learning, OpenAI models, and Azure Cognitive Services. Provides templates and pipelines for different AI scenarios (vision, language, structured data). End-to-End Lifecycle Management: Handles the entire AI lifecycle: data preparation, model training, deployment, and monitoring. Supports responsible AI practices (fairness, compliance, and transparency). Scalability and Governance: Designed for enterprise-scale projects with robust security, compliance, and access controls. Facilitates team collaboration and workflow orchestration across departments. Examples of What You Can Do with Azure AI Foundry Customer Service Chatbots: Create chatbots that can answer customer questions 24/7, helping businesses provide better service without needing a human to be available all the time. Image Recognition: Develop programs that can look at pictures and tell you what they see, like identifying objects in a photo or recognizing faces. Language Translation: Build tools that can translate speech or text from one language to another, making it easier for people from different countries to communicate. Predictive Maintenance: Create systems that can predict when machines need maintenance before they break down, helping companies save money and avoid downtime. Personalized Recommendations: Develop AI that can suggest products or content based on what a person likes, similar to how streaming services recommend movies or shows you might enjoy. 🔗 Bottom Line: Azure AI Foundry is your one-stop shop for enterprise-level AI projects, while OpenAI Studio offers simplicity and speed for targeted, smaller tasks. Choose based on your project size, complexity, and goals! 🎯338Views0likes1CommentExtracting Handwritten Corrections with Azure AI Foundry's Latest Tools
In document processing, dealing with documents that contain a mix of handwritten and typed text presents a unique challenge. Often, these documents also feature handwritten corrections where certain sections are crossed out and replaced with corrected text. Ensuring that the final extracted content accurately reflects these corrections is crucial for maintaining data accuracy and usability. In our recent endeavors, we explored various tools to tackle this issue, with a particular focus on Document Intelligence Studio and Azure AI Foundry's new Field Extraction Preview feature. The Challenge Documents with mixed content types—handwritten and typed—can be particularly troublesome for traditional OCR (Optical Character Recognition) systems. These systems often struggle with recognizing handwritten text accurately, especially when it coexists with typed text. Additionally, when handwritten corrections are involved, distinguishing between crossed-out text and the corrected text adds another layer of complexity, as the model is confused with which value(s) to pick out. Our Approach Initial Experiments with Pre-built Models To address this challenge, we initially turned to Document Intelligence Studio's pre-built invoice model, which provided a solid starting point. However, it would often extract both the crossed-out value as well as the new handwritten value under the same field. In addition, it did not always match the correct key to field value. Custom Neural Model Training Next, we attempted to train a custom neural model in the Document Intelligence Studio, which leverages Deep Learning for predicting key document elements, allowing for further adjustments and refinements. It is recommended to use at least 100 to 1000 sample files to achieve more accurate and consistent results. When training models, it is crucial to use text-based PDFs (PDFs with selectable text) as they provide better data for training. The model's accuracy improves with more varied training data, including different types of handwritten edits. Without enough training data or variance, the model may overgeneralize. Therefore, we uploaded approximately 100 text-based pdfs's (PDF has selectable text) to Azure AI Foundry and manually corrected the column containing handwritten text. After training on a subset of these files, we built and tested our custom neural model on the training data. The model performed impressively, achieving a 92% confidence score in identifying the correct values. The main drawbacks were the manual effort required for data labeling and the 30 minutes needed to build the model. During our experiments, we noticed that when extracting fields from a table, labeling and extracting every column comprehensively rather than just a few columns resulted in higher accuracy. The model was better at predicting when it had a complete view of the table Breakthrough with Document Field Extraction (Preview) Finally, the breakthrough came when we leveraged the new Document Field Extraction Preview feature from Azure AI Foundry. This feature demonstrated significant improvements in handling mixed content and provided a more seamless experience in extracting the necessary information. Field Description Modification: One of the key steps in our process was modifying the field descriptions within the Field Extraction Preview feature. By providing detailed descriptions of the fields we wanted to extract, we helped the AI understand the context and nuances of our documents better. Specifically, we wanted to make sure that the value extracted forFOB_COST was the handwritten correction, so we wrote in theField Description: "Ignore strikethrough or 'x'-ed out text at all costs, for example: do not extract red / black pen or marks through text. Do not use stray marks. This field only has numbers." Correction Handling: During the extraction process, the AI was able to distinguish between crossed-out text and the handwritten corrections. Whenever a correction was detected, the AI prioritized the corrected text over the crossed-out content, ensuring that the final extracted data was accurate and up-to-date. Performance Evaluation: After configuring the settings and field descriptions, we ran several tests to evaluate the performance of the extraction process. The results were impressive, with the AI accurately extracting the corrected text and ignoring the crossed-out sections. This significantly reduced the need for manual post-processing and corrections Results The new Field Extraction Preview feature in Azure AI Foundry exceeded our expectations. The modifications we made to the field descriptions, coupled with the AI's advanced capabilities, resulted in a highly efficient and accurate document extraction process. The AI's ability to handle mixed-content documents and prioritize handwritten corrections over crossed-out text has been a game-changer for our workflow. Conclusion For anyone dealing with documents that contain a mix of handwritten and typed text, and where handwritten corrections are present, we highly recommend exploring Azure AI Studio's Field Extraction Preview feature. The improvements in accuracy and efficiency can save significant time and effort, ensuring that your extracted data is both reliable and usable. As we continue to refine our processes, we look forward to even more advancements in document intelligence technologies.233Views0likes0CommentsPhi-3 Vision – Catalyzing Multimodal Innovation
Microsoft's Phi-3 Vision is a new AI model that combines text and image data to deliver smart and efficient solutions. With just 4.2 billion parameters, it offers high performance and can run on devices with limited computing power. From describing images to analyzing documents, Phi-3 Vision is designed to make advanced AI accessible and practical for everyday use. Explore how this model is set to change the way we interact with AI, offering powerful capabilities in a small and efficient package.30KViews5likes2CommentsIntroducing AI-generated voices for Azure neural text to speech service
In this blog, we introduce two new voices created using the latestcontrollable new voice generation technology, a masculine voice named AIGenerate1 and a feminine voice named AIGenerate2, and provide a deeper view on the technology behind.12KViews4likes9Comments