Microsoft Foundry Blog

11 MIN READ

Building custom AI Speech models with Phi-3 and Synthetic data

hyo_choi

Microsoft

Dec 12, 2024

Accelerate End-to-End Custom Speech-to-Text Model Fine-Tuning with LLM and Synthetic Data

Introduction

In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include:

Speech to Text (STT)
Text to Speech (TTS)
Speech Translation
Custom Neural Voice
Speaker Recognition

Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire.

The Data Challenge:

When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions.

Addressing Data Scarcity with Synthetic Data

A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed.

What is Synthetic Data?

Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather.

Use cases include:

Privacy Compliance: Train models without handling personal or sensitive data.
Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy.
Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance.
Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models.

By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness.

Overview of the Process

This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture:

Figure 1.End-to-End Custom Speech-to-Text Model Fine-Tuning Process

Custom Speech with Synthetic data Hands-on Labs: GitHub Repository

Step 0: Environment Setup

First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to:

Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry.
Provision Azure AI Speech and Azure Storage account.

Below is a sample configuration focusing on creating a custom Italian model:

# this is a sample for keys used in this code repo. 
# Please rename it to .env before you can use it

# Azure Phi3.5
AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models
AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct

#Azure AI Speech
AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt
CUSTOM_SPEECH_LANG=Italian
CUSTOM_SPEECH_LOCALE=it-IT
# https://speech.microsoft.com/portal?projecttype=voicegallery
TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural 
TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural

#Azure Account Storage
AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AZURE_STORAGE_CONTAINER_NAME=stt-container

Key Settings Explained:

AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model.
AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources.
CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model.
TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation.
AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored.

Figure 2. Azure AI Speech Studio > Voice Gallery

Step 1: Generating Domain-Specific Text Utterances with Phi-3.5

Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand).

Code snippet (illustrative):

topic = f"""
Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages.
"""
question = f"""
create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages.
only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. 
"""


response = client.complete(
    messages=[
        SystemMessage(content="""
        Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases.
        Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. 
        Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. 
        """),

       
        UserMessage(content=f"""
        #topic#: {topic}
        Question: {question}
        """),
    ],
    ...
)

content = response.choices[0].message.content
print(content) # Prints the generated JSONL with no, locale, and content keys

Sample Output (Contoso Electronics in Italian):

{"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"}
{"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"}
{"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"}
{"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"}
{"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"}
{"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"}
{"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"}
{"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"}
{"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"}
{"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"}

These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio.

Step 2: Creating the Synthetic Audio Dataset

Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples.

Figure 3. Play WAV files in the notebook

Core Function:

def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice):
    ssml = f"""<speak version='1.0'  xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'>
                     <voice name='{default_tts_voice}'>
                             {html.escape(text)}
                     </voice>
                   </speak>"""
    speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get()
    stream = speechsdk.AudioDataStream(speech_sythesis_result)
    stream.save_to_wav_file(file_path)

Execution:

For each generated text line, the code produces multiple WAV files (one per specified TTS voice).
It also creates a manifest.txt for reference and a zip file containing all the training data.

Note:

If DELETE_OLD_DATA = True, the training_dataset folder resets each run.
If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples.

Code snippet (illustrative):

import zipfile
import shutil

DELETE_OLD_DATA = True

train_dataset_dir = "train_dataset"
if not os.path.exists(train_dataset_dir):
    os.makedirs(train_dataset_dir)

if(DELETE_OLD_DATA):
    for file in os.listdir(train_dataset_dir):
        os.remove(os.path.join(train_dataset_dir, file))    

timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
zip_filename = f'train_{lang}_{timestamp}.zip'
with zipfile.ZipFile(zip_filename, 'w') as zipf:
    for file in files:
        zipf.write(os.path.join(output_dir, file), file)

print(f"Created zip file: {zip_filename}")

shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename))
print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}")
train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)}
%store train_dataset_path

You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario.

Example Snippet to create the synthetic evaluation data:

import datetime

print(TTS_FOR_EVAL) 
languages = [CUSTOM_SPEECH_LOCALE]
eval_output_dir = "synthetic_eval_data"
DELETE_OLD_DATA = True

if not os.path.exists(eval_output_dir):
    os.makedirs(eval_output_dir)

if(DELETE_OLD_DATA):
    for file in os.listdir(eval_output_dir):
        os.remove(os.path.join(eval_output_dir, file))

eval_tts_voices = TTS_FOR_EVAL.split(',')

for tts_voice in eval_tts_voices:
    with open(synthetic_text_file, 'r', encoding='utf-8') as f:
        for line in f:
            try:
                expression = json.loads(line)
                no = expression['no']
                for lang in languages:
                    text = expression[lang]
                    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
                    file_name = f"{no}_{lang}_{timestamp}.wav"
                    get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice)
                    with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file:
                        manifest_file.write(f"{file_name}\t{text}\n")
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON on line: {line}")
                print(e)

Step 3: Creating and Training a Custom Speech Model

To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs:

Upload your dataset (the zip file created in Step 2) to your Azure Storage container.
Register your dataset as a Custom Speech dataset.
Create a Custom Speech model using that dataset.
Create evaluations using that custom model with asynchronous calls until it’s completed.

Figure 4. fine-tuning the custom model flow

You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes.

Key APIs & References:

Azure Speech-to-Text REST APIs (v3.2)
The provided common.py in the hands-on repo abstracts API calls for convenience.

Example Snippet to create training dataset:

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

kind="Acoustic"
display_name = "acoustic dataset(zip) for training"
description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model"

zip_dataset_dict = {}

for display_name in uploaded_files:
    zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE)

You can monitor training progress using monitor_training_status function which polls the model’s status and updates you once training completes

Core Function:

def monitor_training_status(custom_model_id):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = get_custom_model_status(base_url, headers, custom_model_id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_custom_model_status(base_url, headers, custom_model_id)
        while(pbar.n < 3):
            pbar.update(1)
        print("Training Completed")

Figure 5. training status monitoring

Step 4: Evaluate Trained Custom Speech

After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER.

Figure 6. evaluate and compare metrics flow

Key Steps:

Use create_evaluation function to evaluate the custom model against your test set.

Compare evaluation metrics between base and custom models.

Check WER to quantify accuracy improvements.

After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file.

Example Snippet to create evaluation:

description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model"
evaluation_ids={}
for display_name in uploaded_files:
    evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE)

Figure 7. Screenshot of evaluation results in Azure AI Foundry

Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb.

Example Snippet to create WER dateframe:

# Collect WER results for each dataset
wer_results = []
eval_title = "Evaluation Results for base model and custom model: "
for display_name in uploaded_files:
    eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name])
    eval_title = eval_title + display_name + " "
    wer_results.append({
            'Dataset': display_name,
            'WER_base_model': eval_info['properties']['wordErrorRate1'],
            'WER_custom_model': eval_info['properties']['wordErrorRate2'],
            
    })
# Create a DataFrame to display the results
print(eval_info)
wer_df = pd.DataFrame(wer_results)
print(eval_title)
print(wer_df)

About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training.

You'll also similarly create a WER result markdown file using the md_table_scoring_result method below.

Core Function:

# Create a markdown file for table scoring results
md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files)

Figure 8. Screenshot of WER results by md_table_scoring_result function

Implementation Considerations

The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance.

Conclusion

By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to:

Rapidly produce large volumes of specialized training and evaluation data.
Substantially reduce the time and cost associated with recording real audio.
Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples.

As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂