azure ai studio
169 TopicsEnhancing Workplace Safety and Efficiency with Azure AI Foundry's Content Understanding
Discover how Azure AI Foundry’s Content Understanding service, featuring the Video Shot Analysis template, revolutionizes workplace safety and efficiency. By leveraging Generative AI to analyze video data, businesses can gain actionable insights into worker actions, posture, safety risks, and environmental conditions. Learn how this cutting-edge tool transforms operations across industries like manufacturing, logistics, and healthcare.417Views2likes0CommentsIntroducing Azure AI Agent Service
Introducing Azure AI Agent Service at Microsoft Ignite 2024 Discover how Azure AI Agent Service is revolutionizing the development and deployment of AI agents. This service empowers developers to build, deploy, and scale high-quality AI agents tailored to business needs within hours. With features like rapid development, extensive data connections, flexible model selection, and enterprise-grade security, Azure AI Agent Service sets a new standard in AI automation40KViews9likes1CommentEvaluating Fine-Tuned Models for Function-Calling: Beyond Input-Output Metrics
In the intricate world of machine learning and Artificial Intelligence, the fine-tuning of models for specific tasks is an art form that requires meticulous attention to detail. One such task that has garnered significant attention is function-calling, where models are trained to call specific functions with appropriate arguments based on given inputs. Evaluating these fine-tuned models is crucial to ensure their reliability and effectiveness. While in the previous blog post, we looked at how to run an end-to-end fine-tuning pipeline using the Azure Machine Learning Platform, this blog post will delve into the multifaceted evaluation process of these models, emphasizing the importance of not just input-response evaluation but also the correctness of function calls and arguments. Understanding Function-Calling: Models optimized for Function-calling are designed to interpret input data and subsequently call predefined functions with the correct arguments. These models find applications in various domains, including automated customer support, data processing, and even complex decision-making systems. The key to their success lies in their ability to understand the context and semantics of the input, translating it into precise function calls. The Challenge of Input-Response Evaluation: The most straightforward method of evaluating these models is through input-response evaluation. This involves providing the model with a set of inputs and comparing its responses to the expected outputs. Metrics such as accuracy, precision, recall, and F1-score are commonly used to measure performance. However, input-response evaluation alone presents several challenges: Superficial Assessment: This method primarily checks if the model's output matches the expected result. It doesn't delve into the model's internal decision-making process or the correctness of the function calls and arguments. Misleading Metrics: If the predicted response doesn't match the expected answer, input-response metrics alone won't explain why. The discrepancy could stem from incorrect function calls or arguments, not just from an incorrect final output. Limited Scope: Many tasks require a broader spectrum of capabilities beyond just function-calling. This includes general conversation, generating leading questions to gather necessary inputs for function-calling, and synthesizing responses from function execution. Input-response evaluation doesn't cover these nuances as it requires semantic understanding of the input and response instead of word-by-word assessment. Evaluating Function Calls: The Next Layer To bridge the gap left by input-response evaluation, we need to scrutinize the function calls themselves. This involves verifying that the model calls the appropriate functions for given inputs. Why This Matters Correct Function Semantics: Ensuring the right function is called guarantees that the model understands the semantics of the task. For instance, in a customer support system, calling a function to reset a password instead of updating an address could lead to significant user frustration. Maintainability and Debugging: Correct function calls make the system easier to maintain and debug. If the wrong function is called, it can lead to unexpected behaviors that are harder to trace and fix. Addressing Gaps When the predicted response doesn't match the expected answer, evaluating function names and arguments helps identify the root cause of the discrepancy. This insight is crucial for taking necessary actions to improve the model's performance, whether it involves fine-tuning the training data or adjusting the model's architecture. Evaluating Function Arguments: The Final Layer The last layer of evaluation involves scrutinizing the arguments passed to the functions. Even if the correct function is called, incorrect or improperly formatted arguments can lead to failures or incorrect outputs. Importance of Correct Arguments Functional Integrity: The arguments to a function are as crucial as the function itself. Passing incorrect arguments can result in errors or unintended outcomes. For example, calling a function to process a payment with an incorrect amount or currency could have severe financial implications. User Experience: In applications like chatbots or virtual assistants, incorrect arguments can degrade the user experience. A model that correctly calls a weather-check function but passes the wrong location will not serve the user's needs effectively. A Holistic Evaluation Approach To ensure the robustness of fine-tuned models for function-calling, a holistic evaluation approach is necessary. This involves: Input-Response Evaluation: Checking the overall accuracy and effectiveness of the model's outputs. Function Call Verification: Ensuring the correct functions are called for given inputs. Argument Validation: Verifying that the arguments passed to functions are correct and appropriately formatted. Beyond Lexical Evaluation: Semantic Similarity Given the complexity of tasks, it's imperative to extend the scope of metrics to include semantic similarity evaluation. This approach assesses how well the model's output aligns with the intended meaning, rather than just matching words or phrases. Semantic Similarity Metrics: Use advanced metrics like BERTScore, BLEU, ROUGE, or METEOR to measure the semantic similarity between the model's output and the expected response. These metrics evaluate the meaning of the text, not just the lexical match. Contextual Understanding: Incorporate evaluation methods that assess the model's ability to understand context, generate leading questions, and synthesize responses. This ensures the model can handle a broader range of tasks effectively. Evaluate GenAI Models and Applications Using Azure AI Foundry The evaluation functionality in the Azure AI Foundry portal provides a comprehensive platform that offers tools and features for assessing the performance and safety of your generative AI model. In Azure AI Foundry portal, you're able to log, view, and analyze detailed evaluation metrics. With built-in and custom evaluators, the tool empowers developers and researchers to analyze models under diverse conditions and scenarios while enabling straightforward comparison of results across multiple models. Within Azure AI Foundry, a comprehensive approach to evaluation includes three key dimensions: Risk and Safety Evaluators: Evaluating potential risks associated with AI-generated content is essential for safeguarding against content risks with varying degrees of severity. This includes evaluating an AI system's predisposition towards generating harmful or inappropriate content. Performance and Quality Evaluators: This involves assessing the accuracy, groundedness, and relevance of generated content using robust AI-assisted and Natural Language Processing (NLP) metrics. Custom Evaluators: Tailored evaluation metrics can be designed to meet specific needs and goals, providing flexibility and precision in assessing unique aspects of AI-generated content. These custom evaluators allow for more detailed and specific analyses, addressing particular concerns or requirements that standard metrics might not cover. Running the Evaluation for Fine-Tuned Models Using the Azure Evaluation Framework Metrics Used for the Workflow Function-Call Invocation Function-Call Arguments BLEU Score: Measures how closely the generated text matches the reference text. ROUGE Score: Focuses on recall-oriented measures to assess how well the generated text covers the reference text. GLEU Score: Measures the similarity by shared n-grams between the generated text and ground truth, focusing on both precision and recall. METEOR Score: Considers synonyms, stemming, and paraphrasing for content alignment. Diff Eval: An AI-assisted custom metric that compares the actual response to ground truth and highlights the key differences between the two responses. We will use the same validation split from glaive-function-calling-v2 as used in the fine-tuning blog post, run it through the hosted endpoint for inference, get the response, and use the actual input and predicted response for evaluation. Preprocessing the Dataset First, we need to preprocess the dataset and convert it into a QnA format as the original dataset maintains an end-to-end conversation as one unified record. Parse_conversation and apply_chat_template: This function effectively transforms a raw conversation string into a list of dictionaries, each representing a message with a role and content. Get_multilevel_qna_pairs: This iteratively breaks down the conversation as questions and prompts every time it encounters the role as "assistant" within the formatted dictionary. def parse_conversation(input_string): ROLE_MAPPING = {"USER" : "user", "ASSISTANT" : "assistant", "SYSTEM" : "system", "FUNCTION RESPONSE" : "tool"} # Regular expression to split the conversation based on SYSTEM, USER, and ASSISTANT pattern = r"(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):" # Split the input string and keep the delimiters parts = re.split(pattern, input_string) # Initialize the list to store conversation entries conversation = [] # Iterate over the parts, skipping the first empty string for i in range(1, len(parts), 2): role = parts[i].strip() content = parts[i + 1].strip() content = content.replace("<|endoftext|>", "").strip() if content.startswith('<functioncall>'): # build structured data for function call # try to turn function call from raw text to structured data content = content.replace('<functioncall>', '').strip() # replace single quotes with double quotes for valid JSON clean_content = content.replace("'{", '{').replace("'}", '}') data_json = json.loads(clean_content) # Make it compatible with openAI prompt format func_call = {'recipient_name': f"functions.{data_json['name']}", 'parameters': data_json['arguments']} content = {'tool_uses': [func_call]} # Append a dictionary with the role and content to the conversation list conversation.append({"role": ROLE_MAPPING[role], "content": content}) return conversation def apply_chat_template(input_data): try: system_message = parse_conversation(input_data['system']) chat_message = parse_conversation(input_data['chat']) message = system_message + chat_message return message except Exception as e: print(str(e)) return None def get_multilevel_qna_pairs(message): prompts = [] answers = [] for i, item in enumerate(message): if item['role'] == 'assistant': prompts.append(message[:i]) answers.append(item["content"]) return prompts, answers Reference : inference.py Submitting a Request to the Hosted Endpoint : Next, we need to write the logic to send request to the hosted endpoint and run the inference. def run_inference(input_data): # Replace this with the URL for your deployed model url = 'https://llama-endpoint-ft.westus3.inference.ml.azure.com/score' # Replace this with the primary/secondary key, AMLToken, or Microsoft Entra ID token for the endpoint api_key = '' # Update it with the API key params = { "temperature": 0.1, "max_new_tokens": 512, "do_sample": True, "return_full_text": False } body = format_input(input_data, params) body = str.encode(json.dumps(body)) if not api_key: raise Exception("A key should be provided to invoke the endpoint") headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)} req = urllib.request.Request(url, body, headers) try: response = urllib.request.urlopen(req) result = json.loads(response.read().decode("utf-8"))["result"] except urllib.error.HTTPError as error: print("The request failed with status code: " + str(error.code)) # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure print(error.info()) print(error.read().decode("utf8", 'ignore')) return result Evaluation Function: Next, we write the evaluation function that will run the inference and will evaluate the match for function calls and function arguments. def eval(query, answer): """ Evaluate the performance of a model in selecting the correct function based on given prompts. Args: input_data (List) : List of input prompts for evaluation and benchmarking expected_output (List) : List of expected response Returns: df : Pandas Dataframe with input prompts, actual response, expected response, Match/No Match and ROUGE Score """ # Initialize the ROUGE Scorer where llm response is not function-call scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True) expected_output = answer # For generic model response without function-call, set a threshold to classify it as a match match_threshold_g = 0.75 predicted_response = run_inference(query) is_func_call = False if predicted_response[1:12] == "'tool_uses'": is_func_call = True try: predicted_response = ast.literal_eval(predicted_response) except: predicted_response = predicted_response if isinstance(predicted_response, dict): predicted_functions = [func["recipient_name"] for func in predicted_response["tool_uses"]] predicted_function_args = [func["parameters"] for func in predicted_response["tool_uses"]] actual_functions = [func["recipient_name"] for func in expected_output["tool_uses"]] actual_function_args = [func["parameters"] for func in expected_output["tool_uses"]] fcall_match = predicted_functions == actual_functions fcall_args_match = predicted_function_args == actual_function_args match = "Yes" if fcall_match and fcall_args_match else "No" else: fmeasure_score = scorer.score(expected_output, predicted_response)['rougeL'].fmeasure match = "Yes" if fmeasure_score >= match_threshold_g else "No" result = { "response": predicted_response, "fcall_match": fcall_match if is_func_call else "NA", "fcall_args_match": fcall_args_match if is_func_call else "NA", "match": match } return result Create a AI-assisted custom metric for Difference evaluation Create a Prompty file: Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers. The primary goal is to accelerate the developer inner loop. Prompty standardizes prompts and their execution into a single asset. 2. Create a class to load the Prompty file and process the outputs with JSON format. class DifferenceEvaluator: def __init__(self, model_config: AzureOpenAIModelConfiguration): """ Initialize an evaluator configured for a specific Azure OpenAI model. :param model_config: Configuration for the Azure OpenAI model. :type model_config: AzureOpenAIModelConfiguration **Usage** .. code-block:: python eval_fn = CompletenessEvaluator(model_config) result = eval_fn( question="What is (3+1)-4?", answer="First, the result within the first bracket is 3+1 = 4; then the next step is 4-4=0. The answer is 0", truth="0") """ # TODO: Remove this block once the bug is fixed # https://msdata.visualstudio.com/Vienna/_workitems/edit/3151324 if model_config.api_version is None: model_config.api_version = "2024-05-01-preview" prompty_model_config = {"configuration": model_config} current_dir = os.path.dirname(__file__) prompty_path = os.path.join(current_dir, "difference.prompty") assert os.path.exists(prompty_path), f"Please specify a valid prompty file for completeness metric! The following path does not exist:\n{prompty_path}" self._flow = load_flow(source=prompty_path, model=prompty_model_config) def __call__(self, *, response: str, ground_truth: str, **kwargs): """Evaluate correctness of the answer in the context. :param answer: The answer to be evaluated. :type answer: str :param context: The context in which the answer is evaluated. :type context: str :return: The correctness score. :rtype: dict """ # Validate input parameters response = str(response or "") ground_truth = str(ground_truth or "") if not (response.strip()) or not (ground_truth.strip()): raise ValueError("All inputs including 'answer' must be non-empty strings.") # Run the evaluation flow output = self._flow(response=response, ground_truth=ground_truth) print(output) return json.loads(output) Reference: difference.py Run the evaluation pipeline: Run the evaluation pipeline on the validation dataset using both in-built metrics and custom metrics. In order to ensure the evaluate() can correctly parse the data, you must specify column mapping to map the column from the dataset to keywords that are accepted by the evaluators. def run_evaluation(name = None, dataset_path = None): model_config = AzureOpenAIModelConfiguration( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["AZURE_OPENAI_API_VERSION"], azure_deployment=os.environ["AZURE_OPENAI_EVALUATION_DEPLOYMENT"] ) # Initializing Evaluators difference_eval = DifferenceEvaluator(model_config) bleu = BleuScoreEvaluator() glue = GleuScoreEvaluator() meteor = MeteorScoreEvaluator(alpha = 0.9, beta = 3.0, gamma = 0.5) rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L) data_path = str(pathlib.Path.cwd() / dataset_path) csv_output_path = str(pathlib.Path.cwd() / "./eval_results/eval_results.csv") output_path = str(pathlib.Path.cwd() / "./eval_results/eval_results.jsonl") result = evaluate( # target=copilot_qna, evaluation_name=name, data=data_path, target=eval, evaluators={ "bleu": bleu, "gleu": glue, "meteor": meteor, "rouge" : rouge, "difference": difference_eval }, evaluator_config= {"default": { # only provide additional input fields that target and data do not have "ground_truth": "${data.answer}", "query": "${data.query}", "response": "${target.response}", }} ) tabular_result = pd.DataFrame(result.get("rows")) tabular_result.to_csv(csv_output_path, index=False) tabular_result.to_json(output_path, orient="records", lines=True) return result, tabular_result Reference: evaluate.py Reviewing the Results: Let's review the results for only function-calling scenarios. 85 out of 102 records had a 100% match, whereas the rest had discrepancies in the function arguments being passed. The difference evaluator output gives insights into what the exact differences are, which we can use to improve the model performance by fixing the training dataset and model hyperparameters in subsequent iterations. As can be inferred from the above results, the model doesn't do a great job if it involves number conversion, date formatting and we can leverage these insights to further fine-tune the model performance. Conclusion Evaluating fine-tuned models for function-calling requires a comprehensive approach that goes beyond input-response metrics. By incorporating function call verification, argument validation, and semantic similarity evaluation, we can ensure these models perform reliably and effectively in real-world applications. This holistic evaluation strategy not only enhances the model's accuracy but also ensures its robustness, maintainability, and user satisfaction.141Views0likes0CommentsFine-Tuning Small Language Models for Function-Calling: A Comprehensive Guide
In the rapidly evolving landscape of artificial intelligence, fine-tuning small language models (SLMs) for use case specific workloads has become increasingly essential. The motivation behind this lies in the need for lower latency, reduced memory footprint, and improved accuracy—all while maintaining cost-effectiveness. This blog delves into the reasons for fine-tuning SLMs for function-call, key considerations, and a practical guide to implementing fine-tuning on Azure Why Fine-Tune Small Language Models? 1. Lower Latency and Reduced Memory Footprint : Smaller models with fewer weights inherently offer faster processing times due to reduced matrix multiplication operations. This lower latency is crucial for real-time applications where speed is paramount. Additionally, these models reduce the memory footprint, making them ideal for deployment in resource-constrained environments. 2. Cost Efficiency : Fine-tuning smaller models is more cost-effective than training large models from scratch. It reduces the computational resources required, thereby lowering operational costs. This makes it a viable option for startups and enterprises looking to optimize their AI expenditure. 3. Improved Accuracy : By tailoring a model to a specific function-calling use case, you can achieve higher accuracy. Fine-tuning allows the model to learn the intricacies of function-calling, thereby providing more relevant and precise outputs. 4. Smaller Token Size : Smaller models and efficient token handling lead to a reduction in token size, which further optimizes processing speed and resource usage. Key Considerations for Fine-Tuning a. Selection of the Right Base Model : Choosing the appropriate base model is crucial. Evaluate industrial benchmarks and leaderboards, such as the [Berkeley Function Call Leaderboard] to guide your selection. Consider factors like model size, which affects GPU VRAM requirements, accuracy, and context length. For this blg post, we will use Llama-3.2-3b-instruct model as our base model for fine-tuning. b. Dataset Preparation : Proper dataset preparation is a cornerstone for successful fine-tuning of SLMs for function-calling tasks. The dataset must be representative of real-world scenarios and cover the full spectrum of use cases you anticipate. For this blog, we will utilize theglaiveai/glaive-function-calling-v2dataset from Hugging Face, renowned for its comprehensive coverage of simple, multiple, and multi-turn function-calling scenarios across diverse domains. - Key Steps in Dataset Preparation: Understanding the Complexity of the Use Case Before diving into the technicalities of dataset preparation, it's essential to understand the complexity of the use case at hand. Is the task limited to function-calling, or does it involve a broader, more generic conversation? If the latter is true, it becomes imperative to ensure that the existing knowledge and capabilities of the language model (SLM) are preserved. The dataset should seamlessly integrate both function-call and non-function-call scenarios to provide a holistic conversational experience. Differentiating Function-Calling Scenarios Let's explore the different scenarios that might arise in function-calling applications: Single Function-Calling: This scenario involves invoking a single function based on user input. For instance, in the travel industry, a user might ask, "What are the available flights from New York to London on December 10th?" The dataset should include examples that allow the model to extract relevant information and call the flight search function accurately. Multiple Function-Calling: Here, the language model must choose one function from a set of possible tools. For example, if a user asks, "Can you book me a hotel or a flight to Paris?" the dataset should provide instances where the model decides between booking a hotel or a flight based on user preferences or additional input. Multi-Turn Conversations: This scenario requires tools to be invoked in a sequence based on the conversation's state. Consider a user planning a vacation: "I want to visit Italy. What are my options?" followed by "Book me a flight," and then "Find a hotel in Rome." The dataset should capture the flow of conversation, enabling the model to handle each request in context. Parallel Function-Calling: In situations where multiple tools need to be invoked simultaneously, such as booking flights and hotels at the same time, the dataset should include examples that allow the model to manage these parallel tasks effectively. For instance, "Book a flight to Tokyo and reserve a hotel in Shinjuku for the same dates." Handling Missing Information: A robust dataset should also include scenarios where the language model needs to ask the user for missing information. For example, if a user simply says, "Book me a flight," the model should prompt, "Could you please specify the destination and dates?" c. Compute Selection Ensure your compute setup has adequate VRAM to accommodate model weights, gradients, and activations. The compute should be tailored to your model size and batch size requirements. d. Hyperparameter Selection : The selection of hyperparameters is a critical step that can significantly influence the performance of a model. Hyperparameters, unlike the model’s parameters, are not learned from the data but are set before the training process begins. Choosing the right hyperparameters can lead to faster convergence and higher accuracy, making this an area that demands careful attention. Hyperparameters can be thought of as the settings or knobs that you, as the model trainer, can adjust to tailor the training process. These include learning rate, batch size, the architecture of layers, and more. One of the leading methodologies for fine-tuning models is LORA (Low-Rank Adaptation), which has gained popularity due to its efficiency and effectiveness. LORA is a technique that allows for the efficient adaptation of large language models by introducing low-rank matrices during the training process. This approach reduces the number of trainable parameters, leading to faster convergence and reduced computational costs. When using LORA, two primary hyperparameters to consider are: Rank: This represents the dimensionality of the low-rank matrices. It is a critical factor influencing the model’s capacity to learn nuanced patterns. Alpha: This is a scaling factor applied to the low-rank updates, typically set to be 2-4 times the rank value. A good starting point for these parameters might be a rank of 8 and an alpha of 16, but these values should be tailored based on the model's complexity and the specific task at hand. e. Optimize context length : Another significant aspect of model fine-tuning, especially in function-calling scenarios, is the management of context length. In these prompts, we often provide detailed information such as function names, descriptions, and argument types, which consume a substantial number of tokens. Efficiently managing this context can lead to performance gains without sacrificing accuracy. Iterative Experimentation with Context Details:To optimize context length, an iterative experimentation approach is recommended: Baseline Experiment: Start by including all possible details—function descriptions, argument types, and more. This serves as your baseline for comparison. Simplified Contexts: Gradually remove elements from the context: First Iteration: Retain only the function names and arguments, omitting descriptions. Second Iteration: Remove the arguments, keeping just the function names. Final Iteration: Test the model's performance without any function names or arguments. By incrementally simplifying the context, you can identify the minimal necessary While conducting these experiments, it is advantageous to utilize previous checkpoints. Instead of starting from the base model for each iteration, use the trained model from the previous step as a starting point. This approach can save time and computational resources, allowing for more efficient experimentation. Fine-Tuning on Azure: Step-by-Step Now lets run the fine-tuning job while adhering to all the guidelines and instructions shared above:- 1. Create an Azure Machine Learning Workspace: An Azure Machine Learning workspace is your control center for managing all the resources you need to train, deploy, automate, and manage machine learning models. It serves as a central repository for your datasets, compute resources, and models. To get started, you can create a workspace through the Azure portal by navigating to the Azure Machine Learning service and selecting "Create new workspace." Ensure you configure resource group, workspace name, region, and other necessary settings. 2. Create a Compute Instance: To run your Python notebook and execute scripts, you need a compute instance. This virtual machine in Azure Machine Learning allows you to perform data preparation, training, and experimentation. Go to the "Compute" section in your workspace, select "Create," and choose a compute instance that fits your needs, ensuring it has the necessary specifications for your workload. 3: Dataset Preparation: For this blog, we'll use the glaiveai/glaive-function-calling-v2 dataset from Hugging Face, which includes simple, multi-turn function calling and generic conversations across various domains. The dataset needs to be formatted to be compatible with the OpenAI format: Convert each conversation into achat_templateformat. Assign roles as 'system', 'user', or 'assistant'. Remove "<|endoftext|>” string and if the response is a function-call, replace the “<functioncall>” string and add role as tool so that LLM knows when to stop responding and wait for function execution results def parse_conversation(input_string): ROLE_MAPPING = {"USER" : "user", "ASSISTANT" : "assistant", "SYSTEM" : "system", "FUNCTION RESPONSE" : "tool"} # Regular expression to split the conversation based on SYSTEM, USER, and ASSISTANT pattern = r"(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):" # Split the input string and keep the delimiters parts = re.split(pattern, input_string) # Initialize the list to store conversation entries conversation = [] # Iterate over the parts, skipping the first empty string for i in range(1, len(parts), 2): role = parts[i].strip() content = parts[i + 1].strip() content = content.replace("<|endoftext|>", "").strip() if content.startswith('<functioncall>'): # build structured data for function call # try to turn function call from raw text to structured data content = content.replace('<functioncall>', '').strip() # replace single quotes with double quotes for valid JSON clean_content = content.replace("'{", '{').replace("'}", '}') data_json = json.loads(clean_content) # Make it compatible with openAI prompt format func_call = {'recipient_name': f"functions.{data_json['name']}", 'parameters': data_json['arguments']} content = {'tool_uses': [func_call]} # Append a dictionary with the role and content to the conversation list conversation.append({"role": ROLE_MAPPING[role], "content": content}) return conversation def prepare_dataset(tokenizer, args): # Create the cache_dir cache_dir = "./outputs/dataset" os.makedirs(cache_dir, exist_ok = True) # Load the dataset from disk train_dataset = load_from_disk(args.train_dir) eval_dataset = load_from_disk(args.val_dir) column_names = list(train_dataset.features) def apply_chat_template(examples): conversations = [] for system, chat in zip(examples["system"], examples["chat"]): try: system_message = parse_conversation(system) chat_message = parse_conversation(chat) message = system_message + chat_message conversations.append(message) except Exception as e: print(e) text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in conversations] return {"text": text} # process the dataseta and drop unused columns processed_train_dataset = train_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names) processed_eval_dataset = eval_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names) return processed_train_dataset, processed_eval_dataset 4: Create a Data Asset:Azure Machine Learning allows you to register datasets as data assets, making them easily manageable and reusable: def get_or_create_data_asset(ml_client, data_name, data_local_dir, update=False): try: latest_data_version = max([int(d.version) for d in ml_client.data.list(name=data_name)]) if update: raise ResourceExistsError('Found Data asset, but will update the Data.') else: data_asset = ml_client.data.get(name=data_name, version=latest_data_version) logger.info(f"Found Data asset: {data_name}. Will not create again") except (ResourceNotFoundError, ResourceExistsError) as e: data = Data( path=data_local_dir, type=AssetTypes.URI_FOLDER, description=f"{data_name} for fine tuning", tags={"FineTuningType": "Instruction", "Language": "En"}, name=data_name ) data_asset = ml_client.data.create_or_update(data) logger.info(f"Created/Updated Data asset: {data_name}") return data_asset train_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_train", data_local_dir=f"{DATA_DIR}/train", update=True) val_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_val", data_local_dir=f"{DATA_DIR}/val", update=True) test_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_test", data_local_dir=f"{DATA_DIR}/test", update=True) 5: Create an Environment:While Azure provides built-in environments for common use cases, creating a custom environment tailored to your specific needs can be beneficial. An environment in Azure ML is essentially a containerized setup that defines the software, libraries, and other dependencies required to run your machine learning workload. Why Use Environments? Reproducibility: By defining an environment, you ensure that your training and inference processes are reproducible, with the same configuration used every time. Consistency: Environments help maintain consistency across different runs and teams, reducing "it works on my machine" problems. Portability: They encapsulate your dependencies, making it easier to move and share your ML projects across different Azure services or even with external collaborators. %%writefile {CLOUD_DIR}/train/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2 USER root # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update && apt-get -y upgrade RUN pip install --upgrade pip RUN apt-get install -y openssh-server openssh-client # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation def get_or_create_docker_environment_asset(ml_client, env_name, docker_dir, update=False): try: latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)]) if update: raise ResourceExistsError('Found Environment asset, but will update the Environment.') else: env_asset = ml_client.environments.get(name=env_name, version=latest_env_version) print(f"Found Environment asset: {env_name}. Will not create again") except (ResourceNotFoundError, ResourceExistsError) as e: print(f"Exception: {e}") env_docker_image = Environment( build=BuildContext(path=docker_dir), name=env_name, description="Environment created from a Docker context.", ) env_asset = ml_client.environments.create_or_update(env_docker_image) print(f"Created Environment asset: {env_name}") return env_asset env = get_or_create_docker_environment_asset(ml_client, azure_env_name, docker_dir=f"{CLOUD_DIR}/train", update=False) Reference : training.ipynb 6: Create a Training Script:Your training script will handle the fine-tuning process and log metrics using MLflow, which is tightly integrated with Azure Machine Learning. This involves - Loading the dataset, defining the model architecture, writing functions to track and log metrics such as training and evaluation loss. def main(args): ################### # Hyper-parameters ################### # Only overwrite environ if wandb param passed if len(args.wandb_project) > 0: os.environ['WANDB_API_KEY'] = args.wandb_api_key os.environ["WANDB_PROJECT"] = args.wandb_project if len(args.wandb_watch) > 0: os.environ["WANDB_WATCH"] = args.wandb_watch if len(args.wandb_log_model) > 0: os.environ["WANDB_LOG_MODEL"] = args.wandb_log_model use_wandb = len(args.wandb_project) > 0 or ("WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0) training_config = {"per_device_train_batch_size" : args.train_batch_size, # Controls the batch size per device "per_device_eval_batch_size" : args.eval_batch_size, # Controls the batch size for evaluation "gradient_accumulation_steps" : args.grad_accum_steps, "warmup_ratio" : args.warmup_ratio, # Controls the ratio of warmup steps "learning_rate" : args.learning_rate, "fp16" : not torch.cuda.is_bf16_supported(), "bf16" : torch.cuda.is_bf16_supported(), "optim" : "adamw_8bit", "lr_scheduler_type" : args.lr_scheduler_type, "output_dir" : args.output_dir, "logging_steps": args.logging_steps, "logging_strategy": "epoch", "save_steps": args.save_steps, "eval_strategy": "epoch", "num_train_epochs": args.epochs, # "load_best_model_at_end": True, "save_only_model": False, "seed" : 0 } peft_config = { "r": args.lora_r, "lora_alpha": args.lora_alpha, "lora_dropout": args.lora_dropout, "bias": "none", #"target_modules": "all-linear", "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], "modules_to_save": None, "use_gradient_checkpointing": "unsloth", "use_rslora": False, "loftq_config": None, } checkpoint_dir = os.path.join(args.output_dir, "checkpoints") train_conf = TrainingArguments( **training_config, report_to="wandb" if use_wandb else "azure_ml", run_name=args.wandb_run_name if use_wandb else None, ) model, tokenizer = load_model(args) model = FastLanguageModel.get_peft_model(model, **peft_config) ############### # Setup logging ############### logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = train_conf.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process a small summary logger.warning( f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}" + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}" ) logger.info(f"Training/evaluation parameters {train_conf}") logger.info(f"PEFT parameters {peft_config}") # Load the dataset train_dataset, eval_dataset = prepare_dataset(tokenizer, args) ########### # Training ########### trainer = SFTTrainer( model=model, args=train_conf, tokenizer = tokenizer, train_dataset=train_dataset, eval_dataset=eval_dataset, dataset_text_field="text", packing = False # Can make training 5x faster for shorter responses ) # Show current memory stats gpu_stats = torch.cuda.get_device_properties(0) start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3) logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.") logger.info(f"{start_gpu_memory} GB of memory reserved.") last_checkpoint = None if os.path.isdir(checkpoint_dir): checkpoints = [os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)] if len(checkpoints) > 0: checkpoints.sort(key=os.path.getmtime, reverse=True) last_checkpoint = checkpoints[0] trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint) ############# # Evaluation ############# tokenizer.padding_side = "left" metrics = trainer.evaluate() metrics["eval_samples"] = len(eval_dataset) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # ############ # # Save model # ############ os.makedirs(args.model_dir, exist_ok=True) if args.save_merged_model: print("Save PEFT model with merged 16-bit weights") model.save_pretrained_merged("outputs", tokenizer, save_method="merged_16bit") else: print(f"Save PEFT model: {args.model_dir}/model") model.save_pretrained(f"{args.model_dir}/model") tokenizer.save_pretrained(args.model_dir) Reference : train.py 7: Create the Compute Cluster:. For this experiment, we are using Standard_NC24ads_A100_v4 which has 1 GPU and 80 GB of VRAM. Select the compute based on the model size and batch size. from azure.ai.ml.entities import AmlCompute ### Create the compute cluster try: compute = ml_client.compute.get(azure_compute_cluster_name) print("The compute cluster already exists! Reusing it for the current run") except Exception as ex: print( f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {azure_compute_cluster_size}!" ) try: print("Attempt #1 - Trying to create a dedicated compute") tier = 'LowPriority' if USE_LOWPRIORITY_VM else 'Dedicated' compute = AmlCompute( name=azure_compute_cluster_name, size=azure_compute_cluster_size, tier=tier, max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print("Error") 8: Submit the Fine-Tuning Job With everything set up, you can now submit your fine-tuning job: from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration job = command( inputs=dict( #train_dir=Input(type="uri_folder", path=DATA_DIR), # Get data from local path train_dir=Input(path=f"{AZURE_DATA_NAME}_train@latest"), # Get data from Data asset val_dir = Input(path=f"{AZURE_DATA_NAME}_val@latest"), epoch=d['train']['epoch'], train_batch_size=d['train']['train_batch_size'], eval_batch_size=d['train']['eval_batch_size'], ), code=f"{CLOUD_DIR}/train", # local path where the code is stored compute=azure_compute_cluster_name, command="python train_v3.py --train_dir ${{inputs.train_dir}} --val_dir ${{inputs.val_dir}} --train_batch_size ${{inputs.train_batch_size}} --eval_batch_size ${{inputs.eval_batch_size}}", #environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/77", # Use built-in Environment asset environment=f"{azure_env_name}@latest", distribution={ "type": "PyTorch", "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) ml_client.jobs.stream(returned_job.name) 9: Monitor Training Metrics: After initiating the job, keep an eye on the output for key metrics like training loss and evaluation loss. Since we've logged the results to MLflow, which is seamlessly integrated with Azure Machine Learning, we can easily review the loss function by navigating to the metrics tab within the jobs section. Key Takeways: Both the training and evaluation loss decrease significantly in the initial steps, suggesting effective learning. The gradual reduction in loss in subsequent steps indicates that the model continues to refine its parameters, but at a slower rate. The consistency in the downward trend for both training and evaluation loss implies that the model is not overfitting and is generalizing well to new data. However, the slight uptick towards the end in the evaluation loss might need monitoring to ensure it doesn't indicate overfitting at later stages. Overall, it looks promising, so lets go ahead and register the model. 10: Register the Model:After fine-tuning, register the model to make it available for deployment: from azureml.core import Workspace, Run import os # Connect to your workspace ws = Workspace.from_config() experiment_name = 'experiment_name' run_id = 'job_name' run = Run(ws.experiments[experiment_name], run_id) # Register the model model = run.register_model( model_name=d["serve"]["azure_model_name"], # this is the name the model will be registered under model_path="outputs" # this is the path to the model file in the run's outputs ) # Create a local directory to save the outputs local_folder = './model_v2' os.makedirs(local_folder, exist_ok=True) # Download the entire outputs folder run.download_files(prefix='outputs', output_directory=local_folder) Step 11: Deploy the Model to a Managed Online Endpoint: Managed online endpoints provide a seamless way to deploy models without managing underlying infrastructure. They offer scalability, versioning, and easy rollback compared to deploying on an Azure Kubernetes Service (AKS) cluster. 11 a. Build the enviornment: For deploying the model to managed online endpoint, first create the environment with required dependencies and webserver for inference. %%writefile {CLOUD_DIR}/serve/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2 # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir # Inference requirements COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/ RUN /var/requirements/install_system_requirements.sh && \ cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \ cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \ ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \ rm -f /etc/nginx/sites-enabled/default ENV SVDIR=/var/runit ENV WORKER_TIMEOUT=400 EXPOSE 5001 8883 8888 # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update RUN apt-get install -y openssh-server openssh-client RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation Reference : serving.ipynb 11b. Create a serving script: Creating a serve script for inference is a crucial step in deploying your machine learning model to a production environment. This script handles incoming requests, processes input data, runs the model inference, and returns the results. In Azure Machine Learning, the serve script is part of the deployment package for your model, typically used in conjunction with a managed endpoint or a Kubernetes service. A serve script in Azure ML typically consists of two main functions: init():This function initializes the model and any other necessary resources. It is called once when the deployment is first loaded. run(data): This function is called every time a request is made to the deployed model. It processes the incoming data, performs inference using the model, and returns the results. import os import re import json import torch import base64 import logging from io import BytesIO from transformers import AutoTokenizer, AutoProcessor, pipeline from transformers import AutoModelForCausalLM, AutoProcessor device = torch.device("cuda" if torch.cuda.is_available() else "cpu") def init(): """ This function is called when the container is initialized/started, typically after create/update of the deployment. You can write the logic here to perform init operations like caching the model in memory """ global model global tokenizer # AZUREML_MODEL_DIR is an environment variable created during deployment. # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION) # Please provide your model's folder name if there is one model_name_or_path = os.path.join( os.getenv("AZUREML_MODEL_DIR"), "outputs" ) model_kwargs = dict( trust_remote_code=True, device_map={"":0}, torch_dtype="auto" ) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map ={"" : 0}, **model_kwargs) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) logging.info("Loaded model.") def run(json_data: str): logging.info("Request received") data = json.loads(json_data) input_data = data["input_data"] params = data['params'] pipe = pipeline("text-generation", model = model, tokenizer = tokenizer) output = pipe(input_data, **params) result = output[0]["generated_text"] logging.info(f"Generated text : {result}") json_result = {"result" : str(result)} return json_result Reference : score.py 11c. Create a managed online endpoint and deploy the model to endpoint: Creating an endpoint and deploying your model on Azure Machine Learning is the final step to make your model accessible for real-time inference. This process involves setting up a service that can handle incoming requests, execute the model, and return the results. Why Create an Endpoint? An endpointis a network-accessible interface that allows external applications or users to interact with your deployed machine learning model. Creating an endpoint is crucial for the following reasons: Accessibility: Endpoints make your model accessible over the internet or within a secured network, enabling other applications, services, or users to send requests and receive responses. API Integration: By exposing your model as a RESTful API, endpoints facilitate integration with various applications, allowing seamless communication and data exchange. Load Management: An endpoint can manage requests from multiple clients, handling concurrent requests and distributing the load appropriately. Security: Endpoints provide mechanisms for authentication and authorization, ensuring that only authorized users can access the model. Scalability: Azure-managed endpoints can automatically scale based on demand, ensuring that your model can handle varying workloads without manual intervention. from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) azure_endpoint_name = d['serve']['azure_endpoint_name'] # Check if the endpoint already exists in the workspace try: endpoint = ml_client.online_endpoints.get(azure_endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=azure_endpoint_name, description=f"Test endpoint for {model.name}", ) # Trigger the endpoint creation try: ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err Why Deploy a Model? Deploymentis the process of transferring your trained machine learning model from a development environment to a production environment where it can serve real-time predictions. Deployment is critical because: Operationalization: Deployment operationalizes your model, moving it from an experimental or development phase to a live environment where it can deliver value to end-users or systems. Resource Allocation: Deploying a model involves configuring the necessary compute resources (such as CPU, memory, and GPUs) to ensure optimal performance during inference. Environment Consistency: During deployment, the model is packaged with its dependencies in a consistent environment, ensuring reproducibility and minimizing discrepancies between development and production. Monitoring and Maintenance: Deployment sets up the infrastructure to monitor the model's performance, usage, and health, allowing for ongoing maintenance and updates. Version Control: Deployment allows you to manage and update different versions of your model, providing flexibility to roll back or switch to newer versions as needed. from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) azure_deployment_name = f"{d['serve']['azure_deployment_name']}-v1" deployment = ManagedOnlineDeployment( name=azure_deployment_name, endpoint_name=azure_endpoint_name, model=model, instance_type=azure_compute_cluster_size, instance_count=1, #code_configuration=code_configuration, environment = env, scoring_script="score.py", code_path=f"./{CLOUD_DIR}/inference", #environment_variables=deployment_env_vars, request_settings=OnlineRequestSettings(max_concurrent_requests_per_instance=20, request_timeout_ms=90000, max_queue_wait_ms=60000), liveness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), ) # Trigger the deployment creation try: ml_client.begin_create_or_update(deployment).wait() print("\n---Deployment created successfully---\n") except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err endpoint.traffic = {azure_deployment_name: 100} endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) Step 12: Run Inference on Sample Data: Test the deployed model using sample data that expects function calls: import json import os sample = { "input_data": [ {'role': 'system', 'content': 'You are an helpful assistant who has access to the following functions to help the user, you can use the functions if needed- { "name": "calculate_shipping_cost", "description": "Calculate the cost of shipping a package", "parameters": { "type": "object", "properties": { "weight": { "type": "number", "description": "The weight of the package in pounds" }, "destination": { "type": "string", "description": "The destination of the package" } }, "required": [ "weight", "destination" ] }}}"'}, {'role': 'user', 'content': 'Can you help me with shipping cost for a package?'}, {'role': 'assistant', 'content': 'Sure! I can help you with that. Please provide me with the weight and destination of the package.'}, {'role': 'user', 'content': 'The weight of the package is 10 pounds and the destination is New York.'} ], "params": { "temperature": 0.1, "max_new_tokens": 512, "do_sample": True, "return_full_text": False } } # Dump the sample data into a json file with open(request_file, "w") as f: json.dump(sample, f) result = ml_client.online_endpoints.invoke( endpoint_name=azure_endpoint_name, deployment_name=azure_deployment_name, request_file=request_file ) result_json = json.loads(result) result = result_json['result'] print(result) Step 13: Compare with Base Model: Now, lets run the same sample through the base model to observe the difference in performance. As we can see, while the fine-tuned model did a perfect job of generating response with the right function and arguments, the base model struggles to generate the desired output Step 14: Rerun the fine-tuning job by removing function descriptions from the system message: Now, lets rerun the experiment, but this time we will drop the function description from the dataset for context length optimization def remove_desc_from_prompts(data): system_message = data['system'] pattern = r'"description":\s*"[^"]*",?\n?' # Remove the "description" fields cleaned_string = re.sub(pattern, '"description":"",', system_message) return cleaned_string ## Update the system message by removing function descriptions and argument description train_dataset = train_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) test_dataset = test_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) val_dataset = val_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) train_dataset.save_to_disk(f"{DATA_DIR}/train") test_dataset.save_to_disk(f"{DATA_DIR}/test") val_dataset.save_to_disk(f"{DATA_DIR}/val") Reference : preprocess.py As can be seen from the results, removing the function description doesn't degrade the model performance but instead this fine-tuned model version requires lesser input tokens resulting in a significant reduction in token consumption with improved latency. Step 15: Further Exploration: Consider removing arguments or even the function itself in subsequent experiments to evaluate performance. Conclusion This blog post has walked through the process of fine-tuning an SLM for function-calling on Azure Machine Learning. By following these steps, you can effectively tailor a model to meet specific functional requirements. You can access the full code here. For a deeper dive into evaluating fine-tuned models, including metrics and code samples, check out the next blog post. By leveraging Azure's powerful tools, you can streamline the development and deployment of machine learning models, making them more efficient and effective for your specific tasks. Reference: Fine tuning for function calling | OpenAI Cookbook Fine-tuning function calls with Azure OpenAI Service - Azure AI services | Microsoft Learn michaelnny/Llama3-FunctionCalling: Fine-tune Llama3 model to support function calling Fine Tuning LLMs for Function Calling w/Pawel Garbacki - YouTube slm-innovator-lab/2_slm-fine-tuning-mlstudio at main · Azure/slm-innovator-lab375Views0likes0CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs:GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. ure AI Speech Studio > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress usingmonitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)533Views3likes6CommentsNew controls for model governance and secure access to on-premises or custom VNET resources
Learn how to create an allowed model listfor the Azure AI model catalog, plus a new way to accesson-premises and custom VNET resources from your managed VNETfor your training, fine-tuning, and inferencing scenarios.2.4KViews3likes1CommentAnnouncing Healthcare AI Models in Azure AI Model Catalog
Modern medicine encompasses various data modalities, including medical imaging, genomics, clinical records, and other structured and unstructured data sources. Understanding the intricacies of this multimodal environment, Azure AI onboards specialized healthcare AI models that go beyond traditional text-based applications, providing robust solutions tailored to healthcare's unique challenges.8.3KViews5likes1Comment