mlops
51 TopicsUnlocking the Power of Synthetic Data for Fine-Tuning and Evaluation
In the rapidly evolving field of large language models (LLMs) and small language models (SLMs), fine-tuning and evaluation often present unique challenges. Whether the objective is to optimize models for function-calling use cases or to validate multi-agent workflows, one thing remains constant: the need for high-quality, diverse, and contextually relevant data. But what happens when real-world data is either unavailable, incomplete, or too sensitive to use? Enter synthetic data—a powerful tool for accelerating the journey from experimentation to deployment. In this blog, we’ll explore how synthetic data can address critical challenges, why it’s indispensable for certain scenarios, and how Azure AI’s Evaluator Simulator Package enables seamless generation of synthetic interaction data to simulate user personas and scenarios. The Growing Need for Synthetic Data in LLM Development Fine-tuning or evaluating an LLM/SLM for specific use cases often requires vast amounts of labeled data tailored to the task at hand. However, sourcing such data comes with hurdles: Data Scarcity: Real-world interaction data for niche use cases may not exist in sufficient quantity. Privacy Concerns: User interactions may contain sensitive information, making direct use of this data problematic. Scenario Testing: Real-world data rarely accounts for edge cases or extreme scenarios that models must handle gracefully. Synthetic data solves these problems by creating controlled, customizable datasets that reflect real-world conditions—without the privacy risks or availability constraints. Synthetic Data for Function-Calling Use Cases Function-calling in LLMs involves executing API calls based on natural language inputs. For example, users might ask a travel app to “find flights to Paris under $500.” Fine-tuning models for such use cases requires training them on structured, intent-rich inputs paired with corresponding API call structures. Synthetic data can: Simulate diverse intents: Generate variations of user queries across languages, styles, and preferences. Provide structured outputs: Automatically align these queries with the required API call schema for training or evaluation. Include edge cases: Test how models respond to ambiguous or incomplete queries. Model evaluation post fine-tuning presents another set of challenges where we need trusted data to evaluate the performance. Hence, having synthetic data generated by a superior model followed by human screening filtering out noise can provide a rich and diverse data to compare the performance of fine-tuned vs base models. Synthetic Data in Multi-Agent Workflow Evaluation Multi-agent workflows involve multiple models (or agents) collaborating to achieve a shared goal. A restaurant recommendation system, for example, may feature one agent parsing user preferences, another querying a knowledge graph, and a third crafting human-like responses. Synthetic data can: Simulate complex user personas: From foodies to budget-conscious travelers, generating interactions that test the robustness of multi-agent collaboration. Recreate realistic workflows: Model intricate agent-to-agent interactions, complete with asynchronous communication and fallback mechanisms. Stress-test failure scenarios: Ensure agents recover gracefully from errors, misunderstandings, or timeouts. Multi-agent workflows often rely on hybrid architectures that combine SLMs, LLMs, domain-specific models, and fine-tuned systems to balance cost, latency, and accuracy. Synthetic data generated by a superior model can serve as a baseline for evaluating nuances like agent orchestration and error recovery. Azure AI Evaluator Simulator: A Game-Changer Azure AI's Evaluator Simulator Package offers a robust framework for generating synthetic interaction data tailored to your application needs. By simulating diverse user personas and scenarios, it provides: Realistic Simulations: Emulate a wide range of user behaviors, preferences, and intents, making it ideal for creating datasets for function-calling and multi-agent workflows. Customizability: Tailor simulations to reflect domain-specific nuances, ensuring data relevance. Efficiency: Automate data generation at scale, saving time and resources compared to manual annotation. How It Works The Azure AI Evaluation SDK’s Simulator class is designed to generate synthetic conversations and simulate task-based interactions. The module allows you to configure different personas—such as tech-savvy users, college grads, enterprise professionals, customers, supply chain managers, procurement manager, finance admin etc each interacting with your application in unique ways. You can also define the tasks that each of these users are trying to accomplish like shopping for a family event, manging inventory, preparing financial reports etc. Here’s how it operates: Model Configuration: Initialize the simulator with your model’s parameters (e.g., temperature, top_p, presence_penalty). Input Preparation: Provide input data (e.g., text blobs) for context, such as extracting text from a Wikipedia page. Prompt Optimization: Use the query_response_generating_prompty_override to customize how query-response pairs are generated. User Prompt Specification: Define user behavior using the user_simulating_prompty_override to align simulations with specific personas. Target Callback Specification: Implement a callback function that connects the simulator with your application. Simulation Execution: Run the simulator to generate synthetic conversations based on your configurations. By following these steps, developers can create robust test datasets, enabling thorough evaluation and fine-tuning of their AI applications. Example: Synthetic Data for an E-Commerce Assistant Bot Let’s walk through an example of generating synthetic data for an e-commerce assistant bot. This bot can perform tasks such as acting as a shopping assistant, managing inventory, and creating promo codes. Before we get started, make sure to install azure-ai-evaluation package to follow along Step 1: Define Functions and APIs Start by defining the core functions the bot can invoke, such as search_products, fetch_product_details, and add_to_cart. These functions simulate real-world operations. Please refer functions and function_list to access the complete list of functions and function definitions. Step 2: Configure the Simulator model_config = { "azure_endpoint": azure_endpoint, "azure_api_key": azure_api_key, "azure_deployment": azure_deployment, } from azure.ai.evaluation.simulator import Simulator simulator = Simulator(model_config=model_config) Next connect the simulator to the application. For this, establish the client and implement a callback function that invokes the application and facilitate interaction between the simulator and app from typing import List, Dict, Any, Optional from functions import * from function_list import function_list from openai import AzureOpenAI from azure.identity import DefaultAzureCredential, get_bearer_token_provider def call_to_ai_application(query: str) -> str: # logic to call your application # use a try except block to catch any errors system_message = "Assume the role of e-commerce assistant designed for multiple roles. You can help with creating promo codes, tracking their usage, checking stock levels, helping customers make shopping decisions and more. You have access to a bunch of tools that you can use to help you with your tasks. You can also ask the user for more information if needed." completion = client.chat.completions.create( model=azure_deployment, messages=[ {"role" : "system", "content" : system_message }, { "role": "user", "content": query, } ], max_tokens=800, temperature=0.1, top_p=0.2, frequency_penalty=0, presence_penalty=0, stop=None, stream=False, tools = function_list, tool_choice="auto" ) message = completion.choices[0].message # print("Message : ", message) # change this to return the response from your application return message async def callback( messages: List[Dict], stream: bool = False, session_state: Any = None, # noqa: ANN401 context: Optional[Dict[str, Any]] = None, ) -> dict: messages_list = messages["messages"] # get last message latest_message = messages_list[-1] query = latest_message["content"] context = None # call your endpoint or ai application here response = call_to_ai_application(query) # we are formatting the response to follow the openAI chat protocol format: if response.tool_calls: prev_messages = messages["messages"] func_call_messages = [] tool_calls = response.tool_calls ## Add the tool calls to the messages for tool_call in tool_calls: formatted_response = {"role" : "assistant", "function_call" : tool_call.function.to_dict()} func_call_messages.append(formatted_response) ## Execute the APIs and add the responses to the messages for tool_call in tool_calls: function_name = tool_call.function.name function_args = tool_call.function.arguments func = globals().get(function_name) if callable(func): result = json.dumps(func(**json.loads(function_args))) # formatted_response = {"content" : result, "role" : "tool", "name" : function_name} formatted_response = {"role" : "function", "content" : result, "name" : function_name} func_call_messages.append(formatted_response) else: print("Function {} not found".format(function_name)) # Second API call: Get the final response from the model final_response = client.chat.completions.create( model=azure_deployment, messages=prev_messages + func_call_messages, ) final_response = {"content" : final_response.choices[0].message.content, "role" : "assistant"} func_call_messages.append(final_response) # Stringify func_call messages to store in session state func_call_messages = create_content_from_func_calls(func_call_messages) func_call_messages = {"role" : "assistant", "content" : func_call_messages} messages["messages"].append(func_call_messages) # messages["messages"].append(final_response) return {"messages": messages["messages"], "stream": stream, "session_state": session_state} else: formatted_response = { "content": response.content, "role": "assistant", } messages["messages"].append(formatted_response) return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context} We have used two helper functions here : create_content_from_func_calls : It creates a string content from a list of function call dictionaries. This merges all the internal messages invoking function calls into a single string. This is needed as the simulator module ignores all internal context and only retains the latest response. split_content : Split a string content into a list of dictionaries based on specified separators. This is required for post-processing step to split the string comprising of function-call and function-response into separate messages each with its own role and content. Step 3: Define the Tasks Use the Azure AI Evaluation SDK to configure the simulator with user personas and tasks, such as: A marketing manager creating a promo code and tracking its usage. A customer making a purchase using the promo code. An inventory manager checking stock levels. Step 4: Customize user persona Internally, the SDK has a prompty file that defines how the LLM which simulates the user should behave. The SDK also offers an option for users to override the file, to support your own prompty files. Let’s override this file to build a user persona who engages in an interactive conversation with the bot and asks follow up questions while responding to bot’s response basis his persona and requirement system: You must behave as a user who wants accomplish this task: {{ task }} and you continue to interact with a system that responds to your queries. If there is a message in the conversation history from the assistant, make sure you read the content of the message and include it your first response. Your mood is {{ mood }} Make sure your conversation is engaging and interactive. Output must be in JSON format Here's a sample output: { "content": "Here is my follow-up question.", "role": "user" } Step 5 : Generate and Store Outputs: Run the simulator to generate synthetic data. You can specify the "num_conversation_turns" that defines the predetermined number of conversation turns to simulate. outputs = await simulator( target=callback, text="Assume the role of e-commerce assistant designed for multiple roles. You can help with creating promo codes, tracking their usage, checking stock levels, helping customers make shopping decisions and more. You have access to a bunch of tools that you can use to help you with your tasks. You can also ask the user for more information if needed.", num_queries=3, max_conversation_turns=5, tasks=tasks, user_simulator_prompty=user_override_prompty, user_simulator_prompty_kwargs=user_prompty_kwargs, ) Step 6 : Review and Save the Outputs Let's look at the output for one of the tasks We can see how the simulator engages in an interactive conversation with the application to accomplish the desired task and all the interaction between app and simulator is captured in the final output. Let's store the output in a file with open("output.json", "w") as f: json.dump(final_outputs, f) Conclusion Synthetic data transcends being a mere substitute for real-world data—it’s a strategic asset for fine-tuning and evaluating LLMs. By enabling precise control over data generation, synthetic datasets empower developers to simulate user behaviors, test edge cases, and optimize models for specific workflows. With tools like Azure AI’s Evaluator Simulator, generating this data has never been more accessible or impactful. Whether you’re building models for function-calling, orchestrating multi-agent systems, or tackling niche use cases, synthetic data ensures you’re equipped to deliver reliable, high-performing solutions—regardless of complexity. Start leveraging synthetic data today and unlock the full potential of your LLM projects! You can access the full code here References azureai-samples/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text at main · Azure-Samples/azureai-samples How to generate synthetic and simulated data for evaluation - Azure AI Foundry | Microsoft Learn Generate Synthetic QnAs from Real-world Data on Azure | Microsoft Community Hub How to use function calling with Azure OpenAI Service - Azure OpenAI Service | Microsoft Learn Fine-tuning function calls with Azure OpenAI Service - Azure AI services | Microsoft Learn330Views0likes0CommentsEvaluating Fine-Tuned Models for Function-Calling: Beyond Input-Output Metrics
In the intricate world of machine learning and Artificial Intelligence, the fine-tuning of models for specific tasks is an art form that requires meticulous attention to detail. One such task that has garnered significant attention is function-calling, where models are trained to call specific functions with appropriate arguments based on given inputs. Evaluating these fine-tuned models is crucial to ensure their reliability and effectiveness. While in the previous blog post, we looked at how to run an end-to-end fine-tuning pipeline using the Azure Machine Learning Platform, this blog post will delve into the multifaceted evaluation process of these models, emphasizing the importance of not just input-response evaluation but also the correctness of function calls and arguments. Understanding Function-Calling: Models optimized for Function-calling are designed to interpret input data and subsequently call predefined functions with the correct arguments. These models find applications in various domains, including automated customer support, data processing, and even complex decision-making systems. The key to their success lies in their ability to understand the context and semantics of the input, translating it into precise function calls. The Challenge of Input-Response Evaluation: The most straightforward method of evaluating these models is through input-response evaluation. This involves providing the model with a set of inputs and comparing its responses to the expected outputs. Metrics such as accuracy, precision, recall, and F1-score are commonly used to measure performance. However, input-response evaluation alone presents several challenges: Superficial Assessment: This method primarily checks if the model's output matches the expected result. It doesn't delve into the model's internal decision-making process or the correctness of the function calls and arguments. Misleading Metrics: If the predicted response doesn't match the expected answer, input-response metrics alone won't explain why. The discrepancy could stem from incorrect function calls or arguments, not just from an incorrect final output. Limited Scope: Many tasks require a broader spectrum of capabilities beyond just function-calling. This includes general conversation, generating leading questions to gather necessary inputs for function-calling, and synthesizing responses from function execution. Input-response evaluation doesn't cover these nuances as it requires semantic understanding of the input and response instead of word-by-word assessment. Evaluating Function Calls: The Next Layer To bridge the gap left by input-response evaluation, we need to scrutinize the function calls themselves. This involves verifying that the model calls the appropriate functions for given inputs. Why This Matters Correct Function Semantics: Ensuring the right function is called guarantees that the model understands the semantics of the task. For instance, in a customer support system, calling a function to reset a password instead of updating an address could lead to significant user frustration. Maintainability and Debugging: Correct function calls make the system easier to maintain and debug. If the wrong function is called, it can lead to unexpected behaviors that are harder to trace and fix. Addressing Gaps When the predicted response doesn't match the expected answer, evaluating function names and arguments helps identify the root cause of the discrepancy. This insight is crucial for taking necessary actions to improve the model's performance, whether it involves fine-tuning the training data or adjusting the model's architecture. Evaluating Function Arguments: The Final Layer The last layer of evaluation involves scrutinizing the arguments passed to the functions. Even if the correct function is called, incorrect or improperly formatted arguments can lead to failures or incorrect outputs. Importance of Correct Arguments Functional Integrity: The arguments to a function are as crucial as the function itself. Passing incorrect arguments can result in errors or unintended outcomes. For example, calling a function to process a payment with an incorrect amount or currency could have severe financial implications. User Experience: In applications like chatbots or virtual assistants, incorrect arguments can degrade the user experience. A model that correctly calls a weather-check function but passes the wrong location will not serve the user's needs effectively. A Holistic Evaluation Approach To ensure the robustness of fine-tuned models for function-calling, a holistic evaluation approach is necessary. This involves: Input-Response Evaluation: Checking the overall accuracy and effectiveness of the model's outputs. Function Call Verification: Ensuring the correct functions are called for given inputs. Argument Validation: Verifying that the arguments passed to functions are correct and appropriately formatted. Beyond Lexical Evaluation: Semantic Similarity Given the complexity of tasks, it's imperative to extend the scope of metrics to include semantic similarity evaluation. This approach assesses how well the model's output aligns with the intended meaning, rather than just matching words or phrases. Semantic Similarity Metrics: Use advanced metrics like BERTScore, BLEU, ROUGE, or METEOR to measure the semantic similarity between the model's output and the expected response. These metrics evaluate the meaning of the text, not just the lexical match. Contextual Understanding: Incorporate evaluation methods that assess the model's ability to understand context, generate leading questions, and synthesize responses. This ensures the model can handle a broader range of tasks effectively. Evaluate GenAI Models and Applications Using Azure AI Foundry The evaluation functionality in the Azure AI Foundry portal provides a comprehensive platform that offers tools and features for assessing the performance and safety of your generative AI model. In Azure AI Foundry portal, you're able to log, view, and analyze detailed evaluation metrics. With built-in and custom evaluators, the tool empowers developers and researchers to analyze models under diverse conditions and scenarios while enabling straightforward comparison of results across multiple models. Within Azure AI Foundry, a comprehensive approach to evaluation includes three key dimensions: Risk and Safety Evaluators: Evaluating potential risks associated with AI-generated content is essential for safeguarding against content risks with varying degrees of severity. This includes evaluating an AI system's predisposition towards generating harmful or inappropriate content. Performance and Quality Evaluators: This involves assessing the accuracy, groundedness, and relevance of generated content using robust AI-assisted and Natural Language Processing (NLP) metrics. Custom Evaluators: Tailored evaluation metrics can be designed to meet specific needs and goals, providing flexibility and precision in assessing unique aspects of AI-generated content. These custom evaluators allow for more detailed and specific analyses, addressing particular concerns or requirements that standard metrics might not cover. Running the Evaluation for Fine-Tuned Models Using the Azure Evaluation Framework Metrics Used for the Workflow Function-Call Invocation Function-Call Arguments BLEU Score: Measures how closely the generated text matches the reference text. ROUGE Score: Focuses on recall-oriented measures to assess how well the generated text covers the reference text. GLEU Score: Measures the similarity by shared n-grams between the generated text and ground truth, focusing on both precision and recall. METEOR Score: Considers synonyms, stemming, and paraphrasing for content alignment. Diff Eval: An AI-assisted custom metric that compares the actual response to ground truth and highlights the key differences between the two responses. We will use the same validation split from glaive-function-calling-v2 as used in the fine-tuning blog post, run it through the hosted endpoint for inference, get the response, and use the actual input and predicted response for evaluation. Preprocessing the Dataset First, we need to preprocess the dataset and convert it into a QnA format as the original dataset maintains an end-to-end conversation as one unified record. Parse_conversation and apply_chat_template: This function effectively transforms a raw conversation string into a list of dictionaries, each representing a message with a role and content. Get_multilevel_qna_pairs: This iteratively breaks down the conversation as questions and prompts every time it encounters the role as "assistant" within the formatted dictionary. def parse_conversation(input_string): ROLE_MAPPING = {"USER" : "user", "ASSISTANT" : "assistant", "SYSTEM" : "system", "FUNCTION RESPONSE" : "tool"} # Regular expression to split the conversation based on SYSTEM, USER, and ASSISTANT pattern = r"(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):" # Split the input string and keep the delimiters parts = re.split(pattern, input_string) # Initialize the list to store conversation entries conversation = [] # Iterate over the parts, skipping the first empty string for i in range(1, len(parts), 2): role = parts[i].strip() content = parts[i + 1].strip() content = content.replace("<|endoftext|>", "").strip() if content.startswith('<functioncall>'): # build structured data for function call # try to turn function call from raw text to structured data content = content.replace('<functioncall>', '').strip() # replace single quotes with double quotes for valid JSON clean_content = content.replace("'{", '{').replace("'}", '}') data_json = json.loads(clean_content) # Make it compatible with openAI prompt format func_call = {'recipient_name': f"functions.{data_json['name']}", 'parameters': data_json['arguments']} content = {'tool_uses': [func_call]} # Append a dictionary with the role and content to the conversation list conversation.append({"role": ROLE_MAPPING[role], "content": content}) return conversation def apply_chat_template(input_data): try: system_message = parse_conversation(input_data['system']) chat_message = parse_conversation(input_data['chat']) message = system_message + chat_message return message except Exception as e: print(str(e)) return None def get_multilevel_qna_pairs(message): prompts = [] answers = [] for i, item in enumerate(message): if item['role'] == 'assistant': prompts.append(message[:i]) answers.append(item["content"]) return prompts, answers Reference : inference.py Submitting a Request to the Hosted Endpoint : Next, we need to write the logic to send request to the hosted endpoint and run the inference. def run_inference(input_data): # Replace this with the URL for your deployed model url = 'https://llama-endpoint-ft.westus3.inference.ml.azure.com/score' # Replace this with the primary/secondary key, AMLToken, or Microsoft Entra ID token for the endpoint api_key = '' # Update it with the API key params = { "temperature": 0.1, "max_new_tokens": 512, "do_sample": True, "return_full_text": False } body = format_input(input_data, params) body = str.encode(json.dumps(body)) if not api_key: raise Exception("A key should be provided to invoke the endpoint") headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)} req = urllib.request.Request(url, body, headers) try: response = urllib.request.urlopen(req) result = json.loads(response.read().decode("utf-8"))["result"] except urllib.error.HTTPError as error: print("The request failed with status code: " + str(error.code)) # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure print(error.info()) print(error.read().decode("utf8", 'ignore')) return result Evaluation Function: Next, we write the evaluation function that will run the inference and will evaluate the match for function calls and function arguments. def eval(query, answer): """ Evaluate the performance of a model in selecting the correct function based on given prompts. Args: input_data (List) : List of input prompts for evaluation and benchmarking expected_output (List) : List of expected response Returns: df : Pandas Dataframe with input prompts, actual response, expected response, Match/No Match and ROUGE Score """ # Initialize the ROUGE Scorer where llm response is not function-call scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True) expected_output = answer # For generic model response without function-call, set a threshold to classify it as a match match_threshold_g = 0.75 predicted_response = run_inference(query) is_func_call = False if predicted_response[1:12] == "'tool_uses'": is_func_call = True try: predicted_response = ast.literal_eval(predicted_response) except: predicted_response = predicted_response if isinstance(predicted_response, dict): predicted_functions = [func["recipient_name"] for func in predicted_response["tool_uses"]] predicted_function_args = [func["parameters"] for func in predicted_response["tool_uses"]] actual_functions = [func["recipient_name"] for func in expected_output["tool_uses"]] actual_function_args = [func["parameters"] for func in expected_output["tool_uses"]] fcall_match = predicted_functions == actual_functions fcall_args_match = predicted_function_args == actual_function_args match = "Yes" if fcall_match and fcall_args_match else "No" else: fmeasure_score = scorer.score(expected_output, predicted_response)['rougeL'].fmeasure match = "Yes" if fmeasure_score >= match_threshold_g else "No" result = { "response": predicted_response, "fcall_match": fcall_match if is_func_call else "NA", "fcall_args_match": fcall_args_match if is_func_call else "NA", "match": match } return result Create a AI-assisted custom metric for Difference evaluation Create a Prompty file: Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers. The primary goal is to accelerate the developer inner loop. Prompty standardizes prompts and their execution into a single asset. 2. Create a class to load the Prompty file and process the outputs with JSON format. class DifferenceEvaluator: def __init__(self, model_config: AzureOpenAIModelConfiguration): """ Initialize an evaluator configured for a specific Azure OpenAI model. :param model_config: Configuration for the Azure OpenAI model. :type model_config: AzureOpenAIModelConfiguration **Usage** .. code-block:: python eval_fn = CompletenessEvaluator(model_config) result = eval_fn( question="What is (3+1)-4?", answer="First, the result within the first bracket is 3+1 = 4; then the next step is 4-4=0. The answer is 0", truth="0") """ # TODO: Remove this block once the bug is fixed # https://msdata.visualstudio.com/Vienna/_workitems/edit/3151324 if model_config.api_version is None: model_config.api_version = "2024-05-01-preview" prompty_model_config = {"configuration": model_config} current_dir = os.path.dirname(__file__) prompty_path = os.path.join(current_dir, "difference.prompty") assert os.path.exists(prompty_path), f"Please specify a valid prompty file for completeness metric! The following path does not exist:\n{prompty_path}" self._flow = load_flow(source=prompty_path, model=prompty_model_config) def __call__(self, *, response: str, ground_truth: str, **kwargs): """Evaluate correctness of the answer in the context. :param answer: The answer to be evaluated. :type answer: str :param context: The context in which the answer is evaluated. :type context: str :return: The correctness score. :rtype: dict """ # Validate input parameters response = str(response or "") ground_truth = str(ground_truth or "") if not (response.strip()) or not (ground_truth.strip()): raise ValueError("All inputs including 'answer' must be non-empty strings.") # Run the evaluation flow output = self._flow(response=response, ground_truth=ground_truth) print(output) return json.loads(output) Reference: difference.py Run the evaluation pipeline: Run the evaluation pipeline on the validation dataset using both in-built metrics and custom metrics. In order to ensure the evaluate() can correctly parse the data, you must specify column mapping to map the column from the dataset to keywords that are accepted by the evaluators. def run_evaluation(name = None, dataset_path = None): model_config = AzureOpenAIModelConfiguration( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["AZURE_OPENAI_API_VERSION"], azure_deployment=os.environ["AZURE_OPENAI_EVALUATION_DEPLOYMENT"] ) # Initializing Evaluators difference_eval = DifferenceEvaluator(model_config) bleu = BleuScoreEvaluator() glue = GleuScoreEvaluator() meteor = MeteorScoreEvaluator(alpha = 0.9, beta = 3.0, gamma = 0.5) rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L) data_path = str(pathlib.Path.cwd() / dataset_path) csv_output_path = str(pathlib.Path.cwd() / "./eval_results/eval_results.csv") output_path = str(pathlib.Path.cwd() / "./eval_results/eval_results.jsonl") result = evaluate( # target=copilot_qna, evaluation_name=name, data=data_path, target=eval, evaluators={ "bleu": bleu, "gleu": glue, "meteor": meteor, "rouge" : rouge, "difference": difference_eval }, evaluator_config= {"default": { # only provide additional input fields that target and data do not have "ground_truth": "${data.answer}", "query": "${data.query}", "response": "${target.response}", }} ) tabular_result = pd.DataFrame(result.get("rows")) tabular_result.to_csv(csv_output_path, index=False) tabular_result.to_json(output_path, orient="records", lines=True) return result, tabular_result Reference: evaluate.py Reviewing the Results: Let's review the results for only function-calling scenarios. 85 out of 102 records had a 100% match, whereas the rest had discrepancies in the function arguments being passed. The difference evaluator output gives insights into what the exact differences are, which we can use to improve the model performance by fixing the training dataset and model hyperparameters in subsequent iterations. As can be inferred from the above results, the model doesn't do a great job if it involves number conversion, date formatting and we can leverage these insights to further fine-tune the model performance. Conclusion Evaluating fine-tuned models for function-calling requires a comprehensive approach that goes beyond input-response metrics. By incorporating function call verification, argument validation, and semantic similarity evaluation, we can ensure these models perform reliably and effectively in real-world applications. This holistic evaluation strategy not only enhances the model's accuracy but also ensures its robustness, maintainability, and user satisfaction.340Views0likes0CommentsGet Rewarded for Sharing Your Experience with Azure Machine Learning
We humbly invite our customers to get rewarded for sharing your first-hand experience working with Azure Machine Learning by writing a review on Gartner Peer Insights. Your review will help other organizations make informed decisions and find solutions that meet their unique needs.4.1KViews0likes0CommentsThe Evolution of AI Frameworks: Understanding Microsoft's Latest Multi-Agent Systems
The landscape of artificial intelligence is undergoing a fundamental transformation in late 2024. Microsoft has unveiled three groundbreaking frameworks—AutoGen 0.4, Magentic-One, and TinyTroupe—that are revolutionizing how we approach AI development. Moving beyond single-model systems, these frameworks represent a shift toward collaborative AI, where multiple specialized agents work together to solve complex problems. Think of these frameworks as different but complementary systems, much like how a city needs infrastructure, service providers, and community organizations to function effectively. AutoGen 0.4 provides the robust foundation, Magentic-One orchestrates complex tasks through specialized agents, and TinyTroupe simulates human behavior for business insights. Together, they form a comprehensive ecosystem for building the next generation of intelligent systems. As we explore each framework in detail, we'll see how this coordinated approach is opening new possibilities in AI development, from enterprise-scale applications to sophisticated business simulations. Framework Comparison: A Deep Dive Before we explore each framework in detail, let's understand how they compare across key dimensions. These comparisons will help us understand where each framework excels and how they complement each other. Core Capabilities and Design Focus Aspect AutoGen 0.4 Magentic-One TinyTroupe Primary Architecture Layered & Event-driven Orchestrator-based Persona-based Core Strength Infrastructure & Scalability Task Orchestration Human Simulation Development Stage Beta Preview Early Release Target Users Enterprise Developers Automation Teams Business Analysts Key Innovation Cross-language Support Dual-loop Orchestration Persona Modeling Deployment Model Cloud/On-premise Container-based Local Main Use Case Enterprise Systems Task Automation Business Insights AutoGen 0.4: The Digital Infrastructure Builder Imagine building a modern city. Before any services can operate, you need robust infrastructure – roads, power grids, water systems, and communication networks. AutoGen 0.4 serves a similar foundational role in the AI ecosystem. It provides the essential infrastructure that allows Agentic systems to operate at enterprise scale. The framework's brilliance lies in its three-layer architecture: The Core Layer acts as the fundamental infrastructure, handling basic communication and resource management, much like a city's utility systems. The AgentChat Layer provides high-level interaction capabilities, similar to how city services interface with residents. The Extensions Layer enables specialized functionalities, comparable to how cities can add new services based on specific needs. What truly sets AutoGen 0.4 apart is its understanding of real-world enterprise needs. Modern organizations rarely operate with a single technology stack – they might use Python for data science, .NET for backend services, and other languages for specific needs. AutoGen 0.4 embraces this reality through its multi-language support, ensuring different components can communicate effectively while maintaining strict type safety to prevent errors. from autogen_agentchat.agents import AssistantAgent from autogen_agentchat.task import Console from autogen_ext.models import OpenAIChatCompletionClient async def enterprise_example(): # Create an enterprise agent with specific configuration agent = AssistantAgent( name="enterprise_system", model_client=OpenAIChatCompletionClient( model="gpt-4o-2024-08-06", api_key="YOUR_API_KEY" ) ) # Define a complex enterprise task task = { "objective": "Analyze sales data and generate insights", "data_source": "sales_database", "output_format": "report" } # Execute task with streaming output stream = agent.run_stream(task=task) await Console(stream) # Example usage: # asyncio.run(enterprise_example()) Magentic-One: The Master Orchestra Conductor If AutoGen 0.4 builds the city's infrastructure, Magentic-One acts as its management system. Think of it as a highly skilled orchestra conductor, coordinating various musicians (specialized agents) to create a harmonious performance (completed tasks). The framework's innovative dual-loop architecture demonstrates this orchestration: The Task Ledger works like a conductor's score, planning out what needs to be done. The Progress Ledger functions as the conductor's real-time monitoring, ensuring each section performs its part correctly. Magentic-One's specialized agents exemplify this orchestra metaphor: WebSurfer: Like the string section, handling intricate web interactions FileSurfer: Similar to the percussion section, managing rhythmic file operations Coder: Comparable to the brass section, producing powerful code outputs ComputerTerminal: Like the woodwinds, executing precise commands This specialization has proven its worth through impressive benchmark performances across GAIA, AssistantBench, and WebArena, showing that specialized expertise, when properly coordinated, produces superior results. from magentic_one import ( Orchestrator, WebSurfer, FileSurfer, Coder, ComputerTerminal ) def automation_example(): # Initialize specialized agents agents = { 'web': WebSurfer(), 'file': FileSurfer(), 'code': Coder(), 'terminal': ComputerTerminal() } # Create orchestrator with task and progress ledgers orchestrator = Orchestrator(agents) # Define complex automation task task = { "type": "web_automation", "steps": [ {"action": "browse", "url": "example.com"}, {"action": "extract", "data": "pricing_info"}, {"action": "save", "format": "csv"} ] } # Execute orchestrated task result = orchestrator.execute_task(task) return result # Example usage: # result = automation_example() TinyTroupe: The Social Behavior Laboratory TinyTroupe takes a fundamentally different approach, more akin to a sophisticated social simulation laboratory than a traditional AI framework. Instead of focusing on task completion, it seeks to understand and replicate human behavior, much like how social scientists study human interactions and decision-making. The framework creates detailed artificial personas (TinyPersons) with rich backgrounds, personalities, and behaviors. Think of it as creating a miniature society where researchers can observe how different personality types interact with products, services, or each other. These personas exist within controlled environments (TinyWorlds), allowing for systematic observation and analysis. Consider a real-world parallel: When automotive companies design new vehicles, they often create detailed driver personas to understand different user needs. TinyTroupe automates and scales this approach, allowing businesses to simulate thousands of interactions with different personality types, providing insights that would be impractical or impossible to gather through traditional focus groups. The beauty of TinyTroupe lies in its ability to capture the nuances of human behavior. Just as no two people are exactly alike, each TinyPerson brings its unique perspective, shaped by its programmed background, experiences, and preferences. This diversity enables more realistic and valuable insights for business decision-making. from tinytroupe import TinyPerson, TinyWorld, TinyPersonFactory from tinytroupe.utils import ResultsExtractor def simulation_example(): # Create simulation environment world = TinyWorld("E-commerce Platform") # Generate diverse personas factory = TinyPersonFactory() personas = [ factory.generate_person( "Create a tech-savvy professional who values efficiency" ), factory.generate_person( "Create a budget-conscious parent who prioritizes safety" ), factory.generate_person( "Create a senior citizen who prefers simplicity" ) ] # Add personas to simulation world for persona in personas: world.add_person(persona) # Define simulation scenario scenario = { "type": "product_evaluation", "product": "Smart Home Device", "interaction_points": ["discovery", "purchase", "setup"] } # Run simulation and extract insights results = world.run_simulation(scenario) insights = ResultsExtractor().analyze(results) return insights # Example usage: # insights = simulation_example() Framework Selection Guide To help you make an informed decision, here's a comprehensive selection matrix based on specific needs: Need Best Choice Reason Alternative Enterprise Scale AutoGen 0.4 Built for distributed systems Magentic-One Task Automation Magentic-One Specialized agents AutoGen 0.4 User Research TinyTroupe Persona simulation None High Performance AutoGen 0.4 Optimized architecture Magentic-One Quick Deployment TinyTroupe Minimal setup Magentic-One Complex Workflows Magentic-One Strong orchestration AutoGen 0.4 Practical Implications For organizations looking to implement these frameworks, consider the following guidance: For Enterprise Applications: Use AutoGen 0.4 as your foundation. Its robust infrastructure and cross-language support make it ideal for building scalable, production-ready systems. For Complex Automation: Implement Magentic-One for tasks requiring sophisticated orchestration. Its specialized agents and safety features make it perfect for automated workflows. For Business Intelligence: Deploy TinyTroupe for market research and user behavior analysis. Its unique simulation capabilities provide valuable insights for business decision-making. Conclusion Microsoft's three-pronged approach to multi-agent AI systems represents a significant leap forward in artificial intelligence. By addressing different aspects of the AI development landscape – infrastructure (AutoGen 0.4), task execution (Magentic-One), and human simulation (TinyTroupe) – these frameworks provide a comprehensive toolkit for building the next generation of AI applications. As these frameworks continue to evolve, we can expect to see even more sophisticated capabilities and tighter integration between them. Organizations that understand and leverage the strengths of each framework will be well-positioned to build powerful, scalable, and intelligent systems that drive real business value. Appendix Technical Implementation Details Feature AutoGen 0.4 Magentic-One TinyTroupe Language Support Python, .NET Python Python State Management Distributed Centralized Environment-based Message Passing Async Event-driven Task-based Simulation-based Error Handling Comprehensive Task-specific Simulation-bound Monitoring Enterprise-grade Task-focused Analysis-oriented Extensibility High Medium Framework-bound Performance and Scalability Metrics Metric AutoGen 0.4 Magentic-One TinyTroupe Response Time Milliseconds Seconds Variable Concurrent Users Thousands Hundreds Dozens Resource Usage Optimized Task-dependent Simulation-dependent Horizontal Scaling Yes Limited No State Persistence Distributed Cache Container Storage Local Files Recovery Capabilities Advanced Basic Manual Security and Safety Features Security Aspect AutoGen 0.4 Magentic-One TinyTroupe Access Control Role-based Container-based Environment-based Content Filtering Enterprise-grade Active Monitoring Simulation Bounds Audit Logging Comprehensive Action-based Simulation Logs Isolation Level Service Container Process Risk Assessment Dynamic Pre-execution Scenario-based Recovery Options Automated Semi-automated Manual Integration and Ecosystem Support Integration Type AutoGen 0.4 Magentic-One TinyTroupe API Support REST, gRPC REST Python API External Services Extensive Web-focused Limited Database Support Multiple Basic Simulation Only Cloud Services Full Support Container Services Local Only Custom Extensions Yes Limited Framework-bound Third-party Tools Wide Support Moderate Minimal2.8KViews1like0CommentsUnlocking the Power of Large-Scale Training in AI
Why Large-Scale Training? So, why are we so obsessed with large-scale AI models anyway? Well, larger models have more parameters—think of these as tiny levers and switches that adjust to learn from data. The more parameters, the more complex tasks a model can handle. In the world of natural language processing (NLP), for instance, GPT-3 boasts 175 billion parameters, making it capable of understanding nuanced language and generating impressive responses. These larger models don’t just stop at text. They’re pushing boundaries in healthcare, finance, and beyond, handling things like medical image analysis, fraud detection, and even predicting patient outcomes. But here is the catch: as these models increase in parameters, so does the need for immense computational power. Training a model as big as GPT-3 on a single machine? That’s a non-starter—it would take forever. And that’s where distributed training comes in. The Perks (and Pitfalls) of Large-Scale Training Building large AI models unlocks incredible possibilities, but it’s not all sunshine and rainbows. Here’s a peek into the main challenges that come with training these behemoths: Memory Limitations Picture this: you have a huge model with billions of parameters, but each GPU has limited memory. Trying to squeeze the whole model into a single GPU? Forget it. It’s like trying to stuff an elephant into a suitcase. Computation Bottlenecks Even if you could load the model, running it would take weeks—maybe even months. With every training step, the compute requirements grow, and training on a single machine becomes both a time and cost nightmare. Data Synchronization & Management Now imagine you’ve got multiple GPUs or nodes working together. That sounds good in theory, but all these devices need to stay in sync. Model parameters and gradients (fancy math terms for “how the model learns”) need to be shared constantly across all GPUs. If not managed carefully, this can slow training down to a crawl. These challenges make it clear why simply “scaling up” on one machine isn’t enough. We need something better—and that’s where distributed training steps in. Distributed Training: The Secret Sauce for Large AI Models Distributed training is like assembling an elite team of GPUs and servers to tackle different parts of the problem simultaneously. This process breaks up the heavy lifting, spreading the workload across multiple machines to make things run faster and more efficiently. Why Go Distributed? Faster Training Times By splitting up the work, distributed training slashes training time. A job that might have taken weeks on one machine can often be completed in days—or even hours—by spreading it across multiple devices. Big Data? No Problem Distributed training is also a lifesaver when dealing with massive datasets. It can process these large datasets in parallel, helping the model learn faster by exposing it to more data in less time. Imagine trying to watch a series by watching one episode on your laptop, another on your phone, and another on your tablet—all at once. That’s the efficiency we’re talking about here. Scalability Need more power? Distributed training allows you to scale up with additional GPUs or nodes. Think of it as being able to add more horsepower to your AI engine anytime you need it. For a deeper dive into distributed training principles, check out this guide on distributed training with Azure. The Different Flavors of Distributed Training Distributed training isn’t one-size-fits-all. It comes in several “flavors,” each suited to different needs: Data Parallelism: Here, we split the dataset across multiple GPUs, each GPU trains on its chunk of the data, and then they synchronize to keep the model consistent. It’s great when the model can fit on a single GPU, but the dataset is too large. Model Parallelism: For models that are just too huge to fit on one GPU, model parallelism divides the model itself across GPUs. Each part of the model is trained on a different GPU, which is ideal for extremely large architectures like some NLP and vision models. Hybrid Approaches: The best of both worlds! By combining data and model parallelism, we can train large datasets on large models efficiently. Techniques like Microsoft’s ZeRO Redundancy Optimizer (ZeRO) take this a step further by distributing the memory load, making it possible to train super-large models even on limited hardware. Azure AI: A Distributed Training Powerhouse So, how does Azure AI fit into all this? Azure is like the ultimate toolkit for distributed training. It offers powerful infrastructure that not only handles the scale of large AI models but also makes the whole process a lot easier. What Makes Azure Stand Out? Optimized Infrastructure Azure’s infrastructure is built for high-performance computing (HPC). With ultra-fast InfiniBand networking, Azure’s VMs (Virtual Machines) allow for seamless data transfer between GPUs and nodes. This is critical when training large models that require low-latency communication between devices. Top-Notch GPU Offerings Azure provides access to some of the latest and greatest GPUs, like NVIDIA’s A100 and H100 models. These GPUs are purpose-built for deep learning, featuring tensor cores that accelerate matrix computations—the backbone of deep learning. And they’re interconnected with NVLink and NVSwitch technology, which significantly reduces data transfer delays. This makes Azure the perfect playground for massive model training. Scalable Architecture Azure Machine Learning provides a versatile range of compute options that adapt to the demands of large-scale model training, from experimentation to full-scale distributed training. At the core are compute clusters, which allow you to set up managed clusters of virtual machines that can automatically scale up or down based on workload needs. These clusters support various VM types, including GPU-optimized options like the ND A100 v4 series, powered by NVIDIA A100 GPUs, ideal for high-performance distributed training. For smaller-scale development, Compute Instances offer on-demand, single-node machines for interactive sessions, making them perfect for prototyping and debugging. For budget-conscious projects, Azure Machine Learning also supports spot VMs in compute clusters, which utilize unused Azure capacity at a lower cost. This option is ideal for non-critical jobs like hyperparameter tuning, where interruptions are manageable. Together, these compute offerings ensure you can scale flexibly and efficiently, using the right resources for each stage of model development. Explore more about Azure Machine Learning compute options, GPU-optimized virtual machines, and how to leverage spot VMs for cost savings on the Azure platform. Curious to see what distributed training looks like in practice? Here’s a tutorial that walks you through setting up distributed training on Azure. How Azure Enables Distributed Learning Azure AI doesn’t just provide raw power; it gives you the tools to manage, optimize, and streamline the distributed training process. Azure offers a suite of tools and frameworks specifically designed to make distributed training accessible, flexible, and efficient. Azure Machine Learning SDK and CLI Azure’s Machine Learning SDK and CLI make it simple to set up, run, and manage distributed training jobs. With the SDK, you can define custom environments, set up compute clusters, and even submit jobs with YAML configurations, making it easy to replicate setups and automate workflows. Support for Popular Frameworks Azure ML is compatible with popular machine learning frameworks like PyTorch and TensorFlow, so you don’t have to worry about changing your entire workflow. Azure ML has built-in support for distributed training within these frameworks, using strategies like Distributed Data Parallel (DDP) and Horovod, a framework designed for distributed deep learning. Advanced Optimization with DeepSpeed Microsoft’s DeepSpeed library is integrated with Azure, providing state-of-the-art optimizations for large model training. DeepSpeed’s memory and computation optimizations, like the ZeRO Optimizer, allow you to train larger models more efficiently, reducing memory requirements and improving training speed. Hyperparameter Tuning with HyperDrive Azure ML’s HyperDrive tool makes hyperparameter tuning straightforward. Define search spaces and optimization strategies, and HyperDrive will run parallel trials to find the best configurations, even stopping underperforming trials early to save resources. It’s hyperparameter tuning on autopilot! Monitoring and Diagnostics Azure provides real-time monitoring with Azure ML Studio dashboards, showing metrics like GPU utilization, loss curves, and throughput. For deeper insights, tools like Azure Monitor and NVIDIA NSight Systems provide detailed diagnostics, helping you identify bottlenecks and optimize your training jobs. This robust toolkit ensures that Azure can handle not only the scale but also the complexity of distributed training, providing the infrastructure and tools you need to train the most advanced AI models efficiently. Real-World Success: What Makes Azure Stand Out for Distributed Learning and AI Azure AI Foundry is more than just a platform—it’s a powerhouse for enabling organizations to achieve groundbreaking results in AI. What makes Azure stand out in distributed learning is its unique combination of high-performance infrastructure, scalability, and a suite of tools designed to make distributed training as efficient and accessible as possible. Here are a few key reasons why Azure is the go-to choice for distributed AI training: High-Performance Infrastructure Azure offers high-performance computing (HPC) resources that are essential for large-scale training. Features like InfiniBand networking provide ultra-low latency and high throughput, making it ideal for workloads that require constant communication across GPUs and nodes. This enables faster synchronization and helps avoid bottlenecks in distributed setups. Advanced GPU Options With NVIDIA’s latest GPUs, such as the A100 and H100, Azure delivers the computational muscle required for deep learning tasks. These GPUs, designed with AI in mind, feature tensor cores that accelerate complex calculations, making them perfect for training large models. Azure’s NVLink and NVSwitch technology connect these GPUs for fast data transfer, further boosting performance. Scalability with VM Scale Sets One of Azure’s key differentiators is its VM Scale Sets, which allow for elastic scaling based on workload demands. This means that you can start small and scale up as your models and datasets grow. Azure’s auto-scaling capabilities ensure that resources are used efficiently, lowering costs while meeting the needs of even the largest models. All-in-One Machine Learning Platform With Azure Machine Learning (Azure ML), you get an end-to-end platform that handles everything from compute cluster management to environment setup and job orchestration. Azure ML takes care of the heavy lifting, enabling you to focus on developing and optimizing your models. Integration with Open-Source and Proprietary Tools Azure supports all major machine learning frameworks and has its own optimization tools like DeepSpeed and HyperDrive. This flexibility lets you pick the best tools for your specific needs, while benefiting from Azure’s optimized infrastructure. Azure’s distributed training capabilities make it possible for organizations to push the boundaries of what’s possible with AI. From improving training speed to enabling real-time insights, Azure is setting the standard for large-scale AI success. Wrapping Up: The Future of Large-Scale AI Training As AI models grow in complexity and capability, the need for efficient, large-scale training will only become more pressing. Distributed training, powered by platforms like Azure AI, is paving the way for the next generation of AI. It offers a robust solution to the limitations of single-device training, enabling faster development, greater scalability, and better performance. Whether you’re working in NLP, computer vision, healthcare, or finance, the ability to train large models efficiently is a game-changer. Ready to scale up your AI? Explore distributed training best practices and discover the power of large-scale AI development.355Views0likes0CommentsAnnouncing the General Availability of the VS Code extension for Azure Machine Learning
The VS Code extension for Azure Machine Learning has been in preview for a while and we are excited to announce the general availability of the VS Code extension for Azure Machine Learning. You can use your favorite VS Code setup, either desktop or web, to build, train, deploy, debug, and manage machine learning models with Azure Machine Learning from within VS Code.3.3KViews0likes0Comments