Blog Post

AI - Machine Learning Blog
11 MIN READ

Evaluating Fine-Tuned Models for Function-Calling: Beyond Input-Output Metrics

Priya_Kedia's avatar
Priya_Kedia
Icon for Microsoft rankMicrosoft
Jan 08, 2025

In the intricate world of machine learning and Artificial Intelligence, the fine-tuning of models for specific tasks is an art form that requires meticulous attention to detail. One such task that has garnered significant attention is function-calling, where models are trained to call specific functions with appropriate arguments based on given inputs. Evaluating these fine-tuned models is crucial to ensure their reliability and effectiveness. While in the previous blog post, we looked at how to run an end-to-end fine-tuning pipeline using the Azure Machine Learning Platform, this blog post will delve into the multifaceted evaluation process of these models, emphasizing the importance of not just input-response evaluation but also the correctness of function calls and arguments.

Understanding Function-Calling: Models optimized for Function-calling are designed to interpret input data and subsequently call predefined functions with the correct arguments. These models find applications in various domains, including automated customer support, data processing, and even complex decision-making systems. The key to their success lies in their ability to understand the context and semantics of the input, translating it into precise function calls.

The Challenge of Input-Response Evaluation: The most straightforward method of evaluating these models is through input-response evaluation. This involves providing the model with a set of inputs and comparing its responses to the expected outputs. Metrics such as accuracy, precision, recall, and F1-score are commonly used to measure performance. However, input-response evaluation alone presents several challenges:

  1. Superficial Assessment: This method primarily checks if the model's output matches the expected result. It doesn't delve into the model's internal decision-making process or the correctness of the function calls and arguments.
  2. Misleading Metrics: If the predicted response doesn't match the expected answer, input-response metrics alone won't explain why. The discrepancy could stem from incorrect function calls or arguments, not just from an incorrect final output.
  3. Limited Scope: Many tasks require a broader spectrum of capabilities beyond just function-calling. This includes general conversation, generating leading questions to gather necessary inputs for function-calling, and synthesizing responses from function execution. Input-response evaluation doesn't cover these nuances as it requires semantic understanding of the input and response instead of word-by-word assessment.

Evaluating Function Calls: The Next Layer

To bridge the gap left by input-response evaluation, we need to scrutinize the function calls themselves. This involves verifying that the model calls the appropriate functions for given inputs.

Why This Matters

  1. Correct Function Semantics: Ensuring the right function is called guarantees that the model understands the semantics of the task. For instance, in a customer support system, calling a function to reset a password instead of updating an address could lead to significant user frustration.
  2. Maintainability and Debugging: Correct function calls make the system easier to maintain and debug. If the wrong function is called, it can lead to unexpected behaviors that are harder to trace and fix.

Addressing Gaps

When the predicted response doesn't match the expected answer, evaluating function names and arguments helps identify the root cause of the discrepancy. This insight is crucial for taking necessary actions to improve the model's performance, whether it involves fine-tuning the training data or adjusting the model's architecture.

Evaluating Function Arguments: The Final Layer

The last layer of evaluation involves scrutinizing the arguments passed to the functions. Even if the correct function is called, incorrect or improperly formatted arguments can lead to failures or incorrect outputs.

Importance of Correct Arguments

  1. Functional Integrity: The arguments to a function are as crucial as the function itself. Passing incorrect arguments can result in errors or unintended outcomes. For example, calling a function to process a payment with an incorrect amount or currency could have severe financial implications.
  2. User Experience: In applications like chatbots or virtual assistants, incorrect arguments can degrade the user experience. A model that correctly calls a weather-check function but passes the wrong location will not serve the user's needs effectively.

A Holistic Evaluation Approach

To ensure the robustness of fine-tuned models for function-calling, a holistic evaluation approach is necessary. This involves:

  1. Input-Response Evaluation: Checking the overall accuracy and effectiveness of the model's outputs.
  2. Function Call Verification: Ensuring the correct functions are called for given inputs.
  3. Argument Validation: Verifying that the arguments passed to functions are correct and appropriately formatted.

Beyond Lexical Evaluation: Semantic Similarity

Given the complexity of tasks, it's imperative to extend the scope of metrics to include semantic similarity evaluation. This approach assesses how well the model's output aligns with the intended meaning, rather than just matching words or phrases.

  1. Semantic Similarity Metrics: Use advanced metrics like BERTScore, BLEU, ROUGE, or METEOR to measure the semantic similarity between the model's output and the expected response. These metrics evaluate the meaning of the text, not just the lexical match.
  2. Contextual Understanding: Incorporate evaluation methods that assess the model's ability to understand context, generate leading questions, and synthesize responses. This ensures the model can handle a broader range of tasks effectively.

Evaluate GenAI Models and Applications Using Azure AI Foundry

The evaluation functionality in the Azure AI Foundry portal provides a comprehensive platform that offers tools and features for assessing the performance and safety of your generative AI model. In Azure AI Foundry portal, you're able to log, view, and analyze detailed evaluation metrics. With built-in and custom evaluators, the tool empowers developers and researchers to analyze models under diverse conditions and scenarios while enabling straightforward comparison of results across multiple models.
Within Azure AI Foundry, a comprehensive approach to evaluation includes three key dimensions:

  • Risk and Safety Evaluators: Evaluating potential risks associated with AI-generated content is essential for safeguarding against content risks with varying degrees of severity. This includes evaluating an AI system's predisposition towards generating harmful or inappropriate content.
  • Performance and Quality Evaluators: This involves assessing the accuracy, groundedness, and relevance of generated content using robust AI-assisted and Natural Language Processing (NLP) metrics.
  • Custom Evaluators: Tailored evaluation metrics can be designed to meet specific needs and goals, providing flexibility and precision in assessing unique aspects of AI-generated content. These custom evaluators allow for more detailed and specific analyses, addressing particular concerns or requirements that standard metrics might not cover.

Running the Evaluation for Fine-Tuned Models Using the Azure Evaluation Framework

Metrics Used for the Workflow

  1. Function-Call Invocation
  2. Function-Call Arguments
  3. BLEU Score: Measures how closely the generated text matches the reference text.
  4. ROUGE Score: Focuses on recall-oriented measures to assess how well the generated text covers the reference text.
  5. GLEU Score: Measures the similarity by shared n-grams between the generated text and ground truth, focusing on both precision and recall.
  6. METEOR Score: Considers synonyms, stemming, and paraphrasing for content alignment.
  7. Diff Eval: An AI-assisted custom metric that compares the actual response to ground truth and highlights the key differences between the two responses.

We will use the same validation split from glaive-function-calling-v2 as used in the fine-tuning blog post, run it through the hosted endpoint for inference, get the response, and use the actual input and predicted response for evaluation.

Preprocessing the Dataset

First, we need to preprocess the dataset and convert it into a QnA format as the original dataset maintains an end-to-end conversation as one unified record.

  1. Parse_conversation and apply_chat_template: This function effectively transforms a raw conversation string into a list of dictionaries, each representing a message with a role and content.
  2. Get_multilevel_qna_pairs: This iteratively breaks down the conversation as questions and prompts every time it encounters the role as "assistant" within the formatted dictionary.
def parse_conversation(input_string):  
    
    ROLE_MAPPING = {"USER" : "user", "ASSISTANT" : "assistant", "SYSTEM" : "system", "FUNCTION RESPONSE" : "tool"}

    # Regular expression to split the conversation based on SYSTEM, USER, and ASSISTANT  
    pattern = r"(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):"  
      
    # Split the input string and keep the delimiters  
    parts = re.split(pattern, input_string)  
      
    # Initialize the list to store conversation entries  
    conversation = []  
      
    # Iterate over the parts, skipping the first empty string  
    for i in range(1, len(parts), 2):  
        role = parts[i].strip()  
        content = parts[i + 1].strip()  
        content = content.replace("<|endoftext|>", "").strip()

        if content.startswith('<functioncall>'):  # build structured data for function call
                # try to turn function call from raw text to structured data
                content = content.replace('<functioncall>', '').strip()
                # replace single quotes with double quotes for valid JSON
                clean_content = content.replace("'{", '{').replace("'}", '}')
                data_json = json.loads(clean_content)
                # Make it compatible with openAI prompt format
                func_call = {'recipient_name': f"functions.{data_json['name']}", 'parameters': data_json['arguments']}
                content = {'tool_uses': [func_call]}
          
        # Append a dictionary with the role and content to the conversation list  
        conversation.append({"role": ROLE_MAPPING[role], "content": content})  
      
    return conversation  

def apply_chat_template(input_data):
        try:
            system_message = parse_conversation(input_data['system'])
            chat_message = parse_conversation(input_data['chat'])
            message = system_message + chat_message
            return message
        except Exception as e:
                print(str(e))
                return None
        
def get_multilevel_qna_pairs(message):
    prompts = []
    answers = []
    for i, item in enumerate(message):
        if item['role'] == 'assistant':
            prompts.append(message[:i])
            answers.append(item["content"])

    return prompts, answers  

Reference : inference.py

Submitting a Request to the Hosted Endpoint :

Next, we need to write the logic to send request to the hosted endpoint and run the inference.

def run_inference(input_data):
    # Replace this with the URL for your deployed model
    url = 'https://llama-endpoint-ft.westus3.inference.ml.azure.com/score'
    # Replace this with the primary/secondary key, AMLToken, or Microsoft Entra ID token for the endpoint
    api_key = '' # Update it with the API key

    params = {
        "temperature": 0.1,
        "max_new_tokens": 512,
        "do_sample": True,
        "return_full_text": False
    }

    body = format_input(input_data, params)
    body = str.encode(json.dumps(body))

    if not api_key:
        raise Exception("A key should be provided to invoke the endpoint")


    headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

    req = urllib.request.Request(url, body, headers)

    try:
        response = urllib.request.urlopen(req)

        result = json.loads(response.read().decode("utf-8"))["result"]
    except urllib.error.HTTPError as error:
        print("The request failed with status code: " + str(error.code))

        # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
        print(error.info())
        print(error.read().decode("utf8", 'ignore'))

    return result

Evaluation Function:

Next, we write the evaluation function that will run the inference and will evaluate the match for function calls and function arguments.

def eval(query, answer):
    """
    Evaluate the performance of a model in selecting the correct function based on given prompts.

    Args:
        input_data (List) : List of input prompts for evaluation and benchmarking
        expected_output (List) : List of expected response

    Returns:
        df : Pandas Dataframe with input prompts, actual response, expected response, Match/No Match and ROUGE Score
    """
    # Initialize the ROUGE Scorer where llm response is not function-call
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True) 

    expected_output = answer
    # For generic model response without function-call, set a threshold to classify it as a match
    match_threshold_g = 0.75

    predicted_response = run_inference(query)

    is_func_call = False

    if predicted_response[1:12] == "'tool_uses'":
        is_func_call = True
        try:
            predicted_response = ast.literal_eval(predicted_response)
        except:
            predicted_response = predicted_response
        if isinstance(predicted_response, dict):
            predicted_functions = [func["recipient_name"] for func in predicted_response["tool_uses"]]
            predicted_function_args = [func["parameters"] for func in predicted_response["tool_uses"]]

            actual_functions = [func["recipient_name"] for func in expected_output["tool_uses"]]
            actual_function_args = [func["parameters"] for func in expected_output["tool_uses"]]

            fcall_match = predicted_functions == actual_functions
            fcall_args_match = predicted_function_args == actual_function_args
            match = "Yes" if fcall_match and fcall_args_match else "No"
    else:
        fmeasure_score = scorer.score(expected_output, predicted_response)['rougeL'].fmeasure 
        match = "Yes" if fmeasure_score >= match_threshold_g else "No"
    
    result = {
            "response": predicted_response,
            "fcall_match": fcall_match if is_func_call else "NA",
            "fcall_args_match": fcall_args_match if is_func_call else "NA",
            "match": match
        }
    
    return result

Create a AI-assisted custom metric for Difference evaluation

  1. Create a Prompty file: Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers. The primary goal is to accelerate the developer inner loop. Prompty standardizes prompts and their execution into a single asset.

    2. Create a class to load the Prompty file and process the outputs with JSON format.

class DifferenceEvaluator:
    def __init__(self, model_config: AzureOpenAIModelConfiguration):
        """
        Initialize an evaluator configured for a specific Azure OpenAI model.

        :param model_config: Configuration for the Azure OpenAI model.
        :type model_config: AzureOpenAIModelConfiguration

        **Usage**

        .. code-block:: python

            eval_fn = CompletenessEvaluator(model_config)
            result = eval_fn(
                question="What is (3+1)-4?",
                answer="First, the result within the first bracket is 3+1 = 4; then the next step is 4-4=0. The answer is 0",
                truth="0")
        """
        # TODO: Remove this block once the bug is fixed
        # https://msdata.visualstudio.com/Vienna/_workitems/edit/3151324
        if model_config.api_version is None:
            model_config.api_version = "2024-05-01-preview"

        prompty_model_config = {"configuration": model_config}
        current_dir = os.path.dirname(__file__)
        prompty_path = os.path.join(current_dir, "difference.prompty")
        assert os.path.exists(prompty_path), f"Please specify a valid prompty file for completeness metric! The following path does not exist:\n{prompty_path}"
        self._flow = load_flow(source=prompty_path, model=prompty_model_config)

    def __call__(self, *, response: str, ground_truth: str, **kwargs):
        """Evaluate correctness of the answer in the context.

        :param answer: The answer to be evaluated.
        :type answer: str
        :param context: The context in which the answer is evaluated.
        :type context: str
        :return: The correctness score.
        :rtype: dict
        """
        # Validate input parameters
        response = str(response or "")
        ground_truth = str(ground_truth or "")

        if not (response.strip()) or not (ground_truth.strip()):
            raise ValueError("All inputs including 'answer' must be non-empty strings.")

        # Run the evaluation flow
        output = self._flow(response=response, ground_truth=ground_truth)
        print(output)
        return json.loads(output)

Reference: difference.py

Run the evaluation pipeline:

Run the evaluation pipeline on the validation dataset using both in-built metrics and custom metrics. In order to ensure the evaluate() can correctly parse the data, you must specify column mapping to map the column from the dataset to keywords that are accepted by the evaluators.

def run_evaluation(name = None, dataset_path = None):
    
    model_config = AzureOpenAIModelConfiguration(
        azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
        api_version=os.environ["AZURE_OPENAI_API_VERSION"],
        azure_deployment=os.environ["AZURE_OPENAI_EVALUATION_DEPLOYMENT"]
    )

    # Initializing Evaluators
    difference_eval = DifferenceEvaluator(model_config)

    bleu = BleuScoreEvaluator()
    glue = GleuScoreEvaluator()
    meteor = MeteorScoreEvaluator(alpha = 0.9, beta = 3.0, gamma = 0.5)
    rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L)

    data_path = str(pathlib.Path.cwd() / dataset_path)
    csv_output_path = str(pathlib.Path.cwd() / "./eval_results/eval_results.csv")
    output_path = str(pathlib.Path.cwd() / "./eval_results/eval_results.jsonl")

    result = evaluate(
        # target=copilot_qna,
        evaluation_name=name,
        data=data_path,
        target=eval,
        evaluators={
            "bleu": bleu,
            "gleu": glue,
            "meteor": meteor,
            "rouge" : rouge,
            "difference": difference_eval
        },
        evaluator_config=
        {"default": {
            # only provide additional input fields that target and data do not have
            "ground_truth": "${data.answer}",
            "query": "${data.query}",
            "response": "${target.response}",
        }}
    )
    
    tabular_result = pd.DataFrame(result.get("rows"))
    tabular_result.to_csv(csv_output_path, index=False)
    tabular_result.to_json(output_path, orient="records", lines=True) 

    return result, tabular_result

Reference: evaluate.py

Reviewing the Results:

Let's review the results for only function-calling scenarios. 85 out of 102 records had a 100% match, whereas the rest had discrepancies in the function arguments being passed. The difference evaluator output gives insights into what the exact differences are, which we can use to improve the model performance by fixing the training dataset and model hyperparameters in subsequent iterations.

As can be inferred from the above results, the model doesn't do a great job if it involves number conversion, date formatting and we can leverage these insights to further fine-tune the model performance.

Conclusion
Evaluating fine-tuned models for function-calling requires a comprehensive approach that goes beyond input-response metrics. By incorporating function call verification, argument validation, and semantic similarity evaluation, we can ensure these models perform reliably and effectively in real-world applications. This holistic evaluation strategy not only enhances the model's accuracy but also ensures its robustness, maintainability, and user satisfaction.

Updated Jan 08, 2025
Version 2.0
No CommentsBe the first to comment