Using Azure OpenAI Services to automate programming test scoring

cyruswong · ‎Dec 14 2023

Problem Statement

GitHub Classroom allows educators to create unit tests to automatically score students’ programming tasks. However, the pre-condition for running unit tests is that the project codes must be runnable or compile without error. Therefore, if students cannot keep the project fully runnable, they will only receive a zero mark. This is undesirable, especially in the programming practical test situation. Even if students submit partially correct code statements, they should earn some scores. As a result, educator will need to review all source codes one by one. This task is very exhausting, time-consuming, and hard to grade in a fair and consistent manner.

Solution

At IVE we have created a solution GitHubClassroomAIGrader, using a collection of open-source Jupyter notebooks and leverage multiple Large Language Models, including Azure OpenAI Services, to automatically score programming tests and provide feedback on the results. The Educators has the opportunity to review and override the score if there is anything wrong.

Preparation

Fork GitHubClassroomAIGrader
Create a Codespace.
Install GitHub Classroom with GitHub CLI
Download all student's repo “gh classroom clone student-repos" and move it into “data” folder
Setup and get your Azure Open AI API key.
Up .env_template and rename it to .env.

Here is the order in which you need to run the notebooks:

extract_answer.ipynb: This notebook extracts program source code for each student.
*_grader.ipynb: Those notebooks use LLM to give score, comments and explain the score for all students.
human_review.ipynb: This notebook combines 1 or more LLM grader reports, generates the excel report for educator review and override the score.
generate_score_report.ipynb: This notebook generates the final score report and extracts samples.

The notebooks are designed to be self-explanatory. You can easily get started by changing the name of the first cell to match the corresponding assignment. Therefore, I will not explain all notebooks in this post.

I will focus on azure_openai_grader.ipynb.

How to score a programming code?

For a programming test, we provide starter code to students. They are required to read the instructions and write additional code to meet the requirements. We have a standard answer already. We will store the question name, instructions, starter code, answer and mark in an Excel sheet. This sheet will be used to prompt and score student answers.

Excel sheet with student ID and GitHub user name.

To enhance the management of LLM and make prompt management more efficient, we utilize LangChain. The code is straightforward and involves the following steps:

We create a chat prompt that combines various elements such as "instruction", "starter", "answer", "mark", "student_answer", and "student_commit". We also use "Run on Save" to help students commit code when they save and the number of commit is a good indicator to prove students are working honestly.
We create a LLM in Azure OpenAI Services ChatGPT and keep the temperature low as scoring does not require creativity.
We use PydanticOutputParser to produce the output format instructions and extract it into a Python object.
Finally, we pipe everything together to create a runnable chain.

from langchain.chat_models import AzureChatOpenAI
from langchain.prompts.chat import ChatPromptTemplate
import langchain
langchain.debug = False
from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

# Define your desired data structure.
class ScoreResult(BaseModel):
    score: int = Field(description="Score")
    comments: str = Field(description="Comments")
    calculation: str = Field(description="Calculation")

parser = PydanticOutputParser(pydantic_object=ScoreResult)

llm = AzureChatOpenAI(
    deployment_name=deployment_name,
    model_name=model_name,
    temperature=0.2
)

def score_answer(instruction, starter, answer, mark, student_answer, student_commit, llm=llm, prompt_file="grader_prompt.txt"):
    with open(prompt_file) as f:
        grader_prompt = f.read()

    data = {"instruction": instruction,
            "starter": starter,
            "answer": answer,
            "mark": mark,
            "student_answer": student_answer,
            "student_commit": student_commit}

    prompt = PromptTemplate(
        template="You are a Python programming instructor who grades student Python exercises.\n{format_instructions}\n",
        input_variables=[],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )
    system_message_prompt = SystemMessagePromptTemplate(prompt=prompt)
    human_message_prompt = HumanMessagePromptTemplate(prompt=PromptTemplate(
                                    template=grader_prompt,
                                    input_variables=["instruction", "starter", "answer", "mark", "student_answer", "student_commit"],
                                )
                            )

    chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])


    runnable = chat_prompt | llm | parser
    
    # Get the result
    data = {"instruction": instruction,
            "starter": starter,
            "answer": answer,
            "mark": mark,
            "student_answer": student_answer,
            "student_commit": student_commit}
    output = runnable.invoke(data)
    return output

The output of parser.get_format_instructions() in system prompt.

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"score": {"title": "Score", "description": "Score", "type": "integer"}, "comments": {"title": "Comments", "description": "Comments", "type": "string"}, "calculation": {"title": "Calculation", "description": "Calculation", "type": "string"}}, "required": ["score", "comments", "calculation"]}
```

grader_prompt.txt

Programming question
<question>
{instruction}
</question>

<Starter> 

{starter}

</Starter> 

<StandardAnswer>


{answer}

</StandardAnswer>

<StudentAnswer>

{student_answer}

</StudentAnswer>

Number of times code commit to GitHub: {student_commit}

Student add the code statement from Starter.
Student follows the question to add more code statements.

Rubric:
- If the content of StudentAnswer is nearly the same as the content of Starter, score is 0 and comment “Not attempted”. Skip all other rules.
- The maximum score of this question is {mark}.
- Compare the StudentAnswer and StandardAnswer line by line and Programming logic. Give 1 score for each line of correct code.
- Don't give score to Code statements provided by the Starter.
- Evaluate both StandardAnswer and StudentAnswer for input, print, and main function line by line.
- Explain your score calculation.
- If you are unsure, don’t give a score!
- Give comments to the student.

The output must be in the following JSON format:
"""
{{
    "score" : "...",   
    "comments" : "...",
    "calculation" : "..."
}}
"""

In such instances of failure, we must intervene manually by either switching to a stronger model, tweaking the parameters, or making minor updates to the prompt.

We begin by creating a backup of the batch job output.

backup_student_answer_df = student_answer_df.copy()

Manually execute the unsuccessful cases by adjusting the following code and running them again.

import time

print(f"Total failed cases: {len(failed_cases)}")
# use more powerful model to score failed cases
orginal_deployment_name = deployment_name
orginal_model_name = model_name
deployment_name = "gpt-4"
model_name = "gpt-4"

turned_llm = AzureChatOpenAI(
    deployment_name=deployment_name,
    model_name=model_name,
    temperature=0.1
)

if len(failed_cases) > 0:
    print("Failed cases:")
    for failed_case in failed_cases:
        # print(failed_case)
        # Get row from student_answer_df by Directory
        row = student_answer_df.loc[student_answer_df['Directory'] == failed_case["directory"]]        
        question = failed_case['question']
        instruction = standard_answer_dict[question]["Instruction"]
        starter = standard_answer_dict[question]["Starter"]
        answer = standard_answer_dict[question]["Answer"]
        mark = standard_answer_dict[question]["Mark"]      
        student_answer = row[question + " Content"]
        print(student_answer)
        student_commit = row[question + " Commit"]
        result = score_answer(instruction, starter, answer, mark, student_answer, student_commit, llm=turned_llm) 
        time.sleep(10)        
        #update student_answer_df with result
        row[question + " Score"] = result.score
        row[question + " Comments"] = result.comments
        row[question + " Calculation"] = result.calculation
        # replace row in student_answer_df
        # student_answer_df.loc[student_answer_df['Directory'] == failed_case["directory"]] = row
        #remove failed case from failed_cases
        failed_cases.remove(failed_case)

deployment_name = orginal_deployment_name
model_name = orginal_model_name

Based on experience, most of the cases can be resolved by utilizing ChatGPT 4.

The output of human_review.ipynb

And, educator needs to select all and click on “Wrap Text”.

We can get the comments, and score calculation details.

The "Average" refers to the mean score given by all LLM graders. As a result, the teacher can simply scan through the student's code and override the "Score" field.

Conclusion

The default setting in GitHub Classroom for unit tests requires project code to be runnable or compile without error, leading to zero marks for non-runnable projects. This is not ideal for programming practical tests as students should receive some credit for partially correct code statements. To address this issue, we have developed a solution that utilizes open-source Jupyter notebooks and multiple Large Language Models, including Azure OpenAI Services ChatGPT, to automatically score programming tests and provide feedback. Educators can review and override scores if necessary. The solution involves preparing the environment, running the notebooks in a specific order, and manually intervening in cases of failure. This approach does not replace all manual work, but it provides an objective reference to prevent errors and reduce the time required to score programming codes from GitHub Classroom.

Project collaborators include, Markus, Kwok,Hau Ling, Lau Hing Pui, and Xu Yuan from the IT114115 Higher Diploma in Cloud and Data Centre Administration and Microsoft Learn Student Ambassadors candidates

About the Author

Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology and he focuses on teaching public Cloud technologies. He is one of the Microsoft Learn for Educators Ambassador and Microsoft Azure MVP from Hong Kong.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs