Automating HPC Workflows with Copilot Agents

xpillons

Microsoft

Dec 03, 2025

Let AI Do the Heavy Lifting

Introduction

High Performance Computing (HPC) workloads are complex, requiring precise job submission scripts and careful resource management. Manual scripting for platforms like OpenFOAM is time-consuming, error-prone, and often frustrating. At SC25, we showcased how Copilot Agents—powered by AI—are transforming HPC workflows by automating Slurm submission scripts, making scientific computing more efficient and accessible. A full demonstration can be found in the video at the end of this article.

Why Automate HPC Workflows?

High-performance computing workloads are often elaborate, requiring carefully structured job submission scripts to efficiently manage system resources. In applications like OpenFOAM, where precise setup of nodes, tasks, and memory is essential, composing these scripts by hand can be both labor-intensive and susceptible to errors.

Manually creating Slurm scripts not only consumes valuable time but also raises the likelihood of mistakes, resulting in failed jobs and costly delays that delay research and innovation. For OpenFOAM users, this translates into spending less time on actual simulations and more time resolving script-related problems.

Automating the creation of these scripts eases the burden on researchers and engineers by accelerating research processes, minimizing errors, and enabling users to dedicate more attention to simulation and analysis instead of debugging submission issues.

AI-powered Workflow Automation

Copilot Agents uses artificial intelligence to simplify the process of making job submission scripts, helping HPC workflows run smoothly and efficiently. With this system, users can focus less on manual scripting and more on research and analysis.

Copilot Agent recognizes your workload's context and applies best practices to create precise and optimized Slurm scripts. It interprets specific needs so that each script matches the requirements of individual jobs, which helps with resource allocation and scheduling.

Key benefits include quicker script creation, fewer mistakes, and greater consistency across HPC tasks. Automating this process speeds up the workflow and maintains standards, resulting in more dependable and repeatable job submissions.

Typical Workflow with Copilot Agents

Defining the Context: Begin by outlining your workload requirements clearly and thoroughly. Indicate how to load and run the application, specify the number of tasks per node, and detail any special logging or configuration instructions. The more accurate you are with these details, the more effectively the agent can create a reliable script.

Script Generation by AI: Copilot processes your input and automatically creates a full Slurm submission script. Using AI models, this stage incorporates best practices to save time and prevent errors.

Validation and Submission: After the script is built, it’s checked for accuracy and submitted to the scheduler. You should always examine the output and error logs and adjust as needed. This ongoing review helps ensure that jobs run smoothly and improves your workflow over time.

Best Practices for Defining Context

Consider context as your guideline: providing more specific and thorough details helps the agent produce a more accurate Slurm script. Always make your instructions straightforward and precise. Add links to relevant documentation when possible, and share example cases that show exactly what you need. Be clear about requirements like how to load applications, set the number of tasks per node, or any special configuration and logging needs. Clear and complete context not only lowers the chance of mistakes but also results in higher-quality scripts, ultimately saving you time and effort.

Script Generation: Iterative Improvement

Model Selection: Advanced models such as GPT-5 are capable of producing highly detailed and comprehensive scripts. Although the initial draft may require additional time to generate, these models typically integrate best practices and sophisticated configuration options, which can be further refined through iterative development.

Iterative Improvement: The initial script produced by AI generally serves as a starting point for further enhancement. Systematic revisions informed by output logs, error reports, and user feedback contribute to improving the accuracy, efficiency, and customization of the final submission script according to the specific needs of your HPC workload.

Practical Example: As demonstrated in the video below, a chat-based Copilot Agent facilitates script creation by prompting for the script name and subsequently generating a Bash script that incorporates all requested features. These include leveraging Slurm environment variables, automating task distribution, loading requisite modules, and enabling comprehensive logging. The resulting script is prepared for submission via the sbatch command.

Validation and Continuous Improvement

Once you have generated your Slurm submission script using Copilot Agent, it is essential to conduct a careful review of the output prior to executing your job. This preliminary assessment is critical for identifying potential issues early and ensuring that the script aligns with your specific workload requirements.

Submit the job to the scheduler for validation, and diligently monitor both the output and error log files, as these will inform your subsequent actions.

Should errors arise—such as missing file paths or incorrect module loads—utilize the feedback from the logs to amend your script accordingly. This iterative refinement process is fundamental to optimizing your workflow and achieving reliable job execution.

The accompanying example illustrates how Copilot Agent can assist in locating and correcting errors, such as updating an OpenFOAM tutorial path. By leveraging AI-enabled feedback, users are able to efficiently address issues and confidently resubmit jobs.

Continuous validation and revision are paramount to advancing high-performance computing automation. Consistently refer to output and error logs to guide subsequent iterations, thereby enhancing the robustness and dependability of your scripts over time.

Key Benefits

Time Efficiency: Copilot Agents significantly decrease the time needed to generate job submission scripts. Tasks that previously required hours of manual scripting can now be completed within minutes, enabling researchers and engineers to give more attention to simulation and analysis rather than script troubleshooting.

Error Reduction: Automation substantially lowers the risk of human error commonly associated with manual script development. By enforcing best practices and standardizing the script generation process, Copilot Agents improve reliability and minimize job failures.

Enhanced Scalability: Automated workflows facilitate more efficient scaling across high-performance computing (HPC) environments. As workloads increase in complexity and scale, Copilot Agents support consistency and optimal resource utilization, simplifying the management of expansive simulations.

User-Friendly Automation: Copilot Agents make HPC scripting more approachable for new users by offering intuitive automation and guidance. This approach ensures adherence to best practices and broadens accessibility, even for individuals with limited prior experience.

Updated Dec 03, 2025

Version 1.0

OpenFOAM VSCode Copilot Agent - no music.mp469.6 MB

ai infrastructure

hpc

xpillons

Microsoft

Joined June 20, 2019

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity