imperial college london
3 TopicsMind the Specs: Grading formal specifications and KPIs as artefacts for LLM-driven code generation
Large language models now write code straight from a prompt, but the specification in between is never checked, and a model asked to judge its own work brings the same blind spots to the review. We built a pipeline that lifts a plain-language requirements bundle into two graded specifications (a formal Alloy model and a set of numerical KPI targets), scores both before a single line of code is written, and hands the graded result to the code generator. It starts from GitHub Spec Kit and the Azure Well-Architected Framework. Here is what we built, and what we learned from running it at scale. The problem Writing software used to be four separate activities: gathering requirements, writing a specification, verifying it, and implementing it. A language model collapses all four into a single step. Two of those activities used to give us a quality signal before any code existed: a formal specification you could inspect, and measurable targets an implementation had to hit. The prompt-to-code loop inherits neither. There is no externally observable signal, before a line of code is written, that the requirements a model received are even well-formed enough to drive a correct implementation. You might think the model could just check its own work. It cannot do so reliably. Ask a language model to check the logic it just wrote: not only will it bring the same blind spot to the review, but its stochastic nature will make it produce different answers on each run. A SAT solver does not behave this way. Its verdict is deterministic: the same specification produces the same verdict every time. The thing that historically kept formal specification out of everyday development was never its rigour, it was the cost of writing the specification by hand. And that is exactly the step a language model can now do. What we built We built an agentic pipeline that sits between the requirements and the generated code. In plain terms it takes the requirements once, turns them into two things that can be checked by a machine: a precise description of rules that the system must obey, and a set of measurable targets that the system must hit. These artefacts are both graded, and are handed to the code generator. We split the work in two and gave each half to the tool that is good at it. The language model does the creative part, turning messy prose into formal structure. Deterministic checks, not the model's own opinion, grade what it produces. From a single Spec Kit artefacts bundle the pipeline builds two graded specifications before any code exists, and then carries both into code generation. Since these grades are computed deterministically rather than just generated, you can actually trust them. The input is a GitHub Spec Kit bundle. Spec Kit is an open-source, specification-first toolkit: instead of prompting for code directly, you describe what you want to build, and it produces a set of structured artefacts, a feature specification, a data model, and a set of API contracts. Our pipeline reads that bundle and turns it into the two graded specifications in parallel. overview. Spec Kit artefacts on the left. The Alloy lifter (with SAT solver and the attack step) and the KPI agent run in parallel. Their graded outputs are merged into a verification report that feeds the guided code generator. A dashed baseline path feeds the goal alone to the generator for comparison. Lift the requirements into a formal model The first half is structural. An Alloy lifter translates the requirements into a formal model written in Alloy, a specification language whose rules a SAT solver can check exhaustively, and whose verdict is deterministic, so the grade never depends on asking an LLM what it thinks. A banking requirement like "zero balance discrepancies" becomes a precise, checkable rule: the money leaving one account and the money arriving in another must always add up to the balances you started with, so a transfer can never quietly create or destroy money. The solver searches for any scenario that would break the rule. We modified Spec Kit's templates to force the model to output functional requirements and their corresponding Alloy code blocks in a structured format. Against the stock templates, that change alone nearly doubled the Alloy code compilation rate, jumping from 40 to 74 percent. A machine-written specification cannot be trusted, though, so the lifter does more than write it: it attacks it. Each load-bearing rule is deliberately broken by clearing its body and injecting a clause that forces a violation and the solver is re-run on the broken model. If the solver fails after this mutation, the original rule genuinely caught the violation it was meant to catch. If it still passes, the rule never really constrained anything on its own. Mutation testing usually grades a test suite against a specification that is assumed correct; here the roles are reversed, and the specification itself is on trial. Turn the requirements into measurable targets The second half is measurable. A KPI agent takes the same Spec Kit bundle, retrieves the most relevant principles from the Azure Well-Architected Framework, and derives numerical targets in the Goal-Question-Metric style. Each target carries an explicit threshold, a direction, and a measurement method, the kind of target a monitoring tool could actually track. Where earlier automated approaches stopped at describing quality in words, this half emits the actual numbers an implementation has to satisfy. And the knowledge base is a setting, not a fixture: swapping the Well-Architected Framework for ISO 25010, the NIST Cybersecurity Framework, or Google's SRE workbook requires zero changes to the underlying code. Review the report before any code Both graded halves merge into one human-readable verification report: the patterns the model applied, which rules passed, the counterexamples the solver found, the attack results, and the KPI threshold table. A developer reads it first and can see exactly where the specification is weak: a rule that passed for the wrong reason, or a requirement that nothing covers. After revising the specification, they re-run the lifting phase. Because the process is cached, re-runs are cheap, allowing the developer to loop until the report looks perfect, all before any code exists. The work shifts from reviewing generated code after the fact to curating a specification and reading a report before anything is built. Carry the graded context into code generation Only then does the report do its real job. In the guided pipeline, the merged report becomes the context handed to a code generator, which is asked to implement each rule, requirement, and KPI threshold and to leave markers tracing the code back to them. A baseline generator gets only the plain-language goal. Same generator, same settings; the only difference is whether it can see the graded specification. Feeding graded artefacts, rather than raw prose, into code generation is the piece that ties the whole pipeline together. So three choices separate this from simply asking a model for a spec: the specification is attacked rather than trusted, the targets are numbers rather than prose, and what reaches the code generator is graded evidence rather than raw text. How we tested it We ran the pipeline at scale: 270 Alloy lifts and 1,930 KPI records, across three application domains chosen to differ sharply (banking, software-as-a-service, and healthcare), three levels of requirement detail, four knowledge bases, and three model tiers, with ten runs of each combination so a real effect could be told apart from noise. For the code-generation half, we generated two codes for each case, once with the graded report as context and once from the plain-language goal alone, and compared the two. What we found First, the foundation: the specifications proved gradeable. The rubric cleanly separated sound specifications from degenerate ones. Because it returned the same verdict run after run, the grades are reliable enough to act on. The three key observations are as follows: The model matters more than the prompt Of the two knobs a practitioner controls, the model you choose and the amount of detail you write, the model dominated by roughly nine to one. A weak model could not be rescued by richer requirements. But you do not need the most expensive one: a mid-tier model delivered about 98 percent of the best model's quality at under a third of the cost and about half the time. The cheapest tier was a false economy, producing a model the analyser could even load only 23 percent of the time. More detail can backfire More requirements are not always better. Sparse and standard requirements scored the same, but over-specified requirements collapsed: KPI quality fell from about 0.89 to about 0.73, and the effect held across all four knowledge bases. Pile in too much numerical detail and the pipeline starts echoing the numbers it was handed instead of deriving sound ones, which is the opposite of what more detail is supposed to buy. Graded context produces far better code This is the payoff, and it is the point of the whole pipeline. Across all nine combinations of domain and detail, code generated with the graded verification context scored about 8 out of 10, against about 1 out of 10 for the same generator given only the plain-language goal. The guided code carried the traceability back to each requirement, the named rules, and the structural patterns that a bare prompt gives us no way to know about. This part of the study is a single run per combination, so we report the size and the consistency of the gap rather than a precise average, but the gap was large and it held in every case. What this means for you Four things to take from our study into your own work: Write requirements at a standard, middle level of detail. Not sparse, and not exhaustively numerical. The middle is the sweet spot on both halves of the specification. Reach for a capable mid-tier model before you invest in heavy prompt engineering. Model choice moves quality more than requirement detail does, and the mid tier is the value leader. Give the code generator externally graded context instead of letting it specify for itself. That is where most of the quality gain came from. Treat the knowledge base as a setting worth tuning, not a fixed ingredient. Each is a recommendation that data supports under the conditions we tested, not a universal law. The limit Every grade measures structure, not meaning. A high score says the specification is well-formed, discriminating, and stable. It does not say whether the invariants are the right ones, or the thresholds are the right ones for your deployment. A specification can be perfectly well-formed and still describe the wrong system. That judgement stays with a human, which is where we think it belongs. The pipeline is built to make that judgement efficient by moving it earlier, to curating the specification and reading the report, rather than to remove it. Generated code should not be shipped end to end without human validation. Try it The full pipeline, every input, and the artefacts behind every figure are in the project repository. If you want the Microsoft tools it builds on, start here: Project repository: https://github.com/RadaanMadhan/Specification-Led-Development GitHub Spec Kit: https://github.com/github/spec-kit Azure Well-Architected Framework: https://learn.microsoft.com/en-us/azure/well-architected/ If you'd like to explore the work in more detail, we've included the full technical report in the project repository, covering the related work, methodology, pipeline design, experimental setup, and extended results. About the team This project was carried out by six students at Imperial College London: Leon Hausmann, Charlotte Maxwell, Radaan Madhan, Keshav Das, Anson Huang, and Ander Cobo, in collaboration with Microsoft and supervised by Lee Stott (Microsoft) and Max Cattafi (Imperial College London)80Views1like0CommentsMicrosoft’s A-Grade Azure AI Stack: From Dissertation Prototype to Smart Campus Pilot
This post isn't just about the Student Support Agent (SSA) I built, which earned me a Distinction. It's about how Microsoft's tools made it possible to go from a rough concept to a robust pilot, proving their developer stack is one of the most convenient and powerful options for building intelligent, ethical, and scalable educational systems. The Vision: Cutting Through Campus Complexity University life is full of fragmented systems. Students constantly juggle multiple logins, websites, and interfaces just to check a timetable, book a room, or find a policy. My goal was simple: reduce that cognitive load by creating a unified assistant that could manage all these tasks through a single, intelligent conversation. The Stack That Made It Possible The core of the system relied on a few key, interconnected technologies: Technology Core Function Impact Azure AI Search Hybrid Data Retrieval Anchored responses in official documents. Azure OpenAI Natural Language Generation Created human-like, accurate answers. Semantic Kernel (SK) Multi-Agent Orchestration Managed complex workflows and memory. Azure Speech SDK Multimodal Interface Enabled accessible voice input and output. The foundation was built using Streamlit and FastAPI for rapid prototyping. Building a system that's context-aware, accessible, and extensible is a huge challenge, but it's exactly where the Microsoft AI stack shined. From Simple Chatbot to Multi-Agent Powerhouse Early campus chatbots are often single-agent models, great for basic FAQs, but they quickly fail when tasks span multiple services. I used Semantic Kernel (SK) Microsoft's powerful, open-source framework to build a modular, hub-and-spoke multi-agent system. A central orchestrator routes a request (like "book a study room") to a specialist agent (the Booking Agent), which knows exactly how to handle that task. This modularity was a game-changer: I could add new features (like an Events Agent) without breaking the core system, ensuring the architecture stayed clean and ready for expansion. Agentic Retrieval-Augmented Generation (Agentic RAG): Trust and Transparency To ensure the assistant was trustworthy, I used Agentic RAG to ground responses in real campus (Imperial College London) documentation. This included everything from admission fee payments to campus shuttle time. Azure AI Search indexed all handbooks and policies, allowing the assistant to pull relevant chunks of data and then cite the sources directly in its response. Result: The system avoids common hallucinations by refusing to answer when confidence is low. Students can verify every piece of advice, dramatically improving trust and transparency. Results: A Foundation for Scalable Support A pilot study with 15 students was highly successful: 100% positive feedback on the ease of use and perceived benefit. 93% satisfaction with the voice features. High trust was established due to transparent citations. The SSA proved it could save students time by centralising tasks like booking rooms, checking policies and offering study tips! Final Thoughts Microsoft’s AI ecosystem didn’t just support my dissertation; it shaped it. The tools were reliable, well-documented, and flexible enough to handle real-world complexity. More importantly, they allowed me to focus on student experience, ethics, and pedagogy, rather than wrestling with infrastructure. If you’re a student, educator, or developer looking to build intelligent systems that are transparent, inclusive, and scalable, Microsoft’s AI stack is a great place to start! 🙋🏽♀️ About Me I’m Tyana Tshiota, a postgraduate student in Applied Computational Science and Engineering at Imperial College London. Leveraging Microsoft’s AI stack and the extensive documentation on Microsoft Learn played a key role in achieving a Distinction in my dissertation. Moving forward, I’m excited to deepen my expertise by pursuing Azure certifications. I’d like to extend my sincere gratitude to my supervisor, Lee_Stott , for his invaluable mentorship and support throughout this project. If you haven’t already, check out his insightful posts on the Educator Developer Blog, or try building your own agent with the AI Agents for Beginners curriculum developed by Lee and his team! You can reach out via my LinkedIn if you’re interested in smart campus systems, AI in education, collaborative development, or would like to discuss opportunities.365Views0likes0Comments