Introduction
Charlotte Yeo, UCL MEng Computer Science https://www.linkedin.com/in/charlotte-yeo-627476294/
Supervisors: Janaina Mourao-Miranda (UCL) and Lee Stott (Microsoft).
For my final-year MEng project at UCL, I investigated how to get the best results out of SpecKit, a spec-driven AI development framework, by systematically testing different prompt strategies.
Here's what I found.
Project Overview
LLMs are powerful coding assistants, but they struggle to maintain context over long development sessions, leading to hallucinations and inconsistent outputs. SpecKit addresses this by using persistent, structured specification documents as memory throughout the development process. The developer writes a natural language spec; SpecKit builds the software from it.
The problem is that no one has established best practices for writing those specs. This project aimed to fill that gap.
Experiments
I ran 10 experiments, each using SpecKit to build the same target system, a multi-agent AI code verification tool, from a different prompt formulation. The variables I tested included prompt authority, format, level of detail, and output format. By keeping the target software constant, the effect of each prompt change on SpecKit's performance is isolated.
The target system itself used Microsoft Agent Framework, Azure Cosmos DB for RAG, and Microsoft Foundry to access GPT-5.2, all orchestrated via a Python codebase. This covered a wide range of real-world engineering challenges: multi-agent coordination, cloud service integration, and working with a library new enough that the model hadn't been trained on it.
Technical Details
SpecKit runs as a series of commands inside GitHub Copilot in VS Code, powered here by Claude Sonnet 4.5. The workflow moves through seven stages: /constitution → /specify → /clarify → /plan → /tasks → /analyze → /implement. At each stage, SpecKit writes and updates Markdown files that serve as persistent memory, so the session can be paused and resumed without losing context.
Key tools used:
- Microsoft Agent Framework — agent orchestration
- Microsoft Foundry — access to LLMs (GPT-5.2, Text Embedding 3)
- Azure Cosmos DB — code example database for RAG
- Claude Sonnet 4.5 — model powering SpecKit via GitHub Copilot
Results
These were the key findings:
- Natural language outperforms machine-readable formats. The JSON prompt (Case 1) took 40% longer and generated significantly more issues than the natural language control.
- Authority is necessary. Removing the authoritative framing from the prompt (Case 3) caused SpecKit to treat specifications as optional, resulting in the multi-agent system not being built at all until manually corrected. Total time: 4h 53m vs. 2h 24m for the control.
- Omit what the model already knows. Removing the scoring rubrics (Case 8) saved 34 minutes with no loss in output quality as the model inferred the rubric from context. However, omitting the Cosmos DB schema or agent architecture descriptions caused major implementation errors.
- The model must be able to read its own outputs. Changing the output to PDF (Case 9), which Claude Sonnet 4.5 cannot read in Copilot, caused the implementation stage to increase significantly to 7h 38m, with 33 required interventions, because the model couldn't verify whether its code was working.
Best Practices Found
The biggest insight is that prompt design has as much impact on SpecKit's performance as prompt content. A complete specification written non-authoritatively or in JSON will produce worse results than a slightly shorter specification written in clear, authoritative natural language.
There is also a trade-off between token count and manual intervention. Shorter prompts are faster, but only when the omitted information is something the model can reliably infer. Leaving out details about unique libraries or architectures will result in higher debugging times later.
Future Development
These are directions for future work in this area:
- Running each experiment multiple times to account for model non-determinism
- Repeating experiments with newer or different LLMs to test generalisability
- Testing with different target systems beyond code verification
- Supplying SpecKit with tools (e.g. Playwright MCP) to read outputs it currently cannot access, like live webpages or PDFs
Conclusion
Spec-driven development with SpecKit is a useful approach for building complex software with LLMs, but the quality of your prompt determines the quality of your outcome. For the most effective results, write in natural language, keep the whole prompt authoritative, include detail on novel or library-specific components, design your system's outputs to be readable by the model building them, and leave out only what the model can confidently infer.
If you want to explore the tools used in this project, here are some useful starting points: