Today we’re introducing Office Agent, a multi-agent system that builds upon an open-source stack, Anthropic's Claude model, and a new taste-driven development (TDD) paradigm to deliver polished PowerPoint presentations, ready-to-use Word documents, and soon Excel spreadsheets.
The system orchestrates specialized agents that can plan, draft, and refine Office artifacts end-to-end. Built on a general-purpose agent architecture and validated against leading benchmarks, Office Agent consistently demonstrates state-of-the-art performance, proving its ability to handle complex workflows with both reliability and polish.
GAIA Reported Results
Official metrics reported by AI vendors. Manus: Mar 10, 2025; Genspark: Apr 25, 2025; OpenAI DR: Feb 2, 2025.
Notes: L1, L2, and L3 represent test queries of increasing difficulty, ranging from easiest to hardest.
Design – Multi-Agent System Orchestration with Open Source
At its core, the Agent is powered by a multi-agent orchestration engine:
- A central planner agent coordinates tasks and synthesizes results
- Specialized agents (for code, finance, search, and more) work in parallel
- A secure tool layer integrates utilities and sandboxed environments
Built on open-source frameworks' community innovation, Office Agent delivers coordinated agent workflows with the performance and reliability needed for daily tasks.
Taste-Driven Development (TDD)
Creating polished and professional artifacts is where the real value is. Most AI agents create presentations by simply generating raw code that results in uneven layout and cluttered visuals, leaving users with rounds of manual fixes.
Office Agent introduces a new approach to creation via Taste-Driven Development (TDD):
- Reusable “taste blueprints” distilled from high-quality, in-house accumulated content
- Consistent design language across slides and documents
- Outputs that are both ready-to-use and aesthetically refined
With TDD, the Agent hits a high bar for AI generated content particularly with aesthetic layout. Take PPT generation as one example. In our TDD framework, the generation of a tasteful PPT begins with taste distillation: we analyze a large collection of high-quality presentation samples and extract the underlying taste blueprints. This distilled prior knowledge is then injected into the agent’s planning and execution process, directly influencing layout, style, and content generation.
The workflow operates in an iterative loop. Each generated artifact is first reviewed through a content self-verification module, which evaluates both quality and taste. Feedback from this review is passed back to the agent, enabling self-iteration and refinement.
The final output is a set of HTML5-based slides that balance expressive design with structural rigor. To maximize usability, we also provide a conversion tool that automatically translates these HTML5 slides into PowerPoint format, for further editing in Microsoft PowerPoint.
Examples of Office Agent’s Process and Output:
Neural Networks Lecture [Full Replay] Prompt: I’m giving a lecture on neural networks — can you help me with the teaching slides?
Future of Work Trends [Full Replay] Prompt: Create a presentation summarizing the top 5 global trends shaping the future of work (e.g., AI adoption, remote work, skills-based hiring). Include Microsoft’s WorkLab as a data source
Evolution of coffee culture [Full Replay] Prompt: Create slides for the evolution of coffee culture
Supply Chain Resilience Shift [Full Replay] Prompt: Show the global shift from efficiency to resilience in enterprise supply chain strategy. Use elegant world maps, timeline graphics, and refined serif headings on muted backdrops.
Auto Theming for High-Quality Output:
Preset themes have long been the default answer for anyone creating presentations. They offer diversity, but often at the cost of precision. The assumption is that more options are better - that if users can browse through enough templates, they’ll eventually find the right fit. In practice, this may not work. Users don’t want to scroll through endless designs; they want something that reflects their content with taste.
That’s why we designed auto theming. Instead of asking users to choose from a list of predefined templates, auto theming reads the content itself and generates a design that fits naturally. The result is not just another theme, but the right theme for the job.
See examples: Pokémon Cultural, Butterflies For 1st Grades, Black Myth, Bauhaus Movement
Expert-Guided Taste Refinement
While TDD raises the floor of quality, the influence of human judgment is embedded in the system’s foundations. During development, designers shaped the system’s taste by reviewing and refining example cases and curating the strongest patterns. These design insights were distilled into style rules that the Agent applies at runtime, ensuring outputs align with high-level prompts and deliver polished results at scale.
TDDEval
To evaluate the usefulness of generated artifacts across PowerPoint, Excel, and Word, we developed a benchmark purpose-built for taste-driven generation: TDDEval. Unlike general-purpose benchmarks, TDDEval captures the breadth of knowledge work by spanning a wide spectrum of test tasks. The benchmark includes high-value, representative scenarios such as “Create a business plan PPT,” “Generate a budget forecast in Excel,” and “Write a formal report in Word.” It also incorporates edge cases that stress-test system robustness, from broad open-ended prompts to tightly specified analytical requests.
Quality is measured through a dual-lens framework:
- Content Quality – Evaluating the factual and structural integrity of outputs, including (a) grounding in source material, (b) topical relevance, (c) completeness of coverage, (d) logical structure, and (e) practical usability.
- Taste Score – Capturing the aesthetic and experiential dimension, including (a) visual appeal, (b) layout and organization, (c) typographic quality, (d) design consistency, and (e) the curation of visual assets.
Together, these two axes provide a holistic view: does the output say the right thing and look and feel professional to use immediately? By quantifying both substance and style, TDDEval sets a higher bar for what “quality” means in AI-generated productivity content.
Learnings
Through building and testing Office Agent, we uncovered a series of learnings that shaped its design and performance. These learnings reflect what it actually takes to make agentic systems reliable, accurate, and useful in real-world productivity scenarios.
Learning 1 – Where general-purpose code execution is preferred over task-specific tools
While task-specific tools work well for predictable and repeatable scenarios, a general-purpose agent requires flexibility and the ability to generalize across diverse tool calls. To build a high-quality general-purpose agent, Office Agent adopts a code-first approach - allowing the model to write and execute code (e.g., for MP3 transcription or PDF text extraction) rather than relying on task specific tools. Task-specific tools (while being more predictable and controllable) can constrain flexibility and impact the agent’s generalization ability.
This keeps the agent general-purpose, like a full-stack developer, rather than a narrowly trained specific task solver.
Learning 2 – Self-validation drives accuracy
For complex or multi-step tasks, it is important for the agent to regularly verify progress and self-assess correctness drives up accuracy
- We encourage the model to restate the original question and compare it against its current output to ensure alignment.
- Inserting intermediate checkpoints improves reliability, particularly for tasks requiring precision, filtering, or multi-source synthesis.
- Human-in-the-loop: From a user experience perspective, users can ask the Office Agent to review execution results or generated artifacts to verify whether the content meets their expectations, and can request further adjustments as needed
Learning 3 - Human-Like Browsing, rather than just content fetching
Browser tools should enable human-like web navigation, not just page scraping
- Go beyond extracting raw page content-empower the agent to browse like a human.
- Encourage the model to 1) Click links, paginate, and scroll through long pages. 2) Treat each browsing action as part of a continuous information-gathering journey.
- Incorporate all intermediate observations into the evolving context for better reasoning.
- Use LLM-based summarization to condense lengthy content efficiently, preserving key details while optimizing memory.
Learning 4 – Injecting preference-grounded knowledge leads to better task executions
While LLMs possess broad world knowledge, they often lack task-specific preferences unless explicitly guided. Injecting prior knowledge or preferred choices (e.g., “use python-docx for .docx document processing”) helps agent to select the optimal execution path faster, leading to improved consistency and tool selection. This guidance also reduces hallucination by steering decisions toward reliable, proven patterns.
The Road Ahead
Today, Office Agent is available to Microsoft 365 Personal & Family subscribers via the Frontier program, with Commercial support on the horizon. Office Agent represents a zero-to-one creation tool - generating high-quality, research-backed artifacts from scratch - while Copilot in PowerPoint, Word, and Excel remains the in-app expert, helping users refine, edit, and evolve within each application. Together, they meet where users are in the workflows. Learn more about it in our announcement blog.
This is just the beginning. We are advancing agent orchestration, enriching our taste libraries, and extending integration across the Microsoft ecosystem. Agentic systems don’t just assist with tasks but reshape how knowledge work is created, polished, and completed at scale. Get started here.