Blog Post

Microsoft 365 Copilot Blog
4 MIN READ

Introducing multi-model intelligence in Researcher

gauravanand's avatar
gauravanand
Icon for Microsoft rankMicrosoft
Mar 30, 2026

Today, Researcher—Microsoft 365 Copilot's deep research agent for work—takes a significant step forward. Designed to tackle complex research in the flow of work, Researcher now goes further with two new multi-model capabilities that raise the bar for accuracy, depth, and confidence: Critique and Council.

Critique is a new multi model deep research system designed for complex research tasks. It separates generation from evaluation and utilizes a combination of models from Frontier labs including Anthropic and OpenAI. One model leads the generation phase, planning the task, iterating through retrieval, and producing an initial draft, while a second model focuses on review and refinement, acting as an expert reviewer before the final report is produced. Our evaluations show that this architecture exceeds traditional single model approaches and delivers best in class deep research quality. This design provides clear optionality across generator and reviewer roles, with the ability to support and expand these roles over time as the system evolves.

DRACO (Deep Research Accuracy, Completeness, and Objectivity) benchmark scores across 100 complex research tasks spanning 10 domains. All results are sourced from the original paper [Zhong et al., arXiv:2602.11685 (Feb. 2026)], except for Researcher with Critique. Researcher with Critique achieves a substantial improvement of +7.0 points (SEM ±1.90) in the aggregated score, +13.88% over Perplexity Deep Research (Claude Opus 4.6 model), the best system reported in the paper.

Council brings multiple model responses side-by-side in the Researcher experience. Additionally, a cover letter provides valuable insights on where the models agree, where they diverge, and the unique insights each brings on the topic.

Critique & how it works

Many AI research workflows rely on a single model to handle planning, sourcing, synthesis, and writing, but Critique takes a different approach by dividing responsibilities between two AI partners—one optimized for deep exploration and structured synthesis, and a second focused on validating claims, improving presentation and strengthening the structure. By giving evaluation as much emphasis as generation, this architecture creates a powerful feedback loop that delivers higher-quality results across factual accuracy, analytical breadth, and presentation. Critique will be the default experience in Researcher, available when Auto is selected in the model picker.

Critique follows a review process similar to those conducted in academic and professional research settings. It’s built around rubric‑based evaluation—a structured review that focuses on strengthening the report without turning the reviewer into a second author. The reviewer examines the report from several angles and then generates an enhanced report, concentrating on the following dimensions:

- Source Reliability Assessment. The reviewer emphasizes the use of reputable, authoritative, and domain‑appropriate sources, prioritizing evidence that is verifiable and suitable for your research context.

- Report Completeness. The reviewer evaluates if the final report satisfies the intent of your request comprehensively with relevant and unique insights.

- Strict Evidence Grounding Enforcement. The reviewer enforces a conservative grounding standard, requiring every key claim to be anchored to reliable sources with precise citations—strengthening factual accuracy, reliability, and trust in the final report.

Performance Validation on the DRACO Benchmark

We evaluated Critique on the DRACO benchmark (Deep Research Accuracy, Completeness, and Objectivity)—100 complex deep research tasks across 10 domains, introduced by researchers from Perplexity and academia in February 2026 [Zhong et al.arXiv:2602.11685]. These research tasks originate from anonymized real-world usage patterns executed in a large-scale research system. The system responses are graded against task-specific rubrics along four dimensions: factual accuracy, breadth and depth of analysis, presentation quality, and citation quality.

The DRACO results were evaluated using OpenAI’s GPT-5.2 as the LLM judge - the strictest of the three judge models reported in the paper. We applied the same evaluation protocol and configuration published in the benchmark paper, helping ensure an apples-to-apples comparison. Across all measurements, results were computed by averaging scores over the full DRACO dataset, with each question evaluated across five independent runs. 

To better understand the advantages of Critique, we compared the new architecture with the single-model Researcher (using the same GPT-5.2 judge) across the four evaluation axes that DRACO defines.

We see the largest improvement in Breadth and Depth of Analysis (+3.33), followed by Presentation Quality (+3.04) and Factual Accuracy (+2.58). All dimensions show statistically significant improvements (paired t-test, p < 0.0001).

Critique pushes Researcher to identify missing analytical angles, close coverage gaps, sharpen formulations, and produce responses with stronger organization and clearer narrative flow. This accounts for the substantial improvements to the breadth and depth and presentation quality scores. The factual accuracy gain shows that Critique is challenging weak claims and enforcing higher precision. Citation quality improvements primarily reflect better use of existing sources, as the new layer emphasizes evidence selection and citation rather than increasing source coverage.

The DRACO query set spans 10 domains across medicine, technology, law and more. Researcher with Critique achieves higher scores than the single-model approach across domains, reinforcing its value as a horizontal quality layer for Researcher. At the domain level, statistically significant improvements are observed in 8 of the 10 domains (paired t-test, p < 0.05). The exceptions are Academic (p=0.27) and Needle-in-a-Haystack (p=0.16), both of which exhibit high variance.

Council & how it works

Council is an alternative approach, designed for side-by-side comparison across multiple models. Available when Model Council is selected in the model picker in Researcher, Council runs an Anthropic and OpenAI model simultaneously, with each model producing a complete, standalone report—surfacing facts, citations, and analytical framings that the other may overlook or weigh differently. Once both reports are generated, a dedicated judge model evaluates the reports to create a distilled summary of key findings and highlights where the models meaningfully agree or diverge - including differences in magnitude, framing, or interpretation - and calls out unique contributions from each model.    

Get started with Critique and Council in Researcher

Today, Critique and Council are broadly available in the Frontier program. Learn how to power your Frontier Transformation with Copilot and agents, and experience the power of multi-model intelligence.  

Updated Mar 30, 2026
Version 1.0

4 Comments

  • Alezandru's avatar
    Alezandru
    Occasional Reader

    This is a really important step forward, multi-model intelligence clearly feels like the right direction for complex tasks.

     

    What I find particularly interesting is how this evolves from simple model usage into coordinated systems, where different models contribute distinct capabilities (research, reasoning, synthesis). That already improves quality significantly compared to single-model workflows.

     

    That said, I think the next layer (and likely the harder problem) is not just multi-model execution, but how those models interact and evolve over time.

     

    From what I’ve seen in practice, two gaps tend to appear quickly:

    1. Interaction model (beyond orchestration)

    Most systems still follow a pipeline or aggregation pattern. What seems to unlock better outcomes is structured deliberation:

    - all models receive the same task

    - each generates an independent proposal

    - they see each other’s outputs

    - critique (when needed), revise, then converge (vote / rank)

     

    This shifts the system from “combining outputs” to models actively improving each other’s reasoning before a final result is produced.

     

    2. Continuity and memory

    Even strong multi-model systems tend to be session-based. Once the task ends, the system resets.

    In longer-running environments, this becomes a limitation. What changes the behavior significantly is adding persistent, shared memory:

    - storing errors and their resolutions (so recurring issues can be solved instantly)

    - extracting decisions, patterns, validated solutions

    - building an experience graph (problem -> solution -> outcome)

    - enabling cross-session continuity, so the system doesn’t restart from zero

    At that point, you move from a system that executes tasks to one that learns from its own history.

     

    This is something I’ve been exploring with EAN AgentOS, especially around combining multi-agent “session”-style deliberation with persistent memory across runs. The difference in consistency over time is quite noticeable.

     

    Really curious to see how this evolves, especially whether multi-model systems will remain primarily execution-focused, or start incorporating deeper collaboration + memory layers as standard.

     

  • mikegil4's avatar
    mikegil4
    Brass Contributor

    Any early insights to share on how this might affect token counts (asking more for environmental impact purposes than cost since Researcher is paid by user/month)?