Today, Researcher—Microsoft 365 Copilot's deep research agent for work—takes a significant step forward. Designed to tackle complex research in the flow of work, Researcher now goes further with two new multi-model capabilities that raise the bar for accuracy, depth, and confidence: Critique and Council.
Critique is a new multi model deep research system designed for complex research tasks. It separates generation from evaluation and utilizes a combination of models from Frontier labs including Anthropic and OpenAI. One model leads the generation phase, planning the task, iterating through retrieval, and producing an initial draft, while a second model focuses on review and refinement, acting as an expert reviewer before the final report is produced. Our evaluations show that this architecture exceeds traditional single model approaches and delivers best in class deep research quality. This design provides clear optionality across generator and reviewer roles, with the ability to support and expand these roles over time as the system evolves.
DRACO (Deep Research Accuracy, Completeness, and Objectivity) benchmark scores across 100 complex research tasks spanning 10 domains. All results are sourced from the original paper [Zhong et al., arXiv:2602.11685 (Feb. 2026)], except for Researcher with Critique. Researcher with Critique achieves a substantial improvement of +7.0 points (SEM ±1.90) in the aggregated score, +13.88% over Perplexity Deep Research (Claude Opus 4.6 model), the best system reported in the paper.
Council brings multiple model responses side-by-side in the Researcher experience. Additionally, a cover letter provides valuable insights on where the models agree, where they diverge, and the unique insights each brings on the topic.
Critique & how it works
Many AI research workflows rely on a single model to handle planning, sourcing, synthesis, and writing, but Critique takes a different approach by dividing responsibilities between two AI partners—one optimized for deep exploration and structured synthesis, and a second focused on validating claims, improving presentation and strengthening the structure. By giving evaluation as much emphasis as generation, this architecture creates a powerful feedback loop that delivers higher-quality results across factual accuracy, analytical breadth, and presentation. Critique will be the default experience in Researcher, available when Auto is selected in the model picker.
Critique follows a review process similar to those conducted in academic and professional research settings. It’s built around rubric‑based evaluation—a structured review that focuses on strengthening the report without turning the reviewer into a second author. The reviewer examines the report from several angles and then generates an enhanced report, concentrating on the following dimensions:
- Source Reliability Assessment. The reviewer emphasizes the use of reputable, authoritative, and domain‑appropriate sources, prioritizing evidence that is verifiable and suitable for your research context.
- Report Completeness. The reviewer evaluates if the final report satisfies the intent of your request comprehensively with relevant and unique insights.
- Strict Evidence Grounding Enforcement. The reviewer enforces a conservative grounding standard, requiring every key claim to be anchored to reliable sources with precise citations—strengthening factual accuracy, reliability, and trust in the final report.
Performance Validation on the DRACO Benchmark
We evaluated Critique on the DRACO benchmark (Deep Research Accuracy, Completeness, and Objectivity)—100 complex deep research tasks across 10 domains, introduced by researchers from Perplexity and academia in February 2026 [Zhong et al.arXiv:2602.11685]. These research tasks originate from anonymized real-world usage patterns executed in a large-scale research system. The system responses are graded against task-specific rubrics along four dimensions: factual accuracy, breadth and depth of analysis, presentation quality, and citation quality.
The DRACO results were evaluated using OpenAI’s GPT-5.2 as the LLM judge - the strictest of the three judge models reported in the paper. We applied the same evaluation protocol and configuration published in the benchmark paper, helping ensure an apples-to-apples comparison. Across all measurements, results were computed by averaging scores over the full DRACO dataset, with each question evaluated across five independent runs.
To better understand the advantages of Critique, we compared the new architecture with the single-model Researcher (using the same GPT-5.2 judge) across the four evaluation axes that DRACO defines.
We see the largest improvement in Breadth and Depth of Analysis (+3.33), followed by Presentation Quality (+3.04) and Factual Accuracy (+2.58). All dimensions show statistically significant improvements (paired t-test, p < 0.0001).
Critique pushes Researcher to identify missing analytical angles, close coverage gaps, sharpen formulations, and produce responses with stronger organization and clearer narrative flow. This accounts for the substantial improvements to the breadth and depth and presentation quality scores. The factual accuracy gain shows that Critique is challenging weak claims and enforcing higher precision. Citation quality improvements primarily reflect better use of existing sources, as the new layer emphasizes evidence selection and citation rather than increasing source coverage.
The DRACO query set spans 10 domains across medicine, technology, law and more. Researcher with Critique achieves higher scores than the single-model approach across domains, reinforcing its value as a horizontal quality layer for Researcher. At the domain level, statistically significant improvements are observed in 8 of the 10 domains (paired t-test, p < 0.05). The exceptions are Academic (p=0.27) and Needle-in-a-Haystack (p=0.16), both of which exhibit high variance.
Council & how it works
Council is an alternative approach, designed for side-by-side comparison across multiple models. Available when Model Council is selected in the model picker in Researcher, Council runs an Anthropic and OpenAI model simultaneously, with each model producing a complete, standalone report—surfacing facts, citations, and analytical framings that the other may overlook or weigh differently. Once both reports are generated, a dedicated judge model evaluates the reports to create a distilled summary of key findings and highlights where the models meaningfully agree or diverge - including differences in magnitude, framing, or interpretation - and calls out unique contributions from each model.
Get started with Critique and Council in Researcher
Today, Critique and Council are broadly available in the Frontier program. Learn how to power your Frontier Transformation with Copilot and agents, and experience the power of multi-model intelligence.