We started with 100+ tools and 50+ specialized agents. We ended with 5 core tools and a handful of generalists. The agent got more reliable, not less.
We spent a long time chasing model upgrades, polishing prompts, and debating orchestration strategies. The gains were visible in offline evals, but they didn’t translate into the reliability and outc...
Updated Dec 27, 2025
Version 5.0
Anonymous
Dec 31, 2025
Very informative and well-written. Thanks for the write-up.