The Future of AI: The Model is Key, but the App is the Doorway

Microsoft

Nov 08, 2025

The Future of AI blog series is an evolving collection of posts from the AI Futures team in collaboration with subject matter experts across Microsoft. In this series, we explore tools and technologies that will drive the next generation of AI. Explore more at: Collections | Microsoft Learn

From GPT-5 Launch to Real-World Reactions

August 2025 brought the exciting release of the GPT-5 family of models. OpenAI’s announced GPT-5 as their most advanced model yet for coding and complex tasks, with state-of-the-art performance on coding benchmarks. The Azure AI team rolled out GPT-5 concurrently, immediately integrating it into Foundry, GitHub Copilot, and Microsoft 365 Copilot. This synchronized launch resulted in one of our smoothest major model releases to date. Customers flocked to GPT-5, and our internal benchmark results quickly hit public leaderboards. Microsoft Cloud Advocate Pamela Fox even published an analysis on GPT-5’s Retrieval-Augmented Generation (RAG) performance just days after launch, highlighting the model’s emphasis on accurate tool use and reduced hallucinations. By all traditional metrics, GPT-5 represents an impressive leap forward.

Not long ago, a model jump of this magnitude would dominate tech headlines for weeks. But today, raw benchmark improvements are only part of the story. Many developers have already begun experimenting with GPT-5 inside their favorite applications and workflows. The early hands-on feedback has been mixed, revealing that a model’s true impact is measured not just by leaderboards but by how it behaves in context - that is, within the apps and environments we use to interact with it. In fact, our perception of the quality of a large language model (LLM) is now largely defined by its performance in real-world applications rather than in isolated tests. This means that even if a new model is objectively more powerful, its user experience will depend on how well apps adapt to its capabilities – and quirks.

Integrating GPT-5 in Applications

After the initial excitement, developers started noticing some odd behaviors from GPT-5 when used in coding assistants and other tools. Common complaints include the model inserting nonstandard characters like curly “smart” quotes into code, rewriting entire files when only a one-line fix was requested, wiping out helpful code comments, and other such mischief. For example, one Reddit user described how a simple request to change variable names to snake case caused OpenAI’s ChatGPT to completely rewrite the code – altering unrelated parts, changing lines unnecessarily, and even sometimes removing all the comments. This kind of overzealous editing can be frustrating, especially when using GPT-5 in a coding session that was expected to make quick surgical changes.

Crucially, many of these issues vary by application. Developers noticed that some coding tools handled GPT-5 better than others. The difference came down to how each app leveraged the model - the instructions they provided and the context they added. The output of GPT-5 - or any model - can be drastically shaped by how the application frames the task.

So how many of these early bumps are the fault of the new model itself, and how many stem from the surrounding app context? Could GPT-5 actually be a “better” model that simply needs different handling? It’s a nuanced mix of both. GPT-5 does introduce new behaviors – for instance, it’s notably more cautious with factuality. Pamela Fox’s analysis found GPT-5 was far more willing to say “I don’t know” or decline to answer when information was missing - whereas previous models might improvise an answer. That’s a positive improvement for truthfulness, but if an application isn't designed to anticipate refusals, it might interpret this output as unhelpful. Similarly, GPT-5’s underlying architecture adds reasoning modes and tool usage that earlier models lacked, which means an app needs to decide how to invoke those features (or not). In short, GPT-5 changes the game, but to fully benefit, apps must update their playbook too.

Adapting Apps to a New Model Family

With large language models now powering everything from IDE extensions to Microsoft Office apps, ensuring a smooth upgrade to a new model is critical. Here are some best practices for developers to adapt their applications and workflows to a new model family like GPT-5:

Re-evaluate and Refine Your Prompts: The instructions and prompts that worked for GPT-4 may not yield optimal results with GPT-5. Be as explicit and precise as possible about the desired outcome. For example, instead of telling an AI coding assistant “fix this file,” specify the exact change: “Update the getUser method to include email validation.” This kind of targeted prompt focuses GPT-5 on the task and reduces unwanted side effects. Likewise, avoid overly broad directives like “refactor this code” unless a full refactor is truly desired. Developers have found that words like “rewrite” or “refactor” in prompts can cue the model to make aggressive changes - often far beyond what was intended. The clearer your request, the less GPT-5 will roam beyond it, and be sure to use bulk evaluation tools, such as those shown here, to ensure your refined prompts work well and do not create regressions.
Leverage New Model Controls: GPT-5 introduces tunable settings that can help apps dial in the right behavior. The GPT-5 API supports a verbosity parameter to control how detailed or concise the output should be, along with a reasoning effort setting that adjusts the depth of step-by-step reasoning the model does. Take advantage of these if you can. For instance, in a coding scenario where you want only a quick fix, you might set the reasoning effort to “minimal” to prevent the model from over-thinking (and over-writing) the solution. Early tests showed with minimal reasoning, GPT-5 can often solve straightforward tasks just as well but with significantly less latency. Similarly, if GPT-5 tends to be more verbose than your application UI can handle (perhaps generating a novel when you expected a paragraph), use the verbosity controls or add instructions like “answer briefly” to rein it in. These knobs are new keys to the kingdom – using them can align the new model's output to support your app.
Use Context Wisely: A major factor in any model's behavior is the context you provide. To avoid the “whole file rewrite” problem, for example, provide the model with only the relevant excerpt of code or text whenever possible. In an IDE like VS Code with GitHub Copilot, this might mean selecting the specific lines you want changed or using GitHub Copilot’s tools to focus on a function. Whether or not you're using the model to generate code, smaller, incremental prompts are easier for the model to handle correctly. By scoping the context, you reduce the model's chance to wander off and “optimize” other parts of its response that you didn’t ask it to touch.
Preserve Formatting and Conventions: If you notice the new model introducing unwanted formatting (like the curly quotes scenario), adjust the context or post-processing to enforce consistency. One approach is to add guidelines in the prompt/system message, such as: “Use straight quotes (\" and \') in all code output. Preserve all existing comments and formatting unless specifically instructed to change them.” This sets expectations for the model. In cases where the model still outputs something undesirable (maybe due to a fundamental limitation), consider cleaning it up in a post-processing step in your application. For example, an app grappling with Unicode quotes can include small scripts to replace or revert those characters automatically. Essentially, treat the model like any other contributor: define and enforce the rules of its operation.
Monitor and Iterate with User Feedback: When rolling out a new model into an enterprise app or customer-facing product, it’s wise to do so gradually. Use preview deployments or A/B tests to observe how the new model performs on real queries. Solicit feedback from users, for example are the answers or code completions better or worse in some ways? Pay special attention to any regressions they report, such as lost functionality or strange new output. For example, if developers say, “the coding agent removed all my comments when suggesting a change,” you can respond by tweaking its prompt to emphasize preserving comments or adjusting the suggestion acceptance flow to show diffs (so the user can re-add anything important). Quick iteration is key. The Microsoft ecosystem offers tools like the Azure AI Foundry Evaluation SDK to log and evaluate model responses systematically. Automated evaluations can flag if the new model is, say, citing fewer sources or taking longer to respond, helping you pinpoint areas to refine. Because model upgrade cycles are faster than ever, a continuous evaluation pipeline will help your team adapt to new models without extensive manual rework.
Stay Up-to-date on Integration Features: Azure OpenAI, GitHub Copilot, and Microsoft 365 are continuously evolving with new configuration options and fine-tuning capabilities that let you tailor AI to your business needs. A practical tip is to keep an eye on release notes and configuration options. For instance, Microsoft 365 Copilot now offers Copilot Tuning, which enables organizations to customize AI responses by adapting the model to their specific terminology, communication style, and business needs, resulting in more accurate and relevant outcomes. By exploring these settings, you can fine-tune the model’s behavior at the app level without reinventing the wheel.
Pick The Right Size Model for the Job: Before defaulting to the largest model like GPT-5, consider whether your scenario will be better served by a smaller model like gpt-5-mini, or by a fine-tuned small model. Azure AI Foundry provides multiple model types and sizes, and enables fine-tuning for custom needs. Finally, consider using Azure AI Foundry’s new Model Router, which autoselects between different model sizes depending on the complexity of the prompt.

Embracing the New Model, Holistically

GPT-5 represents a significant step forward in AI capabilities, but realizing the full value of any new model requires a holistic approach. The model is the powerful engine under the hood, but the app is the doorway through which users experience that power. A superior model can still feel subpar if the interface and integration aren’t aligned with its strengths (or if they fail to mitigate its rough edges). Conversely, thoughtful app design and best practices can elevate a model, letting its improvements truly shine for end users.

For general tech practitioners and enterprise developers, especially those invested in the Microsoft ecosystem, the key takeaway is to be proactive when a new model family arrives. Celebrate the boost in benchmarks, yes, but then roll up your sleeves and adapt your systems: tune your prompts, adjust settings, update your testing, and educate your users. Microsoft’s own tools set a precedent by quickly incorporating GPT-5 (from Azure OpenAI models to copilots in Office and GitHub). By implementing these best practices, we can help ensure that upgrading to a new model is an upgrade to the user’s whole experience.