azure ai
206 TopicsThe Future of AI: Structured Vibe Coding - An Improved Approach to AI Software Development
In this post from The Future of AI series, the author introduces structured vibe coding, a method for managing AI agents like a software team using specs, GitHub issues, and pull requests. By applying this approach with GitHub Copilot, they automated a repetitive task—answering Microsoft Excel-based questionnaires—while demonstrating how AI can enhance developer workflows without replacing human oversight. The result is a scalable, collaborative model for AI-assisted software development.161Views0likes0CommentsHow Microsoft Evaluates LLMs in Azure AI Foundry: A Practical, End-to-End Playbook
Deploying large language models (LLMs) without rigorous evaluation is risky: quality regressions, safety issues, and expensive rework often surface in production—when it’s hardest to fix. This guide translates Microsoft’s approach in Azure AI Foundry into a practical playbook: define metrics that matter (quality, safety, and business impact), choose the right evaluation mode (offline, online, human-in-the-loop, automated), and operationalize continuous evaluation with the Azure AI Evaluation SDK and monitoring. Quick-Start Checklist Identify your use case: Match model type (SLM, LLM, task-specific) to business needs. Benchmark models: Use Azure AI Foundry leaderboards for quality, safety, and performance, plus private datasets. Evaluate with key metrics: Focus on relevance, coherence, factuality, completeness, safety, and business impact. Combine offline & online evaluation: Test with curated datasets and monitor real-world performance. Leverage manual & automated methods: Use human-in-the-loop for nuance, automated tools for scale. Use private benchmarks: Evaluate with organization-specific data for best results. Implement continuous monitoring: Set up alerts for drift, safety, and performance issues. Terminology Quick Reference SLM: Small Language Model—compact, efficient models for latency/cost-sensitive tasks. LLM: Large Language Model—broad capabilities, higher resource requirements. MMLU: Multitask Language Understanding—academic benchmark for general knowledge. HumanEval: Benchmark for code generation correctness. BBH: BIG-Bench Hard—reasoning-heavy subset of BIG-Bench. LLM-as-a-Judge: Using a language model to grade outputs using a rubric. The Generative AI Model Selection Challenge Deploying an advanced AI solution without thorough evaluation can lead to costly errors, loss of trust, and regulatory risks. LLMs now power critical business functions, but their unpredictable behavior makes robust evaluation essential. The Issue: Traditional evaluation methods fall short for LLMs, which are sensitive to prompt changes and can exhibit unexpected behaviors. Without a strong evaluation strategy, organizations risk unreliable or unsafe AI deployments. The Solution: Microsoft Azure AI Foundry provides a systematic approach to LLM evaluation, helping organizations reduce risk and realize business value. This guide shares proven techniques and best practices so you can confidently deploy AI and turn evaluation into a competitive advantage. LLMs and Use-Case Alignment When choosing an AI model, it’s important to match it to the specific job you need done. For example, some models are better at solving problems that require logical thinking or math—these are great for tasks that need careful analysis. Others are designed to write computer code, making them ideal for building software tools or helping programmers. There are also models that excel at having natural conversations, which is especially useful for customer service or support roles. Microsoft Azure AI Foundry helps with this by showing how different models perform in various categories, making it easier to pick the right one for your needs. Key Metrics: Quality, Safety, and Business Impact When evaluating an AI model, it’s important to look beyond just how well it performs. To truly understand if a model is ready for real-world use, we need to measure its quality, ensure it’s safe, and see how it impacts the business. Quality metrics show if the model gives accurate and useful answers. Safety metrics help us catch any harmful or biased content before it reaches users. Business impact metrics connect the model’s performance to what matters most—customer satisfaction, efficiency, and meeting important rules or standards. By tracking these key areas, organizations can build AI systems that are reliable, responsible, and valuable. Dimension What it Measures Typical Evaluators Quality Relevance, coherence, factuality, completeness LLM-as-a-judge, groundedness, code eval Safety Harmful content, bias, jailbreak resistance, privacy Content safety checks, bias probes Business Impact User experience, value delivery, compliance Task completion rate, CSAT, cost/latency Organizations that align model selection with use-case-specific benchmarks deploy faster and achieve higher user satisfaction than teams relying only on generic metrics. The key is matching evaluation criteria to business objectives from the earliest stages of model selection. Now that we know which metrics and parameters to evaluate LLMs on, when and how do we run these evaluations? Let’s get right into it. Evaluation Modalities Offline vs. Online Evaluation Offline Evaluation: Pre-deployment assessment using curated datasets and controlled environments. Enables reproducible testing, comprehensive coverage, and rapid iteration. However, it may miss real-world complexity. Online Evaluation: Assesses model performance on live production data. Enables real-world monitoring, drift detection, and user feedback integration. Best practice: use offline evaluation for development and gating, then online evaluation for continuous monitoring. Manual vs. Automated Evaluation Manual Evaluation: Human insight is irreplaceable for subjective qualities like creativity and cultural sensitivity. Azure AI Foundry supports human-in-the-loop evaluation via annotation queues and feedback systems. However, manual evaluation faces scalability and consistency challenges. Automated Evaluation: Azure AI Foundry’s built-in evaluators provide scalable, rigorous assessment of relevance, coherence, safety, and performance. Best practice: The most effective approach combines automated evaluation for broad coverage with targeted manual evaluation for nuanced assessment. Leading organizations implement a "human-in-the-loop" methodology where automated systems flag potential issues for human review. Public vs. Private Benchmarks Public Benchmarks (MMLU, HumanEval, BBH): Useful for standardized comparison but may not reflect your domain or business objectives. Risk of contamination and over-optimization. Private Benchmarks: Organization-specific data and metrics provide evaluation that directly reflects deployment scenarios. Best practice: Use public benchmarks to narrow candidates, then rely on private benchmarks for final decisions. LLM-as-a-Judge and Custom Evaluators LLM-as-a-Judge uses language models themselves to assess the quality of generated content. Azure AI Foundry’s implementation enables scalable, nuanced, and explainable evaluation—but requires careful validation. Common challenges and mitigations: Position bias: Scores can skew toward the first-listed answer. Mitigate by randomizing order, evaluating both (A,B) and (B,A), and using majority voting across permutations. Verbosity bias: Longer answers may be over-scored. Mitigate by enforcing concise-answer rubrics and normalizing by token count. Inconsistency: Repeated runs can vary. Mitigate by aggregating over multiple runs and reporting confidence intervals. Custom Evaluators allow organizations to implement domain-specific logic and business rules, either as Python functions or prompt-based rubrics. This ensures evaluation aligns with your unique business outcomes. Evaluation SDK: Comprehensive Assessment Tools The Azure AI Evaluation SDK (azure-ai-evaluation) provides the technical foundation for systematic LLM assessment. The SDK's architecture enables both local development testing and cloud-scale evaluation: Cloud Evaluation for Scale: The SDK seamlessly transitions from local development to cloud-based evaluation for large-scale assessment. Cloud evaluation enables processing of massive datasets while integrating results into the Azure AI Foundry monitoring dashboard. Built-in Evaluator Library: The platform provides extensive pre-built evaluators covering quality metrics (coherence, fluency, relevance), safety metrics (toxicity, bias, fairness), and task-specific metrics (groundedness for RAG, code correctness for programming). Each evaluator has been validated against human judgment and continuously improved based on real-world usage. Real-World Workflow: From Model Selection to Continuous Monitoring Azure AI Foundry's integrated workflow guides organizations through the complete evaluation lifecycle: Stage 1: Model Selection and Benchmarking Compare models using integrated leaderboards across quality, safety, cost, and performance dimensions Evaluate top candidates using private datasets that reflect actual use cases Generate comprehensive model cards documenting capabilities, limitations, and recommended use cases Stage 2: Pre-Deployment Evaluation Systematic testing using Azure AI Evaluation SDK with built-in and custom evaluators Safety assessment using AI Red Teaming Agent to identify vulnerabilities Human-in-the-loop validation for business-critical applications Stage 3: Production Monitoring and Continuous Evaluation Real-time monitoring through Azure Monitor Application Insights integration Continuous evaluation at configurable sampling rates (e.g., 10 evaluations per hour) Automated alerting for performance degradation, safety issues, or drift detection This workflow ensures that evaluation is not a one-time gate but an ongoing practice that maintains AI system quality and safety throughout the deployment lifecycle. Next Steps and Further Reading Explore the Azure AI Foundry documentation for hands-on guides. Find the Best Model - https://aka.ms/BestModelGenAISolution Azure AI Foundry Evaluation SDK Summary Robust evaluation of large language models (LLMs) using systematic benchmarking and Azure AI Foundry tools is essential for building trustworthy, efficient, and business-aligned AI solutions Tags: #LLMEvaluation #AzureAIFoundry #AIModelSelection #Benchmarking #Skilled by MTT #MicrosoftLearn #MTTBloggingGroup80Views0likes0CommentsWant Safer, Smarter AI? Start with Observability in Azure AI Foundry
Observability in Azure AI: From Black Box to Transparent Intelligence If you are an AI developer or an engineer, you can benefit from Azure AI observability by gaining deep visibility into agent behavior, enabling them to trace decisions, evaluate response quality, and integrate automated testing into their workflows. This empowers you to build safer, more reliable GenAI applications. Responsible AI and compliance teams use observability tools to ensure transparency and accountability, leveraging audit logs, policy mapping, and risk scoring. These capabilities help organizations align AI development with ethical standards and regulatory requirements. Understanding Observability Imagine you're building a customer support chatbot using Azure AI. It’s designed to answer billing questions, troubleshoot issues, and escalate complex cases to human agents. Everything works well in testing—but once deployed, users start reporting confusing answers and slow response times. Without observability, you’re flying blind. You don’t know: Which queries are failing. Why the chatbot is choosing certain responses. Whether it's escalating too often or not enough. How latency and cost are trending over time. Enter Observability: With Azure AI Foundry and Azure Monitor, you can: Trace every interaction: See the full reasoning path the chatbot takes—from user input to model invocation to tool calls. Evaluate response quality: Automatically assess whether answers are grounded, fluent, and relevant. Monitor performance: Track latency, throughput, and cost per interaction. Detect anomalies: Use Azure Monitor’s ML-powered diagnostics to spot unusual patterns. Improve continuously: Feed evaluation results back into your CI/CD pipeline to refine the chatbot with every release. This is observability in action: turning opaque AI behavior into transparent, actionable insights. It’s not just about fixing bugs—it’s about building AI you can trust. Next, let’s understand more about observability: What Is Observability in Azure AI? Observability in Azure AI refers to the ability to monitor, evaluate, and govern AI agents and applications across their lifecycle—from model selection to production deployment. It’s not just about uptime or logs anymore. It’s about trust, safety, performance, cost, and compliance. Observability aligned with the end-to-end AI application development workflow. Image source: Microsoft Learn Key Components and Capabilities Azure AI Foundry Observability Built-in observability for agentic workflows. Tracks metrics like performance, quality, cost, safety, relevance, and “groundedness” in real time. Enables tracing of agent interactions and data lineage. Supports alerts for risky or off-policy responses and integrates with partner governance platforms. Find details on Observability here: Observability in Generative AI with Azure AI Foundry - Azure AI Foundry | Microsoft Learn AI Red Teaming (PyRIT Integration) Scans agents for safety vulnerability. Evaluates attack success rates across categories like hate, violence, sexual content, and l more. Generates scorecards and logs results in the Foundry portal. Find details here: AI Red Teaming Agent - Azure AI Foundry | Microsoft Learn Image source: Microsoft Learn CI/CD Integration GitHub Actions and Azure DevOps workflows automate evaluations. Continuous monitoring and regression detection during development Azure Monitor + Azure BRAIN Uses ML and LLMs for anomaly detection, forecasting, and root cause analysis. Offers multi-tier log storage (Gold, Silver, Bronze) with unified KQL query experience. Integrates with Azure Copilot for diagnostics and optimization. Open Telemetry Extensions Azure is extending OTel with agent-specific entities like AgentRun, ToolCall, Eval, and ModelInvocation. Enables fleet-scale dashboards and semantic tracing for GenAI workloads. Observability as a First-Class Citizen in Azure AI Foundry In Azure AI Foundry, observability isn’t bolted on—it’s built in. The platform treats observability as a first-class capability, essential for building trustworthy, scalable, and responsible AI systems. Image source: Microsoft Learn What Does This Mean in Practice? Semantic Tracing for Agents Azure AI Foundry enables intelligent agents to perform tasks using AgentRun, ToolCall, and ModelInvocation. AgentRun manages the entire lifecycle of an agent's execution, from input processing to output generation. ToolCall allows agents to invoke external tools or APIs for specific tasks, like fetching data or performing calculations. ModelInvocation lets agents directly use AI models for advanced tasks, such as sentiment analysis or image recognition. Together, these components create adaptable agents capable of handling complex workflows efficiently. Integrated Evaluation Framework Developers can continuously assess agent responses for quality, safety, and relevance using built-in evaluators. These can be run manually or automatically via CI/CD pipelines, enabling fast iteration and regression detection. Governance and Risk Management Observability data feeds directly into governance workflows. Azure AI Foundry supports policy mapping, risk scoring, and audit logging, helping teams meet compliance requirements while maintaining agility. Feedback Loop for Continuous Improvement Observability isn’t just about watching—it’s about learning. Azure AI Foundry enables teams to use telemetry and evaluation data to refine agents, improve performance, and reduce risk over time. Now, Build AI You Can Trust Observability isn’t just a technical feature—it’s the foundation of responsible AI. Whether you're building copilots, deploying GenAI agents, or modernizing enterprise workflows, Azure AI Foundry and Azure Monitor give you the tools to trace, evaluate, and improve every decision your AI makes. Now is the time to move beyond black-box models and embrace transparency, safety, and performance at scale. Start integrating observability into your AI workflows and unlock the full potential of your agents—with confidence. Read more here: Plans | Microsoft Learn Observability and Continuous Improvement - Training | Microsoft Learn Observability in Generative AI with Azure AI Foundry - Azure AI Foundry | Microsoft Learn About the Author Priyanka is a Technical Trainer at Microsoft USA with over 15 years of experience as a Microsoft Certified Trainer. She has a profound passion for learning and sharing knowledge across various domains. Priyanka excels in delivering training sessions, proctoring exams, and upskilling Microsoft Partners and Customers. She has significantly contributed to AI and Data-related courseware, exams, and high-profile events such as Microsoft Ignite, Microsoft Learn Live Shows, MCT Community AI Readiness, and Women in Cloud Skills Ready. Furthermore, she supports initiatives like “Code Without Barrier” and “Women in Azure AI,” contributing to AI Skills enhancements. Her primary areas of expertise include courses on Development, Data, and AI. In addition to maintaining and acquiring new certifications in Data and AI, she has also guided learners and enthusiasts on their educational journeys. Priyanka is an active member of the Microsoft Tech community, where she reviews and writes blogs focusing on Data and AI. #SkilledByMTT #MSLearn #MTTBloggingGroup136Views1like0CommentsThe Future of AI: Horses for Courses - Task-Specific Models and Content Understanding
Task-specific models are designed to excel at specific use cases, offering highly specialized solutions that can be more efficient and cost-effective than general-purpose models. These models are optimized for particular tasks, resulting in faster performance and lower latency, and they often do not require prompt engineering or fine-tuning.1.2KViews2likes1Comment🎉Join the Microsoft Ignite 2025 NYC Community Summit in Times Square!
Get ready, New York! The Microsoft Ignite 2025 NYC Community Summit is coming to the heart of Times Square — and you’re invited to be part of the energy, insights, and innovation. Whether you're a seasoned tech leader, a cloud enthusiast, or just Ignite-curious, this two-day experience is your chance to connect with the local Microsoft customer community, attend live sessions by MVPs and local experts. Watch the live streamed Ignite keynote while engaging in real-time conversations with peers and experts. To attend please register here. 🎤 What to Expect Live Keynote Viewing: Watch Microsoft leaders unveil the latest in AI, cloud, and security. Community Conversations: Join breakout discussions with local customers and Microsoft experts. Exclusive Panels & Lightning Talks: Hear from industry voices and community MVPs. Food & Snacks Included: Because no community event is complete without them. 🌟 Featured Speakers & Sessions Explore a variety of exciting topics, including… Generating Pages in Power Apps Lights, Camera, Akka! The Actor Model & Agentic AI Orchestra How to create Moonshot solutions with AI Transforming Facility, Network and Organization Management with Visio and Power BI Building Agents in AI Foundry! What's new with Azure Load Balancer, NAT Gateway, and Public IP Addresses .NET Apps Everywhere! Accelerating Web Application Development with AI-Powered Tools: From Design to Deployment How (and why) Microsoft's upstream teams engage with multi-stakeholder open-source projects Leveling Up Agents: Copilot Studio for Enterprise Studios RAG Hero: Fast-Track Vector Search in .NET Building Resilient Systems Agentic Orchestration: Building Scalable, Open-Source Automation with A2A, MCP and RAG Patterns 🤝 Sponsors & Partners We’re proud to be supported by a fantastic group of sponsors who help make this event possible. 🔗 RSVP & Stay Connected Spots are limited, must register by November 3 rd , 2025 — don’t miss out! 👉 To attend please register here. Exact location provided upon registration acceptance.155Views1like0CommentsUnveiling the Next Generation of Table Structure Recognition
In an era where data is abundant, the ability to accurately and efficiently extract structured information like tables from diverse document types is critical. For instance, consider the complexities of a balance sheet with multiple types of assets or an invoice with various charges, both presented in a table format that can be challenging even for humans to interpret. Traditional parsing methods often struggle with the complexity and variability of real-world tables, leading to manual intervention and inefficient workflows. This is because these methods typically rely on rigid rules or predefined templates that fail when encountering variations in layout, formatting, or content, which are common in real-world documents. While the promise of Generative AI and Large Language Models (LLMs) in document understanding is vast, our research in table parsing has revealed a critical insight: for tasks requiring precision in data alignment, such as correctly associating data cells with their respective row and column headers, classical computer vision techniques currently offer superior performance. Generative AI models, despite their powerful contextual understanding, can sometimes exhibit inconsistencies and misalignments in tabular structures, leading to compromised data integrity (Figure 1). Therefore, Azure Document Intelligence (DI) and Content Understanding (CU) leverages an even more robust and proven computer vision algorithms to ensure the foundational accuracy and consistency that enterprises demand. Figure 1: Vision LLMs struggle to accurately recognize table structure, even in simple tables. Our current table recognizer excels at accurately identifying table structures, even those with complex layouts, rotations, or curved shapes. However, it does have its limitations. For example, it occasionally fails to properly delineate a table where the logical boundaries are not visible but must be inferred from the larger document context, making suboptimal inferences. Furthermore, its architectural design makes it challenging to accelerate on modern GPU platforms, impacting its runtime efficiency. Taking these limitations in considerations and building upon our existing foundation, we are introducing the latest advancement in our table structure recognizer. This new version significantly enhances both performance and accuracy, addressing key challenges in document processing. Precise Separation Line Placement We've made significant strides in the precision of separation line placement. While predicting these separation lines might seem deceptively simple, it comes with subtle yet significant challenges. In many real-world documents, these are logical separation lines, meaning they are not always visibly drawn on the page. Instead, their positions are often implied by an array of nuanced visual cues such as table headers/footers, dot filler text, background color changes, and even the spacing and alignment of content within the cells. Figure 2: Visual Comparison of separation line prediction of current and the new version We've developed a novel model architecture that can be trained end-to-end to directly tackle the above challenges. Recognizing the difficulty for humans to consistently label table separation lines, we've devised a training objective that combines Hungarian matching with an adaptive matching weight to correctly align predictions with ground truth even when the latter is noisy. Additionally, we've incorporated a loss function inspired by speech recognition to encourage the model to accurately predict the correct number of separation lines, further enhancing its performance. Our improved algorithms now respect visual cues more effectively, ensuring that separation lines are placed precisely where they belong. This leads to cleaner, more accurate table structures and ultimately, more reliable data extraction. Figure 2 shows the comparison between the current model and the new model on a few examples. Some quantitative results can be found in Table 1. TSR (current, in %) TSR-v2 (next-gen, in %) Segment Precision Recall F1-Score Precision Recall F1-score Latin 90.2 90.7 90.4 94.0 95.7 94.8 Chinese 96.1 95.3 95.7 97.3 96.8 97.0 Japanese 93.5 93.8 93.7 95.1 97.1 96.1 Korean 95.3 95.9 95.6 97.5 97.8 97.7 Table 1: Table structure accuracy measured by cell prediction precision and recall rates at IoU (intersection over union) threshold of 0.5. Tested on in-house test datasets covering four different scripts. A Data-Driven, GPU-Accelerated Design Another innovation in this release is its data-driven, fully GPU-accelerated design. This architectural shift delivers enhanced quality and significantly faster inference speeds, which is critical for processing large volumes of documents. The design carefully considers the trade-off between model capability and latency requirements, prioritizing an architecture that leverages the inherent parallelism of GPUs. This involves favoring highly parallelizable models over serial approaches to maximize GPU utilization. Furthermore, post-processing logic has been minimized to prevent it from becoming a bottleneck. This comprehensive approach has resulted in a drastic reduction in processing latency, from 250ms per image to less than 10ms. Fueling Robustness with Synthetic Data Achieving the high level of accuracy and robustness required for enterprise-grade table recognition demands vast quantities of high-quality training data. To meet this need efficiently, we've strategically incorporated synthetic data into our development pipeline. A few examples can be found in Figure 3. Figure 3: Synthesized tables Synthetic data offers significant advantages: it's cost-effective to generate and provides unparalleled control over the dataset. This allows us to rapidly synthesize diverse and specific table styles, including rare or challenging layouts, which would be difficult and expensive to collect from real-world documents. Crucially, synthetic data comes with perfectly consistent labels. Unlike human annotation, which can introduce variability, synthetic data ensures that our models learn from a flawlessly labeled ground truth, leading to more reliable and precise training outcomes. Summary This latest version of our table structure recognizer enhances critical document understanding capabilities. We've refined separation line placement to better respect visual cues and implied structures, supported by our synthetic data approach for consistent training. This enhancement, in turn, allows users to maintain the table structure as intended, reducing the need for manual post-processing to clean up the structured output. Additionally, a GPU-accelerated, data-driven design delivers both improved quality and faster performance, crucial for processing large document volumes.1.1KViews2likes3CommentsThe Future of AI: Fine-Tuning Llama 3.1 8B on Azure AI Serverless, why it's so easy & cost efficient
In this article, you will learn how to fine-tune the Llama 3.1 8B model using RAFT and LoRA with Azure AI Serverless Fine-Tuning for efficient, cost-effective model customization.5.2KViews1like0CommentsThe Future of AI: The paradigm shifts in Generative AI Operations
Dive into the transformative world of Generative AI Operations (GenAIOps) with Microsoft Azure. Discover how businesses are overcoming the challenges of deploying and scaling generative AI applications. Learn about the innovative tools and services Azure AI offers, and how they empower developers to create high-quality, scalable AI solutions. Explore the paradigm shift from MLOps to GenAIOps and see how continuous improvement practices ensure your AI applications remain cutting-edge. Join us on this journey to harness the full potential of generative AI and drive operational excellence.7.3KViews1like1Comment