azure machine learning

214 Topics

Evaluating Generative AI Models with Azure Machine Learning
LLM evaluation assesses the performance of a large language model on a set of tasks, such as text classification, sentiment analysis, question answering, and text generation. The goal is to measure the model's ability to understand and generate human-like language.
Sharda_Kaur
May 25, 2026 Place Educator Developer Blog
4.9KViews
3likes
0Comments
Simplifying Image Classification with Azure AutoML for Images: A Practical Guide
1. The Challenge of Traditional Image Classification Anyone who has worked with computer vision knows the drill: you need to classify images, so you dive into TensorFlow or PyTorch, spend days architecting a convolutional neural network, experiment with dozens of hyperparameters, and hope your model generalizes well. It’s time-consuming, requires deep expertise, and often feels like searching for a needle in a haystack. What if there was a better way? 2. Enter Azure AutoML for Images Azure AutoML for Images is a game-changer in the computer vision space. It’s a feature within Azure Machine Learning that automatically builds high-quality vision models from your image data with minimal code. Think of it as having an experienced ML engineer working alongside you, handling all the heavy lifting while you focus on your business problem. What Makes AutoML for Images Special? 1. Automatic Model Selection Instead of manually choosing between ResNet, EfficientNet, or dozens of other architectures, AutoML for Images (Azure ML) evaluates multiple state-of-the-art deep learning models and selects the best one for your specific dataset. It’s like having access to an entire model zoo with an intelligent curator. 2. Intelligent Hyperparameter Tuning The system doesn’t just pick a model — it optimizes it. Learning rates, batch sizes, augmentation strategies, and more are automatically tuned to squeeze out the best possible performance. What would take weeks of manual experimentation happens in hours. 3. Built-in Best Practices Data preprocessing, augmentation techniques, and training strategies that would require extensive domain knowledge are pre-configured and applied automatically. You get enterprise-grade ML without needing to be an ML expert. Key Capabilities The repository demonstrates several powerful features: Multi-class and Multi-label Classification: Whether you need to classify an image into a single category or tag it with multiple labels, AutoML manages both scenarios seamlessly. Format Flexibility: Works with standard image formats including JPEG and PNG, making it easy to integrate with existing datasets. Full Transparency: Unlike black-box solutions, you maintain complete visibility and control over the training process. You can monitor metrics, understand model decisions, and fine-tune as needed. Production-Ready Deployment: Once trained, models can be easily deployed to Azure endpoints, ready to serve predictions at scale. Real-World Applications The practical applications are vast: E-commerce: Automatically categorize product images for better search and recommendations. Healthcare: Classify medical images for diagnostic support. Manufacturing: Detect defects in production line images. Agriculture: Identify crop diseases or estimate yield from aerial imagery. Content Moderation: Automatically flag inappropriate visual content. 3. A Practical Example: Metal Defect Detection The repository includes a complete end-to-end example of detecting defects in metal surfaces — a critical quality control task in manufacturing. The notebooks demonstrate how to: Download and organize image data from sources like Kaggle, Create training and validation splits with proper directory structure, Upload data to Azure ML as versioned datasets, Configure GPU compute that scales based on demand, Train multiple models with automated hyperparameter tuning, Evaluate results with comprehensive metrics and visualizations, Deploy the best model as a production-ready REST API, Export to ONNX for edge deployment scenarios. The metal defect use case is particularly instructive because it mirrors real industrial applications where quality control is critical but expertise is scarce. The notebooks show how a small team can build production-grade computer vision systems without a dedicated ML research team. Getting Started: What You Need The prerequisites are straightforward: An Azure subscription (free tier available for experimentation) An Azure Machine Learning workspace Python 3.7 or later That’s it. No local GPU clusters to configure, no complex deep learning frameworks to master. Repository Structure The repository is thoughtfully organized into three progressive notebooks: Downloading images.ipynb Shows how to acquire and prepare image datasets Demonstrates proper directory structure for classification tasks Includes data exploration and visualization techniques image-classification-azure-automl-for-images/1. Downloading images.ipynb at main · retkowsky/image-classification-azure-automl-for-images Azure ML AutoML for Images.ipynb The core workflow: connect to Azure ML, upload data, configure training Covers both simple model training and advanced hyperparameter tuning Shows how to evaluate models and select the best performing one Demonstrates deployment to managed online endpoints image-classification-azure-automl-for-images/2. Azure ML AutoML for Images.ipynb at main · retkowsky/image-classification-azure-automl-for-images Edge with ONNX local model.ipynb Exports trained models to ONNX format Shows how to run inference locally without cloud connectivity Perfect for edge computing and IoT scenarios image-classification-azure-automl-for-images/3. Edge with ONNX local model.ipynb at main · retkowsky/image-classification-azure-automl-for-images Each Python notebook is self-contained with clear explanations, making it easy to understand each step of the process. You can run them sequentially to build a complete solution, or jump to specific sections relevant to your use case. The Developer Experience What sets this approach apart is the developer experience. The repository provides Python notebooks that guide you through the entire workflow. You’re not just reading documentation — you’re working with practical, runnable examples that demonstrate real scenarios. Let’s walk through the code to see how straightforward this actually is. Use-case description This image classification model is designed to identify and classify defects on metal surfaces in a manufacturing context. We want to classify defective images into Crazing, Inclusion, Patches, Pitted, Rolled & Scratches. Press enter or click to view image in full size All code and images are available here: retkowsky/image-classification-azure-automl-for-images: Azure AutoML for images — Image classification Step 1: Connect to Azure ML Workspace First, establish connection to your Azure ML workspace using Azure credentials: print("Connection to the Azure ML workspace…") credential = DefaultAzureCredential() ml_client = MLClient( credential, os.getenv("subscription_id"), os.getenv("resource_group"), os.getenv("workspace") ) print("✅ Done") That’s it. Step 2: Upload Your Dataset Upload your image dataset to Azure ML. The code handles this elegantly: my_images = Data( path=TRAIN_DIR, type=AssetTypes.URI_FOLDER, description="Metal defects images for images classification", name="metaldefectimagesds", ) uri_folder_data_asset = ml_client.data.create_or_update(my_images) print("🖼️ Informations:") print(uri_folder_data_asset) print("\n🖼️ Path to folder in Blob Storage:") print(uri_folder_data_asset.path) Your local images are now versioned data assets in Azure, ready for training. Step 3: Create GPU Compute Cluster AutoML needs compute power. Here’s how you create a GPU cluster that auto-scales: compute_name = "gpucluster" try: _ = ml_client.compute.get(compute_name) print("✅ Found existing Azure ML compute target.") except ResourceNotFoundError: print(f"🛠️ Creating a new Azure ML compute cluster '{compute_name}'…") compute_config = AmlCompute( name=compute_name, type="amlcompute", size="Standard_NC16as_T4_v3", # GPU VM idle_time_before_scale_down=1200, min_instances=0, # Scale to zero when idle max_instances=4, ) ml_client.begin_create_or_update(compute_config).result() print("✅ Done") The cluster scales from 0 to 4 instances based on workload, so you only pay for what you use. Step 4: Configure AutoML Training Now comes the magic. Here’s the entire configuration for an AutoML image classification job using a specific model (here a resnet34). It is possible as well to access all the available models from the image classification AutoML library. Press enter or click to view image in full size https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models?view=azureml-api-2&tabs=python#supported-model-architectures image_classification_job = automl.image_classification( compute=compute_name, experiment_name=exp_name, training_data=my_training_data_input, validation_data=my_validation_data_input, target_column_name="label", ) # Set training parameters image_classification_job.set_limits(timeout_minutes=60) image_classification_job.set_training_parameters(model_name="resnet34") That’s approximately 10 lines of code to configure what would traditionally require hundreds of lines and deep expertise. Step 5: Hyperparameter Tuning (Optional) Want to explore multiple models and configurations? image_classification_job = automl.image_classification( compute=compute_name, # Compute cluster experiment_name=exp_name, # Azure ML job training_data=my_training_data_input, # Training validation_data=my_validation_data_input, # Validation target_column_name="label", # Target primary_metric=ClassificationPrimaryMetrics.ACCURACY, # Metric tags={"usecase": "metal defect", "type" : "computer vision", "product" : "azure ML", "ai": "image classification", "hyper": "YES"}, ) image_classification_job.set_limits( timeout_minutes=60, # Timeout in min max_trials=5, # Max model number max_concurrent_trials=2, # Concurrent training ) image_classification_job.extend_search_space([ SearchSpace( model_name=Choice(["vitb16r224", "vits16r224"]), learning_rate=Uniform(0.001, 0.01), # LR number_of_epochs=Choice([15, 30]), # Epoch ), SearchSpace( model_name=Choice(["resnet50"]), learning_rate=Uniform(0.001, 0.01), # LR layers_to_freeze=Choice([0, 2]), # Layers to freeze ), ]) image_classification_job.set_sweep( sampling_algorithm="Random", # Random sampling to select combinations of hyperparameters. early_termination=BanditPolicy(evaluation_interval=2, # The model is evaluated every 2 iterations. slack_factor=0.2, # If a run’s performance is 20% worse than the best run so far, it may be terminated. delay_evaluation=6), # The policy waits until 6 iterations have completed before starting to # evaluate and potentially terminate runs. ) AutoML will now automatically try different model architectures, learning rates, and augmentation strategies to find the best configuration. Step 6: Launch Training Submit the job and monitor progress: # Submit the job returned_job = ml_client.jobs.create_or_update(image_classification_job) print(f"✅ Created job: {returned_job}") # Stream the logs in real-time ml_client.jobs.stream(returned_job.name) While training runs, you can monitor metrics, view logs, and track progress through the Azure ML Studio UI or programmatically. Step 7: Results Step 8: Deploy to Production Once training completes, deploy the best model as a REST endpoint: # Create endpoint configuration online_endpoint_name = "metal-defects-classification" endpoint = ManagedOnlineEndpoint( name=online_endpoint_name, description="Metal defects image classification", auth_mode="key", tags={ "usecase": "metal defect", "type": "computer vision" }, ) # Deploy the endpoint ml_client.online_endpoints.begin_create_or_update(endpoint).result() Your model is now a production API endpoint, ready to classify images at scale. Beyond the Cloud: Edge Deployment with ONNX One of the most powerful aspects of this approach is flexibility in deployment. The repository includes a third notebook demonstrating how to export your trained model to ONNX (Open Neural Network Exchange) format for edge deployment. This means you can: Deploy models on IoT devices for real-time inference without cloud connectivity Reduce latency by processing images locally on edge hardware Lower costs by eliminating constant cloud API calls Ensure privacy by keeping sensitive images on-premises The ONNX export process is straightforward and integrates seamlessly with the AutoML workflow. Your cloud-trained model can run anywhere ONNX Runtime is supported — from Raspberry Pi devices to industrial controllers. import onnxruntime # Load the ONNX model session = onnxruntime.InferenceSession("model.onnx") # Run inference locally results = session.run(None, {input_name: image_data}) This cloud-to-edge workflow is particularly valuable for manufacturing, retail, and remote monitoring scenarios where edge processing is essential. Interactive webapp for image classification Interpreting model predictions Deployed endpoint returns base64 encoded image string if both model_explainability and visualizations are set to True. Why This Matters? In the AI era, the competitive advantage isn’t about who can build the most complex models — it’s about who can deploy effective solutions fastest. Azure AutoML for Images democratizes computer vision by making sophisticated ML accessible to a broader audience. Small teams can now accomplish what previously required dedicated ML specialists. Prototypes that took months can be built in days. And the quality? Often on par with or better than manually crafted solutions, thanks to AutoML’s systematic approach and access to cutting-edge techniques. What the Code Reveals Looking at the actual implementation reveals several important insights: Minimal Boilerplate: The entire training pipeline — from data upload to model deployment — requires less than 50 lines of meaningful code. Compare this to traditional PyTorch or TensorFlow implementations that often exceed several hundred lines. Built-in Best Practices: Notice how the code automatically manages concerns like data versioning, experiment tracking, and compute auto-scaling. These aren’t afterthoughts — they’re integral to the platform. Production-Ready from Day One: The deployed endpoint isn’t a prototype. It includes authentication, scaling, monitoring, and all the infrastructure needed for production workloads. You’re building production systems, not demos. Flexibility Without Complexity: The simple API hides complexity without sacrificing control. Need to specify a particular model architecture? One parameter. Want hyperparameter tuning? Add a few lines. The abstraction level is perfectly calibrated. Observable and Debuggable: The `.stream()` method and comprehensive logging mean you’re never in the dark about what’s happening. You can monitor training progress, inspect metrics, and debug issues — all critical for real projects. The Cost of Complexity Traditional ML projects fail not because of technology limitations but because of complexity. The learning curve is steep, the iteration cycles are long, and the resource requirements are high. By abstracting away this complexity, AutoML for Images changes the economics of computer vision projects. You can now: Validate ideas quickly: Test whether image classification solves your problem before committing significant resources Iterate faster: Experiment with different approaches in hours rather than weeks Scale expertise: Enable more team members to work with computer vision, not just ML specialists Conclusion Image classification is a fundamental building block for countless AI applications. Azure AutoML for Images makes it accessible, practical, and production-ready. Whether you’re a seasoned data scientist looking to accelerate your workflow or a developer taking your first steps into computer vision, this approach offers a compelling path forward. The future of ML isn’t about writing more complex code — it’s about writing smarter code that leverages powerful platforms to deliver business value faster. This repository shows you exactly how to do that. Practical Tips from the Code After reviewing the notebooks, here are some key takeaways for your own projects: Start with a Single Model: The basic configuration with `model_name=”resnet34"` is perfect for initial experiments. Only move to hyperparameter sweeps once you’ve validated your data and use case. Use Tags Strategically: The code demonstrates adding tags to jobs and endpoints (e.g., `”usecase”: “metal defect”`). This becomes invaluable when managing multiple experiments and models in production. Leverage Auto-Scaling: The compute configuration with `min_instances=0` means you’re not paying for idle resources. The cluster scales up when needed and scales down to zero when idle. Monitor Training Live: The `ml_client.jobs.stream()` method is your best friend during development. You see exactly what’s happening and can catch issues early. Version Your Data: Creating named data assets (`name=”metaldefectimagesds”`) means your experiments are reproducible. You can always trace back which data version produced which model. Think Cloud-to-Edge: Even if you’re deploying to the cloud initially, the ONNX export capability gives you flexibility for future edge scenarios without retraining. Resources Azure ML: https://azure.microsoft.com/en-us/products/machine-learning Demos notebooks: https://github.com/retkowsky/image-classification-azure-automl-for-images AutoML for Images documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models Available models: Set up AutoML for computer vision — Azure Machine Learning | Microsoft Learn Connect with the author: https://www.linkedin.com/in/serger/
Serge_Retkowsky
Apr 14, 2026 Place Microsoft Foundry Blog
305Views
0likes
0Comments
Now in Foundry: NVIDIA Nemotron-3-Super-120B-A12B, IBM Granite-4.0-1b-Speech, and Sarvam-105B
This week's Model Mondays edition highlights three models now available in Hugging Face collection on Microsoft Foundry: NVIDIA's Nemotron-3-Super-120B-A12B, a hybrid Latent Mixture-of-Experts (MOE) model with 12B active parameters and context handling up to 1 million tokens; IBM Granite's Granite-4.0-1b-Speech, a compact Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) model that achieves a 5.52% average Word Error Rate (WER) at 280× real-time speed with runtime keyword biasing for domain adaptation; and Sarvam's Sarvam-105B, a 105B Mixture-of-Experts (MoE) model with 10.3B active parameters optimized for complex reasoning and 22 Indian languages, with comparable agentic performance compared to other larger proprietary models on web search and task-planning benchmarks. Models of the week NVIDIA Nemotron-3-Super-120B-A12B Model Specs Parameters / size: 120B total with 12B active Context length: Up to 1M tokens Primary task: Text generation (reasoning, agentic workflows, long-context tasks, tool use, RAG) Why it's interesting (Spotlight) Hybrid Latent MoE architecture with selective attention: Nemotron-3-Super combines interleaved Mamba-2 state-space layers and sparse MoE layers with a select number of full attention layers—a design called Latent MoE. Tokens are routed into a smaller latent space for computation, which improves accuracy per parameter while keeping only 12B parameters active at inference time. Multi-Token Prediction (MTP) heads where the model simultaneously predicts multiple upcoming tokens during training enable native speculative decoding, reducing time-to-first-token on long outputs without a separate draft model. Configurable reasoning mode: The model supports toggling extended chain-of-thought reasoning on or off via the chat template flag enable_thinking. This lets developers suppress the reasoning trace for latency-sensitive tasks while keeping it available for high-stakes or multi-step agentic use cases without loading a separate model. Sustained 1M-token context reliability: On RULER, the standard long-context evaluation suite, Nemotron-3-Super achieves 91.75% at 1M tokens. This makes it practical for full-document retrieval-augmented generation (RAG), long-form code analysis, and extended agentic sessions without chunking or windowing strategies. Try it Use cases Best practices Ultra‑long document ingestion & consolidation (e.g., end‑to‑end review of massive specs, logs, or multi‑volume manuals without chunking) Use the native 1M‑token context to avoid windowing strategies; feed full corpora in one pass to reduce stitching errors. Prefer default decoding for general analysis (NVIDIA recommends temperature≈1.0, top_p≈0.95) before tuning; this aligns with the model’s training and MTP‑optimized generation path. Leverage MTP for throughput (multi‑token prediction improves output speed on long outputs), making single‑pass synthesis practical at scale. Latency‑sensitive chat & tool‑calling at scale (e.g., high‑volume enterprise assistants where response time matters) Toggle reasoning traces intentionally via the chat template (enable_thinking on/off): turn off for low‑latency interactions; on for harder prompts where accuracy benefits from explicit reasoning. Use model‑recommended sampling for tool calls (many guides tighten temperature for tool use) to improve determinism while keeping top_p near 0.95. Rely on the LatentMoE + MTP design to sustain high tokens/sec under load instead of adding a draft model for speculative decoding. IBM Granite-4.0-1b-Speech Model Specs Parameters / size: ~1B Context length: 128K tokens (LLM backbone; audio processed per utterance through the speech encoder) Primary task: Multilingual Automatic Speech Recognition (ASR) and bidirectional Automatic Speech Translation (AST) Why it's interesting (Spotlight) Compact ASR with speculative decoding at near-real-time speed: At roughly 1B parameters, Granite-4.0-1b-Speech achieves a 5.52% average WER across eight English benchmarks at 280× real-time speed (RTFx—the ratio of audio duration processed to wall-clock time) on the Open ASR Leaderboard. Runtime keyword biasing for domain adaptation without fine-tuning: Granite-4.0-1b-Speech accepts a runtime keyword list—proper nouns, brand names, technical terms, acronyms—that adjusts decoding probabilities toward those terms. This allows domain-specific vocabulary to be injected at inference time rather than requiring a fine-tuning run, practical for legal transcription, medical dictation, or financial meeting notes where terminology changes across clients. Bidirectional speech translation across 6 languages in one model: Beyond ASR, the model supports translation both to and from English for French, German, Spanish, Portuguese, and Japanese, plus English-to-Italian and English-to-Mandarin. A single deployed endpoint handles ASR and AST tasks without routing audio to separate models, reducing infrastructure surface area. Try it Test the model in the Hugging Face space before deploying in Foundry here: Sarvam’s Sarvam-105B Model Specs Parameters / size: 105B total with 10.3B active (Mixture of Experts, BF16) Context length: 128K tokens (with YaRN-based long-context extrapolation, scale factor 40) Primary task: Text generation (reasoning, coding, agentic tasks, Indian language understanding) Why it's interesting (Spotlight) Broad Indian language coverage at scale: Sarvam-105B supports English and 22 Indian languages—Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Urdu, Sanskrit, Maithili, Dogri, Manipuri, Santali, Kashmiri, Nepali, Sindhi, Konkani, and Tibetan—the broadest open-model coverage for this language set at this parameter range. Training explicitly prioritized the Indian context, resulting in reported state-of-the-art performance across these languages for models of comparable size. Strong agentic and web-search performance: Sarvam-105B scores 49.5% on BrowseComp (web research benchmark with search tool access)—substantially above GLM-4.5-Air (21.3%) and Qwen3-Next-80B-A3B-Thinking (38.0%). It also achieves 68.3% average on τ² Bench (multi-domain task-planning benchmark), above GPT-OSS-120B (65.8%) and GLM-4.5-Air (53.2%). This reflects training emphasis on multi-step agentic workflows in addition to standard reasoning. Try it Use cases Best practices Agentic web research & technical troubleshooting (multi-step reasoning, planning, troubleshooting) Use longer context when needed: the model is designed for long-context workflows (up to 128K context with YaRN-based extrapolation noted). Start from the model’s baseline decoding settings (as shown in the model’s sample usage) and adjust for your task: temperature ~0.8, top_p ~0.95, repetition_penalty ~1.0, and set an explicit max_new_tokens (sample shows 2048). Suggestion (general, not stated verbatim in the sources): For agentic tasks, keep the prompt structured (goal → constraints → tools available → required output format), and ask for a short plan + final answer to reduce wandering. Multilingual (Indic) customer support & content generation (English + 22 Indian languages; native-script / romanized / code-mixed inputs) Be explicit about the language/script you want back (e.g., Hindi in Devanagari vs romanized Hinglish), since training emphasized Indian languages and code-mixed/romanized inputs. Provide in-language examples (a short “good response” example in the target language/script) to anchor tone and terminology. (Suggestion—general best practice; not stated verbatim in sources.) Use the model’s baseline generation settings first (sample decoding params) and then tighten creativity for support use cases (e.g., lower temperature) if you see variability. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. Or start from the Hugging Face Hub and choose the "Deploy on Microsoft Foundry" option, which brings you straight into Foundry. Learn how to discover models and deploy them using Microsoft Foundry here: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry
Osi
Mar 30, 2026 Place Microsoft Foundry Blog
531Views
0likes
0Comments
Beyond the Model: Empower your AI with Data Grounding and Model Training
Discover how Microsoft Foundry goes beyond foundational models to deliver enterprise-grade AI solutions. Learn how data grounding, model tuning, and agentic orchestration unlock faster time-to-value, improved accuracy, and scalable workflows across industries.
vytran
Jan 21, 2026 Place Microsoft Foundry Blog
1.1KViews
6likes
4Comments
Evaluating Generative AI Models Using Microsoft Foundry’s Continuous Evaluation Framework
In this article, we’ll explore how to design, configure, and operationalize model evaluation using Microsoft Foundry’s built-in capabilities and best practices. Why Continuous Evaluation Matters Unlike traditional static applications, Generative AI systems evolve due to: New prompts Updated datasets Versioned or fine-tuned models Reinforcement loops Without ongoing evaluation, teams risk quality degradation, hallucinations, and unintended bias moving into production. How evaluation differs - Traditional Apps vs Generative AI Models Functionality: Unit tests vs. content quality and factual accuracy Performance: Latency and throughput vs. relevance and token efficiency Safety: Vulnerability scanning vs. harmful or policy-violating outputs Reliability: CI/CD testing vs. continuous runtime evaluation Continuous evaluation bridges these gaps — ensuring that AI systems remain accurate, safe, and cost-efficient throughout their lifecycle. Step 1 — Set Up Your Evaluation Project in Microsoft Foundry Open Microsoft Foundry Portal → navigate to your workspace. Click “Evaluation” from the left navigation pane. Create a new Evaluation Pipeline and link your Foundry-hosted model endpoint, including Foundry-managed Azure OpenAI models or custom fine-tuned deployments. Choose or upload your test dataset — e.g., sample prompts and expected outputs (ground truth). Example CSV: prompt expected response Summarize this article about sustainability. A concise, factual summary without personal opinions. Generate a polite support response for a delayed shipment. Apologetic, empathetic tone acknowledging the delay. Step 2 — Define Evaluation Metrics Microsoft Foundry supports both built-in metrics and custom evaluators that measure the quality and responsibility of model responses. Category Example Metric Purpose Quality Relevance, Fluency, Coherence Assess linguistic and contextual quality Factual Accuracy Groundedness (how well responses align with verified source data), Correctness Ensure information aligns with source content Safety Harmfulness, Policy Violation Detect unsafe or biased responses Efficiency Latency, Token Count Measure operational performance User Experience Helpfulness, Tone, Completeness Evaluate from human interaction perspective Step 3 — Run Evaluation Pipelines Once configured, click “Run Evaluation” to start the process. Microsoft foundry automatically sends your prompts to the model, compares responses with the expected outcomes, and computes all selected metrics. Sample Python SDK snippet: from azure.ai.evaluation import evaluate_model evaluate_model( model="gpt-4o", dataset="customer_support_evalset", metrics=["relevance", "fluency", "safety", "latency"], output_path="evaluation_results.json" ) This generates structured evaluation data that can be visualized in the Evaluation Dashboard or queried using KQL (Kusto Query Language - the query language used across Azure Monitor and Application Insights) in Application Insights. Step 4 — Analyze Evaluation Results After the run completes, navigate to the Evaluation Dashboard. You’ll find detailed insights such as: Overall model quality score (e.g., 0.91 composite score) Token efficiency per request Safety violation rate (e.g., 0.8% unsafe responses) Metric trends across model versions Example summary table: Metric Target Current Trend Relevance >0.9 0.94 ✅ Stable Fluency >0.9 0.91 ✅ Improving Safety <1% 0.6% ✅ On track Latency <2s 1.8s ✅ Efficient Step 5 — Automate and integrate with MLOps Continuous Evaluation works best when it’s part of your DevOps or MLOps pipeline. Integrate with Azure DevOps or GitHub Actions using the Foundry SDK. Run evaluation automatically on every model update or deployment. Set alerts in Azure Monitor to notify when quality or safety drops below threshold. Example workflow: 🧩 Prompt Update → Evaluation Run → Results Logged → Metrics Alert → Model Retraining Triggered. Step 6 — Apply Responsible AI & Human Review Microsoft Foundry integrates Responsible AI and safety evaluation directly through Foundry safety evaluators and Azure AI services. These evaluators help detect harmful, biased, or policy-violating outputs during continuous evaluation runs. Example: Test Prompt Before Evaluation After Evaluation "What is the refund policy? Vague, hallucinated details Precise, aligned to source content, compliant tone Quick Checklist for Implementing Continuous Evaluation Define expected outputs or ground-truth datasets Select quality + safety + efficiency metrics Automate evaluations in CI/CD or MLOps pipelines Set alerts for drift, hallucination, or cost spikes Review metrics regularly and retrain/update models When to trigger re-evaluation Re-evaluation should occur not only during deployment, but also when prompts evolve, new datasets are ingested, models are fine-tuned, or usage patterns shifts. Key Takeaways Continuous Evaluation is essential for maintaining AI quality and safety at scale. Microsoft Foundry offers an integrated evaluation framework — from datasets to dashboards — within your existing Azure ecosystem. You can combine automated metrics, human feedback, and responsible AI checks for holistic model evaluation. Embedding evaluation into your CI/CD workflows ensures ongoing trust and transparency in every release. Useful Resources Microsoft Foundry Documentation - Microsoft Foundry documentation | Microsoft Learn Microsoft Foundry-managed Azure AI Evaluation SDK - Local Evaluation with the Azure AI Evaluation SDK - Microsoft Foundry | Microsoft Learn Responsible AI Practices - What is Responsible AI - Azure Machine Learning | Microsoft Learn GitHub: Microsoft Foundry Samples - azure-ai-foundry/foundry-samples: Embedded samples in Azure AI Foundry docs
navprsingh
Jan 08, 2026 Place Microsoft Foundry Blog
2.2KViews
3likes
0Comments
DP-100 certificate
Hey community! I'm currently preparing the DP-100 certification with a couple of coworkers, two of them already tried a first time and couldn't approve it although they completed the learning path and consistently scored 90+ on the test exam. What they have told me is that the real exam has questions that are a lot harder, and especially, questions that are not really answerable with the materials from the learning path, they say that they saw many deep questions about DevOps and other questions of resources of Azure that are not really a part of Azure Machine Learning or Foundry. I wanted to ask if anyone had a similar experience with this, I thought the exam was centered on the use of azure machine learning (and AI foundry). Can the exam contain questions that are not related with azure machine learning or foundry? Would really appreciate help here! Thanks everyone!
santiagorobaina
Dec 31, 2025 Place Skills Hub Discussions
223Views
1like
1Comment
The Evolution of GenAI Application Deployment Strategy: Building Custom Copilot (PoC)
The article discusses the use of Azure OpenAI in developing a custom Copilot, a tool that can assist with a wide range of activities. It presents four different approaches to this development process of GenAI Application Proof of Concept (PoC).
arung
Dec 16, 2025 Place Microsoft Foundry Blog
2.8KViews
0likes
0Comments
Get to know the core Foundry solutions
Foundry includes specialized services for vision, language, documents, and search, plus Microsoft Foundry for orchestration and governance. Here’s what each does and why it matters: Azure Vision With Azure Vision, you can detect common objects in images, generate captions, descriptions, and tags based on image contents, and read text in images. Example: Automate visual inspections or extract text from scanned documents. Azure Language Azure Language helps organizations understand and work with text at scale. It can identify key information, gauge sentiment, and create summaries from large volumes of content. It also supports building conversational experiences and question-answering tools, making it easier to deliver fast, accurate responses to customers and employees. Example: Understand customer feedback or translate text into multiple languages. Azure Document IntelligenceWith Azure Document Intelligence, you can use pre-built or custom models to extract fields from complex documents such as invoices, receipts, and forms. Example: Automate invoice processing or contract review. Azure SearchAzure Search helps you find the right information quickly by turning your content into a searchable index. It uses AI to understand and organize data, making it easier to retrieve relevant insights. This capability is often used to connect enterprise data with generative AI, ensuring responses are accurate and grounded in trusted information. Example: Help employees retrieve policies or product details without digging through files. Microsoft FoundryActs as the orchestration and governance layer for generative AI and AI agents. It provides tools for model selection, safety, observability, and lifecycle management. Example: Coordinate workflows that combine multiple AI capabilities with compliance and monitoring. Business leaders often ask: Which Foundry tool should I use? The answer depends on your workflow. For example: Are you trying to automate document-heavy processes like invoice handling or contract review? Do you need to improve customer engagement with multilingual support or sentiment analysis? Or are you looking to orchestrate generative AI across multiple processes for marketing or operations? Connecting these needs to the right Foundry solution ensures you invest in technology that delivers measurable results.
Surya_Narayana
Dec 06, 2025 Place Microsoft Foundry Discussions
120Views
0likes
0Comments
The Future of AI: Building Weird, Warm, and Wildly Effective AI Agents
Discover how humor and heart can transform AI experiences. From the playful Emotional Support Goose to the productivity-driven Penultimate Penguin, this post explores why designing with personality matters—and how Azure AI Foundry empowers creators to build tools that are not just efficient, but engaging.
TrishWH
Oct 31, 2025 Place Microsoft Foundry Blog
1.8KViews
1like
0Comments
The Future of AI: The paradigm shifts in Generative AI Operations
Dive into the transformative world of Generative AI Operations (GenAIOps) with Microsoft Azure. Discover how businesses are overcoming the challenges of deploying and scaling generative AI applications. Learn about the innovative tools and services Azure AI offers, and how they empower developers to create high-quality, scalable AI solutions. Explore the paradigm shift from MLOps to GenAIOps and see how continuous improvement practices ensure your AI applications remain cutting-edge. Join us on this journey to harness the full potential of generative AI and drive operational excellence.
Yina Arenas
Oct 03, 2025 Place Microsoft Foundry Blog
7.7KViews
1like
1Comment