azure ai services
552 TopicsData Storage in Azure OpenAI Service
Data Stored at Rest by Default Azure OpenAI does store certain data at rest by default when you use specific features (continue reading) In general, the base models are stateless and do not retain your prompts or completions from standard API calls (they aren't used to train or improve the base models). However, some optional service features will persist data in your Azure OpenAI resource. For example, if you upload files for fine-tuning, use the vector store, or enable stateful features like Assistants API Threads or Stored Completions, that data will be stored at rest by the service. This means content such as training datasets, embeddings, conversation history, or output logs from those features are saved within your Azure environment. Importantly, this storage is within your own Azure tenant (in the Azure OpenAI resource you created) and remains in the same geographic region as your resource. In summary, yes – data can be stored at rest by default when using these features, and it stays isolated to your Azure resource in your tenant. If you only use basic completions without these features, then your prompts and outputs are not persisted in the resource by default (aside from transient processing). Location and Deletion of Stored Data Location: All data stored by Azure OpenAI features resides in your Azure OpenAI resource’s storage, within your Azure subscription/tenant and in the same region (geography) that your resource is deployed. Microsoft ensures this data is secured — it is automatically encrypted at rest using AES-256 encryption, and you have the option to add a customer-managed key for double encryption (except in certain preview features that may not support CMK). No other Azure OpenAI customers or OpenAI (the company) can access this data; it remains isolated to your environment. Deletion: You retain full control over any data stored by these features. The official documentation states that stored data can be deleted by the customer at any time. For instance, if you fine-tune a model, the resulting custom model and any training files you uploaded are exclusively available to you and you can delete them whenever you wish. Similarly, any stored conversation threads or batch processing data can be removed by you through the Azure portal or API. In short, data persisted for Azure OpenAI features is user-managed: it lives in your tenant and you can delete it on demand once it’s no longer needed. Comparison to Abuse Monitoring and Content Filtering It’s important to distinguish the above data storage from Azure OpenAI’s content safety system (content filtering and abuse monitoring), which operates differently: Content Filtering: Azure OpenAI automatically checks prompts and generations for policy violations. These filters run in real-time and do not store your prompts or outputs in the filter models, nor are your prompts/outputs used to improve the filters without consent. In other words, the content filtering process itself is ephemeral – it analyzes the content on the fly and doesn’t permanently retain that data. Abuse Monitoring: By default (if enabled), Azure OpenAI has an abuse detection system that might log certain data when misuse is detected. If the system’s algorithms flag potential violations, a sample of your prompts and completions may be captured for review. Any such data selected for human review is stored in a secure, isolated data store tied to your resource and region (within the Azure OpenAI service boundaries in your geography). This is used strictly for moderation purposes – e.g. a Microsoft reviewer could examine a flagged request to ensure compliance with the Azure OpenAI Code of Conduct. When Abuse Monitoring is Disabled: if you disabled content logging/abuse monitoring (via an approved Microsoft process to turn it off). According to Microsoft’s documentation, when a customer has this modified abuse monitoring in place, Microsoft does not store any prompts or completions for that subscription’s Azure OpenAI usage. The human review process is completely bypassed (because there’s no stored data to review). Only the AI-based checks might still occur, but they happen in-memory at request time and do not persist your data at rest. Essentially, with abuse monitoring turned off, no usage data is being saved for moderation purposes; the system will check content policy compliance on the fly and then immediately discard those prompts/outputs without logging them. Data Storage and Deletion in Azure OpenAI “Chat on Your Data” Azure OpenAI’s “Chat on your data” (also called Azure OpenAI on your data, part of the Assistants preview) lets you ground the model’s answers on your own documents. It stores some of your data to enable this functionality. Below, we explain where and how your data is stored, how to delete it, and important considerations (based on official Microsoft documentation). How Azure Open AI on your data stores your data Data Ingestion and Storage: When you add your own data (for example by uploading files or providing a URL) through Azure OpenAI’s “Add your data” feature, the service ingests that content into an Azure Cognitive Search index (Azure AI Search). The data is first stored in Azure Blob Storage (for processing) and then indexed for retrieval: Files Upload (Preview): Files you upload are stored in an Azure Blob Storage account and then ingested (indexed) into an Azure AI Search index. This means the text from your documents is chunked and saved in a search index so the model can retrieve it during chat. Web URLs (Preview): If you add a website URL as a data source, the page content is fetched and saved to a Blob Storage container (webpage-<index name>), then indexed into Azure Cognitive Search. Each URL you add creates a separate container in Blob storage with the page content, which is then added to the search index. Existing Azure Data Stores: You also have the option to connect an existing Azure Cognitive Search index or other vector databases (like Cosmos DB or Elasticsearch) instead of uploading new files. In those cases, the data remains in that source (for example, your existing search index or database), and Azure OpenAI will use it for retrieval rather than copying it elsewhere. Chat Sessions and Threads: Azure OpenAI’s Assistants feature (which underpins “Chat on your data”) is stateful. This means it retains conversation history and any file attachments you use during the chat. Specifically, it stores: (1) Threads, messages, and runs from your chat sessions, and (2) any files you uploaded as part of an Assistant’s setup or messages. All this data is stored in a secure, Microsoft-managed storage account, isolated for your Azure OpenAI resource. In other words, Azure manages the storage for conversation history and uploaded content, and keeps it logically separated per customer/resource. Location and Retention: The stored data (index content, files, chat threads) resides within the same Azure region/tenant as your Azure OpenAI resource. It will persist indefinitely – Azure OpenAI will not automatically purge or delete your data – until you take action to remove it. Even if you close your browser or end a session, the ingested data (search index, stored files, thread history) remains saved on the Azure side. For example, if you created a Cognitive Search index or attached a storage account for “Chat on your data,” that index and the files stay in place; the system does not delete them in the background. How to Delete Stored Data Removing data that was stored by the “Chat on your data” feature involves a manual deletion step. You have a few options depending on what data you want to delete: Delete Chat Threads (Assistants API): If you used the Assistants feature and have saved conversation threads that you want to remove (including their history and any associated uploaded files), you can call the Assistants API to delete those threads. Azure OpenAI provides a DELETE endpoint for threads. Using the thread’s ID, you can issue a delete request to wipe that thread’s messages and any data tied to it. In practice, this means using the Azure OpenAI REST API or SDK with the thread ID. For example: DELETE https://<your-resource-name>.openai.azure.com/openai/threads/{thread_id}?api-version=2024-08-01-preview . This “delete thread” operation will remove the conversation and its stored content from the Azure OpenAI Assistants storage (Simply clearing or resetting the chat in the Studio UI does not delete the underlying thread data – you must call the delete operation explicitly.) Delete Your Search Index or Data Source: If you connected an Azure Cognitive Search index or the system created one for you during data ingestion, you should delete the index (or wipe its documents) to remove your content. You can do this via the Azure portal or Azure Cognitive Search APIs: go to your Azure Cognitive Search resource, find the index that was created to store your data, and delete that index. Deleting the index ensures all chunks of your documents are removed from search. Similarly, if you had set up an external vector database (Cosmos DB, Elasticsearch, etc.) as the data source, you should delete any entries or indexes there to purge the data. Tip: The index name you created is shown in the Azure AI Studio and can be found in your search resource’s overview. Removing that index or the entire search resource will delete the ingested data. Delete Stored Files in Blob Storage: If your usage involved uploading files or crawling URLs (thereby storing files in a Blob Storage container), you’ll want to delete those blobs as well. Navigate to the Azure Blob Storage account/container that was used for “Chat on your data” and delete the uploaded files or containers containing your data. For example, if you used the “Upload files (preview)” option, the files were stored in a container in the Azure Storage account you provided– you can delete those directly from the storage account. Likewise, for any web pages saved under webpage-<index name> containers, delete those containers or blobs via the Storage account in Azure Portal or using Azure Storage Explorer. Full Resource Deletion (optional): As an alternative cleanup method, you can delete the Azure resources or resource group that contain the data. For instance, if you created a dedicated Azure Cognitive Search service or storage account just for this feature, deleting those resources (or the whole resource group they reside in) will remove all stored data and associated indices in one go. Note: Only use this approach if you’re sure those resources aren’t needed for anything else, as it is a broad action. Otherwise, stick to deleting the specific index or files as described above. Verification: Once you have deleted the above, the model will no longer have access to your data. The next time you use “Chat on your data,” it will not find any of the deleted content in the index, and thus cannot include it in answers. (Each query fetches data fresh from the connected index or vector store, so if the data is gone, nothing will be retrieved from it.) Considerations and Limitations No Automatic Deletion: Remember that Azure OpenAI will not auto-delete any data you’ve ingested. All data persists until you remove it. For example, if you remove a data source from the Studio UI or end your session, the configuration UI might forget it, but the actual index and files remain stored in your Azure resources. Always explicitly delete indexes, files, or threads to truly remove the data. Preview Feature Caveats: “Chat on your data” (Azure OpenAI on your data) is currently a preview feature. Some management capabilities are still evolving. A known limitation was that the Azure AI Studio UI did not persist the data source connection between sessions – you’d have to reattach your index each time, even though the index itself continued to exist. This is being worked on, but it underscores that the UI might not show you all lingering data. Deleting via API/portal is the reliable way to ensure data is removed. Also, preview features might not support certain options like customer-managed keys for encryption of the stored data(the data is still encrypted at rest by Microsoft, but you may not be able to bring your own key in preview). Data Location & Isolation: All data stored by this feature stays within your Azure OpenAI resource’s region/geo and is isolated to your tenant. It is not shared with other customers or OpenAI – it remains private to your resource. So, deleting it is solely your responsibility and under your control. Microsoft confirms that the Assistants data storage adheres to compliance like GDPR and CCPA, meaning you have the ability to delete personal data to meet compliance requirements Costs: There is no extra charge specifically for the Assistant “on your data” storage itself. The data being stored in a cognitive search index or blob storage will simply incur the normal Azure charges for those services (for example, Azure Cognitive Search indexing queries, or storage capacity usage). Deleting unused resources when you’re done is wise to avoid ongoing charges. If you only delete the data (index/documents) but keep the search service running, you may still incur minimal costs for the service being available – consider deleting the whole search resource if you no longer need it Residual References: After deletion, any chat sessions or assistants that were using that data source will no longer find it. If you had an Assistant configured with a now-deleted vector store or index, you might need to update or recreate the assistant if you plan to use it again, as the old data source won’t resolve. Clearing out the data ensures it’s gone from future responses. (Each new question to the model will only retrieve from whatever data sources currently exist/are connected.) In summary, the data you intentionally provide for Azure OpenAI’s features (fine-tuning files, vector data, chat histories, etc.) is stored at rest by design in your Azure OpenAI resource (within your tenant and region), and you can delete it at any time. This is separate from the content safety mechanisms. Content filtering doesn’t retain data, and abuse monitoring would ordinarily store some flagged data for review – but since you have that disabled, no prompt or completion data is being stored for abuse monitoring now. All of these details are based on Microsoft’s official documentation, ensuring your understanding is aligned with Azure OpenAI’s data privacy guarantees and settings. Azure OpenAI “Chat on your data” stores your content in Azure Search indexes and blob storage (within your own Azure environment or a managed store tied to your resource). This data remains until you take action to delete it. To remove your data, delete the chat threads (via API) and remove any associated indexes or files in Azure. There are no hidden copies once you do this – the system will not retain context from deleted data on the next chat run. Always double-check the relevant Azure resources (search and storage) to ensure all parts of your data are cleaned up. Following these steps, you can confidently use the feature while maintaining control over your data lifecycle.3.9KViews1like1CommentThe AI Study Guide: Azure’s top free resources for learning generative AI in 2024
Welcome to the January edition of the Azure AI Study Guide! Welcome to the January edition of the Azure AI Study Guide. Every month I’ll bring you the best and newest tools when it comes to skilling up on AI. This month, we’re all about Generative AI. Whether you are already building and training models or trying out a few AI tools for the first time, these free resources are for you.37KViews15likes12CommentsUnveiling the Next Generation of Table Structure Recognition
In an era where data is abundant, the ability to accurately and efficiently extract structured information like tables from diverse document types is critical. For instance, consider the complexities of a balance sheet with multiple types of assets or an invoice with various charges, both presented in a table format that can be challenging even for humans to interpret. Traditional parsing methods often struggle with the complexity and variability of real-world tables, leading to manual intervention and inefficient workflows. This is because these methods typically rely on rigid rules or predefined templates that fail when encountering variations in layout, formatting, or content, which are common in real-world documents. While the promise of Generative AI and Large Language Models (LLMs) in document understanding is vast, our research in table parsing has revealed a critical insight: for tasks requiring precision in data alignment, such as correctly associating data cells with their respective row and column headers, classical computer vision techniques currently offer superior performance. Generative AI models, despite their powerful contextual understanding, can sometimes exhibit inconsistencies and misalignments in tabular structures, leading to compromised data integrity (Figure 1). Therefore, Azure Document Intelligence (DI) and Content Understanding (CU) leverages an even more robust and proven computer vision algorithms to ensure the foundational accuracy and consistency that enterprises demand. Figure 1: Vision LLMs struggle to accurately recognize table structure, even in simple tables. Our current table recognizer excels at accurately identifying table structures, even those with complex layouts, rotations, or curved shapes. However, it does have its limitations. For example, it occasionally fails to properly delineate a table where the logical boundaries are not visible but must be inferred from the larger document context, making suboptimal inferences. Furthermore, its architectural design makes it challenging to accelerate on modern GPU platforms, impacting its runtime efficiency. Taking these limitations in considerations and building upon our existing foundation, we are introducing the latest advancement in our table structure recognizer. This new version significantly enhances both performance and accuracy, addressing key challenges in document processing. Precise Separation Line Placement We've made significant strides in the precision of separation line placement. While predicting these separation lines might seem deceptively simple, it comes with subtle yet significant challenges. In many real-world documents, these are logical separation lines, meaning they are not always visibly drawn on the page. Instead, their positions are often implied by an array of nuanced visual cues such as table headers/footers, dot filler text, background color changes, and even the spacing and alignment of content within the cells. Figure 2: Visual Comparison of separation line prediction of current and the new version We've developed a novel model architecture that can be trained end-to-end to directly tackle the above challenges. Recognizing the difficulty for humans to consistently label table separation lines, we've devised a training objective that combines Hungarian matching with an adaptive matching weight to correctly align predictions with ground truth even when the latter is noisy. Additionally, we've incorporated a loss function inspired by speech recognition to encourage the model to accurately predict the correct number of separation lines, further enhancing its performance. Our improved algorithms now respect visual cues more effectively, ensuring that separation lines are placed precisely where they belong. This leads to cleaner, more accurate table structures and ultimately, more reliable data extraction. Figure 2 shows the comparison between the current model and the new model on a few examples. Some quantitative results can be found in Table 1. TSR (current, in %) TSR-v2 (next-gen, in %) Segment Precision Recall F1-Score Precision Recall F1-score Latin 90.2 90.7 90.4 94.0 95.7 94.8 Chinese 96.1 95.3 95.7 97.3 96.8 97.0 Japanese 93.5 93.8 93.7 95.1 97.1 96.1 Korean 95.3 95.9 95.6 97.5 97.8 97.7 Table 1: Table structure accuracy measured by cell prediction precision and recall rates at IoU (intersection over union) threshold of 0.5. Tested on in-house test datasets covering four different scripts. A Data-Driven, GPU-Accelerated Design Another innovation in this release is its data-driven, fully GPU-accelerated design. This architectural shift delivers enhanced quality and significantly faster inference speeds, which is critical for processing large volumes of documents. The design carefully considers the trade-off between model capability and latency requirements, prioritizing an architecture that leverages the inherent parallelism of GPUs. This involves favoring highly parallelizable models over serial approaches to maximize GPU utilization. Furthermore, post-processing logic has been minimized to prevent it from becoming a bottleneck. This comprehensive approach has resulted in a drastic reduction in processing latency, from 250ms per image to less than 10ms. Fueling Robustness with Synthetic Data Achieving the high level of accuracy and robustness required for enterprise-grade table recognition demands vast quantities of high-quality training data. To meet this need efficiently, we've strategically incorporated synthetic data into our development pipeline. A few examples can be found in Figure 3. Figure 3: Synthesized tables Synthetic data offers significant advantages: it's cost-effective to generate and provides unparalleled control over the dataset. This allows us to rapidly synthesize diverse and specific table styles, including rare or challenging layouts, which would be difficult and expensive to collect from real-world documents. Crucially, synthetic data comes with perfectly consistent labels. Unlike human annotation, which can introduce variability, synthetic data ensures that our models learn from a flawlessly labeled ground truth, leading to more reliable and precise training outcomes. Summary This latest version of our table structure recognizer enhances critical document understanding capabilities. We've refined separation line placement to better respect visual cues and implied structures, supported by our synthetic data approach for consistent training. This enhancement, in turn, allows users to maintain the table structure as intended, reducing the need for manual post-processing to clean up the structured output. Additionally, a GPU-accelerated, data-driven design delivers both improved quality and faster performance, crucial for processing large document volumes.913Views2likes2Comments"Appointment Booking Assistant"—an AI-powered voice agent
In this blog post, we'll explore how to create an AI-driven voice assistant capable of handling medical appointment bookings through natural conversation. Leveraging Microsoft Semantic Kernel, Azure Communication Services (ACS), and Microsoft Graph API, this assistant seamlessly integrates voice interactions with backend scheduling2.2KViews1like3CommentsDeploying Azure ND H100 v5 Instances in AKS with NVIDIA MIG GPU Slicing
In this article we will cover: AKS Cluster Deployment (Latest Version) – creating an AKS cluster using the latest Kubernetes version. GPU Node Pool Provisioning – adding an ND H100 v5 node pool on Ubuntu, with --skip-gpu-driver-install to disable automatic driver installation. NVIDIA H100 MIG Slicing Configurations – available MIG partition profiles on the H100 GPU and how to enable them. Workload Recommendations for MIG Profiles – choosing optimal MIG slice sizes for different AI/ML and HPC scenarios. Best Practices for MIG Management and Scheduling – managing MIG in AKS, scheduling pods, and operational tips. AKS Cluster Deployment (Using the Latest Version) Install/Update Azure CLI: Ensure you have Azure CLI 2.0.64+ (or Azure CLI 1.0.0b2 for preview features). This is required for using the --skip-gpu-driver-install option and other latest features. Install the AKS preview extension if needed: az extension add --name aks-preview az extension update --name aks-preview (Preview features are opt-in; using the preview extension gives access to the latest AKS capabilities) Create a Resource Group: If not already done, create an Azure resource group for the cluster: az group create -n MyResourceGroup -l eastus Create the AKS Cluster: Run az aks create to create the AKS control plane. You can start with a default system node pool (e.g. a small VM for system pods) and no GPU nodes yet. For example: az aks create -g MyResourceGroup -n MyAKSCluster \ --node-vm-size Standard_D4s_v5 \ --node-count 1 \ --kubernetes-version <latest-stable-version> \ --enable-addons monitoring This creates a cluster named MyAKSCluster with one standard node. Use the --kubernetes-version flag to specify the latest AKS-supported Kubernetes version (or omit it to get the default latest). As of early 2025, AKS supports Kubernetes 1.27+; using the newest version ensures support for features like MIG and the ND H100 v5 SKU. Retrieve Cluster Credentials: Once created, get your Kubernetes credentials: az aks get-credentials -g MyResourceGroup -n MyAKSCluster Verification: After creation, you should have a running AKS cluster. You can verify the control plane is up with: kubectl get nodes Adding an ND H100 v5 GPU Node Pool (Ubuntu + Skip Driver Install) Next, add a GPU node pool using the ND H100 v5 VM size. The ND H100 v5 series VMs each come with 8× NVIDIA H100 80GB GPUs (640 GB total GPU memory), high-bandwidth interconnects, and 96 vCPUs– ideal for large-scale AI and HPC workloads. We will configure this node pool to run Ubuntu and skip the automatic NVIDIA driver installation, since we plan to manage drivers (and MIG settings) manually or via the NVIDIA operator. Steps to add the GPU node pool: Use Ubuntu Node Image: AKS supports Ubuntu 20.04/22.04 for ND H100 v5 nodes. The default AKS Linux OS (Ubuntu) is suitable. We also set --os-sku Ubuntu to ensure we use Ubuntu (if your cluster’s default is Azure Linux, note that Azure Linux is not currently supported for MIG node pools). Add the GPU Node Pool with Azure CLI: Run: az aks nodepool add \ --cluster-name MyAKSCluster \ --resource-group MyResourceGroup \ --name h100np \ --node-vm-size Standard_ND96isr_H100_v5 \ --node-count 1 \ --node-os-type Linux \ --os-sku Ubuntu \ --gpu-driver none \ --node-taints nvidia.com/gpu=true:NoSchedule Let’s break down these parameters: --node-vm-size Standard_ND96isr_H100_v5 selects the ND H100 v5 VM size (96 vCPUs, 8×H100 GPUs). Ensure your subscription has quota for this SKU and region. --node-count 1 starts with one GPU VM (scale as needed). --gpu-driver none tells AKS not to pre-install NVIDIA drivers on the node. This prevents the default driver installation, because we plan to handle drivers ourselves (using NVIDIA’s GPU Operator for better control). When using this flag, new GPU nodes come up without NVIDIA drivers until you install them manually or via an operator--node-taints --node-taints nvidia.com/gpu=true:NoSchedule taints the GPU nodes so that regular pods won’t be scheduled on them accidentally. Only pods with a matching toleration (e.g. labeled for GPU use) can run on these nodes. This is a best practice to reserve expensive GPU nodes for GPU workloads (Optional) You can also add labels if needed. For example, to prepare for MIG configuration with the NVIDIA operator, you might add a label like nvidia.com/mig.config=all-1g.10gb to indicate the desired MIG slicing (explained later). We will address MIG config shortly, so adding such a label now is optional. Wait for Node Pool to be Ready: Monitor the Azure CLI output or use kubectl get nodes until the new node appears. It should register in Kubernetes (in NotReady state initially while it's configuring). Since we skipped driver install, the node will not have GPU scheduling resources yet (no nvidia.com/gpu resource visible) until we complete the next step. Installing the NVIDIA Driver Manually (or via GPU Operator) Because we used --skip-gpu-driver-install, the node will not have the necessary NVIDIA driver or CUDA runtime out of the box. You have two main approaches to install the driver: Use the NVIDIA GPU Operator (Helm-based) to handle driver installation. Install drivers manually (e.g., run a DaemonSet that downloads and installs the .run package or Debian packages). NVIDIA GPU Operator manages drivers, the Kubernetes device plugin, and GPU monitoring components. AKS GPU node pools come with the NVIDIA drivers and container runtime already pre-installed. BUT, because we used the flag : -skip-gpu-driver-install, we can now deploy the NVIDIA GPU Operator to handle GPU workloads and monitoring, while disabling its driver installation (to avoid conflicts with the pre-installed drivers). The GPU Operator will deploy the necessary components like the Kubernetes device plugin and the DCGM exporter for monitoring. 2.1 Installing via NVIDIA GPU Operator Step 1: Add the NVIDIA Helm repository. NVIDIA provides a Helm chart for the GPU Operator. Add the official NVIDIA Helm repo and update it: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update This repository contains the gpu-operator chart and other NVIDIA helm charts Step 2: Install the GPU Operator via Helm. Use Helm to install the GPU Operator into a dedicated namespace (e.g., gpu-operator). In AKS, disable the GPU Operator’s driver and toolkit deployment (since AKS already has those), and specify the correct container runtime class for NVIDIA. For example: helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace \ --set operator.runtimeClass=nvidia-container-runtime In the above command: operator.runtimeClass=nvidia-container-runtime aligns with the runtime class name configured on AKS for GPU support After a few minutes, Helm should report a successful deployment. For example: NAME: gpu-operator LAST DEPLOYED: Fri May 5 15:30:05 2023 NAMESPACE: gpu-operator STATUS: deployed REVISION: 1 TEST SUITE: None You can verify that the GPU Operator’s pods are running in the cluster. The Operator will deploy several DaemonSets including the NVIDIA device plugin, DCGM exporter, and others. For example, after installation you should see pods like the following in the gpu-operator namespace: nvidia-dcgm-exporter-xxxxx 1/1 Running 0 60s nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 60s nvidia-mig-manager-xxxxx 1/1 Running 0 4m nvidia-driver-daemonset-xxxxx 1/1 Running 0 4m gpu-operator-node-feature-discovery-... 1/1 Running 0 5m ... (other GPU operator pods) ... Here we see the NVIDIA device plugin and NVIDIA DCGM exporter pods running on each GPU node, as well as other components. (Note: In our AKS setup, the nvidia-driver-daemonset may be present but left idle since we disabled driver management.) Step 3: Confirm the operator’s GPU validation. The GPU Operator will run a CUDA validation job to verify everything is working. Check that the CUDA validation pod has completed successfully: kubectl get pods -n gpu-operator -l app=nvidia-cuda-validator Expected output: NAME READY STATUS RESTARTS AGE nvidia-cuda-validator-bpvkt 0/1 Completed 0 3m56s A Completed CUDA validator indicates the GPUs are accessible and the NVIDIA stack is functioning. At this point, you have the NVIDIA GPU Operator (with device plugin and DCGM exporter) installed via Helm on AKS. Verifying MIG on H100 with Node Pool Provisioning Once the driver is installed and the NVIDIA device plugin is running, you can verify MIG. The process is similar to verifying MIG on A100, but the resource naming and GPU partitioning reflect H100 capabilities. Check Node Resources kubectl describe node <h100-node-name> If you chose single MIG strategy, you might see: Allocatable: nvidia.com/gpu: 56 for a node with 8 H100s × 7 MIG slices each = 56. Or: nvidia.com/gpu: 14 if you used MIG2g (which yields 2–3 slices per GPU, depending on the exact profile). If you chose mixed MIG strategy (mig.strategy=mixed), you’ll see something like: Allocatable: nvidia.com/mig-1g.10gb: 56 or whichever MIG slice name is appropriate (e.g., mig-3g.40gb for MIG3g). Confirm MIG in nvidia-smi Run a GPU Workload For instance, run a quick CUDA container: kubectl run mig-test --rm -ti \ --image=nvidia/cuda:12.1.1-runtime-ubuntu22.04 \ --limits="nvidia.com/gpu=1" \ -- bash Inside the container, nvidia-smi should confirm you have a MIG device. Then any CUDA commands (e.g., deviceQuery) should pass, indicating MIG is active and the driver is working. nvidia-smi -L MIG Management on H100 The H100 supports several MIG profiles – predefined ways to slice the GPU. Each profile is denoted by <N>g.<M>gb meaning it uses N GPU compute slices (out of 7) and M GB of memory. Key H100 80GB MIG profiles include: MIG 1g.10gb: Each instance has 1/7 of the SMs and 10 GB memory (1/8 of VRAM). This yields 7 instances per GPU (7 × 10 GB = 70 GB out of 80, a small portion is reserved). This is the smallest slice size and maximizes the number of instances (useful for many lightweight tasks). MIG 1g.20gb: Each instance has 1/7 of SMs but 20 GB memory (1/4 of VRAM), allowing up to 4 instances per GPU. This profile gives each instance more memory while still only a single compute slice – useful for memory-intensive workloads that don’t need much compute. MIG 2g.20gb: Each instance gets 2/7 of SMs and 20 GB memory (2/8 of VRAM). 3 instances can run on one GPU. This offers a balance: more compute per instance than 1g, with a moderate 20 GB memory each. MIG 3g.40gb: Each instance has 3/7 of SMs and 40 GB memory (half the VRAM). Two instances fit on one H100. This effectively splits the GPU in half. MIG 4g.40gb: Each instance uses 4/7 of SMs and 40 GB memory. Only one such instance can exist per GPU (because it uses half the memory and more than half of the SMs). In practice, a 4g.40gb profile might be combined with a smaller profile on the same GPU (e.g., a 4g.40gb + a 3g.40gb could occupy one GPU, totaling 7/7 SM and 80GB). However, AKS node pools use a single uniform profile per GPU, so you typically wouldn’t mix profiles on the same GPU in AKS. MIG 7g.80gb: This profile uses the entire GPU (all 7/7 SMs and 80 GB memory). Essentially, MIG 7g.80gb is the full GPU as one instance (no slicing). It’s equivalent to not using MIG at all for that GPU. These profiles illustrate the flexibility: you can trade off number of instances vs. the power of each instance. For example, MIG 1g.10gb gives you seven small GPUs, whereas MIG 3g.40gb gives you two much larger slices (each roughly half of an H100). All MIG instances are hardware-isolated, meaning each instance’s performance is independent (one instance can’t starve others of GPU resources) Enabling MIG in AKS: There are two main ways to configure MIG on the AKS node pool: At Node Pool Creation (Static MIG Profile): Azure allows specifying a GPU instance profile when creating the node pool. For example, adding --gpu-instance-profile MIG1g to the az aks nodepool add command would provision each H100 GPU in 1g mode (e.g., 7×10GB instances per GPU). Supported profile names for H100 include MIG1g, MIG2g, MIG3g, MIG4g, and MIG7g (the same profile names used for A100, but on H100 they correspond to the sizes above). Important: Once set, the MIG profile on a node pool cannot be changed without recreating the node pool. If you chose MIG1g, all GPUs in that node pool will be partitioned into 7 slices each, and you can’t later switch those nodes to a different profile on the fly. Dynamically via NVIDIA GPU Operator: If you skipped the driver install (as we did) and are using the GPU Operator, you can let the operator manage MIG. This involves labeling the node with a desired MIG layout. For example, nvidia.com/mig.config=all-1g.10gb means “partition all GPUs into 1g.10gb slices.” The operator’s MIG Manager will then enable MIG mode on the GPUs, create the specified MIG instances, and mark the node ready when done. This approach offers flexibility – you could theoretically adjust the MIG profile by changing the label and letting the operator reconfigure (though it will drain and reboot the node to apply changes). The operator adds a taint like mig-nvidia.io/device-config=pending (or similar) during reconfiguration to prevent scheduling pods too early For our deployment, we opted to skip Azure’s automatic MIG config and use the NVIDIA operator. If you followed the steps in section 2 and set the nvidia.com/mig.config label before node creation, the node on first boot will come up, install drivers, then partition into the specified MIG profile. If not, you can label the node now and the operator will configure MIG accordingly. For example: kubectl label node <node-name> nvidia.com/mig.config=all-3g.40gb --overwrite to split each GPU into two 3g.40gb instances. The operator will detect this and partition the GPUs (the node may briefly go NotReady while MIG is being set up). After MIG is configured, verify the node’s GPU resources again. Depending on the MIG strategy (see next section), you will either see a larger number of generic nvidia.com/gpu resources or specifically named resources like nvidia.com/mig-3g.40gb. We will discuss how to schedule workloads onto these MIG instances next. Important Considerations: Workload Interruption: Applying a new MIG configuration can disrupt running GPU workloads. It's advisable to drain the node or ensure that no critical workloads are running during the reconfiguration process. Node Reboot: Depending on the environment and GPU model, enabling or modifying MIG configurations might require a node reboot. Ensure that your system is prepared for potential reboots to prevent unexpected downtime. Workload Recommendations for MIG Profiles (AI/ML vs. HPC) Different MIG slicing configurations are suited to different types of workloads. Here are recommendations for AI/ML and HPC scenarios: Full GPU (MIG 7g.80gb or MIG disabled) – Best for the largest and most intensive tasks. If you are training large deep learning models (e.g. GPT-style models, complex computer vision training) or running HPC simulations that fully utilize a GPU, you should use the entire H100 GPU. The ND H100 v5 is designed to excel at these demanding workloads. In Kubernetes, you would simply schedule pods that request a whole GPU. (If MIG mode is enabled with 7g.80gb profile, each GPU is one resource unit.) This ensures maximum performance for jobs that can utilize 80 GB of GPU memory and all compute units. HPC workloads like physics simulations, CFD, weather modeling, etc., typically fall here – they are optimized to use full GPUs or even multiple GPUs in parallel, so slicing a GPU could impede their performance unless you explicitly want to run multiple smaller HPC jobs on one card. Large MIG Partitions (3g.40gb or 4g.40gb) – Good for moderately large models or jobs that don’t quite need a full H100. For instance, you can split an H100 into 2× 3g.40gb instances, each with 40 GB VRAM and ~43% of the H100’s compute. This configuration is popular for AI model serving and inference where a full H100 might be underutilized. In fact, it might happen that two MIG 3g.40gb instances on an H100 can serve models with performance equal or better than two full A100 GPUs, at a lower cost. Each 3g.40gb slice is roughly equivalent to an A100 40GB in capability, and also unlocks H100-specific features (like FP8 precision for inference). Use cases: Serving two large ML models concurrently (each model up to 40GB in size, such as certain GPT-XXL or vision models). Each model gets a dedicated MIG slice. Running two medium-sized training jobs on one physical GPU. For example, two separate experiments that each need ~40GB GPU memory can run in parallel, each on a MIG 3g.40gb. This can increase throughput for hyperparameter tuning or multi-user environments. HPC batch jobs: if you have HPC tasks that can fit in half a GPU (perhaps memory-bound tasks or jobs that only need ~50% of the GPU’s FLOPs), using two 3g.40gb instances allows two jobs to run on one GPU server concurrently with minimal interference. MIG 4g.40gb (one 40GB instance using ~57% of compute) is a less common choice by itself – since only one 4g instance can exist per GPU, it leaves some GPU capacity unused (the remaining 3/7 SMs would be idle). It might be used in a mixed profile scenario (one 4g + one 3g on the same GPU) if manually configured. In AKS (which uses uniform profiles per node pool), you’d typically prefer 3g.40gb if you want two equal halves, or just use full GPUs. So in practice, stick with 3g.40gb for a clean 2-way split on H100. Medium MIG Partitions (2g.20gb) – Good for multiple medium workloads. This profile yields 3 instances per GPU, each with 20 GB memory and about 28.6% of the compute. This is useful when you have several smaller training jobs or medium-sized inference tasks that run concurrently. Examples: Serving three different ML models (each ~15–20 GB in size) from one H100 node, each model on its own MIG 2g.20gb instance. Running 3 parallel training jobs for smaller models or prototyping (each job can use 20GB GPU memory). For instance, three data scientists can share one H100 GPU server, each getting what is effectively a “20GB GPU”. Each 2g.20gb MIG slice should outperform a V100 (16 GB) in both memory and compute, so this is still a hefty slice for many models. In HPC context, if you had many lighter GPU-accelerated tasks (for example, three independent tasks that each use ~1/3 of a GPU), this profile could allow them to share a node efficiently. Small MIG Partitions (1g.10gb) – Ideal for high-density inference and lightweight workloads. This profile creates 7 instances per GPU, each with 10 GB VRAM and 1/7 of the compute. It’s perfect for AI inference microservices, model ensembles, or multi-tenant GPU environments: Deploying many small models or instances of a model. For example, you could host seven different AI services (each requiring <10GB GPU memory) on one physical H100, each in its own isolated MIG slice. Most cloud providers use this to offer “fractional GPUs” to customers– e.g., a user could rent a 1g.10gb slice instead of the whole GPU. Running interactive workloads like Jupyter notebooks or development environments for multiple users on one GPU server. Each user can be assigned a MIG 1g.10gb slice for testing small-scale models or doing data science workloads, without affecting others. Inference tasks that are memory-light but require GPU acceleration – e.g., running many inference requests in parallel across MIG slices (each slice still has ample compute for model scoring tasks, and 10 GB is enough for many models like smaller CNNs or transformers). Keep in mind that 1g.10gb slices have the lowest compute per instance, so they are suited for workloads that individually don’t need the full throughput of an H100. They shine when throughput is achieved by running many in parallel. 1g.20gb profile – This one is a bit niche: 4 slices per GPU, each with 20 GB but only 1/7 of the SMs. You might use this if each task needs a large model (20 GB) but isn’t compute-intensive. An example could be running four instances of a large language model in inference mode, where each instance is constrained by memory (loading a 15-18GB model) but you deliberately limit its compute share to run more concurrently. In practice, the 2g.20gb profile (which gives the same memory per instance and more compute) might be preferable if you can utilize the extra SMs. So 1g.20gb would only make sense if you truly have compute-light, memory-heavy workloads or if you need exactly four isolated instances on one GPU. HPC Workloads Consideration: Traditional HPC jobs (MPI applications, scientific computing) typically either use an entire GPU or none. MIG can be useful in HPC for capacity planning – e.g., running multiple smaller GPU-accelerated jobs simultaneously if they don’t all require a full H100. But it introduces complexity, as the HPC scheduler must be aware of fractional GPUs. Many HPC scenarios might instead use whole GPUs per job for simplicity. That said, for HPC inference or analytics (like running multiple inference tasks on simulation output), MIG slicing can improve utilization. If jobs are latency-sensitive, MIG’s isolation ensures one job doesn’t impact another, which is beneficial for multi-tenant HPC clusters (for example, different teams sharing a GPU node). In summary, choose the smallest MIG slice that still meets your workload’s requirements. This maximizes overall GPU utilization and cost-efficiency by packing more tasks on the hardware. Use larger slices or full GPUs only when a job truly needs the extra memory and compute. It’s often a good strategy to create multiple GPU node pools with different MIG profiles tailored to different workload types (e.g., one pool of full GPUs for training and one pool of 1g or 2g MIG GPUs for inference). Appendix A: MIG Management via AKS Node Pool Provisioning (without GPU Operator MIG profiles) Multi-Instance GPU (MIG) allows partitioning an NVIDIA A100 (and newer) GPU into multiple instances. AKS supports MIG for compatible GPU VM sizes (such as the ND A100 v4 series), but MIG must be configured when provisioning the node pool – it cannot be changed on the fly in AKS. In this section, we show how to create a MIG-enabled node pool and integrate it with Kubernetes scheduling. We will not use the GPU Operator’s dynamic MIG reconfiguration; instead, we set MIG at node pool creation time (which is the only option on AKS). Step 1: Provision an AKS node pool with a MIG profile. Choose a MIG-capable VM size (for example, Each instance has 1/7 of the SMs and 10 GB memory (1/8 of VRAM). This yields 7 instances per GPU (7 × 10 GB = 70 GB out of 80, a small portion is reserved). Use the Azure CLI to create a new node pool and specify the --gpu-instance-profile: az aks nodepool add \ --resource-group <myResourceGroup> \ --cluster-name <myAKSCluster> \ --name migpool \ --node-vm-size Standard_ND96isr_H100_v5 \\ --node-count 1 \ --gpu-instance-profile MIG1g In this example, we create a node pool named "migpool" with MIG profile MIG1g (each physical H100 GPU is split into 7 instances of 1g/5gb each). Important: You cannot change the MIG profile after the node pool is created. If you need a different MIG configuration (e.g., 2g.10gb or 4g.20gb instances), you must create a new node pool with the desired profile. Note: MIG is only supported on Ubuntu-based AKS node pools (not on Azure Linux nodes), and currently the AKS cluster autoscaler does not support scaling MIG-enabled node pools. Plan capacity accordingly since MIG node pools can’t auto-scale. Appendix B: Key Points and Best Practices No On-the-Fly Profile Changes With AKS, once a node pool is created with --gpu-instance-profile MIGxg, you cannot switch to a different MIG layout on that same node pool. If you need a new MIG profile, create a new node pool. --skip-gpu-driver-install This is typically used if you need a specific driver version, or if you want the GPU Operator to manage drivers (instead of the in-box AKS driver). Make sure your driver is installed before you schedule GPU workloads. If the driver is missing, pods that request GPU resources will fail to initialize. Driver Versions for H100 H100 requires driver branch R525 or newer (and CUDA 12+). Verify the GPU Operator or your manual install uses a driver that supports H100 and MIG on H100 specifically. Single vs. Mixed Strategy Single strategy lumps all MIG slices together as nvidia.com/gpu. This is simpler for uniform MIG node pools. Mixed strategy exposes resources like nvidia.com/mig-1g.10gb. Use if you need explicit scheduling by MIG slice type. Configure this in the GPU Operator’s Helm values (e.g., --set mig.strategy=single or mixed). If the Operator’s MIG Manager is disabled, it won’t attempt to reconfigure MIG, but it will still let the device plugin report the slices in single or mixed mode. Resource Requests and Scheduling If using single strategy, a pod that requests nvidia.com/gpu: 1 will be allocated a single 1g.10gb MIG slice on H100. If using mixed, that same request must specifically match the MIG resource name (e.g., nvidia.com/mig-1g.10gb: 1). If your pod requests nvidia.com/gpu: 1, but the node only advertises nvidia.com/mig-1g.10gb, scheduling won’t match. So be consistent in your pod specs. Cluster Autoscaler Currently, MIG-enabled node pools have limited or no autoscaler support on AKS (the cluster autoscaler does not fully account for MIG resources). Scale these node pools manually or via custom logic. If you rely heavily on auto-scaling, consider using a standard GPU node pool (no MIG) or carefully plan capacity to avoid needing dynamic scaling for MIG pools. Monitoring The GPU Operator deploys DCGM exporter by default, which can collect MIG-specific metrics. Integrate with Prometheus + Grafana for GPU usage dashboards. MIG slices are typically identified by unique device IDs in DCGM. You can see which MIG slices are busier than others, memory usage, etc. Node Image Upgrades Because you’re skipping the driver install from AKS, ensure you keep your GPU driver DaemonSet or Operator up to date. If you do a node image upgrade (AKS version upgrade), the OS might change, requiring a recompile or a matching driver version. The GPU Operator normally handles this seamlessly by re-installing the driver on the new node image. Test your upgrades in a staging cluster if possible, especially with new AKS releases or driver versions. Handling Multiple Node Pools Many users create one node pool with full GPUs (no MIG) for large jobs, and another MIG-enabled node pool for smaller parallel workloads. You can do so easily by repeating the steps above for each node pool, specifying different MIG profiles. References MIG User Guide NVIDIA GPU Operator with Azure Kubernetes Service ND-H100-v5 sizes series Create a multi-instance GPU node pool in Azure Kubernetes Service (AKS)1.9KViews2likes0CommentsBeyond Prompts: How Agentic AI is Redefining Human-AI Collaboration
The Shift from Reactive to Proactive AI As a passionate innovator in AI education, I’m on a mission to reimagine how we learn and build with AI—looking to craft intelligent agents that move beyond simple prompts to think, plan, and collaborate dynamically. Traditional AI systems rely heavily on prompt-based interactions—you ask a question, and the model responds. These systems are reactive, limited to single-turn tasks, and lack the ability to plan or adapt. This becomes a bottleneck in dynamic environments where tasks require multi-step reasoning, memory, and autonomy. Agentic AI changes the game. An agent is a structured system that uses a looped process to: Think – analyze inputs, reason about tasks, and plan actions. Act – choose and execute tools to complete tasks. Learn – optionally adapt based on feedback or outcomes. Unlike static workflows, agentic systems can: Make autonomous decisions Adapt to changing environments Collaborate with humans or other agents This shift enables AI to move from being a passive assistant to an active collaborator—capable of solving complex problems with minimal human intervention. What Is Agentic AI? Agentic AI refers to AI systems that go beyond static responses—they can reason, plan, act, and adapt autonomously. These agents operate in dynamic environments, making decisions and invoking tools to achieve goals with minimal human intervention. Some of the frameworks that can be used for Agentic AI include LangChain, Semantic Kernel, AutoGen, Crew AI, MetaGPT, etc. The frameworks can use Azure OpenAI, Anthropic Claude, Google Gemini, Mistral AI, Hugging Face Transformers, etc. Key Traits of Agentic AI Autonomy Agents can independently decide what actions to take based on context and goals. Unlike assistants, which support users, agents' complete tasks and drive outcomes. Memory Agents can retain both long-term and short-term context. This enables personalized and context-aware interactions across sessions. Planning Semantic Kernel agents use function calling to plan multi-step tasks. The AI can iteratively invoke functions, analyze results, and adjust its strategy—automating complex workflows. Adaptability Agents dynamically adjust their behavior based on user input, environmental changes, or feedback. This makes them suitable for real-world applications like task management, learning assistants, or research copilots. Frameworks That Enable Agentic AI Semantic Kernel: A flexible framework for building agents with skills, memory, and orchestration. Supports plugins, planning, and multi-agent collaboration. More information here: Semantic Kernel Agent Architecture. Azure AI Foundry: A managed platform for deploying secure, scalable agents with built-in governance and tool integration. More information here: Exploring the Semantic Kernel Azure AI Agent. LangGraph: A JavaScript-compatible SDK for building agentic apps with memory and tool-calling capabilities, ideal for web-based applications. More information here: Agentic app with LangGraph or Azure AI Foundry (Node.js) - Azure App Service. Copilot Studio: A low-code platform to build custom copilots and agentic workflows using generative AI, plugins, and orchestration. Ideal for enterprise-grade conversational agents. More information here: Building your own copilot with Copilot Studio. Microsoft 365 Copilot: Embeds agentic capabilities directly into productivity apps like Word, Excel, and Teams—enabling contextual, multi-step assistance across workflows. More information here: What is Microsoft 365 Copilot? Why It Matters: Real-World Impact Traditional Generative AI is like a calculator—you input a question, and it gives you an answer. It’s reactive, single-turn, and lacks context. While useful for quick tasks, it struggles with complexity, personalization, and continuity. Agentic AI, on the other hand, is like a smart teammate. It can: Understand goals Plan multi-step actions Remember past interactions Adapt to changing needs Generative AI vs. Agentic Systems Feature Generative AI Agentic AI Interaction Style One-shot responses Multi-turn, goal-driven Context Awareness Limited Persistent memory Task Execution Static Dynamic and autonomous Adaptability Low High (based on feedback/input) How Agentic AI Works — Agentic AI for Students Example Imagine a student named Alice preparing for her final exams. She uses a Smart Study Assistant powered by Agentic AI. Here's how the agent works behind the scenes: Skills / Functions These are the actions or the callable units of logic the agent can invoke to perform. The assistant has functions like: Summarize lecture notes Generate quiz questions Search academic papers Schedule study sessions Think of these as plug-and-play capabilities the agent can call when needed. Memory The agent remembers Alice’s: Past quiz scores Topics she struggled with Preferred study times This helps the assistant personalize recommendations and avoid repeating content she already knows. Planner Instead of doing everything at once, the agent: Breaks down Alice’s goal (“prepare for exams”) into steps Plans a week-by-week study schedule Decides which skills/functions to use at each stage It’s like having a tutor who builds a custom roadmap. Orchestrator This is the brain that coordinates everything. It decides when to use memory, which function to call, and how to adjust the plan if Alice misses a study session or scores low on a quiz. It ensures the agent behaves intelligently and adapts in real time. Conclusion Agentic AI marks a pivotal shift in how we interact with intelligent systems—from passive assistants to proactive collaborators. As we move beyond prompts, we unlock new possibilities for autonomy, adaptability, and human-AI synergy. Whether you're a developer, educator, or strategist, understanding agentic frameworks is no longer optional - it’s foundational. Here are the high-level steps to get started with Agentic AI using only official Microsoft resources, each with a direct link to the relevant documentation: Get Started with Agentic AI Understand Agentic AI Concepts - Begin by learning the fundamentals of AI agents, their architecture, and use cases. See: Explore the basics in this Microsoft Learn module Set Up Your Azure Environment - Create an Azure account and ensure you have the necessary roles (e.g., Azure AI Account Owner or Contributor). See: Quickstart guide for Azure AI Foundry Agent Service Create Your First Agent in Azure AI Foundry - Use the Foundry portal to create a project and deploy a default agent. Customize it with instructions and test it in the playground. See: Step-by-step agent creation in Azure AI Foundry Build an Agentic Web App with Semantic Kernel or Foundry - Follow a hands-on tutorial to integrate agentic capabilities into a .NET web app using Semantic Kernel or Azure AI Foundry. See: Tutorial: Build an agentic app with Semantic Kernel or Foundry Deploy and Test Your Agent - Use GitHub Codespaces or Azure Developer CLI to deploy your app and connect it to your agent. Validate functionality using OpenAPI tools and the agent playground. See: Deploy and test your agentic app For Further Learning: Develop generative AI apps with Azure OpenAI and Semantic Kernel Agentic app with Semantic Kernel or Azure AI Foundry (.NET) - Azure App Service AI Agent Orchestration Patterns - Azure Architecture Center Configuring Agents with Semantic Kernel Plugins Workflows with AI Agents and Models - Azure Logic Apps About the author: I'm Juliet Rajan, a Lead Technical Trainer and passionate innovator in AI education. I specialize in crafting gamified, visionary learning experiences and building intelligent agents that go beyond traditional prompt-based systems. My recent work explores agentic AI, autonomous copilots, and dynamic human-AI collaboration using platforms like Azure AI Foundry and Semantic Kernel.676Views6likes2CommentsPantry Log–Microsoft Cognitive, IOT and Mobile App for Managing your Fridge Food Stock
First published on MSDN on Mar 06, 2018 We are Ami Zou (CS & Math), Silvia Sapora(CS), and Elena Liu (Engineering), three undergraduate students from UCL, Imperial College London, and Cambridge University respectively.721Views0likes1CommentModel Mondays S2E9: Models for AI Agents
1. Weekly Highlights This episode kicked off with the top news and updates in the Azure AI ecosystem: GPT-5 and GPT-OSS Models Now in Azure AI Foundry: Azure AI Foundry now supports OpenAI’s GPT-5 lineup (including GPT-5, GPT-5 Mini, and GPT-5 Nano) and the new open-weight GPT-OSS models (120B, 20B). These models offer powerful reasoning, real-time agent tasks, and ultra-low latency Q&A, all with massive context windows and flexible deployment via the Model Router. Flux 1 Context Pro & Flux 1.1 Pro from Black Forest Labs: These new vision models enable in-context image generation, editing, and style transfer, now available in the Image Playground in Azure AI Foundry. Browser Automation Tool (Preview): Agents can now perform real web tasks—search, navigation, form filling, and more—via natural language, accessible through API and SDK. GitHub Copilot Agent Mode + Playwright MCP Server: Debug UIs with AI: Copilot’s agent mode now pairs with Playwright MCP Server to analyze, identify, and fix UI bugs automatically. Discord Community: Join the conversation, share your feedback, and connect with the product team and other developers. 2. Spotlight On: Azure AI Agent Service & Agent Catalog This week’s spotlight was on building and orchestrating multi-agent workflows using the Azure AI Agent Service and the new Agent Catalog. What is the Azure AI Agent Service? A managed platform for building, deploying, and scaling agentic AI solutions. It supports modular, multi-agent workflows, secure authentication, and seamless integration with Azure Logic Apps, OpenAPI tools, and more. Agent Catalog: A collection of open-source, ready-to-use agent templates and workflow samples. These include orchestrator agents, connected agents, and specialized agents for tasks like customer support, research, and more. Demo Highlights: Connected Agents: Orchestrate workflows by delegating tasks to specialized sub-agents (e.g., mortgage application, market insights). Multi-Agent Workflows: Design complex, hierarchical agent graphs with triggers, events, and handoffs (e.g., customer support with escalation to human agents). Workflow Designer: Visualize and edit agent flows, transitions, and variables in a modular, no-code interface. Integration with Azure Logic Apps: Trigger workflows from 1400+ external services and apps. 3. Customer Story: Atomic Work Atomic Work showcased how agentic AI can revolutionize enterprise service management, making employees more productive and ops teams more efficient. Problem: Traditional IT service management is slow, manual, and frustrating for both employees and ops teams. Solution: Atomic Work’s “Atom” is a universal, multimodal agent that works across channels (Teams, browser, etc.), answers L1/L2 questions, automates requests, and proactively assists users. Technical Highlights: Multimodal & Cross-Channel: Atom can guide users through web interfaces, answer questions, and automate tasks without switching tools. Data Ingestion & Context: Regularly ingests up-to-date documentation and context, ensuring accurate, current answers. Security & Integration: Built on Azure for enterprise-grade security and seamless integration with existing systems. Demo: Resetting passwords, troubleshooting VPN, requesting GitHub repo access—all handled by Atom, with proactive suggestions and context-aware actions. Atom can even walk users through complex UI tasks (like generating GitHub tokens) by “seeing” the user’s screen and providing step-by-step guidance. 4. Key Takeaways Here are the key learnings from this episode: Agentic AI is Production-Ready: Azure AI Agent Service and the Agent Catalog make it easy to build, deploy, and scale multi-agent workflows for real-world business needs. Modular, No-Code Workflow Design: The workflow designer lets you visually create and edit agent graphs, triggers, and handoffs—no code required. Open-Source & Extensible: The Agent Catalog provides open-source templates and welcomes community contributions. Real-World Impact: Solutions like Atomic Work show how agentic AI can transform IT, HR, and customer support, making organizations more efficient and employees more empowered. Community & Support: Join the Discord and Forum to connect, ask questions, and share your own agentic AI projects. Sharda's Tips: How I Wrote This Blog Writing this blog is like sharing my own learning journey with friends. I start by thinking about why the topic matters and how it can help someone new to Azure or agentic AI. I use simple language, real examples from the episode, and organize my thoughts with GitHub Copilot to make sure I cover all the important points. Here’s the prompt I gave Copilot to help me draft this blog: Generate a technical blog post for Model Mondays S2E9 based on the transcript and episode details. Focus on Azure AI Agent Service, Agent Catalog, and real-world demos. Explain the concepts for students, add a section on practical applications, and share tips for writing technical blogs. Make it clear, engaging, and useful for developers and students. After watching the video, I felt inspired to try out these tools myself. The way the speakers explained and demonstrated everything made me believe that anyone can get started, no matter their background. My goal with this blog is to help you feel the same way—curious, confident, and ready to explore what AI and Azure can do for you. If you have questions or want to share your own experience, I’d love to hear from you. Coming Up Next Week Next week: Document Processing with AI! Join us as we explore how to automate document workflows using Azure AI Foundry, with live demos and expert guests. 1️⃣ | Register For The Livestream – Aug 18, 2025 2️⃣ | Register For The AMA – Aug 22, 2025 3️⃣ | Ask Questions & View Recaps – Discussion Forum About Model Mondays Model Mondays is a weekly series designed to help you build your Azure AI Foundry Model IQ with three elements: 5-Minute Highlights – Quick news and updates about Azure AI models and tools on Monday 15-Minute Spotlight – Deep dive into a key model, protocol, or feature on Monday 30-Minute AMA on Friday – Live Q&A with subject matter experts from Monday livestream Want to get started? Register For Livestreams – every Monday at 1:30pm ET Watch Past Replays to revisit other spotlight topics Register For AMA – to join the next AMA on the schedule Recap Past AMAs – check the AMA schedule for episode specific links Join The Community Great devs don't build alone! In a fast-paced developer ecosystem, there's no time to hunt for help. That's why we have the Azure AI Developer Community. Join us today and let's journey together! Join the Discord – for real-time chats, events & learning Explore the Forum – for AMA recaps, Q&A, and Discussion! About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador interested in cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I summarize my takeaways from each week's Model Mondays livestream.163Views0likes0CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs: GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress using monitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)1.2KViews3likes8Comments