azure machine learning
173 TopicsFine-tuning gpt-oss-20b Now Available on Managed Compute
Earlier this month, we made available OpenAI’s open‑source model gpt‑oss on Azure AI Foundry and Windows AI Foundry. Today, you can fine-tune gpt‑oss‑20b using Managed Compute on Azure — available in preview and accessible via notebook.515Views0likes0CommentsAzure Machine Learning now supports Large-Scale AI Training and Inference with ND H200 v5 VMs
TL;DR: Azure Machine Learning now offers ND H200 v5 VMs accelerated by NVIDIA H200 Tensor Core GPUs, purpose‑built to train and serve modern generative AI more efficiently at cloud scale. With massive on‑GPU memory and high intra‑node bandwidth, you can fit larger models and batches, keep tensors local, and cut cross‑GPU transfers - doing more with fewer nodes. Start with a single VM or scale out to hundreds in a managed cluster to capture cloud economics, while Azure’s AI‑optimized infrastructure delivers consistent performance across training and inference. Why this matters The AI stack is evolving with bigger parameter counts, longer context windows, multimodal pipelines, and production-scale inference. ND H200 v5 on Azure ML is designed to address these needs with a memory-first, network-optimized, and workflow-friendly approach, enabling data science and MLOps teams to move from experiment to production efficiently. Memory, the real superpower At the heart of each ND H200 v5 VM are eight NVIDIA H200 GPUs, each packing 141 GB of HBM3e memory - representing a 76% increase in HBM capacity over H100. That means you can now process more per GPU, larger models, more tokens and better performance. Aggregate that across all eight GPUs and you get a massive 1,128 GB of GPU memory per VM. HBM3e throughput: 4.8 TB/s per GPU ensures continuous data flow, preventing compute starvation. Larger models with fewer compromises: Accommodate wider context windows, larger batch sizes, deeper expert mixtures, or higher-resolution vision tokens without needing aggressive sharding or offloading techniques. Improved scaling: Increased on-GPU memory reduces cross-device communication and enhances step-time stability. Built to scale-within a VM and across the cluster When training across multiple GPUs, communication speed is crucial. Inside the VM: Eight NVIDIA H200 GPUs are linked via NVIDIA NVLink, delivering 900 GB/s of bidirectional bandwidth per GPU for ultra-fast all-reduce and model-parallel operations with minimal synchronization overhead. Across VMs: Each instance comes with eight 400 Gb/s NVIDIA ConnectX-7 InfiniBand adapters connecting to NVIDIA Quantum-2 InfiniBand switches, totaling 3.2 Tb/s interconnect per VM. GPUDirect RDMA: Enables data to move GPU-to-GPU across nodes with lower latency and lower CPU overhead, which is essential for distributed data/model/sequence parallelism. The result is near-linear scaling characteristics for many large-model training and fine-tuning workloads. Built into Azure ML workflows (no friction) Azure Machine Learning integrates ND H200 v5 with the tools your teams already use: Frameworks: PyTorch, TensorFlow, JAX, and more Containers: Optimized Docker images available via Azure Container Registry Distributed training: NVIDIA NCCL fully supported to maximize performance of NVLink and InfiniBand Bring your existing training scripts, launch distributed runs, and integrate into pipelines, registries, managed endpoints, and MLOps with minimal change. Real-world gains Early benchmarks show up to 35% throughput improvements for large language model inference compared to the previous generation, particularly on models like Llama 3.1 405B. The increased HBM capacity allows for larger inference batches, improving utilization and cost efficiency. For training, the combination of additional memory and higher bandwidth supports larger models or more data per step, often potentially reducing overall training time. Your mileage will vary by model architecture, precision, parallelism strategy, and data loader efficiency—but the headroom is real. Quick spec snapshot GPUs: 8× NVIDIA H200 Tensor Core GPUs HBM3: 141 GB per GPU (1,128 GB per VM) HBM bandwidth: 4.8 TB/s per GPU Inter-GPU: NVIDIA NVLink 900 GB/s (intra-VM) Host: 96 vCPUs (Intel Xeon Sapphire Rapids), 1,850 GiB RAM Local storage: 28 TB NVMe SSD Networking: 8× 400 Gb/s NVIDIA ConnectX-7 InfiniBand adapters (3.2 Tb/s total) with GPUDirect RDMA Getting started (it’s just a CLI away) Create an auto-scaling compute cluster in Azure ML: az ml compute create \ --name h200-training-cluster \ --size Standard_ND96isr_H200_v5 \ --min-instances 0 \ --max-instances 8 \ --type amlcompute Auto-scaling means you only pay for what you use - perfect for research bursts, scheduled training, and production inference with variable demand. What you can do now Train foundation models with larger batch sizes and longer sequences Fine-tune LLMs with fewer memory workarounds, reducing the need for offloading and resharding Deploy high-throughput inference ND H200 v5 documentation for chat, RAG, MoE, and multimodal use cases Accelerate scientific and simulation workloads that require high bandwidth + memory Pro tips to unlock performance Optimize HBM usage: Increase batch size/sequence length until you reach the HBM bandwidth limit of approximately 4.8 TB/s per GPU). Utilize parallelism effectively: Combine tensor/model parallel (NVLink-aware) with data parallelism across nodes (InfiniBand + GPUDirect RDMA). Optimize your input pipeline: Parallelize tokenization/augmentation,and store frequently accessed data on local NVMe to prevent GPU stalls. Leverage NCCL: Configure your communication backend to take advantage of the topology, using NVLink intra-node and InfiniBand inter-node. The bottom line This is more than a hardware bump - it’s a platform designed for the next wave of AI. With ND H200 v5 on Azure ML, you gain the memory capacity, network throughput, and operational simplicity needed to transform ambitious models into production-grade systems. For comprehensive technical specifications and deployment guidance, visit the official ND H200 v5 documentation and explore our detailed announcement blog for additional insights and use cases.460Views1like1CommentConnecting Azure Kubernetes Service Cluster to Azure Machine Learning for Multi-Node GPU Training
TLDR Create an Azure Kubernetes Service cluster with GPU nodes and connect it to Azure Machine Learning to run distributed ML training workloads. This integration provides a managed data science platform while maintaining Kubernetes flexibility under the hood, enables multi-node training that spans multiple GPUs, and bridges the gap between infrastructure and ML teams. The solution works for both new and existing clusters, supporting specialized GPU hardware and hybrid scenarios. Why Should You Care? Integrating Azure Kubernetes Service (AKS) clusters with GPUs into Azure Machine Learning (AML) offers several key benefits: Utilize existing infrastructure: Leverage your existing AKS clusters with GPUs via a managed data science platform like AML Flexible resource sharing: Allow both AKS workloads and AML jobs to access the same GPU resources Organizational alignment: Bridge the gap between infrastructure teams (who prefer AKS) and ML teams (who prefer AML) Hybrid scenarios: Connect on-premises GPUs to AML using Azure Arc in a similar way to this tutorial We are looking at Multi-Node Training because it is needed for most bigger training jobs. If you just need a single GPU or single VM we also look at how to do this. Prerequisites Before you begin, ensure you have: Azure subscription with privileges to create and manage AKS clusters and add compute targets in AML. We recommend the AKS and AML resources to be in the same region. Sufficient quota for GPU compute resources. Check this article on how to request quota How to Increase Quota for Specific Types of Azure Virtual Machines. We are using two Standard_NC8as_T4_v3. So, 4 T4s in total. You can also opt for other GPU enabled compute. Azure CLI version 2.24.0 or higher (az upgrade) Azure CLI k8s-extension version 1.2.3 or higher (az extension update --name k8s-extension) kubectl installed and updated Step 1: Create an AKS Cluster with GPU Nodes For Windows users, it's recommended to use WSL (Ubuntu 22.04 or similar). # Login to Azure az login # Create resource group az group create -n ResourceGroup -l francecentral # Create AKS cluster with a system node az aks create -g ResourceGroup -n MyCluster \ --node-vm-size Standard_D16s_v5 \ --node-count 2 \ --enable-addons monitoring # Get cluster credentials az aks get-credentials -g ResourceGroup -n MyCluster # Add GPU node pool (Spot Instances are not recommended) az aks nodepool add \ --resource-group ResourceGroup \ --cluster-name MyCluster \ --name gpupool \ --node-count 2 \ --vm-size standard_nc8as_t4_v3 \ # Verify cluster configuration kubectl get namespaces kubectl get nodes Step 2: Install NVIDIA Device Plugin Next, we need to make sure that our GPUs exactly work as expected. The NVIDIA Device Plugin is a Kubernetes plugin that enables the use of NVIDIA GPUs in containers running on Kubernetes clusters. It acts as a bridge between Kubernetes and the physical GPU hardware. Create and apply the NVIDIA device plugin to enable GPU access within AKS: kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml To confirm that the GPUs are working as expected follow the steps here and run a test workload Use GPUs on Azure Kubernetes Service (AKS) - Azure Kubernetes Service | Microsoft Learn. Step 3: Register the KubernetesConfiguration Provider The KubernetesConfiguration Provider enables Azure to deploy and manage extensions on Kubernetes clusters, including the Azure Machine Learning extension. Before installing extensions, ensure the required resource provider is registered: # Install the k8s-extension Azure CLI extension az extension add --name k8s-extension # Check if the provider is already registered az provider list --query "[?contains(namespace,'Microsoft.KubernetesConfiguration')]" -o table # If not registered, register it az provider register --namespace Microsoft.KubernetesConfiguration az account set --subscription <YOUR-AZURE-SUBSCRIPTION-ID> az feature registration create --namespace Microsoft.KubernetesConfiguration --name ExtensionTypes # Check the status after a few minutes and wait until it shows Registered az feature show --namespace Microsoft.KubernetesConfiguration --name ExtensionTypes # Install the Dapr extension az k8s-extension create --cluster-type managedClusters \ --cluster-name MyCluster \ --resource-group ResourceGroup \ --name dapr \ --extension-type Microsoft.Dapr \ --auto-upgrade-minor-version false You can also check out the “Before you begin” section here Install the Dapr extension for Azure Kubernetes Service (AKS) and Arc-enabled Kubernetes - Azure Kubernetes Service | Microsoft Learn. Step 4: Deploy the Azure Machine Learning Extension Install the AML extension on your AKS cluster for training: az k8s-extension create \ --name azureml-extension \ --extension-type Microsoft.AzureML.Kubernetes \ --config enableTraining=True enableInference=False \ --cluster-type managedClusters \ --cluster-name MyCluster \ --resource-group ResourceGroup \ --scope cluster There are several options on the extension installation available which are listed here Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning | Microsoft Learn. Verify Extension Deployment az k8s-extension create \ --name azureml-extension \ --extension-type Microsoft.AzureML.Kubernetes \ --config enableTraining=True enableInference=False \ --cluster-type managedClusters \ --cluster-name MyCluster \ --resource-group ResourceGroup \ --scope cluster The extension is successfully deployed when provisioning state shows "Succeeded" and all pods in the "azureml" namespace are in the "Running" state. Step 5: Create a GPU-Enabled Instance Type By default, AML only has access to an instance type that doesn't include GPU resources. Create a custom instance type to utilize your GPUs: # Create a custom instance type definition cat > t4-full-node.yaml << EOF apiVersion: amlarc.azureml.com/v1alpha1 kind: InstanceType metadata: name: t4-full-node spec: nodeSelector: agentpool: gpupool kubernetes.azure.com/accelerator: nvidia resources: limits: cpu: "6" nvidia.com/gpu: 2 # Integer value equal to the number of GPUs memory: "55Gi" requests: cpu: "6" memory: "55Gi" EOF # Apply the instance type kubectl apply -f t4-full-node.yaml This configuration creates an instance type that allocates two T4 GPU nodes, making it ideal for ML training jobs. Step 6: Attach the Cluster to Azure Machine Learning Once your instance type is created, you can attach the AKS cluster to your AML workspace: In the Azure Machine Learning Studio, navigate to Compute > Kubernetes clusters Click New and select your AKS cluster Specify your custom instance type ("t4-full-node") when configuring the compute target Complete the attachment process following the UI workflow Alternatively, you can use the Azure CLI or Python SDK to attach the cluster programmatically Attach a Kubernetes cluster to Azure Machine Learning workspace - Azure Machine Learning | Microsoft Learn. Step 7: Test Distributed Training With your GPU-enabled AKS cluster now attached to AML, you can: Create an AML experiment that uses distributed training Specify your custom instance type in the training configuration Submit the job to take advantage of multi-node GPU capabilities You can now run advanced ML workloads like distributed deep learning, which requires multiple GPUs across nodes, all managed through the AML platform. If you want to submit such a job you simply need to list the compute name, the registered instance_type and the number of instances. As an example, clone yuvmaz/aml_labs: Labs to showcase the capabilities of Azure ML and switch to Lab 4 - Foundations of Distributed Deep Learning. Lab 4 introduces you on how distributed training works in general and in AML. In the Jupyter Notebook that guides through that tutorial you will find that the first job definition is in simple_environment.yaml. Open this file an make the following adjustments to use the AKS compute target: $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json command: env | sort | grep -e 'WORLD' -e 'RANK' -e 'MASTER' -e 'NODE' environment: image: library/python:latest distribution: type: pytorch process_count_per_instance: 2 # We use 2 GPUs per node, Cross GPUs compute: azureml:<Kubernetes-compute_target_name> resources: instance_count: 2 # We want to VMs/instances in total, Cross node instance_type: <instance type name><instance type name> display_name: simple-env-vars-display experiment_name: distributed-training-foundations You can proceed in the same way for all other distributed training jobs. Conclusion By integrating AKS clusters with GPUs into Azure Machine Learning, you get the best of both worlds - the container orchestration and infrastructure capabilities of Kubernetes with the ML workflow management features of AML. This setup is particularly valuable for organizations that want to: Maximize GPU utilization across both operational and ML workloads Provide data scientists with self-service access to GPU resources Establish a consistent ML platform that spans both cloud and on-premises resources For production deployments, consider implementing additional security measures, networking configurations, and monitoring solutions appropriate for your organization's requirements. Thanks a lot, to Yuval Mazor and Alan Weaver for their collaboration on this blog post.465Views1like1CommentDistributed Databases: Adaptive Optimization with Graph Neural Networks and Causal Inference
This blog post introduces a new adaptive framework for distributed databases that leverages Graph Neural Networks (GNNs) and causal inference to overcome the classic limitations imposed by the CAP theorem. Traditional distributed systems often rely on static policies for consistency, availability, and partitioning, which struggle to keep up with rapidly changing workloads and data relationships. The proposed GNN-based approach models the complex, interconnected nature of distributed databases, enabling predictive consistency management, intelligent load balancing for availability, and dynamic, graph-aware partitioning. By integrating temporal modeling and reinforcement learning, the framework adapts in real time, delivering significant improvements in latency, load balancing, and partition efficiency across real-world and synthetic benchmarks. This marks a major step toward intelligent, self-optimizing database systems that can meet the demands of modern applications.210Views0likes0CommentsRAFT: A new way to teach LLMs to be better at RAG
In this article, we will look at the limitations of RAG and domain-specific Fine-tuning to adapt LLMs to existing knowledge and how a team of UC Berkeley researchers, Tianjun Zhang and Shishir G. Patil, may have just discovered a better approach.107KViews7likes5CommentsTransforming Customer Support with Azure OpenAI, Azure AI Services, and Voice AI Agents
Customer support today is under immense pressure to meet the rising expectations of speed, personalization, and always-on availability. Yet, businesses still struggle with 1. Long wait times and call center 2. queues 3. Disconnected support channels 4. Limited availability of agents outside business hours 5. Repetitive issues consuming valuable human time 6. Frustrated users due to lack of immediate and contextual answers These inefficiencies are costing businesses over $3.7 trillion annually in poor service delivery, while over 70% of agents (based on the research) spend excessive time searching for the right answers instead of resolving problems directly How Voice AI Agents Are Transforming the Support Experience Enter the era of voice-enabled AI agents—powered by Azure OpenAI, Azure AI Services, and ServiceNow—designed to completely transform the way customers engage with support systems. These agents can now: Handle complex user queries in natural language Access enterprise systems (like CRM, ITSM, HR) in real-time Automate repetitive tasks such as password resets, ticket status updates, or return tracking Escalate only when human assistance is truly needed Create connected, seamless, and intelligent support experiences across departments Let’s take a closer look at four architecture patterns that showcase how enterprises can deploy these agents effectively. 🔷 Architecture Pattern 1: Unified Voice Agent with Azure AI + ServiceNow + CRM Integration In this architecture, the customer support journey begins when a user initiates a voice-based conversation through a front-end interface such as a web application, mobile app, or smart device. The captured audio is streamed directly to Azure OpenAI GPT-4o's real-time API, which performs immediate speech-to-text transcription, interprets the intent behind the request, and prepares the initial system response—all in a single seamless stream. Once the user’s intent is understood (e.g., "create a ticket", "check incident status", or "list recent issues"), GPT-4o passes control to Semantic Kernel, which orchestrates the next steps through function calling. Semantic Kernel hosts pre-defined tools (functions) that map to ServiceNow API actions, such as createIncident, getIncidentStatus, listIncidents, or searchKnowledgeBase. These function calls are then securely routed to ServiceNow via REST APIs. ServiceNow executes the appropriate actions—whether it's creating a new support ticket, retrieving the status of an open incident, or searching its Knowledge Base. CRM data is also seamlessly accessed, if needed, to enrich responses with personalized context such as customer history or case metadata. The result from ServiceNow (e.g., an incident ID or KB article summary) is then sent back to Azure GPT-4o, which converts the structured data into a natural spoken response. This final audio output is delivered to the user in real time, completing the end-to-end conversational loop. Additionally, tools like Azure Monitor or Application Insights can be integrated to log telemetry, track usage trends, monitor latency, and analyze user satisfaction over time. This architecture enables organizations to streamline customer support operations, reduce wait times, and deliver natural, intelligent assistance across any channel—voice-first. 🔷 Architecture Pattern 2: Scalable Customer Support with Multi-Agent Voice Architecture This architecture introduces a modular and distributed agent-based design to deliver intelligent, scalable customer support through a voice interface. The process starts with the User Proxy Agent, which acts as the entry point for all user conversations. It captures voice input and forwards the request to the Master Agent, which serves as the brain of the architecture. The Master Agent, empowered with a large language model (LLM) and memory, interprets the intent behind the user's input and dynamically routes the request to the most appropriate domain-specific agent. These include specialized agents such as the Activation Agent, Root Agent, Sales Agent, or Technical Agent, each designed to handle specific workflows or business tasks. The Activation Agent connects to web services and handles provisioning or onboarding scenarios. The Root Agent taps into document search systems (like Azure Cognitive Search) to answer questions grounded in internal documentation. The Sales Agent is equipped with structured logic models (SLMs) and CRM access to retrieve sales-related data from backend databases. The Technical Agent is containerized via Docker and built to manage backend diagnostics, code-level issues, or infrastructure status—often connecting to systems like ServiceNow for real-time ITSM execution. Once the task is executed by the respective agent, results are passed back through the Master Agent and ultimately to the User Proxy Agent, which synthesizes the output into a voice response and delivers it to the user. The presence of shared memory between agents allows for maintaining context across multi-turn conversations, enabling complex, multi-step interactions (e.g., “Create a ticket, check the latest order status, and escalate it if unresolved.”) without breaking continuity. This architecture is ideal for enterprises looking to scale customer support horizontally, adding new agents without disrupting existing workflows. It enables parallelism, specialization, and real-time orchestration, providing faster resolutions while reducing the burden on human agents. Best suited for distributed support operations across IT, HR, sales, and field support—where task-specific intelligence and modular scale are critical. 🔷 Architecture Pattern 3: Customer Support Reinvented with Voice RAG + Azure AI + ServiceNow This architecture brings a cutting-edge twist to Retrieval-Augmented Generation (RAG) by enabling it through a Voice AI agent—creating a truly conversational experience grounded in enterprise knowledge. By combining Azure OpenAI models with the ServiceNow Knowledge Base, this pattern ensures accurate, voice-driven support for employees or customers in real time. The process begins when a user interacts with a voice-enabled interface—via phone, web, or embedded assistant. The Voice AI agent streams the audio to Azure OpenAI GPT-4o, which transcribes the voice input, understands the intent, and then triggers a RAG pipeline. Instead of relying solely on the model’s internal memory, the system performs a real-time query against the ServiceNow Product Knowledge Base, retrieving relevant knowledge articles, troubleshooting guides, or support workflows. These results are embedded directly into the prompt, creating an enriched context that is passed to the language model via Azure AI Foundry. The model then generates a natural, contextually accurate spoken response, which is converted back into audio and voiced to the user—creating a seamless end-to-end Voice RAG experience. This approach ensures that responses are not only conversational but also deeply grounded in trusted enterprise knowledge. Ideal for helpdesk automation, HR support, and IT troubleshooting—where users prefer speaking naturally and need verified, document-backed responses in real time. 🔷 Architecture Pattern 4: Conversational Customer Support with AI Avatars and Azure AI This architecture delivers rich, conversational experiences by integrating AI avatars, Azure AI, and ServiceNow to offer human-like, intelligent customer support across channels. It merges natural speech, facial expression, and enterprise data to create a highly engaging support assistant. The interaction begins when a user speaks with an AI avatar application, whether embedded in a web portal, mobile device, or kiosk. The voice is captured and processed through a speech-to-text pipeline, which feeds the Avatar Module and Live Discussions Engine to manage lip-sync, emotional tone, and turn-taking. Behind the scenes, the avatar is connected to Azure AI services, including Custom Neural Voice (CNV) and Azure OpenAI, which enable the avatar to understand intent and generate responses in natural, conversational language. Most critically, the system integrates directly with the ServiceNow platform. Through secure APIs, the avatar queries ServiceNow to: Retrieve case status updates Provide summaries of incident history Look up Knowledge Base articles Trigger incident creation if needed These ServiceNow results are then passed through the text-to-speech module, with support for multilingual voice synthesis, and rendered by the avatar using expressive animation. Responses are visually delivered as live or pre-rendered avatar videos, creating a truly interactive and personalized experience. This pattern not only answers basic questions but also surfaces dynamic enterprise data—turning the AI avatar into a frontline voice agent capable of real-time, connected support across IT, HR, or customer service domains. Best for branded digital experiences, frontline support stations, or HR/IT helpdesk automation where facial presence, empathy, and backend integration are essential. ✨ Closing Thoughts: The Future of Customer Support Is Here Customer expectations have evolved—and so must the way we deliver support. By combining the power of Azure OpenAI, Azure AI Services, and ServiceNow, we’re not just automating tasks—we’re reinventing how organizations connect with their users. Whether it's: A unified voice agent handling IT tickets and CRM queries, A multi-agent architecture scaling across departments, A voice-enabled RAG system delivering knowledge-grounded answers in real time, or A human-like AI avatar offering face-to-face support— These architectures are driving a new era of intelligent, conversational, and scalable customer service. 👉 Join us at the Microsoft Booth during ServiceNow Knowledge 2025 (starting May 6th) to experience these solutions live, explore the tech behind them, and imagine how they can transform your business. Let’s build the future of support—together.1.4KViews1like1CommentHubs and Workspaces on Azure Machine Learning – General Availability
We are pleased to announce that hubs and workspaces is now generally available on Azure machine learning allowing users to use hub for team’s collaboration environment for machine learning applications. Azure Hubs and Workspaces provide a centralized platform capability for Azure Machine Learning. This feature enables developers to innovate faster by creating project workspaces and accessing shared company resources without needing repeated assistance from IT administrators. Quick Model Building and Experimentation without IT bottleneck Hubs and Workspaces in Azure Machine Learning provide a centralized solution for managing machine learning resources. Hubs act as a central resource management construct that oversees security, connectivity, computing resources, and team quotas. Once created, they allow developers to create individual workspaces to manage their tasks while adhering to IT setup guidelines. Key Benefits Centralized Management: Hubs allow for centralized settings such as connectivity, compute resources, and security, making it easier for IT admins to manage resources and monitor costs. Cost Efficiency: Utilizing a hub workspace for sharing and reusing configurations enhances cost efficiency when deploying Azure Machine Learning on a large scale. There is a cost associated with setting separate firewall per workspace which scales up as the number of workspaces go up. With hubs, only one firewall is needed which extends across workspaces saving cost. Resource Management: Hubs provide a single pool of compute across workspaces on a user level, eliminating repetitive compute setup and duplicate management steps. This ensures higher utilization of available capacity and fair share of compute resources. Improved Security and Compliance: Hubs act as security boundaries, ensuring that different teams can work in isolated environments without compromising security. Simplified Workspace Creation: Hubs allow for the creation of "light-weight" workspaces in a single step by an ML professional. Enhanced Collaboration: Hubs enable better collaboration among data scientists by providing a centralized platform for managing projects and resource How to get started with Hubs and Projects There are different ways to create hubs for users. You can create hubs via Azure portal, with Azure Resource Manager templates, or via Azure Machine Learning SDK/CLI. Hub properties like networking, monitoring, encryption, identity can be customized while creating a hub and can be set depending on org’s requirements. Workspaces associated with a hub will share hub’s security, connectivity and compute resources. While creating hubs via ML Studio is not supported currently, once hub is created users can create workspaces which get shared access to the company resources made available by the administrator including compute, security and connections. Besides ML Studio, workspaces can be created via Using Azure SDK, Using automation templates, Using Azure CLI. Secure access for Azure resources For accessing data sources outside hubs, connections can help make data available to Azure machine learning. External sources like Snowflake DB, Amazon S3 and Azure SQL DB can be connected to AML resources. Users can also set access permissions to the azure resources with Role based access controls. Besides default built-in roles, users can also create custom roles for more granular access. To conclude, the General Availability of Azure Machine Learning Hubs and Workspaces marks a significant milestone in our commitment to providing scalable, secure, and efficient machine learning solutions. We look forward to seeing how our customers leverage this new feature to drive innovation and achieve their business goals. For more information on hubs and workspaces in Azure machine learning, please refer the following links. What are Azure hubs and workspaces - AML Manage AML hub workspaces in the portal Create a hub using AML SDK and CLI626Views0likes0CommentsUnlock Multi-Modal Embed 4 and Multilingual Agentic RAG with Command A on Azure
Developers and enterprises now have immediate access to state-of-the-art generative and semantic models purpose-built for RAG (Retrieval-Augmented Generation) and agentic AI workflows on Azure AI Foundry to: Deploy high-performance LLMs and semantic search engines directly into production Build faster, more scalable, and multilingual RAG pipelines Leverage models that are optimized for enterprise workloads in finance, healthcare, government, and manufacturing Cohere Embed 4: High-Performance Embeddings for Search & RAG Accompanying Command A is Cohere’s Embed 4, a cutting-edge embedding model ideal for retrieval-augmented generation pipelines and semantic search. Embed 4 (the latest evolution of Cohere’s Embed series) converts text – and even images – into high-dimensional vector representations that capture semantic meaning. It’s a multi-modal, multilingual embedding model designed to provide recall and relevance in vector search, text classification, and clustering tasks. What makes Embed 4 stand out? 100+ Language Support: This model is truly global – it supports well over 100 languages for text embeddings. You can encode queries and documents in many languages (Arabic, Chinese, French, Hindi, etc.) into the same vector space, enabling cross-lingual search out of the box. For example, a question in Spanish can retrieve a relevant document originally in English if their ideas align semantically. Multi-Modal Embeddings: Embed 4 is capable of embedding not only text but also images. This means you can use it for multimodal search scenarios – e.g. indexing both textual content and images and allowing queries across them. Under the hood, the model has an image encoder; the Azure AI Foundry SDK provides an ImageEmbeddingsClient to generate embeddings from images. With this, you could embed a diagram or a screenshot and find text documents that are semantically related to that image’s content. Matryoshka Embeddings (Scalable Dimensions): A novel feature in Cohere’s Embed 4 is Matryoshka Representation Learning, which produces embeddings that can be truncated to smaller sizes with minimal loss in fidelity. In practice, the model can output a high-dimensional vector (e.g. 768 or 1024 dims) but you have the flexibility to use just the first 64, 128, 256, etc. dimensions if needed. These “nested” embeddings mean you can choose a vector size that balances accuracy vs. storage/query speed – smaller vectors save memory and compute while still preserving most of the semantic signal. This is great for enterprise deployments where vector database size and latency are critical. Enterprise Optimizations: Cohere has optimized Embed 4 for production use. It supports int8 quantization and binary embedding output natively, which can drastically reduce storage footprint and speed up similarity search with only minor impact on accuracy (useful for very large indexes). The model is also trained on massive datasets (including domain-specific data) to ensure robust performance on noisy enterprise text. It achieves state-of-the-art results on benchmark evaluations like MTEB, meaning you get retrieval quality on par with or better than other leading embeddings models (OpenAI, Google, etc.). For instance, Cohere’s previous embed model was top-tier on cross-language retrieval tasks and Embed4 further improves on that foundation. Cohere Command A: Generative Model for Enterprise AI Command A is Cohere’s latest flagship large language model, designed for high-performance text generation in demanding enterprise scenarios. It’s an instruction-tuned, conversational LLM that excels at complex tasks like multi-step reasoning, tool use (function calling), and retrieval-augmented generation. Command A features a massive 111B parameter Transformer architecture with 256K token context length – enabling it to handle extremely large inputs (hundreds of pages of text) in a single prompt without losing coherence. Source for above benchmarks : Introducing Command A: Max performance, minimal compute Some key capabilities of Command A include: Long Context (256K tokens): Using an innovative attention architecture (sliding window + global attention), Command A can ingest up to 256,000 tokens of text in one go. This enables use cases like analyzing lengthy financial reports or entire knowledge bases in a single prompt. Enterprise-Tuned Generation: Command A is optimized for business applications – it’s excellent at instructions, summarization, and especially RAG workflows where it integrates retrieved context and even cites sources to mitigate hallucinations. It supports tool calling (function calling) out-of-the-box so it can interact with external APIs or data sources as part of an Azure AI Agent. Multilingual Proficiency: Command A is good at multilingual use cases (covering all major business languages, with near leading performance in Japanese, Korean, and German). Efficient Deployment: Despite its size, Command A is engineered for efficiency – it delivers 150% higher throughput than its predecessor (Command R+ 08-2024) and requires only 2× A100/H100 GPUs to run. In practice this means lower latency. It also supports streaming token output, so applications can start receiving the response as it’s generated, keeping interaction latency low. Real-World Use Cases for Command A + Embed 4 With both a powerful generative model and a state-of-the-art embedding model at your fingertips, developers can build advanced AI solutions. Here are some real-world use cases unlocked by Command A and Embed 4 on Azure: Financial Report Summarization (RAG): Imagine ingesting thousands of pages of financial filings, earnings call transcripts, and market research into a vector store. Using Embed 4, you can embed and index all this text. When an analyst asks “What were the key revenue drivers mentioned in ACME Corp’s Q1 2025 report?”, you use the query embedding to retrieve the most relevant passages. Command A (with its 256K context) can then take those passages and generate a concise summary or answer with cited evidence. The model’s long context window means it can consider all retrieved chunks at once, and its enterprise tuning ensures factual, business-appropriate summaries. Legal Research Agent (Tool Use + Multilingual): For example a multinational law firm handling cross-border mergers and acquisitions. They have a vast repository of legal documents in multiple languages. Using Embed 4, they index these documents, creating multilingual embeddings. When a lawyer researches a specific legal precedent related to a merger in Germany, they can query in English. Embed 4 retrieves relevant German documents, and Command A summarizes key points, translates excerpts, and compares legal arguments across jurisdictions. Furthermore, Command A leverages tool calling (utilizing agentic capabilities) to retrieve additional information from external databases, such as company registration details and regulatory filings, integrating this data into its analysis to provide a comprehensive report. Technician Knowledge Assistant (RAG + Multilingual): Think of a utilities company committed to operational excellence, managing a vast network of critical equipment, including power generators, transformers, and distribution lines. They can leverage Command A, integrated with Embed 4, to index a comprehensive repository of equipment manuals, maintenance records, and sensor data in multiple languages. This enables technicians and engineers to access critical knowledge instantly. Technicians can ask questions in their native language about specific equipment issues, and Command A retrieves relevant manuals, troubleshooting guides, and past repair reports. It also guides technicians through complex maintenance procedures step-by-step, ensuring consistency and adherence to best practices. This empowers the company to optimize maintenance processes, improve overall equipment reliability, and enhance communication, ultimately achieving operational excellence. Multimodal Search & Indexing: With Embed 4’s image embedding capability, you can build search systems that go beyond text. For instance, a media company could index their image library by generating embeddings for each image (using Azure’s Image Embeddings client) and also index captions/descriptions. A user could then supply a query image (or a textual description) and retrieve both images and articles that are semantically similar to the query. This is useful for scenarios like finding slides similar to a given diagram, searching scanned invoices by content, or matching user-uploaded photos to reference documents. Getting Started: Deploying via Azure AI Foundry In Azure AI Foundry, Embed 4 can be used via the Embeddings API to encode text or images into vectors. Each text input is turned into a numeric vector (e.g. 1024-dimension float array) that you can store in a vector database or use for similarity comparisons. The embeddings are normalized for cosine similarity by default. You can also take advantage of Azure’s vector index or Azure Cognitive Search to directly build vector search on top of these model outputs. Image Source : Introducing Embed 4: Multimodal search for business One of the biggest benefits of using Azure AI Foundry is the ease of deployment for these models. Cohere’s Command A and Embed 4 are available in the model catalog – you can find their model cards and deploy them in just a few clicks. Azure Foundry supports serverless API endpoints for these models, meaning Microsoft hosts the inference infrastructure and scales it for you (with pay-as-you-go billing). Integration with Azure AI Agent Service: If you’re building an AI agent (a system that can orchestrate models and tools to perform tasks), Azure AI Agent Service makes it easy to incorporate these models. In the Agent Service, you can simply reference the deployed model by name as the agent’s reasoning LLM. For example, you could specify an agent that uses CohereCommandA as its model, and add tools like Azure Cognitive Search. The agent can then handle user requests by, say, using a Search tool (powered by Embed 4 vector index) and then passing the results to Command A for answer formulation – all managed by the Azure Agent framework. This lets you build production-grade agentic AI workflows that leverage Cohere’s models with minimal plumbing. In short, Azure provides the glue to connect Command A + Embed 4 + Tools into a coherent solution. Try Command A and Embed 4 today on Azure AI Foundry The availability of Cohere’s Command A and Embed 4 on Azure AI Foundry empowers developers to build the next generation of intelligent apps on a fully managed platform. You can now easily deploy a 256K-context LLM that rivals the best in the industry, alongside a high-performance embedding model that plugs into your search and retrieval pipelines. Whether it’s summarizing lengthy documents with cited facts, powering a multilingual enterprise assistant, enabling multimodal search experiences, or orchestrating complex tool-using agents – these models open up a world of possibilities. Azure AI Foundry makes it simple to integrate these capabilities into your solutions, with the security, compliance, and scalability of Azure’s cloud. We encourage you to try out Command A and Embed 4 in your own projects. Spin them up from the Azure model catalog, use the provided SDK examples to get started, and explore how they can elevate your applications’ intelligence. With Cohere’s models on Azure, you have cutting-edge AI at your fingertips, ready to deploy in production. We’re excited to see what you build with them!3.3KViews0likes0CommentsScalable and Efficient Fine-Tuning of LLM on Azure ML
https://github.com/james-tn/llm-fine-tuning/tree/main/opensource_llm/single_step Co-Author: Mohamad AL jazaery Why Scalable and Efficient Fine-Tuning Matters Faster Iterations, Shorter Time-to-Value: In today’s competitive AI landscape, time is of the essence. The faster you can fine-tune a model, the quicker you can validate ideas, test hypotheses, and bring solutions to market. High-profile GPU machines are costly: High-performance GPUs and compute clusters don’t come cheap, and their availability is often limited. Efficient fine-tuning techniques, such as model sharding and distributed training, maximize the utilization of these precious resources—ensuring that you get the most out of your infrastructure investment. Choosing the Right Azure ML GPU Compute for the Job: NC or ND? Not all GPU computes are created equal, and choosing the right sku can make or break your training efficiency. ND Series: Ideal for distributed training across multiple nodes, thanks to its Infiniband (IB) connectivity that ensures high-speed communication between nodes like pretraining LLM or finetuning very large model ~70B params. NC Series: Small and medium workload where no heavy interaction between nodes needed like LLM inferencing or mid-size LLM finetuning. Azure GPU Machine Options by Scenario: Scenario Common model size Training Approach Recommended Azure Compute Small-scale fine-tuning < 3B parameters Parameter-efficient tuning NCas_T4_v3 (Tesla T4, 16 GB) Medium-scale fine-tuning 1–5B parameters Full or parameter-efficient NCs_v3 (Tesla V100, 16 GB) Distributed training for medium models 5–10B parameters Full fine-tuning ND_v2 (Tesla V100 NVLINK, 32 GB, InfiniBand) Large-scale fine-tuning (single machine) 10–30B parameters Full or parameter-efficient NC_A100_v4 (A100, 40 GB) Distributed training for very large models 20–70B parameters Full fine-tuning NDasrA100_v4 (A100, 80 GB, HDR InfiniBand) Very large models training (single machine) up to 70B parameters Full or parameter-efficient NCads_H100_v5 (H100 NVL, 94 GB) Massive-scale distributed training > 70B parameters Full fine-tuning ND-H100-v5 (H100, 80 GB, scale-out InfiniBand) Distributed Efficient Training: A Quick Guide When scaling fine-tuning tasks, choosing the right distributed training method is key: DDP (Data Parallelism): Works well when the entire model fits on a single GPU. It replicates the model across multiple GPUs and splits the data for parallel processing. Check experiment 1 in the following section. Model Parallelism: A game-changer for massive models that don’t fit on a single GPU. It shards not only the data but also the model parameters and optimizer states across multiple GPUs, enabling efficient training of models like LLaMA-70B on GPUs with low memory GPUs. Both FSDP and DeepSpeed as libraries excel at implementing advanced forms of model parallelism and memory optimization. Memory Optimization Techniques Gradient Checkpointing: Reduces memory by recomputing activations during the backward pass, trading memory for additional computation. Mixed Precision Training: Reduces memory usage by using FP16 or BF16 instead of FP32, accelerating training while maintaining numerical stability. Supported by both frameworks. Quantization (DeepSpeed Exclusive): Uses INT8 precision for weights and activations, dramatically reducing memory and compute requirements. Offloading (DeepSpeed Exclusive): Offloads optimizer states and model parameters to CPU or NVMe, freeing up GPU memory for computation. Our Experiments: Pushing the Limits of Scalability Experiment 1: Distributed Training on Multiple Nodes using DDP We conducted an experiment to fine-tune the Llama-3.1-8B model using LoRA (Low-Rank Adaptation) on Azure ML NDv2-V100 nodes. The goal was to evaluate the efficiency of fine-tuning across different numbers of nodes (1, 2, and 3) and observe the impact on training time and throughput. Azure ML Job YAML Definition $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/mistralai-Mistral-7B-v01/versions/19 command: > accelerate launch --num_processes 16 # gpu per machine * num of machines --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT compute: azureml:ndv2-cluster resources: instance_count: 2 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Results: As you increased the number of nodes from one to three, the throughput increased proportionally. This indicates that the system scaled efficiently with the addition of more nodes, maintaining a close-to-linear improvement in throughput. Experiment 2: Model Parallelism using FSDP Fine-tuning a 70B-parameter model on GPUs with only 16GB of memory might sound impossible, but we made it happen using FSDP (Full Sharded Data Parallelism) on Azure ML using a cluster of multiple NDv2-V100 nodes. By distributing not only the data but also the model parameters and optimizer states across multiple nodes, we unlocked the power of full sharding. $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml-meta/models/Llama-3.3-70B-Instruct/versions/4 command: > accelerate launch --config_file "configs/fsdp_config.yaml" --num_processes 32 --num_machines 4 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py compute: azureml:ndv2-cluster resources: instance_count: 4 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Key Takeaways: Memory Efficiency: Full sharding enabled us to fine-tune the LLaMA-70B model on V100 GPUs despite their limited memory. Connectivity Matters: The Infiniband (IB) connectivity of ND nodes played a critical role in ensuring smooth communication across GPUs, making this feat possible. Conclusion Scalable and efficient fine-tuning is the key to unlocking the true potential of Large Language Models. By leveraging distributed training techniques, such as FSDP and DDP, and optimizing compute resources on Azure ML, researchers and practitioners can overcome the challenges of training massive models—reducing costs, accelerating time-to-value, and driving AI innovation. Access the code and start experimenting here! Future work: The second part will focus on real-world pipeline setups, including end-to-end model training, hyperparameter optimization, and testing. The third part will dive into deploying trained models for practical use. Future posts may explore best practices for specific fine-tuning scenarios and techniques.2.4KViews3likes0Comments