azure machine learning
194 TopicsProject Maria: Bringing Speech and Avatars Together for Next-Generation Customer Experiences
In an age where digital transformation influences nearly every aspect of business, companies are actively seeking innovative ways to differentiate their customer interactions. Traditional text-based chatbots, while helpful, often leave users wanting a more natural, personalized, and efficient experience. Imagine hosting a virtual brand ambassador—a digital twin of yourself or your organization’s spokesperson—capable of answering customer queries in real time with a lifelike voice and expressive 2D or 3D face. This is where Project Maria comes in. Project Maria is an internal Microsoft initiative that integrates cutting-edge speech-to-text (STT), text-to-speech (TTS), large language model and avatar technologies. Using Azure AI speech and custom neural voice models, it seeks to create immersive, personalized interactions for customers—reducing friction, increasing brand loyalty, and opening new business opportunities in areas such as customer support, product briefings, digital twins, live marketing events, safety briefings, and beyond. In this blog post, we will dive into: The Problem and Rationale for evolving beyond basic text-based solutions. Speech-to-Text (STT), Text-to-Speech (TTS) Pipelines, Azure OpenAI GPT-4o Real-Time API that power natural conversations. Avatar Models in Azure, including off-the-shelf 2D avatars and fully customized custom avatar Neural Voice Model Creation, from data gathering to training and deployment on Azure. Security and Compliance considerations for handling sensitive voice assets and data. Use Cases from customer support to digital brand ambassadors and safety briefings. Real-World Debut of Project Maria, showcased at the AI Leaders’ Summit in Seattle. Future Outlook on how custom avatar will reshape business interactions, scale presence, and streamline time-consuming tasks. If you’re developing or considering a neural (custom) voice + avatar models for your product or enterprise, this post will guide you through both conceptual and technical details to help you get started—and highlight where the field is heading next. 1. The Problem: Limitations of Text-Based Chatbots 1.1 Boredom and Fatigue in Text Interactions Text-based chatbots have come a long way, especially with the advent of powerful Large Language Models (LLMs) and Small Large Models (SLMs). Despite these innovations, interactions can still become tedious—often requiring users to spend significant personal time crafting the right questions. Many of us have experienced chatbots that respond with excessively verbose or repetitive messages, leading to boredom or even frustration. In industries that demand immediacy—like healthcare, finance, or real-time consumer support—purely text-based exchanges can feel slow and cumbersome. Moreover, text chat requires a user’s full attention to read and type, whether in a busy contact center environment or an internal knowledge base where employees juggle multiple tasks. 1.2 Desire for More Engaging and Efficient Modalities Today’s users expect something closer to human conversation. Devices ranging from smartphones to smart speakers and in-car infotainment systems have normalized voice-based interfaces. Adding an avatar—whether a 2D or 3D representation—deepens engagement by combining speech with a friendly visual persona. This can elevate brand identity: an avatar that looks, talks, and gestures like your company’s brand ambassador or a well-known subject-matter expert. 1.3 The Need for Scalability In a busy customer support environment, human representatives simply can’t handle an infinite volume of conversations or offer 24/7 coverage across multiple channels. Automation is essential, yet providing high-quality automated interactions remains challenging. While a text-based chatbot might handle routine queries, a voice-based, avatar-enabled agent can manage more complex requests with greater dynamism and personality. By giving your digital support assistant both a “face” and a voice aligned with your brand, you can foster deeper emotional connections and provide a more genuine, empathetic experience. This blend of automation and personalization scales your support operations, ensuring higher customer satisfaction while freeing human agents to focus on critical or specialized tasks. 2. The Vision: Project Maria’s Approach Project Maria addresses these challenges by creating a unified pipeline that supports: Speech-to-Text (STT) for recognizing user queries quickly and accurately. Natural Language Understanding (NLU) layers (potentially leveraging Azure OpenAI or other large language models) for comprehensive query interpretation. Text-to-Speech (TTS) that returns highly natural-sounding responses, possibly in multiple languages, with customized prosody and style. Avatar Rendering, which can be a 2D animated avatar or a more advanced 3D digital twin, bringing personality and facial expressions to the conversation. By using Azure AI Services—particularly the Speech and Custom Neural Voice offerings—can deliver brand-specific voices. This ensures that each brand or individual user’s avatar can match (or approximate) a signature voice, turning a run-of-the-mill voice assistant into a truly personal digital replicas 3. Technical Foundations 3.1 Speech-to-Text (STT) At the heart of the system is Azure AI Services for Speech, which provides: Real-time transcription capabilities with a variety of languages and dialects. Noise suppression, ensuring robust performance in busy environments. Streaming APIs, critical for real-time or near-real-time interactions. When a user speaks, audio data is captured (for example, via a web microphone feed or a phone line) and streamed to the Azure service. The recognized text is returned in segments, which the NLU or conversation manager can interpret. 3.1.1 Audio Pipeline Capture: The user’s microphone audio is captured by a front-end (e.g., a web app, mobile app, or IoT device). Pre-processing: Noise reduction or volume normalization might be applied locally or in the cloud, ensuring consistent input. Azure STT Ingestion: Data is sent to the Speech service endpoint, authenticated via subscription keys or tokens (more on security later). Result Handling: The recognized text arrives in partial hypotheses (partial transcripts) and final recognized segments. Project Maria (Custom Avatar) processes these results to understand user intent 3.2 Text-to-Speech (TTS) Once an intent is identified and a response is formulated, the system needs to deliver speech output. Standard Neural Voices: Microsoft provides a wide range of prebuilt voices in multiple languages. Custom Neural Voice: For an even more personalized experience, you can train a voice model that matches a brand spokesperson or a distinct voice identity. This is done using your custom datasets, ensuring the final system speaks exactly like the recorded persona. 3.2.1 Voice Font Selection and Configuration In a typical architecture: The conversation manager (which could be an orchestrator or a custom microservice) provides the text output to the TTS service. The TTS service uses a configured voice font—like en-US-JennyNeural or a custom neural voice ID (like Maria Neural Voice) if you have a specialized voice model. The synthesized audio is returned as an audio stream (e.g., PCM or MP3). You can play this in a webpage directly or in a native app environment. Azure OpenAI GPT-4o Real-Time API integrates with Azure's Speech Services to enable seamless interactions. First, your speech is transcribed in near real time. GPT-4o then processes this text to generate context-aware responses, which are converted to natural-sounding audio via Azure TTS. This audio is synchronized with avatar models to create a lifelike, engaging interface 3.3 Real-Time Conversational Loop Maria is designed for real-time or text to speech conversations. The user’s speech is continuously streamed to Azure STT. The recognized text triggers a real-time inference step for the next best action or response. The response is generated by Azure OpenAI model (like GPT-4o) or other LLM/SLM The text is then synthesized to speech, which the user hears with minimal latency. 3.4 Avatars: 2D and Beyond 3.4.1 Prebuilt Azure 2D Avatars Azure AI Speech Services includes an Avatar capability that can be activated to display a talking head or a 2D animated character. Developers can: Choose from prebuilt characters or import basic custom animations. Synchronize lip movements to the TTS output. Overlay brand-specific backgrounds or adopt transparency for embedding in various UIs. 3.4.2 Fully Custom Avatars (Customer Support Agent Like Maria) For organizations wanting a customer support agent, subject-matter expert, or brand ambassador: Capture: Record high-fidelity audio and video of the person you want to replicate. The more data, the better the outcome (though privacy and licensing must be considered). Modeling: Use advanced 3D or specialized 2D animation software (or partner with Microsoft’s custom avatar creation solutions) to generate a rigged model that matches the real person’s facial geometry and expressions. Integration: Once the model is rigged, it can be integrated with the TTS engine. As text is converted to speech, the avatar automatically animates lip shapes and facial expressions in near real time. 3.5 Latency and Bandwidth Considerations When building an interactive system, keep an eye on: Network latency: Real-time STT and TTS require stable, fast connections. Compute resources: If hosting advanced ML or high concurrency, scaling containers (e.g., via Docker and Kubernetes) is critical. Avatars: Real-time animation might require sending frames or instructions to a client’s browser or device. 4. Building the Model: Neural Voice Model Creation 4.1 Data Gathering To train a custom neural voice, you typically need: High-quality audio clips: Ideally recorded in a professional studio to minimize background noise, with the same microphone setup throughout. Matching transcripts for each clip. Minimum data duration: Microsoft recommends a certain threshold (e.g., 300+ utterances, typically around 30 minutes to a few hours of recorded speech, depending on the complexity of the final voice needed). 4.2 Training Process Data Upload: Use the Azure Speech portal or APIs to upload your curated dataset. Model Training: Azure runs training jobs that often require a few hours (or more). This step includes: Acoustic feature extraction (spectrogram analysis). Language or phoneme modeling for the relevant language and accent. Prosody tuning, ensuring the voice can handle various styles (cheerful, empathetic, urgent, etc.). Quality Checks: After training, you receive an initial voice model. You can generate test phrases to assess clarity, intonation, and overall quality. Iteration: If the voice quality is not satisfactory, you gather more data or refine the existing data (removing noisy segments or inaccurate transcripts). 4.3 Deployment Once satisfied with the custom neural voice: Deploy the model to an Azure endpoint within your subscription. Configure your TTS engine to use the custom endpoint ID instead of a standard voice. 5. Securing Avatar and Voice Models Security is paramount when personal data, brand identity, or intellectual property is on the line. 5.1 API Keys and Endpoints Azure AI Services requires an API key or an OAuth token to access STT/TTS features. Store keys in Azure Key Vault or as secure environment variables. Avoid hard-coding them in the front-end or source control. 5.2 Access Control Role-Based Access Control (RBAC) at both Azure subscription level and container (e.g., Docker or Kubernetes) level ensures only authorized personnel can deploy or manage the containers running these services. Network Security: Use private endpoints if you want to limit exposure to the public internet. 5.3 Intellectual Property Concerns Avatar and Voice Imitation: A avatar model and custom neural voice that mimics a specific individual must be authorized by that individual. Azure has a verification process in place to ensure consent. Data Storage: The training audio data and transcripts must be securely stored, often with encryption at rest and in transit. 6. Use Cases: Bringing It All Together 6.1 Customer Support A digital avatar that greets users on a website or mobile app can handle first-level queries: “Where can I find my billing information?” “What is your return policy?” By speaking these answers aloud with a friendly face and voice, the experience is more memorable and can reduce queue times for human agents. If the question is too complex, the avatar can seamlessly hand off to a live agent. Meanwhile, transcripts of the entire conversation are stored (e.g., in Azure Cosmos DB), enabling data analytics and further improvements to the system. 6.2 Safety Briefings and Public Announcements Industries like manufacturing, aviation, or construction must repeatedly deliver consistent safety messages. A personal avatar can recite crucial safety protocols in multiple languages, ensuring nothing is lost in translation. Because the TTS voice is consistent, workers become accustomed to the avatar’s instructions. Over time, you could even create a brand or site-specific “Safety Officer” avatar that fosters familiarity. 6.3 Digital Twins at Live Events Suppose you want your company’s spokesperson to simultaneously appear at multiple events across the globe. With a digital twin: The spokesperson’s avatar and voice “present” in real time, responding to local audience questions. This can be done in multiple languages, bridging communication barriers instantaneously. Attendees get a sense of personal interaction, while the real spokesperson can focus on core tasks, or appear physically at another event entirely. 6.4 AI Training and Education In e-learning platforms, a digital tutor can guide students through lessons, answer questions in real time, and adapt the tone of voice based on the difficulty of the topic or the student’s performance. By offering a face and voice, the tutor becomes more engaging than a text-only system. 7. Debut: Maria at the AI Leaders Summit in Seattle Project Maria had its first major showcase at the AI Leaders Summit in Seattle last week. We set up a live demonstration: Live Conversations: Attendees approached a large screen that displayed Maria’s 2D avatar. On-the-Fly: Maria recognized queries with STT, generated text responses from an internal knowledge base (powered by GPT-4o or domain-specific models), then spoke them back with a custom Azure neural voice. Interactive: The avatar lip-synced to the output speech, included animated gestures for emphasis, and even displayed text-based subtitles for clarity. The response was overwhelmingly positive. Customers praised the fluid voice quality and the lifelike nature of Maria’s avatar. Many commented that they felt they were interacting with a real brand ambassador, especially because the chosen custom neural voice had just the right inflections and emotional range. 8. Technical Implementation Details Below is a high-level architecture of how Project Maria might be deployed using containers and Azure resources. Front-End Web App: Built with a modern JavaScript framework (React, Vue, Angular, etc.). Captures user audio through the browser’s WebRTC or MediaStream APIs. Connects via WebSockets or RESTful endpoints for STT requests. Renders the avatar in a <canvas> element or using a specialized avatar library. Backend: Containerized with Docker. Exposes endpoints for STT streaming (optionally passing data directly to Azure for transcription). Integrates with the TTS service, retrieving synthesized audio buffers. Returns the audio back to the front-end in a continuous stream for immediate playback. Avatar Integration: The back-end or a specialized service handles lip-sync generation (e.g., via phoneme mapping from the TTS output). The front-end renders the 2D or 3D avatar in sync with the audio playback. This can be done by streaming timing markers that indicate which phoneme is currently active. Data and Conversation Storage: Use an Azure Cosmos DB or a similar NoSQL solution to store transcripts, user IDs, timestamps, and optional metadata (e.g., conversation sentiment). This data can later be used to improve the conversation model, evaluate performance, or train advanced analytics solutions. Security: All sensitive environment variables (like Azure API keys) are loaded securely, either through Azure Key Vault or container orchestration secrets. The system enforces user authentication if needed. For instance, an internal HR system might restrict the avatar-based service to employees only. Scaling: Deploy containers in Azure Kubernetes Service (AKS), setting up auto-scaling to handle peak loads. Monitor CPU/memory usage, as well as TTS quota usage. For STT, ensure the service tier can handle simultaneous requests from multiple users. 9. Securing Avatar Models and Voice Data 9.1 Identity Management Each avatar or custom neural voice is tied to a specific subscription. Using Azure Active Directory (Azure AD), you can give fine-grained permissions so that only authorized DevOps or AI specialists can alter or redeploy the voice. 9.2 API Gateways and Firewalls For enterprise contexts, you might place an API Gateway in front of your containerized services. This central gateway can: Inspect requests for anomalies, Enforce rate-limits, Log traffic to meet compliance or auditing requirements. 9.3 Key Rotation and Secrets Management Frequently rotates keys to minimize the risk of compromised credentials. Tools like Azure Key Vault or GitHub’s secret storage features can automate the rotation process, ensuring minimal downtime. 10. The Path Forward: Scaling Custom Avatar 10.1 Extended Personalization While Project Maria currently focuses on voice and basic facial expressions, future expansions include: Emotion Synthesis: Beyond standard TTS expressions (friendly, sad, excited), we can integrate emotional AI to dynamically adjust the avatar’s tone based on user sentiment. Gesture Libraries: 2D or 3D avatars can incorporate hand gestures, posture changes, or background movements to mimic a real person in conversation. This reduces the “uncanny valley” effect. 10.2 Multilingual, Multimodal As businesses operate globally, multilingual interactions become paramount. We have seen many use cases to: Auto-detect language from a user’s speech and respond in kind. Offer real-time translation, bridging non-English speakers to brand content. 10.3 Agent Autonomy Systems like Maria won’t just respond to direct questions; they can initiate proactivity: Send voice-based notifications or warnings when critical events happen. Manage long-running tasks such as scheduling or triaging user requests, akin to an “executive assistant” for multiple users simultaneously. 10.4 Ethical and Social Considerations With near-perfect replicas of voices, there is a growing concern about identity theft, misinformation, and deepfakes. Companies implementing digital twins must: Secure explicit consent from individuals. Implement watermarking or authentication for voice data. Educate customers and employees on usage boundaries and disclaimers 11. Conclusion Project Maria represents a significant leap in how businesses and organizations can scale their presence, offering a humanized, voice-enabled digital experience. By merging speech-to-text, text-to-speech, and avatar technologies, you can: Boost Engagement: A friendly face and familiar voice can reduce user fatigue and build emotional resonance. Extend Brand Reach: Appear in many locations at once via digital twins, creating personalized interactions at scale. Streamline Operations: Automate repetitive queries while maintaining a human touch, freeing up valuable employee time. Ensure Security and Compliance: By using Azure’s robust ecosystem of services and best practices for voice data. As demonstrated at the AI Leaders Summit in Seattle, Maria is already reshaping how businesses think about communication. The synergy of avatars, neural voices, and secure, cloud-based AI is paving the way for the next frontier in customer interaction. Looking ahead, we anticipate that digital twins—like Maria—will become ubiquitous, automating not just chat responses but a wide range of tasks that once demanded human presence. From personalized marketing to advanced training scenarios, the possibilities are vast. In short, the fusion of STT, TTS, and avatar technologies is more than a novel gimmick; it is an evolution in human-computer interaction. By investing in robust pipelines, custom neural voice training, and carefully orchestrated containerized deployments, businesses can unlock extraordinary potential. Project Maria is our blueprint for how to do it right—secure, customizable, and scalable—helping organizations around the world transform user experiences in ways that are both convenient and captivating. If you’re looking to scale your brand, innovate in human-machine dialogues, or harness the power of digital twins, we encourage you to explore Azure AI Services’ STT, TTS, and Avatar solutions. Together, these advancements promise a future where your digital self (or brand persona) can meaningfully interact with users anytime, anywhere. Detailed Technical Implementation:- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/what-is-custom-text-to-speech-avatar Text to Speech with Multi-Agent Orchestration Framework:- https://github.com/ganachan/Project_Maria_Accelerator_tts Contoso_Maria_Greetings.mp4399Views1like1CommentModernizing Legacy Applications in your Nonprofit
In this blog, we’ll explore how nonprofits can modernize their existing applications to enhance security without starting from scratch. By leveraging Microsoft Azure’s powerful tools, organizations can strengthen their defenses, improve performance, and ensure their applications remain secure and scalable for the future. Securing Legacy Applications Without Rebuilding from Scratch For many nonprofits, starting over isn’t an option—they need to secure and modernize the applications they already have. Fortunately, Microsoft Azure provides solutions that help organizations enhance security without requiring a complete rebuild: ✅ Containerization with Azure Kubernetes Service (AKS) – Nonprofits can containerize legacy applications and host them in a secure, scalable environment, reducing vulnerabilities without rewriting the entire application. This approach helps keep security updates and compliance requirements in check while maintaining the existing software functionality. ✅ Incremental Modernization with Cloud-Native Services – Instead of a full-scale rebuild, nonprofits can gradually modernize their applications by integrating cloud-native services. This could involve migrating databases to Azure SQL, implementing API-driven architectures, or introducing automation through Azure Logic Apps. This phased approach enhances security, improves performance, and allows for future scalability without disrupting core operations. ✅ Azure SQL Database – Helps nonprofits move from outdated, on-premises databases to a fully managed cloud database, reducing maintenance efforts while improving security, performance, and compliance. ✅ Azure API Management – Allows organizations to connect legacy systems with modern cloud-based services by securely exposing APIs, enabling seamless integration and extended functionality. Understanding Your Options When considering the modernization of legacy applications, there are several strategies that organizations can adopt, each with its own benefits and considerations: Rehost (Lift-and-Shift) This strategy is all about speed and simplicity. It involves moving applications from their current environment to a new one with minimal or no changes to the code. This allows organizations to quickly transition to the cloud without altering the core functionality of their applications. Replatform Replatforming sits between rehosting and refactoring. It requires making some code changes so that applications can take advantage of cloud technologies. This approach allows organizations to benefit from cloud capabilities without needing a complete overhaul of their applications. Refactor (or Repackage) Refactoring focuses on enhancing productivity and speed by making minimal code changes. This strategy ensures that applications can connect easily to a cloud-first environment, optimizing their performance and scalability. Rearchitect For organizations that need enhanced cloud scalability, rearchitecting is the way to go. This approach involves modifying and extending the application's functionality and code to better utilize cloud resources, ensuring improved performance and scalability. Rebuild (or Rewrite) When existing applications have limited functionality or lifespan, rebuilding them using cloud solutions might be necessary. Although this approach requires significant effort, it provides a fresh start with modern capabilities and extended lifespans. Replace If an application no longer meets current or future business needs, even after rebuilding, replacing it with a ready-made solution may be the best option. This approach can be quicker than rebuilding and allows organizations to focus on other priorities. However, it may also pose challenges such as business process interruptions and limitations on future modernization efforts. Nonprofit Considerations Wrapping up, we agree that nonprofits rely on technology to drive their missions, but outdated applications can pose serious security risks. We've covered how organizations don’t have to start from scratch to modernize and secure their systems. By leveraging Microsoft Azure’s powerful tools—like containerization, cloud-native services, and secure database management—nonprofits can enhance security, improve performance, and ensure long-term scalability. Here is one thing to consider: Nonprofits may not have the technical team to assist with these processes, but understanding these strategies is crucial. This knowledge can empower them in conversations with development partners, ensuring they are fully aware and engaged throughout the modernization journey. By being informed, nonprofits can make better decisions, ask the right questions, and collaborate effectively with their partners to achieve their modernization goals. Modernization isn’t just about keeping up with technology; it’s about protecting the trust nonprofits have built with their donors, volunteers, and communities. Whether it’s securing legacy applications or embedding security into new software development through the Secure Software Development Lifecycle (SSDLC), taking proactive steps today ensures a more resilient and secure future. On the contrary, for nonprofits that do want to start over from scratch with building new applications, integrating security from the start is essential. Learn more about how SSDLC can strengthen your organization’s software security here: Building Secure Software from the Ground Up: Why It Matters for Nonprofits | Microsoft Community Hub Want to explore nonprofit application modernization further? Check out this guide: What is Application Modernization? | Microsoft Azure.62Views0likes0CommentsThe Future of AI: Harnessing AI for E-commerce - personalized shopping agents
Explore the development of personalized shopping agents that enhance user experience by providing tailored product recommendations based on uploaded images. Leveraging Azure AI Foundry, these agents analyze images for apparel recognition and generate intelligent product recommendations, creating a seamless and intuitive shopping experience for retail customers.517Views5likes2CommentsA Framework for Calculating ROI for Agentic AI Apps
Contributors and Reviewers: Anurag Karuparti (C), Aishwarya Umachandran(C), Tara Webb(R), Bart Czernicki (R), Simon Lacasse (R), Vishnu Pamula (R) ROI serves as a critical metric for assessing the financial benefits of any investment, including AI projects. It helps determine whether the investment generates more value than it costs. The fundamental formula for calculating ROI is: ROI = (Net Return from Investment - Cost of Investment) / Cost of Investment * 100 Studies indicate that companies investing in AI are realizing significant returns, with an average ROI of $3.7 for every $1 invested. Notably, 5% of organizations worldwide are achieving an even higher average ROI of $10 for every $1 invested. (IDC Study 2024) 1. Key Metrics for Measuring ROI in Agentic AI Apps Measuring the ROI of agentic AI apps necessitates a comprehensive approach that considers both tangible and intangible benefits. Intangible benefits may be difficult to quantify but significantly contribute to ROI. Here are some key metrics to consider: a. Tangible Benefits Cost Savings: Agentic Apps can automate tasks, leading to significant cost reductions in areas like customer service, data entry, and many business operations. By handling complex workflows autonomously, agentic AI minimizes the need for human intervention, resulting in lower labor costs and increased efficiency. Revenue Increase: Agentic Apps can help businesses identify new revenue streams, optimize pricing strategies, and improve sales and marketing effectiveness, ultimately driving revenue growth. Productivity Gains: By automating tasks and providing employees with enhanced tools and information, Agentic Apps can boost productivity and efficiency. Data Quality Improvements: Agentic Apps can minimize errors in tasks such as data entry and analysis, leading to improved accuracy and reduced costs associated with correcting mistakes. Improved Customer Satisfaction: Agentic Apps can enhance customer satisfaction by providing personalized experience, faster service, and proactive problem-solving. Faster Time-to-Market: Agentic AI can accelerate product development and deployment, enabling businesses to bring new products and services to market faster. b. Intangible Benefits Improved Decision-Making: Agentic AI can analyze vast amounts of data and provide valuable insights that can help businesses make more informed decisions. Enhanced Brand Reputation: By providing innovative and efficient services, agentic AI can enhance a company's brand reputation and foster customer loyalty. Increased Employee Satisfaction: By automating mundane tasks and empowering employees with better tools, agentic AI can improve employee satisfaction and retention. Improved Compliance: Agentic AI can help businesses comply with regulations and reduce the risk of penalties. Increased Innovation: By freeing up employees from routine tasks, agentic AI can foster a culture of innovation and creativity. 2. Cost Components of Developing and Deploying Agentic Apps Developing and deploying agentic AI apps involves various cost components, which can be categorized as follows: Cost Component Description Example Development Costs This includes the cost of software and development tools, salaries of developers, data scientists, and machine learning engineers, and cloud computing resources. Salaries for a team comprising a data scientist ($120,000 - $180,000 per year), a machine learning engineer ($130,000 - $200,000 per year), and an AI software developer ($110,000 - $170,000 per year) and development costs on cloud platforms like Azure (The above salaries are just estimates based on public info and can vary) Data Acquisition and Preparation Agentic AI apps may require large amounts of data for training and operation. This includes the cost of acquiring data, cleaning it, and preparing it for use in AI models. Purchasing datasets from third-party providers or investing in data annotation services. Testing and Deployment This includes the cost of testing the AI app, deploying it to the cloud or on-premises, and integrating it with existing systems. Cloud computing costs for deploying the app on platforms Azure, AWS and Google. Maintenance and Updates Agentic AI apps require ongoing maintenance and updates to ensure they remain effective and secure. This includes the cost of monitoring the app, fixing bugs, and adding new features. Costs associated with software updates, security patches, and ongoing monitoring of the app's performance. 3. New Revenue Streams from Agentic Apps Agentic AI apps can generate revenue through various business models by enhancing business operations in several ways. Revenue Stream/Value Proposition Description Example Subscription Fees Businesses can charge users a recurring fee for access to the agentic AI app. Offering different subscription tiers with varying levels of access and features. Usage-Based Pricing Businesses can charge users based on their usage of the app, such as the number of tasks performed, or the amount of data processed. Charging users per API call or per transaction processed by the agentic AI app. Licensing Fees Businesses can license their agentic AI technology to other companies. Granting other businesses, the right to use the agentic AI technology in their own products or services. It's important to note that agentic AI is poised to disrupt traditional SaaS business models, particularly the prevalent per-seat pricing model. As agentic AI becomes more sophisticated, businesses may shift towards alternative pricing models, such as usage-based pricing or outcome-based pricing, where the cost is directly tied to the AI's contribution to measurable business goals. 4. Framework for Calculating ROI for Agentic Apps Based on the analysis presented above, the following framework can be used to calculate the ROI of agentic AI apps: Define Objectives and KPIs: Clearly define the objectives of implementing the agentic AI app and the key performance indicators (KPIs) that will be used to measure its success. This could include metrics such as cost savings, revenue increase, productivity gains, customer satisfaction, and error reduction. Establish a Baseline: Establish a baseline for the KPIs before implementing the agentic AI app. This will help measure the impact of the app on the business. Estimate Revenue Gains and Cost Savings: Estimate the potential revenue gains and cost savings that can be achieved by implementing the AI Agentic. This may involve analyzing historical data, conducting surveys, and consulting with industry experts. Identify and Assess Costs: Identify all costs associated with developing, deploying, and maintaining the agentic AI app. This includes development costs, data acquisition costs, infrastructure costs, and ongoing maintenance costs. Determine Intangible Benefits: Identify and assess the intangible benefits of the agentic AI app, such as improved decision-making, enhanced brand reputation, and increased employee satisfaction. While these benefits may be difficult to quantify, they can significantly contribute to the overall ROI. Set a Realistic Timeframe: Establish a realistic timeframe for measuring the ROI of the agentic AI app. This should consider the time it takes to develop, deploy, and fully integrate the app into the business. Develop a Current State Scenario: Develop a scenario that represents the current state of the business without the agentic AI app. This will help compare the performance of the business with and without the app. Calculate the ROI: Using the data gathered in the previous steps, calculate the ROI of the agentic AI app using the ROI formula. Monitor and Adjust: Continuously monitor the performance of the agentic AI app and track the KPIs. Adjust the app and its implementation as needed to optimize its effectiveness and maximize ROI. When calculating the ROI of AI initiatives, it's crucial to avoid common pitfalls such as: Uncertainty of Benefits: Accurately estimating the benefits of AI can be challenging due to the evolving nature of technology and the potential for unforeseen outcomes. Computing ROI Based on a Single Point in Time: AI projects often have long-term benefits that may not be fully realized in the short term. As per a recent IDC Study in Nov 2024, organizations realize value in14 months. Treating Each AI Project Individually: AI projects can have synergistic effects and evaluating them in isolation may underestimate their overall impact on the business. 5. Example Scenarios: Option-1 A financial services call center handles 100,000 customer inquiries per year, each currently taking an average of 5 minutes. Of these calls, 10% (10,000 calls) are simple, routine requests (e.g., checking balances) and can be easily automated. Additionally, misrouting and inefficient handling cause each call to run 1 extra minute on average. Current Situation (Before Multi-Agent AI): Total calls: 100,000 Simple, routine calls: 10,000 Agent costs per minute: $0.50 Routine Calls Cost (Before AI): Routine calls each take 3 minutes. Total routine call time: 10,000 calls × 3 min = 30,000 min Cost: 30,000 min × $0.50 = $15,000 per year Misrouting Cost (Before AI): Extra 1 minute per call due to misrouting. Total extra time: 100,000 calls × 1 min = 100,000 min Cost: 100,000 min × $0.50 = $50,000 per year Total Extra Costs (Before AI): Routine tasks: $15,000 Misrouting: $50,000 Combined inefficiencies: $65,000 per year After Implementing Multi-Agent Collaboration AI: The AI system handles routine inquiries automatically and optimizes call routing: Routine Calls Automated: 10,000 routine calls no longer require agent time. Saves $15,000 per year on routine tasks. Correct Routing: Removes the extra 1 minute per call. Saves $50,000 per year from avoiding misrouting costs. Efficiency Gains: With misrouting fixed and agents freed from routine tasks, staff can handle a slight increase in call volume and also reduce overtime. Staff can handle an additional 4000 calls annually, each call at 5 minutes on average. (4000*5*0.50 = $10,000) Total Annual Savings After AI (Tangible Benefit): Routine tasks saved: $15,000 Misrouting eliminated: $50,000 Efficiency gains: $10,000 Total: $75,000 System Costs: Implementation and integration: $40,000 Annual maintenance: $5,000 Total Annual Cost: $45,000 ROI Calculation: Net Benefit: $75,000 (savings) – $45,000 (cost) = $30,000 ROI = (Net Benefit / Cost) × 100% = (30,000 / 45,000) × 100% ≈ 67% A 67% ROI means that for every dollar invested in the multi-agent collaboration AI system, the call center gains an additional 67 cents in profit each year. Option 2 Scenario: A company wants to semi-automate customer support for their e-commerce platform using an AI-powered chatbot on Azure. The AI-powered customer service chatbot provides support for very frequently asked questions. It automates responses, provides real-time order tracking, and offers personalized product recommendations while proactively engaging customers with tailored offers and anticipating their needs. It autonomously handles tasks like follow-ups and issue resolution, integrates seamlessly with existing systems, supports multiple languages, and operates 24/7 to enhance customer satisfaction and drive sales. Additionally, it escalates complex issues to human agents and continuously improves through self-feedback. Cost Estimation: Development and Deployment: $25,000 (including Azure App Service, Azure Agent Service, and other development costs) Maintenance and Support: $5,000 per year Benefit Estimation: Reduced Customer Service Costs: The chatbot handles 2,000 customer inquiries per month, which previously required 3 full-time employees with an average salary of $40,000 per year. Increased Sales: The chatbot's personalized recommendations and efficient support lead to a 5% increase in monthly sales, Calculating ROI: Annual Cost Savings 3 employees * $40,000 = $120,000 Chatbot cost = $25,000 (development) + $5,000 (maintenance) = $30,000 Cost savings = $120,000 - $30,000 = $90,000 Annual Revenue Increase Monthly sales: $500,000 Increase: 5% of $500,000 = $25,000 per month Yearly increase: $25,000 * 12 = $300,000 Total Annual Benefits $90,000 (cost savings) + $300,000 (revenue) = $390,000 ROI ROI = (Total Benefits − Annual Cost) / Annual Cost × 100% = (390,000 − 30,000 / 30,000) × 100% = 1200% This example demonstrates a significant ROI for the customer service chatbot. However, it's important to remember that this is a simplified calculation. Actual ROI may vary depending on various factors specific to the business and its implementation. Note: Calculating Azure Costs Azure costs vary by use case and are dependent on the architecture components. We'll discuss example scenarios for calculating these costs in a future blog. 6. Risks and Considerations Since the core of these agents relies on LLM, there is a potential for hallucination. Rigorous testing and evaluation are therefore critical before deploying them to production. Additionally, in the initial stages, agents may exhibit inefficiencies due to the complexity of orchestration, potentially introducing a 10–20% overhead. It is wise to set an ROI range that considers differences in response confidence. However, over time, these agents are expected to improve and optimize through iterative learning and feedback. 7. ROI will differ from use case to use case For example, in one call center, routine inquiries might be the primary source of inefficiency, while in another, the biggest gains might come from reducing customer wait times. Similarly, different industries may have different labor costs, different complexity levels for tasks, or varying levels of baseline performance. Cloud workload costs on Azure may also change based on usage patterns, the AI services you choose, data storage needs, and the extent of system integration required. In short, while the overall method for calculating ROI remains the same (measure gains, subtract costs, then divide by costs), the types of gains (e.g., labor reduction, error reduction, increased throughput, improved customer satisfaction) and the kinds of costs (e.g., Azure compute, integration services, licensing fees, training expenses) will be different for each scenario. As a result, you need to carefully identify the relevant metrics and expenses for every individual use case. Conclusion Agentic AI apps hold immense potential for businesses seeking to automate tasks, enhance efficiency, and improve decision-making. By implementing a comprehensive framework for calculating ROI, businesses can effectively justify their investment in agentic AI and ensure that these apps deliver both tangible and intangible benefits. This framework should encompass both quantitative and qualitative metrics, including cost savings, revenue increases, productivity gains, customer satisfaction, and intangible benefits such as improved decision-making and enhanced brand reputation. While the framework presented in this report provides a structured approach to evaluating the ROI of agentic AI apps, it's important to acknowledge the potential challenges and limitations. Quantifying some intangible benefits, such as enhanced brand reputation or increased employee satisfaction, can be subjective and may require alternative measurement approaches. Furthermore, the rapidly evolving nature of agentic AI technology may necessitate ongoing adjustments to the ROI framework to accurately capture its impact on businesses. Despite these challenges, a well-defined ROI framework remains crucial for making informed decisions about agentic AI investments and maximizing their potential. By carefully evaluating the ROI of agentic AI apps, businesses can strategically leverage this transformative technology to achieve their objectives and gain a competitive edge in the evolving digital landscape. References: IDC’s 2024 AI opportunity study: Top five AI trends to watch - The Official Microsoft Blog1.4KViews0likes0CommentsFine-Tuning Small Language Models for Function-Calling: A Comprehensive Guide
In the rapidly evolving landscape of artificial intelligence, fine-tuning small language models (SLMs) for use case specific workloads has become increasingly essential. The motivation behind this lies in the need for lower latency, reduced memory footprint, and improved accuracy—all while maintaining cost-effectiveness. This blog delves into the reasons for fine-tuning SLMs for function-call, key considerations, and a practical guide to implementing fine-tuning on Azure Why Fine-Tune Small Language Models? 1. Lower Latency and Reduced Memory Footprint : Smaller models with fewer weights inherently offer faster processing times due to reduced matrix multiplication operations. This lower latency is crucial for real-time applications where speed is paramount. Additionally, these models reduce the memory footprint, making them ideal for deployment in resource-constrained environments. 2. Cost Efficiency : Fine-tuning smaller models is more cost-effective than training large models from scratch. It reduces the computational resources required, thereby lowering operational costs. This makes it a viable option for startups and enterprises looking to optimize their AI expenditure. 3. Improved Accuracy : By tailoring a model to a specific function-calling use case, you can achieve higher accuracy. Fine-tuning allows the model to learn the intricacies of function-calling, thereby providing more relevant and precise outputs. 4. Smaller Token Size : Smaller models and efficient token handling lead to a reduction in token size, which further optimizes processing speed and resource usage. Key Considerations for Fine-Tuning a. Selection of the Right Base Model : Choosing the appropriate base model is crucial. Evaluate industrial benchmarks and leaderboards, such as the [Berkeley Function Call Leaderboard] to guide your selection. Consider factors like model size, which affects GPU VRAM requirements, accuracy, and context length. For this blg post, we will use Llama-3.2-3b-instruct model as our base model for fine-tuning. b. Dataset Preparation : Proper dataset preparation is a cornerstone for successful fine-tuning of SLMs for function-calling tasks. The dataset must be representative of real-world scenarios and cover the full spectrum of use cases you anticipate. For this blog, we will utilize the glaiveai/glaive-function-calling-v2 dataset from Hugging Face, renowned for its comprehensive coverage of simple, multiple, and multi-turn function-calling scenarios across diverse domains. - Key Steps in Dataset Preparation: Understanding the Complexity of the Use Case Before diving into the technicalities of dataset preparation, it's essential to understand the complexity of the use case at hand. Is the task limited to function-calling, or does it involve a broader, more generic conversation? If the latter is true, it becomes imperative to ensure that the existing knowledge and capabilities of the language model (SLM) are preserved. The dataset should seamlessly integrate both function-call and non-function-call scenarios to provide a holistic conversational experience. Differentiating Function-Calling Scenarios Let's explore the different scenarios that might arise in function-calling applications: Single Function-Calling: This scenario involves invoking a single function based on user input. For instance, in the travel industry, a user might ask, "What are the available flights from New York to London on December 10th?" The dataset should include examples that allow the model to extract relevant information and call the flight search function accurately. Multiple Function-Calling: Here, the language model must choose one function from a set of possible tools. For example, if a user asks, "Can you book me a hotel or a flight to Paris?" the dataset should provide instances where the model decides between booking a hotel or a flight based on user preferences or additional input. Multi-Turn Conversations: This scenario requires tools to be invoked in a sequence based on the conversation's state. Consider a user planning a vacation: "I want to visit Italy. What are my options?" followed by "Book me a flight," and then "Find a hotel in Rome." The dataset should capture the flow of conversation, enabling the model to handle each request in context. Parallel Function-Calling: In situations where multiple tools need to be invoked simultaneously, such as booking flights and hotels at the same time, the dataset should include examples that allow the model to manage these parallel tasks effectively. For instance, "Book a flight to Tokyo and reserve a hotel in Shinjuku for the same dates." Handling Missing Information: A robust dataset should also include scenarios where the language model needs to ask the user for missing information. For example, if a user simply says, "Book me a flight," the model should prompt, "Could you please specify the destination and dates?" c. Compute Selection Ensure your compute setup has adequate VRAM to accommodate model weights, gradients, and activations. The compute should be tailored to your model size and batch size requirements. d. Hyperparameter Selection : The selection of hyperparameters is a critical step that can significantly influence the performance of a model. Hyperparameters, unlike the model’s parameters, are not learned from the data but are set before the training process begins. Choosing the right hyperparameters can lead to faster convergence and higher accuracy, making this an area that demands careful attention. Hyperparameters can be thought of as the settings or knobs that you, as the model trainer, can adjust to tailor the training process. These include learning rate, batch size, the architecture of layers, and more. One of the leading methodologies for fine-tuning models is LORA (Low-Rank Adaptation), which has gained popularity due to its efficiency and effectiveness. LORA is a technique that allows for the efficient adaptation of large language models by introducing low-rank matrices during the training process. This approach reduces the number of trainable parameters, leading to faster convergence and reduced computational costs. When using LORA, two primary hyperparameters to consider are: Rank: This represents the dimensionality of the low-rank matrices. It is a critical factor influencing the model’s capacity to learn nuanced patterns. Alpha: This is a scaling factor applied to the low-rank updates, typically set to be 2-4 times the rank value. A good starting point for these parameters might be a rank of 8 and an alpha of 16, but these values should be tailored based on the model's complexity and the specific task at hand. e. Optimize context length : Another significant aspect of model fine-tuning, especially in function-calling scenarios, is the management of context length. In these prompts, we often provide detailed information such as function names, descriptions, and argument types, which consume a substantial number of tokens. Efficiently managing this context can lead to performance gains without sacrificing accuracy. Iterative Experimentation with Context Details: To optimize context length, an iterative experimentation approach is recommended: Baseline Experiment: Start by including all possible details—function descriptions, argument types, and more. This serves as your baseline for comparison. Simplified Contexts: Gradually remove elements from the context: First Iteration: Retain only the function names and arguments, omitting descriptions. Second Iteration: Remove the arguments, keeping just the function names. Final Iteration: Test the model's performance without any function names or arguments. By incrementally simplifying the context, you can identify the minimal necessary While conducting these experiments, it is advantageous to utilize previous checkpoints. Instead of starting from the base model for each iteration, use the trained model from the previous step as a starting point. This approach can save time and computational resources, allowing for more efficient experimentation. Fine-Tuning on Azure: Step-by-Step Now lets run the fine-tuning job while adhering to all the guidelines and instructions shared above:- 1. Create an Azure Machine Learning Workspace: An Azure Machine Learning workspace is your control center for managing all the resources you need to train, deploy, automate, and manage machine learning models. It serves as a central repository for your datasets, compute resources, and models. To get started, you can create a workspace through the Azure portal by navigating to the Azure Machine Learning service and selecting "Create new workspace." Ensure you configure resource group, workspace name, region, and other necessary settings. 2. Create a Compute Instance: To run your Python notebook and execute scripts, you need a compute instance. This virtual machine in Azure Machine Learning allows you to perform data preparation, training, and experimentation. Go to the "Compute" section in your workspace, select "Create," and choose a compute instance that fits your needs, ensuring it has the necessary specifications for your workload. 3: Dataset Preparation: For this blog, we'll use the glaiveai/glaive-function-calling-v2 dataset from Hugging Face, which includes simple, multi-turn function calling and generic conversations across various domains. The dataset needs to be formatted to be compatible with the OpenAI format: Convert each conversation into a chat_template format. Assign roles as 'system', 'user', or 'assistant'. Remove "<|endoftext|>” string and if the response is a function-call, replace the “<functioncall>” string and add role as tool so that LLM knows when to stop responding and wait for function execution results def parse_conversation(input_string): ROLE_MAPPING = {"USER" : "user", "ASSISTANT" : "assistant", "SYSTEM" : "system", "FUNCTION RESPONSE" : "tool"} # Regular expression to split the conversation based on SYSTEM, USER, and ASSISTANT pattern = r"(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):" # Split the input string and keep the delimiters parts = re.split(pattern, input_string) # Initialize the list to store conversation entries conversation = [] # Iterate over the parts, skipping the first empty string for i in range(1, len(parts), 2): role = parts[i].strip() content = parts[i + 1].strip() content = content.replace("<|endoftext|>", "").strip() if content.startswith('<functioncall>'): # build structured data for function call # try to turn function call from raw text to structured data content = content.replace('<functioncall>', '').strip() # replace single quotes with double quotes for valid JSON clean_content = content.replace("'{", '{').replace("'}", '}') data_json = json.loads(clean_content) # Make it compatible with openAI prompt format func_call = {'recipient_name': f"functions.{data_json['name']}", 'parameters': data_json['arguments']} content = {'tool_uses': [func_call]} # Append a dictionary with the role and content to the conversation list conversation.append({"role": ROLE_MAPPING[role], "content": content}) return conversation def prepare_dataset(tokenizer, args): # Create the cache_dir cache_dir = "./outputs/dataset" os.makedirs(cache_dir, exist_ok = True) # Load the dataset from disk train_dataset = load_from_disk(args.train_dir) eval_dataset = load_from_disk(args.val_dir) column_names = list(train_dataset.features) def apply_chat_template(examples): conversations = [] for system, chat in zip(examples["system"], examples["chat"]): try: system_message = parse_conversation(system) chat_message = parse_conversation(chat) message = system_message + chat_message conversations.append(message) except Exception as e: print(e) text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in conversations] return {"text": text} # process the dataseta and drop unused columns processed_train_dataset = train_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names) processed_eval_dataset = eval_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names) return processed_train_dataset, processed_eval_dataset 4: Create a Data Asset: Azure Machine Learning allows you to register datasets as data assets, making them easily manageable and reusable: def get_or_create_data_asset(ml_client, data_name, data_local_dir, update=False): try: latest_data_version = max([int(d.version) for d in ml_client.data.list(name=data_name)]) if update: raise ResourceExistsError('Found Data asset, but will update the Data.') else: data_asset = ml_client.data.get(name=data_name, version=latest_data_version) logger.info(f"Found Data asset: {data_name}. Will not create again") except (ResourceNotFoundError, ResourceExistsError) as e: data = Data( path=data_local_dir, type=AssetTypes.URI_FOLDER, description=f"{data_name} for fine tuning", tags={"FineTuningType": "Instruction", "Language": "En"}, name=data_name ) data_asset = ml_client.data.create_or_update(data) logger.info(f"Created/Updated Data asset: {data_name}") return data_asset train_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_train", data_local_dir=f"{DATA_DIR}/train", update=True) val_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_val", data_local_dir=f"{DATA_DIR}/val", update=True) test_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_test", data_local_dir=f"{DATA_DIR}/test", update=True) 5: Create an Environment: While Azure provides built-in environments for common use cases, creating a custom environment tailored to your specific needs can be beneficial. An environment in Azure ML is essentially a containerized setup that defines the software, libraries, and other dependencies required to run your machine learning workload. Why Use Environments? Reproducibility: By defining an environment, you ensure that your training and inference processes are reproducible, with the same configuration used every time. Consistency: Environments help maintain consistency across different runs and teams, reducing "it works on my machine" problems. Portability: They encapsulate your dependencies, making it easier to move and share your ML projects across different Azure services or even with external collaborators. %%writefile {CLOUD_DIR}/train/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2 USER root # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update && apt-get -y upgrade RUN pip install --upgrade pip RUN apt-get install -y openssh-server openssh-client # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation def get_or_create_docker_environment_asset(ml_client, env_name, docker_dir, update=False): try: latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)]) if update: raise ResourceExistsError('Found Environment asset, but will update the Environment.') else: env_asset = ml_client.environments.get(name=env_name, version=latest_env_version) print(f"Found Environment asset: {env_name}. Will not create again") except (ResourceNotFoundError, ResourceExistsError) as e: print(f"Exception: {e}") env_docker_image = Environment( build=BuildContext(path=docker_dir), name=env_name, description="Environment created from a Docker context.", ) env_asset = ml_client.environments.create_or_update(env_docker_image) print(f"Created Environment asset: {env_name}") return env_asset env = get_or_create_docker_environment_asset(ml_client, azure_env_name, docker_dir=f"{CLOUD_DIR}/train", update=False) Reference : training.ipynb 6: Create a Training Script: Your training script will handle the fine-tuning process and log metrics using MLflow, which is tightly integrated with Azure Machine Learning. This involves - Loading the dataset, defining the model architecture, writing functions to track and log metrics such as training and evaluation loss. def main(args): ################### # Hyper-parameters ################### # Only overwrite environ if wandb param passed if len(args.wandb_project) > 0: os.environ['WANDB_API_KEY'] = args.wandb_api_key os.environ["WANDB_PROJECT"] = args.wandb_project if len(args.wandb_watch) > 0: os.environ["WANDB_WATCH"] = args.wandb_watch if len(args.wandb_log_model) > 0: os.environ["WANDB_LOG_MODEL"] = args.wandb_log_model use_wandb = len(args.wandb_project) > 0 or ("WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0) training_config = {"per_device_train_batch_size" : args.train_batch_size, # Controls the batch size per device "per_device_eval_batch_size" : args.eval_batch_size, # Controls the batch size for evaluation "gradient_accumulation_steps" : args.grad_accum_steps, "warmup_ratio" : args.warmup_ratio, # Controls the ratio of warmup steps "learning_rate" : args.learning_rate, "fp16" : not torch.cuda.is_bf16_supported(), "bf16" : torch.cuda.is_bf16_supported(), "optim" : "adamw_8bit", "lr_scheduler_type" : args.lr_scheduler_type, "output_dir" : args.output_dir, "logging_steps": args.logging_steps, "logging_strategy": "epoch", "save_steps": args.save_steps, "eval_strategy": "epoch", "num_train_epochs": args.epochs, # "load_best_model_at_end": True, "save_only_model": False, "seed" : 0 } peft_config = { "r": args.lora_r, "lora_alpha": args.lora_alpha, "lora_dropout": args.lora_dropout, "bias": "none", #"target_modules": "all-linear", "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], "modules_to_save": None, "use_gradient_checkpointing": "unsloth", "use_rslora": False, "loftq_config": None, } checkpoint_dir = os.path.join(args.output_dir, "checkpoints") train_conf = TrainingArguments( **training_config, report_to="wandb" if use_wandb else "azure_ml", run_name=args.wandb_run_name if use_wandb else None, ) model, tokenizer = load_model(args) model = FastLanguageModel.get_peft_model(model, **peft_config) ############### # Setup logging ############### logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = train_conf.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process a small summary logger.warning( f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}" + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}" ) logger.info(f"Training/evaluation parameters {train_conf}") logger.info(f"PEFT parameters {peft_config}") # Load the dataset train_dataset, eval_dataset = prepare_dataset(tokenizer, args) ########### # Training ########### trainer = SFTTrainer( model=model, args=train_conf, tokenizer = tokenizer, train_dataset=train_dataset, eval_dataset=eval_dataset, dataset_text_field="text", packing = False # Can make training 5x faster for shorter responses ) # Show current memory stats gpu_stats = torch.cuda.get_device_properties(0) start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3) logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.") logger.info(f"{start_gpu_memory} GB of memory reserved.") last_checkpoint = None if os.path.isdir(checkpoint_dir): checkpoints = [os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)] if len(checkpoints) > 0: checkpoints.sort(key=os.path.getmtime, reverse=True) last_checkpoint = checkpoints[0] trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint) ############# # Evaluation ############# tokenizer.padding_side = "left" metrics = trainer.evaluate() metrics["eval_samples"] = len(eval_dataset) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # ############ # # Save model # ############ os.makedirs(args.model_dir, exist_ok=True) if args.save_merged_model: print("Save PEFT model with merged 16-bit weights") model.save_pretrained_merged("outputs", tokenizer, save_method="merged_16bit") else: print(f"Save PEFT model: {args.model_dir}/model") model.save_pretrained(f"{args.model_dir}/model") tokenizer.save_pretrained(args.model_dir) Reference : train.py 7: Create the Compute Cluster: . For this experiment, we are using Standard_NC24ads_A100_v4 which has 1 GPU and 80 GB of VRAM. Select the compute based on the model size and batch size. from azure.ai.ml.entities import AmlCompute ### Create the compute cluster try: compute = ml_client.compute.get(azure_compute_cluster_name) print("The compute cluster already exists! Reusing it for the current run") except Exception as ex: print( f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {azure_compute_cluster_size}!" ) try: print("Attempt #1 - Trying to create a dedicated compute") tier = 'LowPriority' if USE_LOWPRIORITY_VM else 'Dedicated' compute = AmlCompute( name=azure_compute_cluster_name, size=azure_compute_cluster_size, tier=tier, max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print("Error") 8: Submit the Fine-Tuning Job With everything set up, you can now submit your fine-tuning job: from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration job = command( inputs=dict( #train_dir=Input(type="uri_folder", path=DATA_DIR), # Get data from local path train_dir=Input(path=f"{AZURE_DATA_NAME}_train@latest"), # Get data from Data asset val_dir = Input(path=f"{AZURE_DATA_NAME}_val@latest"), epoch=d['train']['epoch'], train_batch_size=d['train']['train_batch_size'], eval_batch_size=d['train']['eval_batch_size'], ), code=f"{CLOUD_DIR}/train", # local path where the code is stored compute=azure_compute_cluster_name, command="python train_v3.py --train_dir ${{inputs.train_dir}} --val_dir ${{inputs.val_dir}} --train_batch_size ${{inputs.train_batch_size}} --eval_batch_size ${{inputs.eval_batch_size}}", #environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/77", # Use built-in Environment asset environment=f"{azure_env_name}@latest", distribution={ "type": "PyTorch", "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) ml_client.jobs.stream(returned_job.name) 9: Monitor Training Metrics: After initiating the job, keep an eye on the output for key metrics like training loss and evaluation loss. Since we've logged the results to MLflow, which is seamlessly integrated with Azure Machine Learning, we can easily review the loss function by navigating to the metrics tab within the jobs section. Key Takeways: Both the training and evaluation loss decrease significantly in the initial steps, suggesting effective learning. The gradual reduction in loss in subsequent steps indicates that the model continues to refine its parameters, but at a slower rate. The consistency in the downward trend for both training and evaluation loss implies that the model is not overfitting and is generalizing well to new data. However, the slight uptick towards the end in the evaluation loss might need monitoring to ensure it doesn't indicate overfitting at later stages. Overall, it looks promising, so lets go ahead and register the model. 10: Register the Model: After fine-tuning, register the model to make it available for deployment: from azureml.core import Workspace, Run import os # Connect to your workspace ws = Workspace.from_config() experiment_name = 'experiment_name' run_id = 'job_name' run = Run(ws.experiments[experiment_name], run_id) # Register the model model = run.register_model( model_name=d["serve"]["azure_model_name"], # this is the name the model will be registered under model_path="outputs" # this is the path to the model file in the run's outputs ) # Create a local directory to save the outputs local_folder = './model_v2' os.makedirs(local_folder, exist_ok=True) # Download the entire outputs folder run.download_files(prefix='outputs', output_directory=local_folder) Step 11: Deploy the Model to a Managed Online Endpoint: Managed online endpoints provide a seamless way to deploy models without managing underlying infrastructure. They offer scalability, versioning, and easy rollback compared to deploying on an Azure Kubernetes Service (AKS) cluster. 11 a. Build the enviornment: For deploying the model to managed online endpoint, first create the environment with required dependencies and webserver for inference. %%writefile {CLOUD_DIR}/serve/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2 # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir # Inference requirements COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/ RUN /var/requirements/install_system_requirements.sh && \ cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \ cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \ ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \ rm -f /etc/nginx/sites-enabled/default ENV SVDIR=/var/runit ENV WORKER_TIMEOUT=400 EXPOSE 5001 8883 8888 # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update RUN apt-get install -y openssh-server openssh-client RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation Reference : serving.ipynb 11b. Create a serving script: Creating a serve script for inference is a crucial step in deploying your machine learning model to a production environment. This script handles incoming requests, processes input data, runs the model inference, and returns the results. In Azure Machine Learning, the serve script is part of the deployment package for your model, typically used in conjunction with a managed endpoint or a Kubernetes service. A serve script in Azure ML typically consists of two main functions: init(): This function initializes the model and any other necessary resources. It is called once when the deployment is first loaded. run(data): This function is called every time a request is made to the deployed model. It processes the incoming data, performs inference using the model, and returns the results. import os import re import json import torch import base64 import logging from io import BytesIO from transformers import AutoTokenizer, AutoProcessor, pipeline from transformers import AutoModelForCausalLM, AutoProcessor device = torch.device("cuda" if torch.cuda.is_available() else "cpu") def init(): """ This function is called when the container is initialized/started, typically after create/update of the deployment. You can write the logic here to perform init operations like caching the model in memory """ global model global tokenizer # AZUREML_MODEL_DIR is an environment variable created during deployment. # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION) # Please provide your model's folder name if there is one model_name_or_path = os.path.join( os.getenv("AZUREML_MODEL_DIR"), "outputs" ) model_kwargs = dict( trust_remote_code=True, device_map={"":0}, torch_dtype="auto" ) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map ={"" : 0}, **model_kwargs) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) logging.info("Loaded model.") def run(json_data: str): logging.info("Request received") data = json.loads(json_data) input_data = data["input_data"] params = data['params'] pipe = pipeline("text-generation", model = model, tokenizer = tokenizer) output = pipe(input_data, **params) result = output[0]["generated_text"] logging.info(f"Generated text : {result}") json_result = {"result" : str(result)} return json_result Reference : score.py 11c. Create a managed online endpoint and deploy the model to endpoint: Creating an endpoint and deploying your model on Azure Machine Learning is the final step to make your model accessible for real-time inference. This process involves setting up a service that can handle incoming requests, execute the model, and return the results. Why Create an Endpoint? An endpoint is a network-accessible interface that allows external applications or users to interact with your deployed machine learning model. Creating an endpoint is crucial for the following reasons: Accessibility: Endpoints make your model accessible over the internet or within a secured network, enabling other applications, services, or users to send requests and receive responses. API Integration: By exposing your model as a RESTful API, endpoints facilitate integration with various applications, allowing seamless communication and data exchange. Load Management: An endpoint can manage requests from multiple clients, handling concurrent requests and distributing the load appropriately. Security: Endpoints provide mechanisms for authentication and authorization, ensuring that only authorized users can access the model. Scalability: Azure-managed endpoints can automatically scale based on demand, ensuring that your model can handle varying workloads without manual intervention. from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) azure_endpoint_name = d['serve']['azure_endpoint_name'] # Check if the endpoint already exists in the workspace try: endpoint = ml_client.online_endpoints.get(azure_endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=azure_endpoint_name, description=f"Test endpoint for {model.name}", ) # Trigger the endpoint creation try: ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err Why Deploy a Model? Deployment is the process of transferring your trained machine learning model from a development environment to a production environment where it can serve real-time predictions. Deployment is critical because: Operationalization: Deployment operationalizes your model, moving it from an experimental or development phase to a live environment where it can deliver value to end-users or systems. Resource Allocation: Deploying a model involves configuring the necessary compute resources (such as CPU, memory, and GPUs) to ensure optimal performance during inference. Environment Consistency: During deployment, the model is packaged with its dependencies in a consistent environment, ensuring reproducibility and minimizing discrepancies between development and production. Monitoring and Maintenance: Deployment sets up the infrastructure to monitor the model's performance, usage, and health, allowing for ongoing maintenance and updates. Version Control: Deployment allows you to manage and update different versions of your model, providing flexibility to roll back or switch to newer versions as needed. from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) azure_deployment_name = f"{d['serve']['azure_deployment_name']}-v1" deployment = ManagedOnlineDeployment( name=azure_deployment_name, endpoint_name=azure_endpoint_name, model=model, instance_type=azure_compute_cluster_size, instance_count=1, #code_configuration=code_configuration, environment = env, scoring_script="score.py", code_path=f"./{CLOUD_DIR}/inference", #environment_variables=deployment_env_vars, request_settings=OnlineRequestSettings(max_concurrent_requests_per_instance=20, request_timeout_ms=90000, max_queue_wait_ms=60000), liveness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), ) # Trigger the deployment creation try: ml_client.begin_create_or_update(deployment).wait() print("\n---Deployment created successfully---\n") except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err endpoint.traffic = {azure_deployment_name: 100} endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) Step 12: Run Inference on Sample Data: Test the deployed model using sample data that expects function calls: import json import os sample = { "input_data": [ {'role': 'system', 'content': 'You are an helpful assistant who has access to the following functions to help the user, you can use the functions if needed- { "name": "calculate_shipping_cost", "description": "Calculate the cost of shipping a package", "parameters": { "type": "object", "properties": { "weight": { "type": "number", "description": "The weight of the package in pounds" }, "destination": { "type": "string", "description": "The destination of the package" } }, "required": [ "weight", "destination" ] }}}"'}, {'role': 'user', 'content': 'Can you help me with shipping cost for a package?'}, {'role': 'assistant', 'content': 'Sure! I can help you with that. Please provide me with the weight and destination of the package.'}, {'role': 'user', 'content': 'The weight of the package is 10 pounds and the destination is New York.'} ], "params": { "temperature": 0.1, "max_new_tokens": 512, "do_sample": True, "return_full_text": False } } # Dump the sample data into a json file with open(request_file, "w") as f: json.dump(sample, f) result = ml_client.online_endpoints.invoke( endpoint_name=azure_endpoint_name, deployment_name=azure_deployment_name, request_file=request_file ) result_json = json.loads(result) result = result_json['result'] print(result) Step 13: Compare with Base Model: Now, lets run the same sample through the base model to observe the difference in performance. As we can see, while the fine-tuned model did a perfect job of generating response with the right function and arguments, the base model struggles to generate the desired output Step 14: Rerun the fine-tuning job by removing function descriptions from the system message: Now, lets rerun the experiment, but this time we will drop the function description from the dataset for context length optimization def remove_desc_from_prompts(data): system_message = data['system'] pattern = r'"description":\s*"[^"]*",?\n?' # Remove the "description" fields cleaned_string = re.sub(pattern, '"description":"",', system_message) return cleaned_string ## Update the system message by removing function descriptions and argument description train_dataset = train_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) test_dataset = test_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) val_dataset = val_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) train_dataset.save_to_disk(f"{DATA_DIR}/train") test_dataset.save_to_disk(f"{DATA_DIR}/test") val_dataset.save_to_disk(f"{DATA_DIR}/val") Reference : preprocess.py As can be seen from the results, removing the function description doesn't degrade the model performance but instead this fine-tuned model version requires lesser input tokens resulting in a significant reduction in token consumption with improved latency. Step 15: Further Exploration: Consider removing arguments or even the function itself in subsequent experiments to evaluate performance. Conclusion This blog post has walked through the process of fine-tuning an SLM for function-calling on Azure Machine Learning. By following these steps, you can effectively tailor a model to meet specific functional requirements. You can access the full code here. For a deeper dive into evaluating fine-tuned models, including metrics and code samples, check out the next blog post. By leveraging Azure's powerful tools, you can streamline the development and deployment of machine learning models, making them more efficient and effective for your specific tasks. Reference: Fine tuning for function calling | OpenAI Cookbook Fine-tuning function calls with Azure OpenAI Service - Azure AI services | Microsoft Learn michaelnny/Llama3-FunctionCalling: Fine-tune Llama3 model to support function calling Fine Tuning LLMs for Function Calling w/Pawel Garbacki - YouTube slm-innovator-lab/2_slm-fine-tuning-mlstudio at main · Azure/slm-innovator-lab1.9KViews1like1CommentFine-Tuning DeepSeek-R1-Distill-Llama-8B with PyTorch FSDP, QLoRA on Azure Machine Learning
Large Language Models (LLMs) have demonstrated remarkable capabilities across various industries, revolutionizing how we approach tasks like legal document summarization, creative content generation, and customer sentiment analysis. However, adapting these general-purpose models to excel in specific domains often requires fine-tuning. This is where fine-tuning comes in, allowing us to tailor LLMs to meet unique requirements and improve their performance on targeted tasks. In this blog post, we'll explore the process of fine-tuning the DeepSeek-R1-Distill-Llama-8B model, highlighting the advantages of using PyTorch Fully Sharded Data Parallel (FSDP) and Quantization-Aware Low-Rank Adaptation (QLoRA) techniques in conjunction with the Azure Machine Learning platform. Why Fine-Tuning Matters In some cases, LLMs may not perform well on specific domains, tasks, or datasets, or may produce inaccurate or misleading outputs. In such cases, fine-tuning the model can be a useful technique to adapt it to the desired goal and improve its quality and reliability. Hallucinations: Hallucinations are untrue statements output by the model. They can harm the credibility and trustworthiness of your application. One possible mitigation is fine-tuning the model with data that contains accurate and consistent information. Accuracy and quality problems: Pre-trained models may not achieve the desired level of accuracy or quality for a specific task or domain. This shortfall can be due a mismatch between the pre-training data and the target data, the diversity and complexity of the target data, and/or incorrect evaluation metrics and criteria. DeepSeek-R1 is an open-source language model excelling in text-based tasks, including creative writing, question answering, editing, and summarization. It's particularly strong in reasoning-intensive tasks like coding, math, and explaining scientific concepts. DeepSeek-R1 stands out due to its mixture of experts (MoE) architecture and use of reinforcement learning, achieving high performance with greater efficiency and lower costs compared to other models. It has 671 billion parameters across multiple expert networks, but only 37 billion are required for a single forward pass. DeepSeek-R1 uses reinforcement learning (RL) to generate a chain-of-thought (CoT) before delivering its final answer. To make these capabilities more accessible, DeepSeek has distilled its R1 outputs into several smaller models. DeepSeek has also created smaller, distilled versions based on Qwen and Llama architectures. Qwen-based distilled models: 1.5B, 7B, 14B and 32B Llama-based distilled models: 8B and 70B DeepSeek-R1-Distill-Llama-8B is a distilled large language model (LLM) based on the Llama architecture, created using outputs from the larger DeepSeek-R1 model. Through knowledge distillation, the reasoning patterns of the larger 671 billion parameter DeepSeek-R1 model are transferred into a smaller, more efficient model. The DeepSeek-R1-Distill-Llama-8B has only 8 billion parameters, making it computationally efficient while retaining a significant portion of the original model's performance. It is fine-tuned from models like Llama-3.1-8B-Instruct, achieving high performance across multiple benchmarks. This distilled model offers a balance of performance and resource requirements, improving inference speed and reducing computational costs, making it cost-effective for production deployments. PyTorch FSDP: Scaling Fine-Tuning with Data Parallelism PyTorch Fully Sharded Data Parallel (FSDP) is a distributed training framework that addresses the challenges of fine-tuning large models by sharding model parameters, optimizer states, and gradients across multiple GPUs. This technique enables you to train models with billions of parameters on systems with limited GPU memory. QLoRA: Efficient Fine-Tuning with Quantization and Low-Rank Adaptation Quantization-Aware Low-Rank Adaptation (QLoRA) is a parameter-efficient fine-tuning technique that reduces memory usage and accelerates training by quantizing the model weights and fine-tuning only a small subset of parameters. QLoRA leverages Low-Rank Adaptation (LoRA) to fine-tune only a small subset of the model’s parameters, making training faster and memory efficient. Azure Machine Learning: Your Platform for Scalable Fine-Tuning Azure Machine Learning provides a robust platform for fine-tuning LLMs, offering a comprehensive suite of tools and services to streamline the process. Scalable Compute: Azure Machine Learning Compute provides virtual machines (VMs) that run parts of the distributed deep learning job, auto-scaling as necessary. Azure Machine Learning compute clusters can schedule tasks, collect results, adjust resources to actual loads, and manage errors[5]. VMs that participate in the cluster can be GPU-enabled to accelerate deep learning calculations. Data Storage: Azure offers standard and premium blob storage options for storing training data and execution logs. Premium blob storage is used to store training data and enable high-performance access during model training, which is needed for distributed training. Experiment Tracking: Azure Machine Learning provides tools for tracking and managing your fine-tuning experiments, allowing you to monitor performance metrics and reproduce your results. Hands-on lab Now let’s start finetune and deploy the same on AML. Lets sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. # import required libraries """ This script sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. Modules imported: - time: Provides various time-related functions. - azure.identity: Provides authentication capabilities with DefaultAzureCredential and InteractiveBrowserCredential. - azure.ai.ml: Contains classes and functions for interacting with Azure ML services, including MLClient, Input, pipeline, load_component, command, Data, Environment, BuildContext, Model, Input, Output, and AssetTypes. - azure.core.exceptions: Contains exceptions for handling resource-related errors. - os: Provides a way to interact with the operating system. Variables: - credential: An instance of DefaultAzureCredential used for authenticating with Azure services. - ml_client: An instance of MLClient initialized using the provided credentials. If the initialization fails, an exception is caught and printed. """ import time from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential from azure.ai.ml import MLClient, Input from azure.ai.ml.dsl import pipeline from azure.ai.ml import load_component from azure.ai.ml import command from azure.ai.ml.entities import Data, Environment, BuildContext from azure.ai.ml.entities import Model from azure.ai.ml import Input from azure.ai.ml import Output from azure.ai.ml.constants import AssetTypes from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError import os credential = DefaultAzureCredential() ml_client = None try: ml_client = MLClient.from_config(credential) except Exception as ex: print(ex) Now lets install some libraries required to download the dataset and run the openai client. %conda run -n azureml_py310_sdkv2 pip install datasets==3.2.0 openai Lets create our training environment. os.makedirs("environment_train", exist_ok=True) Lets build our docker environment. %%writefile environment_train/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202501.3 USER root # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update && apt-get -y upgrade RUN pip install --upgrade pip RUN apt-get install -y openssh-server openssh-client # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation Let’s also specify our requirements.txt %%writefile environment_train/requirements.txt transformers==4.48.2 peft==0.14.0 accelerate==1.3.0 bitsandbytes==0.45.1 datasets==3.2.0 evaluate==0.4.3 huggingface_hub[hf_transfer] safetensors>=0.5.2 sentencepiece==0.2.0 scikit-learn==1.6.1 tokenizers>=0.21.0 py7zr Once we specify both lets create the AML custom training environment. env_name = "deepseek-training" env_docker_image = Environment( build=BuildContext(path = "environment_train", dockerfile_path="Dockerfile"), name=env_name, description="Environment created for llm fine-tuning.", version="1" ) env_asset_train = ml_client.environments.create_or_update(env_docker_image) While the training environment is ready let’s start with the dataset preparation. from datasets import load_dataset import pandas as pd dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en") df = pd.DataFrame(dataset['train']) df = df.iloc[0:2000] df.head() Here is quick snapshot of what the dataset looks like Noe lets split the dataset into train and test for validation. from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.1, random_state=42) print("Number of train elements: ", len(train)) print("Number of test elements: ", len(test)) Let’s create the prompt template to run the finetuning process. In this case we have used COT prompt template. # custom instruct prompt start prompt_template = f""" <|begin▁of▁sentence|> You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response. <|User|> {{question}} <|Assistant|> <think> {{complex_cot}} </think> {{answer}} <|end▁of▁sentence|> """ # template dataset to add prompt to each sample def template_dataset(sample): sample["text"] = prompt_template.format(question=sample["Question"], complex_cot=sample["Complex_CoT"], answer=sample["Response"]) return sample Let’s run the mapping of this prompt through the whole dataset and create train and test jsonl files.. from datasets import Dataset, DatasetDict from random import randint train_dataset = Dataset.from_pandas(train) test_dataset = Dataset.from_pandas(test) dataset = DatasetDict({"train": train_dataset, "test": test_dataset}) train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features)) print(train_dataset[randint(0, len(dataset))]["text"]) test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features)) train_dataset.to_json(f"data/train.jsonl") test_dataset.to_json(f"data/eval.jsonl") Now let’s start creating our training script. os.makedirs("src_train", exist_ok=True) write the train.py which uses both Qlora and PyTorch FSDP. %%writefile src_train/train.py import os import argparse import sys import logging from accelerate import Accelerator import datetime from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed import transformers import traceback from huggingface_hub import snapshot_download from datasets import load_dataset def download_model(model_name): print("Downloading model ", model_name) os.makedirs("/tmp/tmp_folder", exist_ok=True) snapshot_download(repo_id=model_name, local_dir="/tmp/tmp_folder") print(f"Model {model_name} downloaded under /tmp/tmp_folder") def init_distributed(): # Initialize the process group torch.distributed.init_process_group( backend="nccl", # Use "gloo" backend for CPU timeout=datetime.timedelta(seconds=5400) ) local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank) return local_rank def main(args): model_name = args.model_name_or_path train_ds = load_dataset('json', data_files=args.train_file, split='train') test_ds = load_dataset('json', data_files=args.eval_file, split='train') per_device_train_batch_size=args.train_batch_size per_device_eval_batch_size=args.eval_batch_size gradient_accumulation_steps=args.grad_accum_steps learning_rate=args.learning_rate num_train_epochs=args.epochs lora_r=8 lora_alpha=16 lora_dropout=0.1 fsdp="full_shard auto_wrap offload" fsdp_config={ 'backward_prefetch': 'backward_pre', 'cpu_ram_efficient_loading': True, 'offload_params': True, 'forward_prefetch': False, 'use_orig_params': False } gradient_checkpointing=False merge_weights=True seed=42 token=None model_dir = args.model_dir if torch.cuda.is_available() and (torch.cuda.device_count() > 1 or int(os.environ.get("SM_HOST_COUNT", 1)) > 1): # Call this function at the beginning of your script local_rank = init_distributed() # Now you can use distributed functionalities torch.distributed.barrier(device_ids=[local_rank]) os.environ.update({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) set_seed(seed) accelerator = Accelerator() if token is not None: os.environ.update({"HF_TOKEN": token}) accelerator.wait_for_everyone() if int(os.environ.get("SM_HOST_COUNT", 1)) == 1: if accelerator.is_main_process: download_model(model_name) else: download_model(model_name) accelerator.wait_for_everyone() model_name = "/tmp/tmp_folder" tokenizer = AutoTokenizer.from_pretrained(model_name) # Set Tokenizer pad Token tokenizer.pad_token = tokenizer.eos_token with accelerator.main_process_first(): # tokenize and chunk dataset lm_train_dataset = train_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features) ) print(f"Total number of train samples: {len(lm_train_dataset)}") if test_ds is not None: lm_test_dataset = test_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features) ) print(f"Total number of test samples: {len(lm_test_dataset)}") else: lm_test_dataset = None torch_dtype = torch.bfloat16 # Defining additional configs for FSDP if fsdp != "" and fsdp_config is not None: bnb_config_params = { "bnb_4bit_quant_storage": torch_dtype } model_configs = { "torch_dtype": torch_dtype } fsdp_configurations = { "fsdp": fsdp, "fsdp_config": fsdp_config, "gradient_checkpointing_kwargs": { "use_reentrant": False }, "tf32": True } else: bnb_config_params = dict() model_configs = dict() fsdp_configurations = dict() bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype, **bnb_config_params ) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, quantization_config=bnb_config, attn_implementation="flash_attention_2", use_cache=not gradient_checkpointing, cache_dir="/tmp/.cache", **model_configs ) if fsdp == "" and fsdp_config is None: model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing) if gradient_checkpointing: model.gradient_checkpointing_enable() config = LoraConfig( r=lora_r, lora_alpha=lora_alpha, target_modules="all-linear", lora_dropout=lora_dropout, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, config) trainer = transformers.Trainer( model=model, train_dataset=lm_train_dataset, eval_dataset=lm_test_dataset if lm_test_dataset is not None else None, args=transformers.TrainingArguments( per_device_train_batch_size=per_device_train_batch_size, per_device_eval_batch_size=per_device_eval_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, gradient_checkpointing=gradient_checkpointing, logging_strategy="steps", logging_steps=1, log_on_each_node=False, num_train_epochs=num_train_epochs, learning_rate=learning_rate, bf16=True, ddp_find_unused_parameters=False, save_strategy="no", output_dir="outputs", **fsdp_configurations ), data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), ) if trainer.accelerator.is_main_process: trainer.model.print_trainable_parameters() trainer.train() if trainer.is_fsdp_enabled: trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") if merge_weights: output_dir = "/tmp/model" # merge adapter weights with base model and save # save int 4 model trainer.model.save_pretrained(output_dir, safe_serialization=False) if accelerator.is_main_process: # clear memory del model del trainer torch.cuda.empty_cache() # load PEFT model model = AutoPeftModelForCausalLM.from_pretrained( output_dir, torch_dtype=torch.float16, low_cpu_mem_usage=True, trust_remote_code=True, ) # Merge LoRA and base model and save model = model.merge_and_unload() model.save_pretrained( model_dir, safe_serialization=True, max_shard_size="2GB" ) else: trainer.model.save_pretrained( model_dir, safe_serialization=True ) if accelerator.is_main_process: tokenizer.save_pretrained(model_dir) accelerator.wait_for_everyone() def parse_args(): # setup argparse parser = argparse.ArgumentParser() # curr_time = datetime.now().strftime("%Y-%m-%d_%H:%M:%S") # hyperparameters parser.add_argument("--model_name_or_path", default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", type=str, help="Input directory for training") parser.add_argument("--train_file", type=str, help="Input data for training") parser.add_argument("--eval_file", type=str, help="Input data for eval") parser.add_argument("--epochs", default=1, type=int, help="number of epochs") parser.add_argument("--train_batch_size", default=2, type=int, help="training - mini batch size for each gpu/process") parser.add_argument("--eval_batch_size", default=4, type=int, help="evaluation - mini batch size for each gpu/process") parser.add_argument("--grad_accum_steps", default=4, type=int, help="gradient accumulation steps") parser.add_argument("--learning_rate", default=2e-4, type=float, help="learning rate") parser.add_argument("--save_merged_model", type=bool, default=False) parser.add_argument("--model_dir", type=str, default="./", help="output directory for model") # parse args args = parser.parse_args() # return args return args if __name__ == "__main__": #sys.argv = [''] args = parse_args() main(args) Next step is to create a compute cluster on which the training will run. azure_compute_cluster_name = "a100-compute" azure_compute_cluster_size = "Standard_NC24ads_A100_v4" USE_LOWPRIORITY_VM = True from azure.ai.ml.entities import AmlCompute ### Create the compute cluster try: compute = ml_client.compute.get(azure_compute_cluster_name) except Exception as ex: try: tier = "LowPriority" if USE_LOWPRIORITY_VM else "Dedicated" compute = AmlCompute( name=azure_compute_cluster_name, size=azure_compute_cluster_size, tier=tier, max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print(e) Once the compute is ready, lets run the training job. from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration str_command = "" str_command += "python train.py --train_file ${{inputs.train_file}} --eval_file ${{inputs.eval_file}} \ --epochs ${{inputs.epoch}} --train_batch_size ${{inputs.train_batch_size}} \ --eval_batch_size ${{inputs.eval_batch_size}} --model_name_or_path ${{inputs.model_name_or_path}} \ --model_dir ${{inputs.model_dir}} --save_merged_model ${{inputs.save_merged_model}}" job = command( inputs=dict( train_file=Input( type="uri_file", path="data/train.jsonl", ), eval_file=Input( type="uri_file", path="data/eval.jsonl", ), epoch=1, train_batch_size=2, eval_batch_size=1, model_name_or_path="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", model_dir="./outputs", save_merged_model = True ), code="./src_train", # local path where the code is stored compute=azure_compute_cluster_name, command=str_command, environment=env_asset_train, distribution={ "type": "PyTorch", "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) ml_client.jobs.stream(returned_job.name) Once the training is completed, lets register the model as a custom model type. from azure.ai.ml.entities import Model from azure.ai.ml.constants import AssetTypes run_model = Model( path=f"azureml://jobs/{returned_job.name}/outputs/artifacts/paths/outputs/", name="deepseekr1-dist-llama8bft", description="Model created from run.", type=AssetTypes.CUSTOM_MODEL, ) model = ml_client.models.create_or_update(run_model) Once the model is registered the next step is to deploy the same as Online Managed Endpoint. from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) endpoint_name = "deepseekr1-dist-llama8bft-ep" # Check if the endpoint already exists in the workspace try: endpoint = ml_client.online_endpoints.get(endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"Test endpoint for {model.name}" ) # Trigger the endpoint creation try: ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err Let’s define the deployment name , SKU type of the VM and Request timeout parameter. # Initialize deployment parameters deployment_name = "deepseekr1-dist-llama8bftd-eploy" sku_name = "Standard_NC24ads_A100_v4" REQUEST_TIMEOUT_MS = 90000 os.makedirs("environment_inf", exist_ok=True) Lets create the environment for our inference . %%writefile environment_inf/Dockerfile FROM vllm/vllm-openai:latest ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS Let’s build the environment with the docker file created above. from azure.ai.ml.entities import Environment, BuildContext env_docker_image = Environment( build=BuildContext(path="environment_inf", dockerfile_path= "Dockerfile"), name="vllm-custom", description="Environment created from a Docker context.", inference_config={ "liveness_route": { "port": 8000, "path": "/health", }, "readiness_route": { "port": 8000, "path": "/health", }, "scoring_route": { "port": 8000, "path": "/", }, }, ) env_asset_inf = ml_client.environments.create_or_update(env_docker_image) Once our environment for inference server is ready let’s do the deployment. Lets define some environment variables model_path = f"/var/azureml-app/azureml-models/{model.name}/{model.version}/outputs" env_vars = { "MODEL_NAME": model_path, "VLLM_ARGS": "--max-model-len 16000 --enforce-eager", } deployment_env_vars = {**env_vars} Lets do the deployment now. import time from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) t0 = time.time() deployment = ManagedOnlineDeployment( name= deployment_name, endpoint_name=endpoint_name, model=model, instance_type=sku_name, instance_count=1, environment_variables=deployment_env_vars, environment=env_asset_inf, request_settings=OnlineRequestSettings( max_concurrent_requests_per_instance=2, request_timeout_ms=50000, max_queue_wait_ms=60000 ), liveness_probe=ProbeSettings( failure_threshold=5, success_threshold=1, timeout=10, period=30, initial_delay=120 ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, timeout=2, period=10, initial_delay=120, ), ) # Trigger the deployment creation try: ml_client.begin_create_or_update(deployment).wait() except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err endpoint.traffic = {deployment_name: 100} endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) Wow!! Our endpoint is now deployed. Let’s start testing the same. endpoint_results = endpoint_poller.result() endpoint_name = endpoint_results.name keys = ml_client.online_endpoints.get_keys(name=endpoint_name) primary_key = keys.primary_key url = os.path.join(endpoint_results.scoring_uri, "v1") endpoint_name = ( endpoint_results.name if endpoint_name is None else endpoint_name ) keys = ml_client.online_endpoints.get_keys(name=endpoint_name) Once we get the API keys we can use openai client to stream the tokens. from openai import OpenAI vllm_client = OpenAI(base_url=url, api_key=primary_key) # Create your prompt system_message = """You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.""" user_message = f"""A 3-week-old child has been diagnosed with late onset perinatal meningitis, and the CSF culture shows gram-positive bacilli. What characteristic of this bacterium can specifically differentiate it from other bacterial agents?""" response = vllm_client .chat.completions.create( model=model_path, messages=[ {"role": "system", "content": system_message}, {"role": "user", "content": user_message}, ], temperature=0.7, max_tokens=4000, stream=True, # Stream the response ) print("Streaming response:") for chunk in response: delta = chunk.choices[0].delta if hasattr(delta, "content"): print(delta.content, end="", flush=True) Conclusion Fine-tuning the DeepSeek-R1-Distill-Llama-8B model with PyTorch FSDP and QLoRA on Azure Machine Learning offers a powerful approach to customising LLMs for specific tasks. By leveraging the scalability and efficiency of these techniques, you can unlock the full potential of LLMs and drive innovation in your respective domain. Hope you liked the blog. Do like the blog and follow me for more such content. Thanks Manoranjan Rajguru AI Global Black Belt3.4KViews0likes0CommentsScalable and Efficient Fine-Tuning of LLM on Azure ML
https://github.com/james-tn/llm-fine-tuning/tree/main/opensource_llm/single_step Co-Author: Mohamad AL jazaery Why Scalable and Efficient Fine-Tuning Matters Faster Iterations, Shorter Time-to-Value: In today’s competitive AI landscape, time is of the essence. The faster you can fine-tune a model, the quicker you can validate ideas, test hypotheses, and bring solutions to market. High-profile GPU machines are costly: High-performance GPUs and compute clusters don’t come cheap, and their availability is often limited. Efficient fine-tuning techniques, such as model sharding and distributed training, maximize the utilization of these precious resources—ensuring that you get the most out of your infrastructure investment. Choosing the Right Azure ML GPU Compute for the Job: NC or ND? Not all GPU computes are created equal, and choosing the right sku can make or break your training efficiency. ND Series: Ideal for distributed training across multiple nodes, thanks to its Infiniband (IB) connectivity that ensures high-speed communication between nodes like pretraining LLM or finetuning very large model ~70B params. NC Series: Small and medium workload where no heavy interaction between nodes needed like LLM inferencing or mid-size LLM finetuning. Azure GPU Machine Options by Scenario: Scenario Common model size Training Approach Recommended Azure Compute Small-scale fine-tuning < 3B parameters Parameter-efficient tuning NCas_T4_v3 (Tesla T4, 16 GB) Medium-scale fine-tuning 1–5B parameters Full or parameter-efficient NCs_v3 (Tesla V100, 16 GB) Distributed training for medium models 5–10B parameters Full fine-tuning ND_v2 (Tesla V100 NVLINK, 32 GB, InfiniBand) Large-scale fine-tuning (single machine) 10–30B parameters Full or parameter-efficient NC_A100_v4 (A100, 40 GB) Distributed training for very large models 20–70B parameters Full fine-tuning NDasrA100_v4 (A100, 80 GB, HDR InfiniBand) Very large models training (single machine) up to 70B parameters Full or parameter-efficient NCads_H100_v5 (H100 NVL, 94 GB) Massive-scale distributed training > 70B parameters Full fine-tuning ND-H100-v5 (H100, 80 GB, scale-out InfiniBand) Distributed Efficient Training: A Quick Guide When scaling fine-tuning tasks, choosing the right distributed training method is key: DDP (Data Parallelism): Works well when the entire model fits on a single GPU. It replicates the model across multiple GPUs and splits the data for parallel processing. Check experiment 1 in the following section. Model Parallelism: A game-changer for massive models that don’t fit on a single GPU. It shards not only the data but also the model parameters and optimizer states across multiple GPUs, enabling efficient training of models like LLaMA-70B on GPUs with low memory GPUs. Both FSDP and DeepSpeed as libraries excel at implementing advanced forms of model parallelism and memory optimization. Memory Optimization Techniques Gradient Checkpointing: Reduces memory by recomputing activations during the backward pass, trading memory for additional computation. Mixed Precision Training: Reduces memory usage by using FP16 or BF16 instead of FP32, accelerating training while maintaining numerical stability. Supported by both frameworks. Quantization (DeepSpeed Exclusive): Uses INT8 precision for weights and activations, dramatically reducing memory and compute requirements. Offloading (DeepSpeed Exclusive): Offloads optimizer states and model parameters to CPU or NVMe, freeing up GPU memory for computation. Our Experiments: Pushing the Limits of Scalability Experiment 1: Distributed Training on Multiple Nodes using DDP We conducted an experiment to fine-tune the Llama-3.1-8B model using LoRA (Low-Rank Adaptation) on Azure ML NDv2-V100 nodes. The goal was to evaluate the efficiency of fine-tuning across different numbers of nodes (1, 2, and 3) and observe the impact on training time and throughput. Azure ML Job YAML Definition $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/mistralai-Mistral-7B-v01/versions/19 command: > accelerate launch --num_processes 16 # gpu per machine * num of machines --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT compute: azureml:ndv2-cluster resources: instance_count: 2 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Results: As you increased the number of nodes from one to three, the throughput increased proportionally. This indicates that the system scaled efficiently with the addition of more nodes, maintaining a close-to-linear improvement in throughput. Experiment 2: Model Parallelism using FSDP Fine-tuning a 70B-parameter model on GPUs with only 16GB of memory might sound impossible, but we made it happen using FSDP (Full Sharded Data Parallelism) on Azure ML using a cluster of multiple NDv2-V100 nodes. By distributing not only the data but also the model parameters and optimizer states across multiple nodes, we unlocked the power of full sharding. $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml-meta/models/Llama-3.3-70B-Instruct/versions/4 command: > accelerate launch --config_file "configs/fsdp_config.yaml" --num_processes 32 --num_machines 4 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py compute: azureml:ndv2-cluster resources: instance_count: 4 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Key Takeaways: Memory Efficiency: Full sharding enabled us to fine-tune the LLaMA-70B model on V100 GPUs despite their limited memory. Connectivity Matters: The Infiniband (IB) connectivity of ND nodes played a critical role in ensuring smooth communication across GPUs, making this feat possible. Conclusion Scalable and efficient fine-tuning is the key to unlocking the true potential of Large Language Models. By leveraging distributed training techniques, such as FSDP and DDP, and optimizing compute resources on Azure ML, researchers and practitioners can overcome the challenges of training massive models—reducing costs, accelerating time-to-value, and driving AI innovation. Access the code and start experimenting here! Future work: The second part will focus on real-world pipeline setups, including end-to-end model training, hyperparameter optimization, and testing. The third part will dive into deploying trained models for practical use. Future posts may explore best practices for specific fine-tuning scenarios and techniques.1.3KViews3likes0CommentsFine Tune Mistral Models on Azure AI Foundry
We're excited to announce the general availability of fine-tuning for Mistral models on Azure is now live! Starting today, Mistral Large 2411, Mistral Nemo, and Ministral 3B fine-tuning are available to all our Azure AI Foundry customers, providing unmatched customization and performance. This also establishes Azure AI Foundry as the second platform, after Mistral's own, where fine-tuning of Mistral models is currently available. Azure AI Foundry lets you tailor large language models to your personal datasets by using a process known as fine-tuning. Fine-tuning provides significant value by enabling customization and optimization for specific tasks and applications. It leads to improved performance, cost efficiency, reduced latency, and tailored outputs. Finetuning enabled Mistral Models Mistral Large 2411 Mistral Large 24.11 is an advanced Large Language Model (LLM) with state-of-the-art reasoning, knowledge and coding capabilities. Designed to support multiple languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish. Mistral large is highly proficient in coding, with training in over 80 programming languages such as Python, Java, C, C++, JavaScript, and Bash, as well as specialized languages like Swift and Fortran. Mistral large emphasizes agent-centric capabilities, providing top-tier agent functionalities with native function calling and JSON output. It is equipped with advanced reasoning skills, featuring state-of-the-art mathematical and logical capabilities. Mistral Nemo 2407 Mistral Nemo is an advanced Language Model (LLM) that excels in reasoning, world knowledge, and coding within its size category. Developed in collaboration with Nvidia, this powerful 12B model pushes the boundaries of language understanding and generation. Mistral Nemo features multilingual proficiency with a new tokenizer, Tekken, designed for multilingual applications. It supports over 100 languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, and many more. Tekken is more efficient than the Llama 3 tokenizer, compressing text for approximately 85% of all languages, with significant improvements in Malayalam, Hindi, Arabic, and prevalent European languages. Mistral Nemo also boasts top-tier agentic capabilities, including native function calling and JSON outputting. Additionally, it demonstrates state-of-the-art mathematical and reasoning capabilities within its size category. Ministral 3B Ministral 3B is a cutting-edge Small Language Model (SLM) designed for edge computing and on-device applications. Its low-latency and compute-efficient inference make it ideal for standard GenAI applications that require real-time processing and handle high volumes. With 3.6 billion parameters, Ministral 3B sets a new benchmark in knowledge, commonsense reasoning, function-calling, and efficiency within the sub-10B category. This model can be utilized or fine-tuned for various purposes, from orchestrating agentic workflows to creating specialized task workers. Serverless Finetuning of Mistral Models Fine-tuning is a powerful technique for customizing and optimizing the performance of large language models (LLMs) for specific use cases. By further training a pre-trained LLM on a labeled dataset related to a particular task, fine-tuning can improve the model's performance. This can be done with a large model for complex or dissimilar tasks, or with a smaller model to match the performance of a larger model, potentially leading to latency and cost benefits. The performance increase varies depending on the use cases. To fine-tune a Mistral models model: Sign in to Azure AI Foundry. Choose the model you want to fine-tune from the Azure AI Foundry portal model catalog. On the model's Details page, select fine-tune. Select the project in which you want to fine-tune your models. To use the pay-as-you-go model fine-tune offering, your workspace must belong to the East US 2 region. On the fine-tune wizard, select the link to Azure Marketplace Terms to learn more about the terms of use. You can also select the Marketplace offer details tab to learn about pricing for the selected model. If this is your first time fine-tuning the model in the project, you have to subscribe your project for the particular offering (for example, Ministral-3B) from Azure Marketplace. This step requires that your account has the Azure subscription permissions and resource group permissions listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering, which allows you to control and monitor spending. Select Subscribe and fine-tune. Note Subscribing a project to a particular Azure Marketplace offering (in this case, Ministral-3B) requires that your account has Contributor or Owner access at the subscription level where the project is created. Alternatively, your user account can be assigned a custom role that has the Azure subscription permissions and resource group permissions listed in the prerequisites. Once you sign up the project for the particular Azure Marketplace offering, subsequent fine-tuning of the same offering in the same project don't require subscribing again. Therefore, you don't need to have subscription-level permissions for subsequent fine-tune jobs. If this scenario applies to you, select Continue to fine-tune. Enter a name for your fine-tuned model and the optional tags and description. Select training data to fine-tune your model. See data preparation for more information. Note If you have your training/validation files in a credential less datastore, you will need to allow workspace managed identity access to their datastore in order to proceed with MaaS finetuning with a credential less storage. On the "Datastore" page, after clicking "Update authentication" > Select the following option: Make sure all your training examples follow the expected format for inference. To fine-tune models effectively, ensure a balanced and diverse dataset. This involves maintaining data balance, including various scenarios, and periodically refining training data to align with real-world expectations, ultimately leading to more accurate and balanced model responses. The batch size to use for training. When set to -1, batch_size is calculated as 0.2% of examples in training set and the max is 256. The fine-tuning learning rate is the original learning rate used for pretraining multiplied by this multiplier. We recommend experimenting with values between 0.5 and 2. Empirically, we've found that larger learning rates often perform better with larger batch sizes. Must be between 0.0 and 5.0. Number of training epochs. An epoch refers to one full cycle through the data set. Task parameters are an optional step and an advanced option- Tuning hyperparameter is essential for optimizing large language models (LLMs) in real-world applications. It allows for improved performance and efficient resource usage. The default settings can be used or advanced users can customize parameters like epochs or learning rate. Review your selections and proceed to train your model. Once your model is fine-tuned, you can deploy the model and can use it in your own application, in the playground, or in prompt flow Get started today! Whether you're a newcomer to fine-tuning or an experienced developer, getting started with Azure AI Foundry is now more accessible than ever. Fine-tuning is available through both Azure AI Foundry and Azure ML Studio, offering a user-friendly interface for those who prefer a graphical user interface (GUI) and SDK’s and CLI for advanced users. Learn more! Try it out with Azure AI Foundry Explore documentation for the model catalog in Azure AI Foundry Begin using the Finetuning SDK in the notebook Learn more about Azure AI Content Safety - Azure AI Content Safety – AI Content Moderation | Microsoft Azure767Views0likes0Comments