Blog Post

AI - Azure AI services Blog
5 MIN READ

Best Practices for Mitigating Hallucinations in Large Language Models (LLMs)

ellienosrat's avatar
ellienosrat
Icon for Microsoft rankMicrosoft
Apr 10, 2025

Real-world AI Solutions: Lessons from the Field

Overview 

This document provides practical guidance for minimizing hallucinations—instances where models produce inaccurate or fabricated content—when building applications with Azure AI services. It targets developers, architects, and MLOps teams working with LLMs in enterprise settings.

 

Key Outcomes

✅ Reduce hallucinations through retrieval-augmented strategies and prompt engineering
✅ Improve model output reliability, grounding, and explainability
✅ Enable robust enterprise deployment through layered safety, monitoring, and security

 

Understanding Hallucinations

Hallucinations come in different forms. Here are some realistic examples for each category to help clarify them:

Type

Description

Example

Factual

Outputs are incorrect or made up

"Albert Einstein won the Nobel Prize in Physics in 1950." (It was 1921)

Temporal

Stale or outdated knowledge shown as current

"The latest iPhone model is the iPhone 12." (When iPhone 15 is current)

Contextual

Adds concepts that weren’t mentioned or implied

Summarizing a doc and adding "AI is dangerous" when the doc never said it

Linguistic

Grammatically correct but incoherent sentences

"The quantum sandwich negates bicycle logic through elegant syntax."

Extrinsic

Unsupported by source documents

Citing nonexistent facts in a RAG-backed chatbot

Intrinsic

Contradictory or self-conflicting answers

Saying both "Azure OpenAI supports fine-tuning" and "Azure OpenAI does not."

 

Mitigation Strategies

1- Retrieval-Augmented Generation (RAG)

Grounding model outputs with enterprise knowledge sources like PDFs, SharePoint docs, or images.

Key Practices:

  • Data Preparation and Organization
    • Clean and curate your data.
    • Organize data into topics to improve search accuracy and prevent noise.
    • Regularly audit and update grounding data to avoid outdated or biased content.
  • Search and Retrieval Techniques
    • Explore different methods (keyword, vector, hybrid, semantic search) to find the best fit for your use case.
    • Use metadata filtering (e.g., tagging by recency or source reliability) to prioritize high-quality information.
    • Apply data chunking to improve retrieval efficiency and clarity.
  • Query Engineering and Post-Processing
    • Use prompt engineering to specify which data source or section to pull from.
    • Apply query transformation methods (e.g., sub-queries) for complex queries.
    • Employ reranking methods to boost output quality.

 

2- Prompt Engineering

High-quality prompts guide LLMs to produce factual and relevant responses. Use the ICE method:

  • Instructions: Start with direct, specific asks.
  • Constraints: Add boundaries like "only from retrieved docs".
  • Escalation: Include fallback behaviors (e.g., “Say ‘I don’t know’ if unsure”).

Example Prompt Improvement:

❌: Summarize this document. 

✅: Using only the retrieved documentation, summarize this paper in 3–5 bullet points. If any information is missing, reply with 'Insufficient data.'

Prompt Patterns That Work:

  • Clarity and Specificity
    • Write clear, unambiguous instructions to minimize misinterpretation.
    • Use detailed prompts, e.g., "Provide only factual, verified information. If unsure, respond with 'I don't know.'"
  • Structure
    • Break down complex tasks into smaller logical subtasks for accuracy. Example: Research Paper Analysis

❌ Bad Prompt (Too Broad, Prone to Hallucination): "Summarize this research paper and explain its implications."

✅ Better Prompt (Broken into Subtasks):

        • Extract Core Information: "Summarize the key findings of the research paper in 3-5 bullet points."
        • Assess Reliability: "Identify the sources of data used and assess their credibility."
        • Determine Implications: "Based on the findings, explain potential real-world applications."
        • Limit Speculation: "If any conclusions are uncertain, indicate that explicitly rather than making assumptions."
  • Repetition
    • Repeating key instructions in a prompt can help reduce hallucinations. The way you structure the repetition matters. Here are some best practices:
  • Beginning (Highly Recommended)
    • The start of the prompt has the most impact on how the LLM interprets the task. Place essential guidelines here, such as: "Provide only factual, verified information."
  • End (For Final Confirmation or Safety Checks)
    • Use the end to reinforce key rules. Instead of repeating the initial instruction verbatim, word it differently to reinforce it, and keep it concise. For example: "If unsure, clearly state 'I don't know.'"
  • Temperature Control
    • Adjust temperature settings (0.1–0.4) for deterministic, focused responses.
  • Chain-of-Thought
    • Incorporate "Chain-of-Thought" instructions to encourage logical, stepwise responses. For example, to solve a math problem: "Solve this problem step-by-step. First, break it into smaller parts. Explain each step before moving to the next."

 Tip: Use Azure AI Prompt Flow’s playground to test prompt variations with parameter sweeps.

 

3- System-Level Defenses

Mitigation isn't just prompt-side—it requires end-to-end design.

Key Recommendations:

  • Content Filtering: Use Azure AI Content Safety to detect sexual, hate, violence, or self-harm content.
  • Metaprompts: Define system boundaries ("You can only answer from documents retrieved").
  • RBAC & Networking: Use Azure Private Link, VNETs, and Microsoft Entra ID for secure access.

 

4- Evaluation & Feedback Loops

Continuously evaluate outputs using both automated and human-in-the-loop feedback.

Real-World Setup:

  • Labeling Teams: Review hallucination-prone cases with Human in Loop integrations.
  • Automated Test Generation
    • Use LLMs to generate diverse test cases covering multiple inputs and difficulty levels.
    • Simulate real-world queries to evaluate model accuracy.
  • Evaluations Using Multiple LLMs
    • Cross-evaluate outputs from multiple LLMs.
    • Use ranking and comparison to refine model performance.
    • Be cautious—automated evaluations may miss subtle errors requiring human oversight.

Tip: Common Evaluation Metrics

Metric

What It Measures

How to Use It

Relevance Score

How closely the model's response aligns with the user query and intent (0–1 scale).

Use automated LLM-based grading or semantic similarity to flag off-topic or loosely related answers.

Groundedness Score

Whether the output is supported by retrieved documents or source context.

Use manual review or Azure AI Evaluation tools (like RAG evaluation) to identify unsupported claims.

User Trust Score

Real-time feedback from users, typically collected via thumbs up/down or star ratings.

Track trends to identify low-confidence flows and prioritize them for prompt tuning or data curation.

Tip: Use evaluation scores in combination. For example, high relevance but low groundedness often signals hallucination risks—especially in chat apps with fallback answers.

Tip: Flag any outputs where "source_confidence" < threshold and route them to a human review queue.

Tip: Include “accuracy audits” as part of your CI/CD pipeline using Prompt Flow or other evaluations tools to test components.

 

Summary & Deployment Checklist

Task

Tools/Methods

Curate and chunk enterprise data

Azure AI Search, data chunkers

Use clear, scoped, role-based prompts

Prompt engineering, prompt templates

Ground all outputs using RAG

Azure AI Search + Azure OpenAI

Automate evaluation flows

Prompt Flow + custom evaluators

Add safety filters and monitoring

Azure Content Safety, Monitor, Insights

Secure deployments with RBAC/VNET

Azure Key Vault, Entra ID, Private Link

 

Additional AI Best Practices blog posts:

Best Practices for Requesting Quota Increase for Azure OpenAI Models

Best Practices for Leveraging Azure OpenAI in Constrained Optimization Scenarios

Best Practices for Structured Extraction from Documents Using Azure OpenAI

Best Practices for Using Generative AI in Automated Response Generation for Complex Decision Making

Best Practices for Leveraging Azure OpenAI in Code Conversion Scenarios

Kickstarting AI Agent Development with Synthetic Data: A GenAI Approach on Azure | Microsoft Community Hub

 

Published Apr 10, 2025
Version 1.0
No CommentsBe the first to comment