Best Practices for Mitigating Hallucinations in Large Language Models (LLMs)

ellienosrat

Microsoft

Apr 10, 2025

Real-world AI Solutions: Lessons from the Field

Overview

This document provides practical guidance for minimizing hallucinations—instances where models produce inaccurate or fabricated content—when building applications with Azure AI services. It targets developers, architects, and MLOps teams working with LLMs in enterprise settings.

Key Outcomes

✅ Reduce hallucinations through retrieval-augmented strategies and prompt engineering
✅ Improve model output reliability, grounding, and explainability
✅ Enable robust enterprise deployment through layered safety, monitoring, and security

Understanding Hallucinations

Hallucinations come in different forms. Here are some realistic examples for each category to help clarify them:

Type	Description	Example
Factual	Outputs are incorrect or made up	"Albert Einstein won the Nobel Prize in Physics in 1950." (It was 1921)
Temporal	Stale or outdated knowledge shown as current	"The latest iPhone model is the iPhone 12." (When iPhone 15 is current)
Contextual	Adds concepts that weren’t mentioned or implied	Summarizing a doc and adding "AI is dangerous" when the doc never said it
Linguistic	Grammatically correct but incoherent sentences	"The quantum sandwich negates bicycle logic through elegant syntax."
Extrinsic	Unsupported by source documents	Citing nonexistent facts in a RAG-backed chatbot
Intrinsic	Contradictory or self-conflicting answers	Saying both "Azure OpenAI supports fine-tuning" and "Azure OpenAI does not."

Mitigation Strategies

1- Retrieval-Augmented Generation (RAG)

Grounding model outputs with enterprise knowledge sources like PDFs, SharePoint docs, or images.

Key Practices:

Data Preparation and Organization
- Clean and curate your data.
- Organize data into topics to improve search accuracy and prevent noise.
- Regularly audit and update grounding data to avoid outdated or biased content.
Search and Retrieval Techniques
- Explore different methods (keyword, vector, hybrid, semantic search) to find the best fit for your use case.
- Use metadata filtering (e.g., tagging by recency or source reliability) to prioritize high-quality information.
- Apply data chunking to improve retrieval efficiency and clarity.
Query Engineering and Post-Processing
- Use prompt engineering to specify which data source or section to pull from.
- Apply query transformation methods (e.g., sub-queries) for complex queries.
- Employ reranking methods to boost output quality.

2- Prompt Engineering

High-quality prompts guide LLMs to produce factual and relevant responses. Use the ICE method:

Instructions: Start with direct, specific asks.
Constraints: Add boundaries like "only from retrieved docs".
Escalation: Include fallback behaviors (e.g., “Say ‘I don’t know’ if unsure”).

Example Prompt Improvement:

❌: Summarize this document.

✅: Using only the retrieved documentation, summarize this paper in 3–5 bullet points. If any information is missing, reply with 'Insufficient data.'

Prompt Patterns That Work:

Clarity and Specificity
- Write clear, unambiguous instructions to minimize misinterpretation.
- Use detailed prompts, e.g., "Provide only factual, verified information. If unsure, respond with 'I don't know.'"
Structure
- Break down complex tasks into smaller logical subtasks for accuracy. Example: Research Paper Analysis

❌ Bad Prompt (Too Broad, Prone to Hallucination): "Summarize this research paper and explain its implications."

✅ Better Prompt (Broken into Subtasks):

- - - Extract Core Information: "Summarize the key findings of the research paper in 3-5 bullet points."
    - Assess Reliability: "Identify the sources of data used and assess their credibility."
    - Determine Implications: "Based on the findings, explain potential real-world applications."
    - Limit Speculation: "If any conclusions are uncertain, indicate that explicitly rather than making assumptions."
Repetition
- Repeating key instructions in a prompt can help reduce hallucinations. The way you structure the repetition matters. Here are some best practices:
Beginning (Highly Recommended)
- The start of the prompt has the most impact on how the LLM interprets the task. Place essential guidelines here, such as: "Provide only factual, verified information."
End (For Final Confirmation or Safety Checks)
- Use the end to reinforce key rules. Instead of repeating the initial instruction verbatim, word it differently to reinforce it, and keep it concise. For example: "If unsure, clearly state 'I don't know.'"
Temperature Control
- Adjust temperature settings (0.1–0.4) for deterministic, focused responses.
Chain-of-Thought
- Incorporate "Chain-of-Thought" instructions to encourage logical, stepwise responses. For example, to solve a math problem: "Solve this problem step-by-step. First, break it into smaller parts. Explain each step before moving to the next."

Tip: Use Azure AI Prompt Flow’s playground to test prompt variations with parameter sweeps.

3- System-Level Defenses

Mitigation isn't just prompt-side—it requires end-to-end design.

Key Recommendations:

Content Filtering: Use Azure AI Content Safety to detect sexual, hate, violence, or self-harm content.
Metaprompts: Define system boundaries ("You can only answer from documents retrieved").
RBAC & Networking: Use Azure Private Link, VNETs, and Microsoft Entra ID for secure access.

4- Evaluation & Feedback Loops

Continuously evaluate outputs using both automated and human-in-the-loop feedback.

Real-World Setup:

Labeling Teams: Review hallucination-prone cases with Human in Loop integrations.
Automated Test Generation

- Use LLMs to generate diverse test cases covering multiple inputs and difficulty levels.
- Simulate real-world queries to evaluate model accuracy.

Evaluations Using Multiple LLMs

Cross-evaluate outputs from multiple LLMs.
Use ranking and comparison to refine model performance.
Be cautious—automated evaluations may miss subtle errors requiring human oversight.

Tip: Common Evaluation Metrics

Metric	What It Measures	How to Use It
Relevance Score	How closely the model's response aligns with the user query and intent (0–1 scale).	Use automated LLM-based grading or semantic similarity to flag off-topic or loosely related answers.
Groundedness Score	Whether the output is supported by retrieved documents or source context.	Use manual review or Azure AI Evaluation tools (like RAG evaluation) to identify unsupported claims.
User Trust Score	Real-time feedback from users, typically collected via thumbs up/down or star ratings.	Track trends to identify low-confidence flows and prioritize them for prompt tuning or data curation.

Tip: Use evaluation scores in combination. For example, high relevance but low groundedness often signals hallucination risks—especially in chat apps with fallback answers.

Tip: Flag any outputs where "source_confidence" < threshold and route them to a human review queue.

Tip: Include “accuracy audits” as part of your CI/CD pipeline using Prompt Flow or other evaluations tools to test components.

Summary & Deployment Checklist

Task	Tools/Methods
Curate and chunk enterprise data	Azure AI Search, data chunkers
Use clear, scoped, role-based prompts	Prompt engineering, prompt templates
Ground all outputs using RAG	Azure AI Search + Azure OpenAI
Automate evaluation flows	Prompt Flow + custom evaluators
Add safety filters and monitoring	Azure Content Safety, Monitor, Insights
Secure deployments with RBAC/VNET	Azure Key Vault, Entra ID, Private Link