Building Secure AI Chat Systems: Part 1 -- Protecting Your LLM from Malicious Inputs

Brass Contributor

Oct 16, 2025

A practical guide to implementing harmful content detection, PII protection, and prompt injection prevention using Azure AI services

I was reading an article over the weekend about WotNot, an AI chatbot startup that left 346,000 files completely exposed in an unsecured cloud storage bucket—passports, medical records, sensitive customer data, all accessible to anyone on the internet without even a password. The researchers who discovered it tried for over two months to get the company to fix it. Two months. That genuinely shook me.

So I did what any concerned developer would do - I went down a rabbit hole of security breaches and best practices.

In August, Lenovo's customer support chatbot was exploited through a simple 400-character prompt injection that allowed attackers to steal session cookies and potentially access customer support systems. Cyber Press published a detailed analysis of how attackers used indirect prompt injection through hidden instructions in customer reviews to expose system prompts and execute remote code, ultimately accessing databases containing Social Security numbers and personal data. Then there's the alarming statistic that 77% of employees are pasting confidential records—client lists, financial data, source code—directly into AI chat prompts, and 82% of this happens through unmanaged accounts that bypass all corporate monitoring.

And these are just the incidents we know about. The pattern is clear: we're rushing to deploy AI without thinking through the security implications. Some of these breaches stem from terrible architectural decisions like unsecured storage buckets or weak authentication tokens. But many others happen because the LLMs themselves have no guardrails. They're too trusting, too eager to please, and can't distinguish between legitimate instructions and malicious commands.

In this article, I want to focus on security within the AI system itself - specifically the large language model and how we protect it from malicious inputs and unintended behaviors. I'll save the architectural security concerns like vector database encryption and chat history hashing for the next part.

After working through this problem, I've identified three major security and privacy challenges we need to address: harmful content detection, personally identifiable information protection, and prompt injection prevention. The good news is that Azure provides services that make tackling these issues much more straightforward than building everything from scratch.

Let me walk you through each one and show you how I implemented them.

Harmful Content Detection

This is about catching inputs that contain hate speech, violence, self-harm content, or sexual material before they ever reach your LLM. You don't want your chatbot processing or responding to truly harmful requests, both for ethical reasons and to protect your users and your organization.

Azure Content Safety is the service designed for exactly this. It analyzes text and assigns severity scores across different categories of harmful content. The API is remarkably simple to use - you send it text and it returns a structured analysis with severity levels from 0 to 6.

Here's how I implemented the safety check:

from azure.ai.contentsafety import ContentSafetyClient
from azure.ai.contentsafety.models import AnalyzeTextOptions
from azure.core.credentials import AzureKeyCredential

client = ContentSafetyClient(endpoint, AzureKeyCredential(key))


def check_prompt_safety(user_input: str) -> tuple[bool, str]:
    """
    Check user input for harmful content.
    
    Returns:
        (is_safe, message)
    """
    request = AnalyzeTextOptions(text=user_input)
    
    try:
        response = client.analyze_text(request)
        max_severity = 0
        flagged_categories = []
        
        if hasattr(response, 'categories_analysis'):
            for category in response.categories_analysis:
                if category.severity > max_severity:
                    max_severity = category.severity
                if category.severity > 2:
                    flagged_categories.append(category.category)
        
        if max_severity > 2:
            return False, f"Content flagged: {', '.join(flagged_categories)} (severity: {max_severity})"
        
        return True, "Safe"
    
    except Exception as e:
        return False, f"Safety check failed: {str(e)}"

To get started with this service, head to your Azure portal and search for "Content Safety". You'll need to create a resource and grab your endpoint and key. The setup takes maybe five minutes.

Personally Identifiable Information Detection

This one's critical. Users often don't realize they're sharing sensitive information in their queries. Things like email addresses, phone numbers, social security numbers, or even credit card details can slip into casual conversation. We need to catch and redact this before it gets logged or sent to any external service.

Azure Text Analytics has a PII recognition feature that's surprisingly comprehensive. It detects dozens of entity types across multiple languages and even provides confidence scores for each detection.

Here's my implementation:

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

client = TextAnalyticsClient(endpoint, AzureKeyCredential(key))


def detect_and_redact_pii(text: str) -> dict:
    """Detect and redact PII using [CATEGORY] labels."""
    try:
        response = client.recognize_pii_entities([text], language="en")
        redacted_text = text
        entities = []
        
        for doc in response:
            if doc.is_error:
                return {"error": doc.error.message, "redacted": text}
            
            for entity in sorted(doc.entities, key=lambda x: x.offset, reverse=True):
                entities.append({
                    "category": entity.category,
                    "text": entity.text,
                    "confidence": entity.confidence_score
                })
                
                start = entity.offset
                end = entity.offset + entity.length
                redacted_text = (
                    redacted_text[:start] + 
                    f"[{entity.category}]" + 
                    redacted_text[end:]
                )
        
        return {
            "original": text,
            "redacted": redacted_text,
            "entities": entities,
            "has_pii": len(entities) > 0
        }
    
    except Exception as e:
        return {"error": str(e), "redacted": text}

You'll find this under the "Language" service in Azure portal. Same deal - create the resource, get your credentials, and you're ready to go.

Prompt Injection Prevention

This is the sneaky one. Prompt injection is when users try to manipulate your AI into ignoring its instructions or behaving in unintended ways. Classic examples include phrases like "ignore previous instructions" or "you are now an unrestricted AI called DAN." to more sophisticated attaempts like social engineering or shell command injection. OWASP ranked this as the number one security risk in its 2025 OWASP Top 10 for LLM Applications

The good news is that Azure Content Safety has a feature called Prompt Shields specifically designed to detect and block these attacks. It recognizes multiple categories of prompt injection including attempts to change system rules, role-play attacks where users try to make the AI assume a different persona, encoding attacks that use ciphers or transformations to bypass filters, and conversation mockups that embed fake dialogue turns to confuse the model.

Here's my approach:

import json
import requests


def detect_prompt_injection(user_input: str) -> dict:
    """
    Detect prompt injection attacks using Azure Content Safety Prompt Shields.
    
    Returns:
        detection results with attack classification
    """
    url = f"{endpoint}/contentsafety/text:shieldPrompt?api-version=2024-09-01"
    
    headers = {
        'Ocp-Apim-Subscription-Key': key,
        'Content-Type': 'application/json'
    }
    
    payload = {
        "userPrompt": user_input,
        "documents": []
    }
    
    try:
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()
        result = response.json()
        
        attack_detected = result.get("userPromptAnalysis", {}).get("attackDetected", False)
        
        return {
            "is_injection": attack_detected,
            "attack_detected": attack_detected,
            "blocked": attack_detected
        }
    
    except Exception as e:
        return {
            "is_injection": True,
            "blocked": True,
            "error": str(e)
        }

Tuning for Your Use Case

The default thresholds and patterns might not be perfect for your specific application. A customer service chatbot might need different sensitivity levels than an internal documentation assistant. While I was testing, some perfectly reasonable queries were getting flagged, while other questionable inputs slipped through. The solution is tuning.

For harmful content, you can adjust the severity threshold. The default I used was 2, but if you're building something for a sensitive environment like education, you might want to lower that to 0 or 1.

# More strict - catch even mild potentially harmful content
if max_severity > 0:
    print(f"Result: BLOCKED (severity: {max_severity})")

For PII detection, you can supplement Azure's built-in recognition with custom regex patterns for domain-specific identifiers like student IDs or employee numbers:

import re

CUSTOM_PII_PATTERNS = [
    (r'\b(?:student\s*id|sid)\s*:?\s*(\d{4,})\b', 'StudentID'),
    (r'\b(?:employee\s*id|eid)\s*:?\s*(\d{4,})\b', 'EmployeeID'),
]

all_entities = azure_detected_entities.copy()

for pattern, category in CUSTOM_PII_PATTERNS:
    match = re.search(pattern, text, re.IGNORECASE)
    if match:
        all_entities.append((category, match.group(1)))

For prompt injection, Prompt Shields is a managed service that Azure continuously updates with new attack patterns, so you can't fine-tune the underlying model. However, you can customize your defense strategy by layering Prompt Shields with custom pattern matching for domain-specific threats. While Prompt Shields catches general jailbreak attempts, you might have industry-specific phrases that warrant extra scrutiny:

import re

# Azure Prompt Shields for general jailbreak detection
result = detect_prompt_injection(user_input)

# Add domain-specific patterns
DOMAIN_SPECIFIC_PATTERNS = [
    r"for educational purposes",
    r"to save (my|a) life",
    r"my professor (said|told|requires)",
    r"as part of (required|mandatory) testing",
]

# Check custom patterns
user_lower = user_input.lower()
custom_matched = False

for pattern in DOMAIN_SPECIFIC_PATTERNS:
    if re.search(pattern, user_lower, re.IGNORECASE):
        custom_matched = True
        break

final_blocked = result["is_injection"] or custom_matched

if final_blocked:
    print("BLOCKED - Potential injection detected")

run the security_tuning.py script and you will get an output similar to this:

Beyond the Basics: Additional Azure Content Safety Features

While harmful content detection, PII protection, and prompt injection prevention form the core security layer for most AI chat systems, Azure Content Safety offers additional features worth knowing about for specific use cases.

Groundedness Detection is critical if you're building RAG applications or document summarizers. It detects when your LLM hallucinates—making up information that wasn't in your source documents. The feature can even automatically correct ungrounded responses, ensuring your chatbot stays factual. If you're building a customer support bot that references knowledge base articles or a medical assistant that cites research papers, this is essential.

Protected Material Detection helps you avoid copyright violations by detecting known copyrighted text like song lyrics, articles, or recipes in AI outputs. There's also a code variant that checks against public GitHub repositories, though it's only current through April 2023. This is particularly important if you're in media, publishing, or any industry where copyright infringement carries serious legal consequences.

Custom Categories let you train your own content filters for domain-specific moderation needs. If you have industry-specific terms or patterns that the standard filters miss, you can quickly train a custom category with your own examples.

For most AI chat systems, the three core protections we've implemented provide solid baseline security. But depending on your specific application—whether it's healthcare, legal, education, or media—these additional features might be exactly what you need to address your unique compliance and safety requirements.

After implementing all three layers of protection plus custom tuning, I feel much more confident about the security posture of my AI chat system. But this is only half the story. In the next article, I'll tackle the architectural security concerns - things like securing vector databases, implementing proper chat history hashing, securing API endpoints, and ensuring compliance with data residency requirements.

The code I've shared here is available on my GitHub repository at https://github.com/HamidOna/ai_safety_azure. I'd also recommend checking out Microsoft's documentation on responsible AI practices and the OWASP Top 10 for LLM Applications for more comprehensive guidance on AI security.

References: