Real-world AI Solutions: Lessons from the Field
Introduction
When building AI agents—especially for internal enterprise use cases—one of the biggest challenges is data access. Real organizational data may be:
- Disorganized or poorly labeled
- Stored across legacy systems
- Gatekept due to security or compliance
- Unavailable for a variety of reasons during early PoC stages
Instead of waiting months for a data-wrangling project before testing AI Agent feasibility, you can bootstrap your efforts with synthetic data using Azure OpenAI. This lets you validate functionality, test LLM capabilities, and build a working prototype before touching production data.
In this post, we’ll walk through how to use Azure OpenAI to generate realistic, structured synthetic data to power early-stage AI agents for internal tools such as CRM bots, HR assistants, and more.
Use Case: An Internal HR Assistant Agent
Let’s consider an AI agent that answers HR-related questions for your organization. For example:
- “How many PTO days do I have left?”
- “What’s the parental leave policy?”
- “Who is John Smith’s manager?”
The real HR database is undocumented, partially incomplete, and requires security review, so you need realistic mock data to simulate the HR domain for the PoC.
Step 1: Define Data Schema and Agent Capabilities
Start by outlining what data your AI agent needs to function:
Synthetic HR Dataset Fields:
Structured data:
{
  "employee_id": "string",
  "full_name": "string",
  "job_title": "string",
  "department": "string",
  "manager": "string",
  "hire_date": "date",
  "pto_balance": "integer",
  "email": "string",
  "location": "string",
}
Unstructured data:
{
  "policy_docs": ["leave_policy", "expense_policy", "code_of_conduct"]
}
Agent Features to Simulate:
- Answering employee info queries
- Summarizing HR policies
- Extracting PTO balances
- Referring to documents
Step 2: Generate Synthetic Data with Azure OpenAI
Use few-shot prompting to generate structured JSON data for fake employees. Below is a sample system message that defines the fields required and provides examples for the LLM – followed by a user prompt.
System Message:
You are a data generation engine that produces high-quality, 
realistic, and schema-compliant synthetic data for enterprise 
use cases.
You generate structured employee profile records in JSON 
format for internal HR systems. Each record should be 
plausible and follow enterprise conventions.
Each employee object must include the following fields:
- employee_id: A unique ID in the format "E" followed by 5 digits (e.g., E10492)
- full_name: A realistic first and last name
- email: Company-style email using the format firstname.lastname@contoso.com
- job_title: A valid job title (e.g., Software Engineer, HR Manager)
- department: One of: Engineering, Marketing, HR, Sales, IT, Finance, Legal
- manager: The full name of a realistic employee (can be shared among records)
- hire_date: A valid ISO-8601 date string between 2015 and 2024
- pto_balance: Integer between 0 and 30
- location: Realistic city and state or city and country (e.g., "Austin, TX", "San Francisco, CA")
To guide your output, use these example as a reference for structure and tone:
```json
[
  {
    "employee_id": "E10231",
    "full_name": "Priya Mehta",
    "email": "jessica.alvarez@contoso.com",
    "job_title": "HR Business Partner",
    "department": "HR",
    "manager": "Daniel Singh",
    "hire_date": "2019-06-14",
    "pto_balance": 22,
    "location": "Chicago, IL"
  },
…
]Prompt:
Generate 50 synthetic employee records in JSON format for testing an internal HR assistant AI agent. Use the system message guidelines.Call via Azure OpenAI API:
import os
from openai import AzureOpenAI
client = AzureOpenAI(
  api_key = os.getenv("AZURE_OPENAI_API_KEY"), 
  api_version = os.getenv("AZURE_OPENAI_API_VERSION"),
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)
response = client.chat.completions.create(
    model="gpt-4o", # update model as needed
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": prompt}
    ]
)
synthetic_data = response.choices[0].message.content
Step 3: Add Generated Policy Documents
To enable your agent to retrieve and summarize policies, use GPT-4o or GPT-4.5 to create synthetic HR documents.
Prompt:
Write an HR policy document titled “Leave Policy” in ~500 words. 
It should include sections on sick leave, parental leave, 
vacation, and approval workflow.Save the result as a .txt file in Azure Blob Storage or inject it into a RAG pipeline.
Step 4: Build the AI Agent on Synthetic Data
Now that you have synthetic employee data and documents, build a simple agent that can:
- Answer questions by querying the synthetic dataset
- Retrieve chunks from the documents
- Simulate chat-based interaction using Azure OpenAI
Create a library of test queries:
- “What’s my PTO balance?”
- “Who’s my manager?”
- “What’s the parental leave policy?”
- “Send me the expense policy document.”
Evaluate how the agent responds, where it struggles, and what functions need real data later.
The synthetic data generated will allow you to test routing logic, tool calls, and handoffs. It also enables validation of system behavior, edge cases, and handling failures (e.g. missing employees).
Step 5: Transition to Real Data Gradually
Once the concept works, and as your data preparation efforts make progress, follow a gradual transition plan to integrate organizational data:
- Document mapping: Align synthetic fields to real data sources
- Define integration boundaries: Secure APIs to retrieve real records
- Replace modules incrementally: Swap synthetic data for real where permitted
- Use synthetic data for staging: Retain as a safe test environment
- Model evaluation: Run agents on real data and compare performance
Best Practices
- Keep Synthetic Data Realistic: Use realistic organization names, job titles, email formats, and document structures.
- Use few-shot prompting & JSON format: Provide examples in the system message, and leverage JSON format for structured data generation.
- Separate Generative Phases: Split synthetic data generation into tabular data and unstructured documents, keeping datasets clean and usable independently.
- Tag and Track Synthetic Data: Keep track of synthetic data with tags. For example, mark generated records with the following metadata:
"source": "synthetic"
"generation_id": "2025-04-HR-Gen1"
- Validate Diversity & Realism: Check your data for representation across fields such as departments, locations, and managers. Ensure variability in date ranges and PTO balances.
- Monitor Agent Behavior: Log outputs to evaluate hallucinations, error modes, and retrieval issues. Apply guardrails, even during PoC, laying the framework for production roll-out.
Conclusion
With GenAI, organizations don’t need perfect data to start building useful AI agents. Synthetic data generated by Azure OpenAI enables functional PoCs that prove agent value. By simulating the expected structure and interactions, you can enable developers, stakeholders, and users to see the vision early and accelerate time-to-market.
Additional AI Best Practices blog posts:
Best Practices for Requesting Quota Increase for Azure OpenAI Models
Best Practices for Leveraging Azure OpenAI in Constrained Optimization Scenarios
Best Practices for Structured Extraction from Documents Using Azure OpenAI
Best Practices for Using Generative AI in Automated Response Generation for Complex Decision Making
Best Practices for Leveraging Azure OpenAI in Code Conversion Scenarios