AI - Azure AI services Blog

5 MIN READ

Kickstarting AI Agent Development with Synthetic Data: A GenAI Approach on Azure

Laurentran

Microsoft

Mar 31, 2025

Real-world AI Solutions: Lessons from the Field

Introduction

When building AI agents—especially for internal enterprise use cases—one of the biggest challenges is data access. Real organizational data may be:

Disorganized or poorly labeled
Stored across legacy systems
Gatekept due to security or compliance
Unavailable for a variety of reasons during early PoC stages

Instead of waiting months for a data-wrangling project before testing AI Agent feasibility, you can bootstrap your efforts with synthetic data using Azure OpenAI. This lets you validate functionality, test LLM capabilities, and build a working prototype before touching production data.

In this post, we’ll walk through how to use Azure OpenAI to generate realistic, structured synthetic data to power early-stage AI agents for internal tools such as CRM bots, HR assistants, and more.

Use Case: An Internal HR Assistant Agent

Let’s consider an AI agent that answers HR-related questions for your organization. For example:

“How many PTO days do I have left?”
“What’s the parental leave policy?”
“Who is John Smith’s manager?”

The real HR database is undocumented, partially incomplete, and requires security review, so you need realistic mock data to simulate the HR domain for the PoC.

Step 1: Define Data Schema and Agent Capabilities

Start by outlining what data your AI agent needs to function:

Synthetic HR Dataset Fields:

Structured data:

{
  "employee_id": "string",
  "full_name": "string",
  "job_title": "string",
  "department": "string",
  "manager": "string",
  "hire_date": "date",
  "pto_balance": "integer",
  "email": "string",
  "location": "string",
}

Unstructured data:

{
  "policy_docs": ["leave_policy", "expense_policy", "code_of_conduct"]
}

Agent Features to Simulate:

Answering employee info queries
Summarizing HR policies
Extracting PTO balances
Referring to documents

Step 2: Generate Synthetic Data with Azure OpenAI

Use few-shot prompting to generate structured JSON data for fake employees. Below is a sample system message that defines the fields required and provides examples for the LLM – followed by a user prompt.

System Message:

You are a data generation engine that produces high-quality, 
realistic, and schema-compliant synthetic data for enterprise 
use cases.

You generate structured employee profile records in JSON 
format for internal HR systems. Each record should be 
plausible and follow enterprise conventions.

Each employee object must include the following fields:

- employee_id: A unique ID in the format "E" followed by 5 digits (e.g., E10492)
- full_name: A realistic first and last name
- email: Company-style email using the format firstname.lastname@contoso.com
- job_title: A valid job title (e.g., Software Engineer, HR Manager)
- department: One of: Engineering, Marketing, HR, Sales, IT, Finance, Legal
- manager: The full name of a realistic employee (can be shared among records)
- hire_date: A valid ISO-8601 date string between 2015 and 2024
- pto_balance: Integer between 0 and 30
- location: Realistic city and state or city and country (e.g., "Austin, TX", "San Francisco, CA")

To guide your output, use these example as a reference for structure and tone:

```json
[
  {
    "employee_id": "E10231",
    "full_name": "Priya Mehta",
    "email": "jessica.alvarez@contoso.com",
    "job_title": "HR Business Partner",
    "department": "HR",
    "manager": "Daniel Singh",
    "hire_date": "2019-06-14",
    "pto_balance": 22,
    "location": "Chicago, IL"
  },
…
]

Prompt:

Generate 50 synthetic employee records in JSON format for testing an internal HR assistant AI agent. Use the system message guidelines.

Call via Azure OpenAI API:

import os
from openai import AzureOpenAI

client = AzureOpenAI(
  api_key = os.getenv("AZURE_OPENAI_API_KEY"), 
  api_version = os.getenv("AZURE_OPENAI_API_VERSION"),
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)

response = client.chat.completions.create(
    model="gpt-4o", # update model as needed
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": prompt}
    ]
)
synthetic_data = response.choices[0].message.content

Step 3: Add Generated Policy Documents

To enable your agent to retrieve and summarize policies, use GPT-4o or GPT-4.5 to create synthetic HR documents.

Prompt:

Write an HR policy document titled “Leave Policy” in ~500 words. 
It should include sections on sick leave, parental leave, 
vacation, and approval workflow.

Save the result as a .txt file in Azure Blob Storage or inject it into a RAG pipeline.

Step 4: Build the AI Agent on Synthetic Data

Now that you have synthetic employee data and documents, build a simple agent that can:

Answer questions by querying the synthetic dataset
Retrieve chunks from the documents
Simulate chat-based interaction using Azure OpenAI

Create a library of test queries:

“What’s my PTO balance?”
“Who’s my manager?”
“What’s the parental leave policy?”
“Send me the expense policy document.”

Evaluate how the agent responds, where it struggles, and what functions need real data later.

The synthetic data generated will allow you to test routing logic, tool calls, and handoffs. It also enables validation of system behavior, edge cases, and handling failures (e.g. missing employees).

Step 5: Transition to Real Data Gradually

Once the concept works, and as your data preparation efforts make progress, follow a gradual transition plan to integrate organizational data:

Document mapping: Align synthetic fields to real data sources
Define integration boundaries: Secure APIs to retrieve real records
Replace modules incrementally: Swap synthetic data for real where permitted
Use synthetic data for staging: Retain as a safe test environment
Model evaluation: Run agents on real data and compare performance

Best Practices

Keep Synthetic Data Realistic: Use realistic organization names, job titles, email formats, and document structures.
Use few-shot prompting & JSON format: Provide examples in the system message, and leverage JSON format for structured data generation.
Separate Generative Phases: Split synthetic data generation into tabular data and unstructured documents, keeping datasets clean and usable independently.
Tag and Track Synthetic Data: Keep track of synthetic data with tags. For example, mark generated records with the following metadata:

 "source": "synthetic"
 "generation_id": "2025-04-HR-Gen1"

Validate Diversity & Realism: Check your data for representation across fields such as departments, locations, and managers. Ensure variability in date ranges and PTO balances.
Monitor Agent Behavior: Log outputs to evaluate hallucinations, error modes, and retrieval issues. Apply guardrails, even during PoC, laying the framework for production roll-out.

Conclusion

With GenAI, organizations don’t need perfect data to start building useful AI agents. Synthetic data generated by Azure OpenAI enables functional PoCs that prove agent value. By simulating the expected structure and interactions, you can enable developers, stakeholders, and users to see the vision early and accelerate time-to-market.

Additional AI Best Practices blog posts:

Best Practices for Requesting Quota Increase for Azure OpenAI Models

Best Practices for Leveraging Azure OpenAI in Constrained Optimization Scenarios

Best Practices for Structured Extraction from Documents Using Azure OpenAI

Best Practices for Using Generative AI in Automated Response Generation for Complex Decision Making

Best Practices for Leveraging Azure OpenAI in Code Conversion Scenarios

Updated Apr 01, 2025

Version 6.0

azure ai services

azure openai service

Laurentran

Microsoft

Joined March 12, 2025

View Profile

AI - Azure AI services Blog

Follow this blog board to get notified when there's new activity

{\n \"employee_id\": \"string\",\n \"full_name\": \"string\",\n \"job_title\": \"string\",\n \"department\": \"string\",\n \"manager\": \"string\",\n \"hire_date\": \"date\",\n \"pto_balance\": \"integer\",\n \"email\": \"string\",\n \"location\": \"string\",\n}\n

You are a data generation engine that produces high-quality, \nrealistic, and schema-compliant synthetic data for enterprise \nuse cases.\n\nYou generate structured employee profile records in JSON \nformat for internal HR systems. Each record should be \nplausible and follow enterprise conventions.\n\nEach employee object must include the following fields:\n\n- employee_id: A unique ID in the format \"E\" followed by 5 digits (e.g., E10492)\n- full_name: A realistic first and last name\n- email: Company-style email using the format firstname.lastname@contoso.com\n- job_title: A valid job title (e.g., Software Engineer, HR Manager)\n- department: One of: Engineering, Marketing, HR, Sales, IT, Finance, Legal\n- manager: The full name of a realistic employee (can be shared among records)\n- hire_date: A valid ISO-8601 date string between 2015 and 2024\n- pto_balance: Integer between 0 and 30\n- location: Realistic city and state or city and country (e.g., \"Austin, TX\", \"San Francisco, CA\")\n\nTo guide your output, use these example as a reference for structure and tone:\n\n```json\n[\n {\n \"employee_id\": \"E10231\",\n \"full_name\": \"Priya Mehta\",\n \"email\": \"jessica.alvarez@contoso.com\",\n \"job_title\": \"HR Business Partner\",\n \"department\": \"HR\",\n \"manager\": \"Daniel Singh\",\n \"hire_date\": \"2019-06-14\",\n \"pto_balance\": 22,\n \"location\": \"Chicago, IL\"\n },\n…\n]

import os\nfrom openai import AzureOpenAI\n\nclient = AzureOpenAI(\n api_key = os.getenv(\"AZURE_OPENAI_API_KEY\"), \n api_version = os.getenv(\"AZURE_OPENAI_API_VERSION\"),\n azure_endpoint = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n)\n\nresponse = client.chat.completions.create(\n model=\"gpt-4o\", # update model as needed\n messages=[\n {\"role\": \"system\", \"content\": system_message},\n {\"role\": \"user\", \"content\": prompt}\n ]\n)\nsynthetic_data = response.choices[0].message.content\n

Blog Post

Kickstarting AI Agent Development with Synthetic Data: A GenAI Approach on Azure

Real-world AI Solutions: Lessons from the Field

Introduction

Use Case: An Internal HR Assistant Agent

Step 1: Define Data Schema and Agent Capabilities

Synthetic HR Dataset Fields:

Agent Features to Simulate:

Step 2: Generate Synthetic Data with Azure OpenAI

System Message:

Prompt:

Call via Azure OpenAI API:

Step 3: Add Generated Policy Documents

Prompt:

Step 4: Build the AI Agent on Synthetic Data

Step 5: Transition to Real Data Gradually

Best Practices

Conclusion

Additional AI Best Practices blog posts:

Introduction

Use Case: An Internal HR Assistant Agent

Step 1: Define Data Schema and Agent Capabilities

Synthetic HR Dataset Fields:

Agent Features to Simulate:

Step 2: Generate Synthetic Data with Azure OpenAI

System Message:

Prompt:

Call via Azure OpenAI API:

Step 3: Add Generated Policy Documents

Prompt:

Step 4: Build the AI Agent on Synthetic Data

Step 5: Transition to Real Data Gradually

Best Practices

Conclusion

Additional AI Best Practices blog posts:

Introduction

Use Case: An Internal HR Assistant Agent

Step 1: Define Data Schema and Agent Capabilities

Synthetic HR Dataset Fields:

Agent Features to Simulate:

Step 2: Generate Synthetic Data with Azure OpenAI

System Message:

Prompt:

Call via Azure OpenAI API:

Step 3: Add Generated Policy Documents

Prompt:

Step 4: Build the AI Agent on Synthetic Data

Step 5: Transition to Real Data Gradually

Best Practices

Conclusion

Additional AI Best Practices blog posts: