Microsoft Developer Community Blog

8 MIN READ

How to Build Safe Natural Language-Driven APIs

pratikpanda

Microsoft

Feb 03, 2026

Design patterns for accepting natural language as input without losing control

TL;DR

Building production natural language APIs requires separating semantic parsing from execution. Use LLMs to translate user text into canonical structured requests (via schemas), then execute those requests deterministically.

Key patterns: schema completion for clarification, confidence gates to prevent silent failures, code-based ontologies for normalization, and an orchestration layer. This keeps language as input, not as your API contract.

Introduction

APIs that accept natural language as input are quickly becoming the norm in the age of agentic AI apps and LLMs. From search and recommendations to workflows and automation, users increasingly expect to "just ask" and get results.

But treating natural language as an API contract introduces serious risks in production systems:

Nondeterministic behavior
Prompt-driven business logic
Difficult debugging and replay
Silent failures that are hard to detect

In this post, I'll describe a production-grade architecture for building safe, natural language-driven APIs: one that embraces LLMs for intent discovery and entity extraction while preserving the determinism, observability, and reliability that backend systems require.

This approach is based on building real systems using Azure OpenAI and LangGraph, and on lessons learned the hard way.

The Core Problem with Natural Language APIs

Natural language is an excellent interface for humans. It is a poor interface for systems.

When APIs accept raw text directly and execute logic based on it, several problems emerge:

The API contract becomes implicit and unversioned
Small prompt changes cause behavioral changes
Business logic quietly migrates into prompts

In short: language becomes the contract, and that's fragile.

The solution is not to avoid natural language, but to contain it.

A Key Principle: Natural Language Is Input, Not a Contract

So how do we contain it? The answer lies in treating natural language fundamentally differently than we treat traditional API inputs.

The most important design decision we made was this:

Natural language should be translated into structure, not executed directly.

That single principle drives the entire architecture.

Instead of building "chatty APIs," we split responsibilities clearly:

Natural language is used for intent discovery and entity extraction
Structured data is used for execution

Two Explicit API Layers

This principle translates into a concrete architecture with two distinct API layers, each with a single, clear responsibility.

1. Semantic Parse API (Natural Language → Structure)

This API:

Accepts user text
Extracts intent and entities using LLMs
Completes a predefined schema
Asks clarifying questions when required
Returns a canonical, structured request
Does not execute business logic

Think of this as a compiler, not an engine.

2. Structured Execution API (Structure → Action)

This API:

Accepts only structured input
Calls downstream systems to process the request and get results
Is deterministic and versioned
Contains no natural language handling
Is fully testable and replayable

This is where execution happens.

Why This Separation Matters

Separating these layers gives you:

A stable, versionable API contract
Freedom to improve NLP without breaking clients
Clear ownership boundaries
Deterministic execution paths

Most importantly, it prevents LLM behavior from leaking into core business logic.

Canonical Schemas Are the Backbone

Now that we've established the two-layer architecture, let's dive into what makes it work: canonical schemas.

Each supported intent is defined by a canonical schema that lives in code.

Example (simplified):

This schema is used when a user is looking for similar product recommendations. The entities capture which product to use as reference and how to bias the recommendations toward price or quality.

{
  "intent": "recommend_similar",
  "entities": {
    "reference_product_id": "string",
    "price_bias": "number (-1 to 1)",
    "quality_bias": "number (-1 to 1)"
  }
}

Schemas define:

Required vs optional fields
Allowed ranges and types
Validation rules

They are the contract, not the prompt.

When a user says "show me products like the blue backpack but cheaper", the LLM extracts:

Intent: recommend_similar
reference_product_id: "blue_backpack_123"
price_bias: -0.8 (strongly prefer cheaper)
quality_bias: 0.0 (neutral)

The schema ensures that even if the user phrased it as "find alternatives to item 123 with better pricing" or "cheaper versions of that blue bag", the output is always the same structure. The natural language variation is absorbed at the semantic layer. The execution layer receives a consistent, validated request every time.

This decoupling is what makes the system maintainable.

Schema Completion, Not Free-Form Chat

But what happens when the user's input doesn't contain all the information needed to complete the schema? This is where structured clarification comes in.

A common misconception is that clarification means "chatting until it feels right."

In production systems, clarification is schema completion.

If required fields are missing or ambiguous, the semantic API responds with:

What information is missing
A targeted clarification question
The current schema state

Example response:

{
  "status": "needs_clarification",
  "missing_fields": ["reference_product_id"],
  "question": "Which product should I compare against?",
  "state": {
    "intent": "recommend_similar",
    "entities": {
      "reference_product_id": null,
      "price_bias": -0.3,
      "quality_bias": 0.4
    }
  }
}

The state object is the memory. The API itself remains stateless.

A Complete Conversation Flow

To illustrate how schema completion works in practice, here's a full conversation flow where the user's initial request is missing required information:

Initial Request:

User: "Show me cheaper alternatives with good quality"

API Response (needs clarification):

{
  "status": "needs_clarification",
  "missing_fields": ["reference_product_id"],
  "question": "Which product should I compare against?",
  "state": {
    "intent": "recommend_similar",
    "entities": {
      "reference_product_id": null,
      "price_bias": -0.3,
      "quality_bias": 0.4
    }
  }
}

Follow-up Request:

User: "The blue backpack"

Client sends:

{ 
  "user_input": "The blue backpack", 
  "state": {
    "intent": "recommend_similar",
    "entities": {
      "reference_product_id": null,
      "price_bias": -0.3,
      "quality_bias": 0.4
    }
  }
}

API Response (complete):

{
  "status": "complete",
  "canonical_request": {
    "intent": "recommend_similar",
    "entities": {
      "reference_product_id": "blue_backpack_123",
      "price_bias": -0.3,
      "quality_bias": 0.4
    }
  }
}

The client passes the state back with each clarification. The API remains stateless, while the client manages the conversation context. Once complete, the canonical_request can be sent directly to the execution API.

Why LangGraph Fits This Problem Perfectly

With schemas and clarification flows defined, we need a way to orchestrate the semantic parsing workflow reliably. This is where LangGraph becomes valuable.

LangGraph allows semantic parsing to be modeled as a structured, deterministic workflow with explicit decision points:

Classify intent: Determine what the user wants to do from a predefined set of supported actions
Extract candidate entities: Pull out relevant parameters from the natural language input using the LLM
Merge into schema state: Map the extracted values into the canonical schema structure
Validate required fields: Check if all mandatory fields are present and values are within acceptable ranges
Either complete or request clarification: Return the canonical request if complete, or ask a targeted question if information is missing

Each node has a single responsibility. Validation and routing are done in code, not by the LLM.

LangGraph provides:

Explicit state transitions
Deterministic routing
Observable execution
Safe retries

Used this way, it becomes a powerful orchestration tool, not a conversational agent.

Confidence Gates Prevent Silent Failures

Structured workflows handle the process, but there's another critical safety mechanism we need: knowing when the LLM isn't confident about its extraction.

Even when outputs are structurally valid, they may not be reliable.

We require the semantic layer to emit a confidence score. If confidence falls below a threshold, execution is blocked and clarification is requested.

This simple rule eliminates an entire class of silent misinterpretations that are otherwise very hard to detect.

Example:

When a user says "Show me items similar to the bag", the LLM might extract:

{
  "intent": "recommend_similar",
  "confidence": 0.55,
  "entities": {
    "reference_product_id": "generic_bag_001",
    "confidence_scores": {
      "reference_product_id": 0.4
    }
  }
}

The overall confidence is low (0.55), and the entity confidence for reference_product_id is very low (0.4) because "the bag" is ambiguous. There might be hundreds of bags in the catalog.

Instead of proceeding with a potentially wrong guess, the API responds:

{
  "status": "needs_clarification",
  "reason": "low_confidence",
  "question": "I found multiple bags. Did you mean the blue backpack, the leather tote, or the travel duffel?",
  "confidence": 0.55
}

This prevents the system from silently executing the wrong recommendation and provides a better user experience.

Lightweight Ontologies (Keep Them in Code)

Beyond confidence scoring, we need a way to normalize the variety of terms users might use into consistent canonical values.

We also introduced lightweight, code-level ontologies:

Allowed intents
Required entities per intent
Synonym-to-canonical mappings
Cross-field validation rules

These live in code and configuration, not in prompts.

LLMs propose values. Code enforces meaning.

Example:

Consider these user inputs that all mean the same thing:

"Show me cheaper options"
"Find budget-friendly alternatives"
"I want something more affordable"
"Give me lower-priced items"

The LLM might extract different values: "cheaper", "budget-friendly", "affordable", "lower-priced".

The ontology maps all of these to a canonical value:

PRICE_BIAS_SYNONYMS = {
    "cheaper": -0.7,
    "budget-friendly": -0.7,
    "affordable": -0.7,
    "lower-priced": -0.7,
    "expensive": 0.7,
    "premium": 0.7,
    "high-end": 0.7
}

When the LLM extracts "budget-friendly", the code normalizes it to -0.7 for the price_bias field.

Similarly, cross-field validation catches logical inconsistencies:

if entities["price_bias"] < -0.5 and entities["quality_bias"] > 0.5:
    return clarification("You want cheaper items with higher quality. This might be difficult. Should I prioritize price or quality?")

The LLM proposes. The ontology normalizes. The validation enforces business rules.

What About Latency?

A common concern with multi-step semantic parsing is performance.

In practice, we observed:

Intent classification: ~40 ms
Entity extraction: ~200 ms
Validation and routing: ~1 ms

Total overhead: ~250–300 ms.

For chat-driven user experiences, this is well within acceptable bounds and far cheaper than incorrect or inconsistent execution.

Key Takeaways

Let's bring it all together.

If you're building APIs that accept natural language in production:

Do not make language your API contract
Translate language into canonical structure
Own schema completion server-side
Use LLMs for discovery and extraction, not execution
Treat safety and determinism as first-class requirements

Natural language is an input format. Structure is the contract.

Closing Thoughts

LLMs make it easy to build impressive demos. Building safe, reliable systems with them requires discipline.

By separating semantic interpretation from execution, and by using tools like Azure OpenAI and LangGraph thoughtfully, you can build natural language-driven APIs that scale, evolve, and behave predictably in production.

Hopefully, this architecture saves you a few painful iterations.

Updated Feb 03, 2026

Version 2.0