A Practitioner’s Guide to Troubleshooting, Fixes, and Hard‑Won Lessons
Introduction
As organizations deploy enterprise AI platforms on Azure, security requirements increasingly drive the adoption of private-first architectures.
- Private networking only
- Centralized firewalls or NVAs
- Hub–and–spoke virtual network architectures
- Private Endpoints for all PaaS services
While these patterns are well understood individually, their interaction often exposes hidden failure modes, particularly around DNS and name resolution.
During a recent production deployment of a private, enterprise-grade AI workload on Azure, several issues surfaced that initially appeared to be platform or service instability. Closer analysis revealed the real cause: gaps in network and DNS design.
This post shares a real-world technical walkthrough of the problem, root causes, resolution steps, and key lessons that now form a reusable blueprint for running AI workloads reliably in private Azure environments.
Problem Statement
The platform was deployed with the following characteristics:
- Hub and spoke network topology
- Custom DNS servers running in the hub
- Firewall / NVA enforcing strict egress controls
- AI, data, and platform services exposed through Private Endpoints
- Azure Container Apps using internal load balancer mode
- Centralized monitoring, secrets, and identity services
Despite successful infrastructure deployment, the environment exhibited non-deterministic production issues, including:
- Container Apps intermittently failing to start or scale
- AI platform endpoints becoming unreachable from workload subnets
- Authentication and secret access failures
- DNS resolution working in some environments but failing in others
- Terraform deployments stalling or failing unexpectedly
Because the symptoms varied across subnets and environments, root cause identification was initially non-trivial.
Root Cause Analysis
After end-to-end isolation, the issue was not AI services, authentication, or application logic. The core problem was DNS resolution in a private Azure environment.
1. Custom DNS servers were not Azure-aware
The hub DNS servers correctly resolved:
- Corporate domains
- On‑premises records
However, they could not resolve Azure platform names or Private Endpoint FQDNs by default.
Azure relies on an internal recursive resolver (168.63.129.16) that must be explicitly integrated when using custom DNS.
2. Missing conditional forwarders for private DNS zones
Many Azure services depend on service-specific private DNS zones, such as:
- privatelink.cognitiveservices.azure.com
- privatelink.openai.azure.com
- privatelink.vaultcore.azure.net
- privatelink.search.windows.net
- privatelink.blob.core.windows.net
Without conditional forwarders pointing to Azure’s internal DNS, queries either:
- Failed silently, or
- Resolved to public endpoints that were blocked by firewall rules
3. Container Apps internal DNS requirements were overlooked
When Azure Container Apps are deployed with:
internal_load_balancer_enabled = true
Azure does not automatically create supporting DNS records.
The environment generates:
- A default domain
- .internal subdomains for internal FQDNs
Without explicitly creating:
- A private DNS zone matching the default domain
- *, @, and *.internal wildcard records
internal service-to-service communication fails.
4. Private DNS zones were not consistently linked
Even when DNS zones existed, they were:
- Spread across multiple subscriptions
- Linked to some VNets but not others
- Missing links to DNS server VNets or shared services VNets
As a result, name resolution succeeded in one subnet and failed in another, depending on the lookup path.
Resolution
No application changes were required. Stability was achieved entirely through architectural corrections.
✅ Step 1: Make custom DNS Azure-aware
On all custom DNS servers (or NVAs acting as DNS proxies):
- Configure conditional forwarders for all Azure private DNS zones
- Forward those queries to: 168.63.129.16
This IP is Azure’s internal recursive resolver and is mandatory for Private Endpoint resolution.
✅ Step 2: Centralize and link private DNS zones
A centralized private DNS model was adopted:
- All private DNS zones hosted in a shared subscription
- Linked to:
- Hub VNet
- All spoke VNets
- DNS server VNet
- Any operational or virtual desktop VNets
This ensured consistent resolution regardless of workload location.
✅ Step 3: Explicitly handle Container Apps DNS
For Container Apps using internal ingress:
- Create a private DNS zone matching the environment’s default domain
- Add:
- * wildcard record
- @ apex record
- *.internal wildcard record
- Point all records to the Container Apps Environment static IP
- Add a conditional forwarder for the default domain if using custom DNS
This step alone resolved multiple internal connectivity issues.
✅ Step 4: Align routing, NSGs, and service tags
Firewall, NSG, and route table rules were aligned to:
- Allow DNS traffic (TCP/UDP 53)
- Allow Azure service tags such as:
- AzureCloud
- CognitiveServices
- AzureActiveDirectory
- Storage
- AzureMonitor
- Ensure certain subnets (e.g., Container Apps, Application Gateway) retained direct internet access where required by Azure platform services
Key Learnings
1. DNS is a Tier‑0 dependency for AI platforms
Many AI “service issues” are DNS failures in disguise. DNS must be treated as foundational platform infrastructure.
2. Private Endpoints require Azure DNS integration
If you use:
- Custom DNS ✅
- Private Endpoints ✅
Then forwarding to 168.63.129.16 is non‑negotiable.
3. Container Apps internal ingress has hidden DNS requirements
Internal Container Apps environments will not function correctly without manually created DNS zones and .internal records.
4. Centralized DNS prevents environment drift
Decentralized or subscription-local DNS zones lead to fragile, inconsistent environments. Centralization improves reliability and operability.
5. Validate networking first, then the platform
Before escalating issues to service teams:
- Validate DNS resolution
- Verify routing
- Check Private Endpoint connectivity
In many cases, the perceived “platform issue” disappears.
Quick Production Validation Checklist
Before go-live, always validate:
- ✅ Private FQDNs resolve to private IPs from all required VNets
- ✅ UDR/NSG rules allow required Azure service traffic
- ✅ Managed identities can access all dependent resources
- ✅ AI portal user workflows succeed (evaluations, agents, etc.)
- ✅ terraform plan shows only intended changes
Conclusion
Running private, enterprise-grade AI workloads on Azure is absolutely achievable—but it requires intentional DNS and networking design.
By:
- Making custom DNS Azure-aware
- Centralizing private DNS zones
- Explicitly handling Container Apps DNS
- Aligning routing and firewall rules
an unstable environment was transformed into a repeatable, production-ready platform pattern.
If you are building AI solutions on Azure with Private Endpoints and hub–spoke networking, getting DNS right early will save weeks of troubleshooting later.