Blog Post

Azure Architecture Blog
7 MIN READ

Building a Scalable IoT Platform for Facility Management with Azure Serverless Services

nishantmv's avatar
nishantmv
Icon for Microsoft rankMicrosoft
Apr 14, 2026

Facility management at scale requires real-time visibility into thousands of connected devices across geographically distributed sites. HVAC systems, generators, occupancy sensors, and independent monitoring devices each produce continuous telemetry streams that must be ingested, processed, stored, and acted upon — often within seconds. The business required a platform that could handle multi-provider device telemetry ingestion, apply configurable business rules, send automated notifications, and provide operations teams with a modern portal for monitoring and control. The platform also had to support multi-tenant role-based access, template-driven device management, and historical data analytics. This article walks through the architecture we built on Azure Functions, Azure Event Grid, Azure Cosmos DB, Azure Redis Cache, and Azure Data Lake Storage Gen2. It covers design choices we made, patterns that worked well, and critical security and resilience improvements identified through rigorous code review. This content is intended for cloud architects, IoT solution designers, and backend engineers who are building or evaluating event-driven IoT platforms on Azure.

The platform follows a microservices architecture with six independently deployable services, all built on Azure Functions v4 with TypeScript. Each service is containerized and deployed to Azure Container Apps.

┌──────────────────────────────────────────────────────────────────────────────┐ │ Enterprise IoT Platform │ ├──────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────────────┐ │ │ │ IoT Portal UI │──>│ IoT Portal API │──>│ Azure Cosmos DB │ │ │ │ (React / Vite) │ │ (Azure Functions)│ │ (Sites, Assets, Rules) │ │ │ └──────────────────┘ └──────────────────┘ └──────────────────────────┘ │ │ │ ▲ │ │ ┌──────────────────┐ ┌─────▼────────────┐ │ │ │ │ IoT Profile API │──>│ Redis Cache │ │ │ │ │ (RBAC) │ └──────────────────┘ │ │ │ └──────────────────┘ │ │ │ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────┴───────────────┐ │ │ │ HTTP Ingestion │──>│ Event Grid │──>│ Telemetry Processor │ │ │ │ (Multi-Provider) │ │ (CloudEvents) │ │ (Standardize & Store) │ │ │ └──────────────────┘ └──────────────────┘ └──────────┬───────────────┘ │ │ │ │ │ ┌──────────────────┐ │ │ │ │ Rule Engine │<──────────────┘ │ │ │ (Notifications) │ │ │ └──────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────────┘

The main building blocks of the architecture are:

  • IoT Portal UI — React 18 frontend with Chakra UI, TanStack Query, and MSAL for Azure AD authentication
  • IoT Portal API — Main backend for asset management, locations, templates, rules, and telemetry queries
  • IoT Profile API — User profile management and RBAC with custom JWT token generation
  • HTTP Ingestion — Provider-agnostic telemetry ingestion endpoints
  • Telemetry Processor — Event-driven standardization, state management, and historical archival
  • Rule Engine — Configurable rule evaluation with automated notifications via email and SMS

Telemetry Ingestion Pipeline

The ingestion layer follows a two-stage pipeline pattern designed for provider independence and horizontal scalability.

Stage 1: HTTP Reception

Telemetry arrives via authenticated HTTP POST endpoints. Azure API Management (APIM) handles authentication using subscription keys, rate limiting, and request validation at the gateway level. Each provider (occupancy sensors, MQTT-based gensets, etc.) gets a dedicated route and controller, but all converge into a shared publishing pipeline.

Stage 2: Event Forwarding

Validated payloads are transformed into CloudEvents with the type Telemetry.Http.Ingested.{Provider} and published to Azure Event Grid. Event Grid routes these to Azure Storage Queues for downstream processing.

Key design decision: New device providers are onboarded by adding a route, a controller, and an APIM configuration — no changes to the core pipeline are required. This keeps the ingestion layer truly extensible without risking regressions.

Telemetry Processing and Standardization

The Telemetry Processor consumes events from Storage Queues and performs four critical functions:

StageFunctionTechnology
StandardizeTransform provider-specific payloads into unified StandardizedEvent schema using JSONPath-based mappingsCapability templates + Redis-cached metadata
Update StateUpsert current asset state to Cosmos DB with monotonic update logicAzure Cosmos DB (hierarchical partition keys)
ArchiveWrite historical telemetry as JSON files organized by date hierarchyAzure Data Lake Storage Gen2
Trigger RulesPublish trigger events for rule evaluationAzure Event Grid → Storage Queues

The monotonic update logic ensures that only data with newer timestamps overwrites existing state — a critical safeguard against out-of-order event delivery in distributed systems.

Template-Driven Device Management

Rather than hardcoding device capabilities, the platform uses a three-tier template hierarchy that decouples device definition from device instances.

TierPurposeExamples
Capability TemplatesDefine individual data points (telemetry, commands, parameters) with data types, validation rules, and unitsTemperature, Humidity, ON/OFF Command
Asset TemplatesDefine device types by combining capabilities with physical specificationsSensor, Gateway, Genset Controller
Location TemplatesDefine physical spaces and specify required assets with optional strict policy enforcementBuilding, Floor, Room, Equipment Area

The template system supports seven data types (String, Integer, Double, Boolean, DateTime, JSON, Binary) and three validation rule types (value limits, allowed value lists, regex patterns). Templates use semantic versioning starting at 1.0.0, and deletion protection prevents removing templates that are referenced by higher-tier templates or instances.

Why this matters: When onboarding a new device model, the operations team defines its capabilities through templates rather than writing code. The telemetry processor automatically uses these templates (via Redis-cached lookups with 5-minute TTL) to standardize incoming data.

Rule Engine: Event-Driven Automation

The Rule Engine evaluates configurable business rules against device state and triggers automated actions. It uses a template-to-implementation pattern where reusable rule templates are instantiated as concrete implementations bound to specific assets or locations.

Rule Trigger Types

Trigger TypeMechanismUse Case
State ChangeStorage Queue from Telemetry Processor"Alert when temperature exceeds 30°C"
Time-BasedCRON patterns evaluated every minute"Check if equipment has been idle for 4 hours"
HTTPDirect HTTP call (debug only)Development and testing

Condition Evaluation

Conditions support comparison operators, temporal operators (olderThan, newerThan), composite logic (all/any with short-circuit evaluation), and history-based conditions (totalDuration, stateChangeCount).

Notification Actions

When rules trigger, the engine publishes notification events to Event Grid, which routes them through Storage Queues to the notification service. This service batches and sends email and SMS notifications via a third-party messaging API.

Rule Triggered → Event Grid → Storage Queue → Notification Service → Messaging Provider (Email/SMS)

Role-Based Access Control and Identity

Security is handled through a layered approach combining Microsoft Entra ID (Azure AD) with a custom IoT token system.

Authentication Flow

  1. User authenticates with Microsoft Entra ID
  2. APIM validates the AAD token at the gateway
  3. APIM calls the Profile API to generate a custom platform token — a JWT signed with RSA certificates from Azure Key Vault
  4. The platform token contains flattened user permissions with scope information
  5. APIM caches the token (60-second TTL) and forwards it as a custom header to backend APIs

Four Scope Types

ScopeDescription
platformGlobal administrative access
siteAccess limited to specific sites
clientAccess limited to specific clients
siteAndClientCombined scope for fine-grained access

Each scope supports a GLOBAL target for unrestricted access within that scope level. The data model in Cosmos DB consists of three entities: Permissions (atomic access rights), Roles (permission groups), and User Profiles (role assignments with scoping).

Hardening: What We Found and Fixed

Building the platform is one thing. Making it production-ready is another. A comprehensive code review uncovered critical security and resilience issues across all services. Here are the most impactful findings and the patterns we applied to resolve them.

Race Conditions in Distributed Rule Processing

The problem: When multiple Azure Function instances process time-based triggers simultaneously, duplicate rule executions generate duplicate notifications.

The fix: Distributed locking using Azure Blob Storage leases. Each timer execution generates a unique lock key (timer-execute-${timestamp}), and only the instance that acquires the lock processes triggers.

const lockKey = `timer-execute-${myTimer.scheduleStatus.next}`; const lock = await acquireDistributedLock(lockKey, 60000); if (!lock) { context.log("Another instance is processing, skipping"); return; }

Circuit Breaker for External APIs

The problem: Direct HTTP calls to the third-party messaging API without circuit breaker protection. When the external provider is down, every request waits 30+ seconds for timeout, causing cascading failures.

The fix: A circuit breaker library wraps all external API calls. After 50% failure rate, the circuit opens and requests fail fast. Recovery is tested automatically after 60 seconds.

import CircuitBreaker from 'opossum';

this.emailCircuitBreaker = new CircuitBreaker(this.sendNotification.bind(this), {
timeout: 30000,
errorThresholdPercentage: 50,
resetTimeout: 60000
});

Authentication Bypass Risk

The problem: A configuration flag could completely disable authentication checks. A misconfiguration in production would expose all data and operations.

The fix: Remove the bypass entirely for production. For development environments, tie the bypass to environment detection with multiple safeguards rather than a single boolean flag.

Weak Token Validation

The problem: The JWT validation used decodeToken() (base64 decode only) instead of cryptographic signature verification. Attackers could craft fake tokens with any claims.

The fix: Replace with verifyToken() using Azure AD public signing keys fetched from Microsoft's OpenID configuration endpoint, validating issuer, audience, expiration, and algorithm.

Additional Hardening

IssueSeverityResolution
Missing rate limiting for notificationsHighPer-notification-group rate limiting with Redis counters and TTL
ReDoS vulnerability in multipart parsingMediumNon-backtracking regex with character classes and length limits
No request size limitsMediummaxRequestBodySize configured in Azure Functions host.json
Array splitting without length limitsMediumExplicit count limits (max 100) before processing comma-separated inputs
CORS wildcard in productionHighReplaced with explicit allowed origins configured per environment
Anonymous function authorizationMediumChanged authLevel to "function" for defense in depth

Technology Decisions and Trade-offs

AspectChoiceRationale
ComputeAzure Functions v4 (TypeScript)Serverless auto-scaling, pay-per-execution, no infrastructure management
MessagingEvent Grid + Storage Queues (CloudEvents)Reliable event delivery with at-least-once semantics and dead-letter support
Primary StoreAzure Cosmos DB (hierarchical partition keys)Global distribution capability, sub-10ms reads, flexible schema
Historical StoreAzure Data Lake Storage Gen2Cost-effective long-term storage with analytics-ready file organization
CachingAzure Redis CacheRead-through caching with explicit invalidation for templates and triggers
API GatewayAzure API ManagementCentralized auth, rate limiting, CORS, and subscription key management
ConfigurationAzure App ConfigurationCentralized config with feature flags across all services
SecretsAzure Key VaultRSA signing certificates, API keys, connection strings
FrontendReact 18 + Vite + Chakra UIFast development, modern DX, accessible component library
API SpecTypeSpec → OpenAPI v3Type-safe API definitions with auto-generated OpenAPI specs

Trade-off worth noting: We chose Storage Queues over Service Bus for inter-service messaging. Storage Queues are simpler and cheaper, and sufficient for our throughput requirements. If ordering guarantees or sessions become necessary in the future, upgrading to Service Bus is a well-understood migration path.

Lessons Learned

  1. Provider isolation pays off early. By designing the ingestion pipeline to be provider-agnostic from day one, onboarding new device vendors became a configuration task rather than an engineering project.
  2. Template-driven data models reduce code changes. Instead of modifying code when device capabilities change, teams update templates. The telemetry processor handles transformation automatically.
  3. Distributed systems need distributed locking. Timer-based triggers in serverless functions will execute on multiple instances simultaneously. Without distributed locking, every timer trigger becomes a source of duplicate processing.
  4. Code review is a production readiness gate. Our review caught authentication bypass risks, ReDoS vulnerabilities, and missing circuit breakers — issues that would have caused outages or security incidents in production.
  5. Schema validation at the edge saves downstream trouble. Using Zod for input validation at HTTP boundaries prevents malformed data from propagating through the event pipeline and causing failures in downstream services.

Conclusion and Next Steps

This IoT platform demonstrates how Azure serverless services can be composed into a production-grade IoT solution that handles multi-provider telemetry ingestion, real-time rule evaluation, and automated notifications — all while maintaining fine-grained access control and template-driven extensibility.

The most important architectural takeaways are:

  1. Event-driven decoupling through Event Grid and Storage Queues creates natural scaling boundaries between services
  2. Template-driven device management eliminates code changes when onboarding new devices or capabilities
  3. Layered security combining Entra ID, APIM policies, custom platform tokens, and function-level auth provides defense in depth
  4. Resilience patterns (circuit breakers, distributed locking, rate limiting) are not optional for production IoT workloads

Recommended Next Steps

If you are building a similar IoT platform on Azure:

  1. Start with the ingestion and processing pipeline — this is the backbone of any IoT solution
  2. Design your template system before building device management — it will save significant rework later
  3. Implement distributed locking for any timer-triggered function from the beginning
  4. Add circuit breakers to every external API call before going to production
  5. Run a security-focused code review specifically looking for authentication bypass, injection, and DoS vectors

For deeper guidance, the Azure IoT reference architecture and the Azure Well-Architected Framework provide additional patterns and checklists that complement the approach described here.

Updated Apr 14, 2026
Version 1.0
No CommentsBe the first to comment