Facility management at scale requires real-time visibility into thousands of connected devices across geographically distributed sites. HVAC systems, generators, occupancy sensors, and independent monitoring devices each produce continuous telemetry streams that must be ingested, processed, stored, and acted upon — often within seconds. The business required a platform that could handle multi-provider device telemetry ingestion, apply configurable business rules, send automated notifications, and provide operations teams with a modern portal for monitoring and control. The platform also had to support multi-tenant role-based access, template-driven device management, and historical data analytics. This article walks through the architecture we built on Azure Functions, Azure Event Grid, Azure Cosmos DB, Azure Redis Cache, and Azure Data Lake Storage Gen2. It covers design choices we made, patterns that worked well, and critical security and resilience improvements identified through rigorous code review. This content is intended for cloud architects, IoT solution designers, and backend engineers who are building or evaluating event-driven IoT platforms on Azure.
The platform follows a microservices architecture with six independently deployable services, all built on Azure Functions v4 with TypeScript. Each service is containerized and deployed to Azure Container Apps.
┌──────────────────────────────────────────────────────────────────────────────┐ │ Enterprise IoT Platform │ ├──────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────────────┐ │ │ │ IoT Portal UI │──>│ IoT Portal API │──>│ Azure Cosmos DB │ │ │ │ (React / Vite) │ │ (Azure Functions)│ │ (Sites, Assets, Rules) │ │ │ └──────────────────┘ └──────────────────┘ └──────────────────────────┘ │ │ │ ▲ │ │ ┌──────────────────┐ ┌─────▼────────────┐ │ │ │ │ IoT Profile API │──>│ Redis Cache │ │ │ │ │ (RBAC) │ └──────────────────┘ │ │ │ └──────────────────┘ │ │ │ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────┴───────────────┐ │ │ │ HTTP Ingestion │──>│ Event Grid │──>│ Telemetry Processor │ │ │ │ (Multi-Provider) │ │ (CloudEvents) │ │ (Standardize & Store) │ │ │ └──────────────────┘ └──────────────────┘ └──────────┬───────────────┘ │ │ │ │ │ ┌──────────────────┐ │ │ │ │ Rule Engine │<──────────────┘ │ │ │ (Notifications) │ │ │ └──────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────────┘
The main building blocks of the architecture are:
- IoT Portal UI — React 18 frontend with Chakra UI, TanStack Query, and MSAL for Azure AD authentication
- IoT Portal API — Main backend for asset management, locations, templates, rules, and telemetry queries
- IoT Profile API — User profile management and RBAC with custom JWT token generation
- HTTP Ingestion — Provider-agnostic telemetry ingestion endpoints
- Telemetry Processor — Event-driven standardization, state management, and historical archival
- Rule Engine — Configurable rule evaluation with automated notifications via email and SMS
Telemetry Ingestion Pipeline
The ingestion layer follows a two-stage pipeline pattern designed for provider independence and horizontal scalability.
Stage 1: HTTP Reception
Telemetry arrives via authenticated HTTP POST endpoints. Azure API Management (APIM) handles authentication using subscription keys, rate limiting, and request validation at the gateway level. Each provider (occupancy sensors, MQTT-based gensets, etc.) gets a dedicated route and controller, but all converge into a shared publishing pipeline.
Stage 2: Event Forwarding
Validated payloads are transformed into CloudEvents with the type Telemetry.Http.Ingested.{Provider} and published to Azure Event Grid. Event Grid routes these to Azure Storage Queues for downstream processing.
Key design decision: New device providers are onboarded by adding a route, a controller, and an APIM configuration — no changes to the core pipeline are required. This keeps the ingestion layer truly extensible without risking regressions.
Telemetry Processing and Standardization
The Telemetry Processor consumes events from Storage Queues and performs four critical functions:
| Stage | Function | Technology |
|---|---|---|
| Standardize | Transform provider-specific payloads into unified StandardizedEvent schema using JSONPath-based mappings | Capability templates + Redis-cached metadata |
| Update State | Upsert current asset state to Cosmos DB with monotonic update logic | Azure Cosmos DB (hierarchical partition keys) |
| Archive | Write historical telemetry as JSON files organized by date hierarchy | Azure Data Lake Storage Gen2 |
| Trigger Rules | Publish trigger events for rule evaluation | Azure Event Grid → Storage Queues |
The monotonic update logic ensures that only data with newer timestamps overwrites existing state — a critical safeguard against out-of-order event delivery in distributed systems.
Template-Driven Device Management
Rather than hardcoding device capabilities, the platform uses a three-tier template hierarchy that decouples device definition from device instances.
| Tier | Purpose | Examples |
|---|---|---|
| Capability Templates | Define individual data points (telemetry, commands, parameters) with data types, validation rules, and units | Temperature, Humidity, ON/OFF Command |
| Asset Templates | Define device types by combining capabilities with physical specifications | Sensor, Gateway, Genset Controller |
| Location Templates | Define physical spaces and specify required assets with optional strict policy enforcement | Building, Floor, Room, Equipment Area |
The template system supports seven data types (String, Integer, Double, Boolean, DateTime, JSON, Binary) and three validation rule types (value limits, allowed value lists, regex patterns). Templates use semantic versioning starting at 1.0.0, and deletion protection prevents removing templates that are referenced by higher-tier templates or instances.
Why this matters: When onboarding a new device model, the operations team defines its capabilities through templates rather than writing code. The telemetry processor automatically uses these templates (via Redis-cached lookups with 5-minute TTL) to standardize incoming data.
Rule Engine: Event-Driven Automation
The Rule Engine evaluates configurable business rules against device state and triggers automated actions. It uses a template-to-implementation pattern where reusable rule templates are instantiated as concrete implementations bound to specific assets or locations.
Rule Trigger Types
| Trigger Type | Mechanism | Use Case |
|---|---|---|
| State Change | Storage Queue from Telemetry Processor | "Alert when temperature exceeds 30°C" |
| Time-Based | CRON patterns evaluated every minute | "Check if equipment has been idle for 4 hours" |
| HTTP | Direct HTTP call (debug only) | Development and testing |
Condition Evaluation
Conditions support comparison operators, temporal operators (olderThan, newerThan), composite logic (all/any with short-circuit evaluation), and history-based conditions (totalDuration, stateChangeCount).
Notification Actions
When rules trigger, the engine publishes notification events to Event Grid, which routes them through Storage Queues to the notification service. This service batches and sends email and SMS notifications via a third-party messaging API.
Rule Triggered → Event Grid → Storage Queue → Notification Service → Messaging Provider (Email/SMS)
Role-Based Access Control and Identity
Security is handled through a layered approach combining Microsoft Entra ID (Azure AD) with a custom IoT token system.
Authentication Flow
- User authenticates with Microsoft Entra ID
- APIM validates the AAD token at the gateway
- APIM calls the Profile API to generate a custom platform token — a JWT signed with RSA certificates from Azure Key Vault
- The platform token contains flattened user permissions with scope information
- APIM caches the token (60-second TTL) and forwards it as a custom header to backend APIs
Four Scope Types
| Scope | Description |
|---|---|
| platform | Global administrative access |
| site | Access limited to specific sites |
| client | Access limited to specific clients |
| siteAndClient | Combined scope for fine-grained access |
Each scope supports a GLOBAL target for unrestricted access within that scope level. The data model in Cosmos DB consists of three entities: Permissions (atomic access rights), Roles (permission groups), and User Profiles (role assignments with scoping).
Hardening: What We Found and Fixed
Building the platform is one thing. Making it production-ready is another. A comprehensive code review uncovered critical security and resilience issues across all services. Here are the most impactful findings and the patterns we applied to resolve them.
Race Conditions in Distributed Rule Processing
The problem: When multiple Azure Function instances process time-based triggers simultaneously, duplicate rule executions generate duplicate notifications.
The fix: Distributed locking using Azure Blob Storage leases. Each timer execution generates a unique lock key (timer-execute-${timestamp}), and only the instance that acquires the lock processes triggers.
const lockKey = `timer-execute-${myTimer.scheduleStatus.next}`; const lock = await acquireDistributedLock(lockKey, 60000); if (!lock) { context.log("Another instance is processing, skipping"); return; }
Circuit Breaker for External APIs
The problem: Direct HTTP calls to the third-party messaging API without circuit breaker protection. When the external provider is down, every request waits 30+ seconds for timeout, causing cascading failures.
The fix: A circuit breaker library wraps all external API calls. After 50% failure rate, the circuit opens and requests fail fast. Recovery is tested automatically after 60 seconds.
import CircuitBreaker from 'opossum';
this.emailCircuitBreaker = new CircuitBreaker(this.sendNotification.bind(this), {
timeout: 30000,
errorThresholdPercentage: 50,
resetTimeout: 60000
});
Authentication Bypass Risk
The problem: A configuration flag could completely disable authentication checks. A misconfiguration in production would expose all data and operations.
The fix: Remove the bypass entirely for production. For development environments, tie the bypass to environment detection with multiple safeguards rather than a single boolean flag.
Weak Token Validation
The problem: The JWT validation used decodeToken() (base64 decode only) instead of cryptographic signature verification. Attackers could craft fake tokens with any claims.
The fix: Replace with verifyToken() using Azure AD public signing keys fetched from Microsoft's OpenID configuration endpoint, validating issuer, audience, expiration, and algorithm.
Additional Hardening
| Issue | Severity | Resolution |
|---|---|---|
| Missing rate limiting for notifications | High | Per-notification-group rate limiting with Redis counters and TTL |
| ReDoS vulnerability in multipart parsing | Medium | Non-backtracking regex with character classes and length limits |
| No request size limits | Medium | maxRequestBodySize configured in Azure Functions host.json |
| Array splitting without length limits | Medium | Explicit count limits (max 100) before processing comma-separated inputs |
| CORS wildcard in production | High | Replaced with explicit allowed origins configured per environment |
| Anonymous function authorization | Medium | Changed authLevel to "function" for defense in depth |
Technology Decisions and Trade-offs
| Aspect | Choice | Rationale |
|---|---|---|
| Compute | Azure Functions v4 (TypeScript) | Serverless auto-scaling, pay-per-execution, no infrastructure management |
| Messaging | Event Grid + Storage Queues (CloudEvents) | Reliable event delivery with at-least-once semantics and dead-letter support |
| Primary Store | Azure Cosmos DB (hierarchical partition keys) | Global distribution capability, sub-10ms reads, flexible schema |
| Historical Store | Azure Data Lake Storage Gen2 | Cost-effective long-term storage with analytics-ready file organization |
| Caching | Azure Redis Cache | Read-through caching with explicit invalidation for templates and triggers |
| API Gateway | Azure API Management | Centralized auth, rate limiting, CORS, and subscription key management |
| Configuration | Azure App Configuration | Centralized config with feature flags across all services |
| Secrets | Azure Key Vault | RSA signing certificates, API keys, connection strings |
| Frontend | React 18 + Vite + Chakra UI | Fast development, modern DX, accessible component library |
| API Spec | TypeSpec → OpenAPI v3 | Type-safe API definitions with auto-generated OpenAPI specs |
Trade-off worth noting: We chose Storage Queues over Service Bus for inter-service messaging. Storage Queues are simpler and cheaper, and sufficient for our throughput requirements. If ordering guarantees or sessions become necessary in the future, upgrading to Service Bus is a well-understood migration path.
Lessons Learned
- Provider isolation pays off early. By designing the ingestion pipeline to be provider-agnostic from day one, onboarding new device vendors became a configuration task rather than an engineering project.
- Template-driven data models reduce code changes. Instead of modifying code when device capabilities change, teams update templates. The telemetry processor handles transformation automatically.
- Distributed systems need distributed locking. Timer-based triggers in serverless functions will execute on multiple instances simultaneously. Without distributed locking, every timer trigger becomes a source of duplicate processing.
- Code review is a production readiness gate. Our review caught authentication bypass risks, ReDoS vulnerabilities, and missing circuit breakers — issues that would have caused outages or security incidents in production.
- Schema validation at the edge saves downstream trouble. Using Zod for input validation at HTTP boundaries prevents malformed data from propagating through the event pipeline and causing failures in downstream services.
Conclusion and Next Steps
This IoT platform demonstrates how Azure serverless services can be composed into a production-grade IoT solution that handles multi-provider telemetry ingestion, real-time rule evaluation, and automated notifications — all while maintaining fine-grained access control and template-driven extensibility.
The most important architectural takeaways are:
- Event-driven decoupling through Event Grid and Storage Queues creates natural scaling boundaries between services
- Template-driven device management eliminates code changes when onboarding new devices or capabilities
- Layered security combining Entra ID, APIM policies, custom platform tokens, and function-level auth provides defense in depth
- Resilience patterns (circuit breakers, distributed locking, rate limiting) are not optional for production IoT workloads
Recommended Next Steps
If you are building a similar IoT platform on Azure:
- Start with the ingestion and processing pipeline — this is the backbone of any IoT solution
- Design your template system before building device management — it will save significant rework later
- Implement distributed locking for any timer-triggered function from the beginning
- Add circuit breakers to every external API call before going to production
- Run a security-focused code review specifically looking for authentication bypass, injection, and DoS vectors
For deeper guidance, the Azure IoT reference architecture and the Azure Well-Architected Framework provide additional patterns and checklists that complement the approach described here.