Facility management at scale requires real-time visibility into thousands of connected devices across geographically distributed sites. HVAC systems, generators, occupancy sensors, and independent monitoring devices each produce continuous telemetry streams that must be ingested, processed, stored, and acted upon — often within seconds.
Business Requirements
The business required a platform that could:
- Handle multi-provider device telemetry ingestion
- Apply configurable business rules in real-time
- Send automated notifications via email and SMS
- Provide operations teams with a modern portal for monitoring and control
- Support multi-tenant role-based access
- Enable template-driven device management
- Provide historical data analytics capabilities
Our Solution
This article walks through the architecture we built on Azure Functions, Azure Event Grid, Azure Cosmos DB, Azure Redis Cache, and Azure Data Lake Storage Gen2. It covers design choices we made, patterns that worked well, and critical security and resilience improvements identified through rigorous code review.
Architecture Overview
The platform follows a microservices architecture with six independently deployable services, all built on Azure Functions v4 with TypeScript. Each service is containerized and deployed to Azure Container Apps.
Main Building Blocks
- IoT Portal UI: React 18 frontend with Chakra UI, TanStack Query, and MSAL for Azure AD authentication
- IoT Portal API: Main backend for asset management, locations, templates, rules, and telemetry queries
- IoT Profile API: User profile management and RBAC with custom JWT token generation
- HTTP Ingestion: Provider-agnostic telemetry ingestion endpoints
- Telemetry Processor: Event-driven standardization, state management, and historical archival
- Rule Engine: Configurable rule evaluation with automated notifications via email and SMS
Telemetry Ingestion Pipeline
The ingestion layer follows a two-stage pipeline pattern designed for provider independence and horizontal scalability.
Stage 1: HTTP Reception
Telemetry arrives via authenticated HTTP POST endpoints. Azure API Management (APIM) handles authentication using subscription keys, rate limiting, and request validation at the gateway level. Each provider (occupancy sensors, MQTT-based gensets, and others) gets a dedicated route and controller, but all converge into a shared publishing pipeline.
Stage 2: Event Forwarding
Validated payloads are transformed into CloudEvents with the type Telemetry.Http.Ingested.{Provider} and published to Azure Event Grid. Event Grid routes these to Azure Storage Queues for downstream processing.
Key design decision: New device providers are onboarded by adding a route, a controller, and an APIM configuration with no changes to the core pipeline. This keeps the ingestion layer extensible without risking regressions.
Telemetry Processing and Standardization
The Telemetry Processor consumes events from Storage Queues and performs four critical functions:
| Stage | Function | Technology |
|---|---|---|
| Standardize | Transform provider-specific payloads into unified StandardizedEvent schema using JSONPath-based mappings | Capability templates + Redis-cached metadata |
| Update State | Upsert current asset state to Cosmos DB with monotonic update logic | Azure Cosmos DB (hierarchical partition keys) |
| Archive | Write historical telemetry as JSON files organized by date hierarchy | Azure Data Lake Storage Gen2 |
| Trigger Rules | Publish trigger events for rule evaluation | Azure Event Grid to Storage Queues |
The monotonic update logic ensures that only data with newer timestamps overwrites existing state, a critical safeguard against out-of-order event delivery in distributed systems.
Template-Driven Device Management
Rather than hardcoding device capabilities, the platform uses a three-tier template hierarchy that decouples device definition from device instances.
| Tier | Purpose | Examples |
|---|---|---|
| Capability Templates | Define individual data points (telemetry, commands, parameters) with data types, validation rules, and units | Temperature, Humidity, ON/OFF Command |
| Asset Templates | Define device types by combining capabilities with physical specifications | Sensor, Gateway, Genset Controller |
| Location Templates | Define physical spaces and specify required assets with optional strict policy enforcement | Building, Floor, Room, Equipment Area |
The template system supports seven data types (String, Integer, Double, Boolean, DateTime, JSON, Binary) and three validation rule types (value limits, allowed value lists, regex patterns). Templates use semantic versioning starting at 1.0.0, and deletion protection prevents removing templates that are referenced by higher-tier templates or instances.
Why this matters: When onboarding a new device model, the operations team defines capabilities through templates rather than writing code. The telemetry processor automatically uses these templates (via Redis-cached lookups with a 5-minute TTL) to standardize incoming data.
Rule Engine: Event-Driven Automation
The Rule Engine evaluates configurable business rules against device state and triggers automated actions. It uses a template-to-implementation pattern where reusable rule templates are instantiated as concrete implementations bound to specific assets or locations.
Rule Trigger Types
| Trigger Type | Mechanism | Use Case |
|---|---|---|
| State Change | Storage Queue from Telemetry Processor | Alert when temperature exceeds 30C |
| Time-Based | CRON patterns evaluated every minute | Check if equipment has been idle for 4 hours |
| HTTP | Direct HTTP call (debug only) | Development and testing |
Condition Evaluation
Conditions support comparison operators, temporal operators (olderThan, newerThan), composite logic (all/any with short-circuit evaluation), and history-based conditions (totalDuration, stateChangeCount).
Notification Actions
When rules trigger, the engine publishes notification events to Event Grid, which routes them through Storage Queues to the notification service. This service batches and sends email and SMS notifications via a third-party messaging API.
Rule Triggered -> Event Grid -> Storage Queue -> Notification Service -> Messaging Provider (Email/SMS)
Role-Based Access Control and Identity
Security is handled through a layered approach combining Microsoft Entra ID (Azure AD) with a custom IoT token system.
Authentication Flow
1. User authenticates with Microsoft Entra ID
2. APIM validates the AAD token at the gateway
3. APIM calls the Profile API to generate a custom platform token, a JWT signed with RSA certificates from Azure Key Vault
4. The platform token contains flattened user permissions with scope information
5. APIM caches the token (60-second TTL) and forwards it as a custom header to backend APIs
Four Scope Types
| Scope | Description |
|---|---|
platform | Global administrative access |
site | Access limited to specific sites |
client | Access limited to specific clients |
siteAndClient | Combined scope for fine-grained access |
Each scope supports a GLOBAL target for unrestricted access within that scope level. The data model in Cosmos DB consists of three entities: Permissions (atomic access rights), Roles (permission groups), and User Profiles (role assignments with scoping).
Hardening: What We Found and Fixed
Building the platform is one thing. Making it production-ready is another. A comprehensive code review uncovered critical security and resilience issues across all services. Here are the most impactful findings and the patterns we applied to resolve them.
Race Conditions in Distributed Rule Processing
The problem: When multiple Azure Function instances process time-based triggers simultaneously, duplicate rule executions generate duplicate notifications.
The fix: Distributed locking using Azure Blob Storage leases. Each timer execution generates a unique lock key (timer-execute-${timestamp}), and only the instance that acquires the lock processes triggers.
const lockKey = `timer-execute-${myTimer.scheduleStatus.next}`;
const lock = await acquireDistributedLock(lockKey, 60000);
if (!lock) {
context.log("Another instance is processing, skipping");
return;
}
Circuit Breaker for External APIs
The problem: Direct HTTP calls to the third-party messaging API without circuit breaker protection. When the external provider is down, every request waits 30+ seconds for timeout, causing cascading failures.
The fix: A circuit breaker library wraps all external API calls. After 50% failure rate, the circuit opens and requests fail fast. Recovery is tested automatically after 60 seconds.
import CircuitBreaker from 'opossum';
this.emailCircuitBreaker = new CircuitBreaker(this.sendNotification.bind(this), {
timeout: 30000,
errorThresholdPercentage: 50,
resetTimeout: 60000
});
Authentication Bypass Risk
The problem: A configuration flag could completely disable authentication checks. A misconfiguration in production would expose all data and operations.
The fix: Remove the bypass entirely for production. For development environments, tie the bypass to environment detection with multiple safeguards rather than a single boolean flag.
Weak Token Validation
The problem: The JWT validation used decodeToken() (base64 decode only) instead of cryptographic signature verification. Attackers could craft fake tokens with any claims.
The fix: Replace with verifyToken() using Azure AD public signing keys fetched from Microsoft's OpenID configuration endpoint, validating issuer, audience, expiration, and algorithm.
Additional Hardening
| Issue | Severity | Resolution |
|---|---|---|
| Missing rate limiting for notifications | High | Per-notification-group rate limiting with Redis counters and TTL |
| ReDoS vulnerability in multipart parsing | Medium | Non-backtracking regex with character classes and length limits |
| No request size limits | Medium | maxRequestBodySize configured in Azure Functions host.json |
| Array splitting without length limits | Medium | Explicit count limits (max 100) before processing comma-separated inputs |
| CORS wildcard in production | High | Replaced with explicit allowed origins configured per environment |
| Anonymous function authorization | Medium | Changed authLevel to function for defense in depth |
Technology Decisions and Trade-offs
Key Technology Choices
| Aspect | Choice | Rationale |
|---|---|---|
| Compute | Azure Functions v4 (TypeScript) | Serverless auto-scaling, pay-per-execution, no infrastructure management |
| Messaging | Event Grid + Storage Queues (CloudEvents) | Reliable event delivery with at-least-once semantics and dead-letter support |
| Primary Store | Azure Cosmos DB (hierarchical partition keys) | Global distribution capability, sub-10ms reads, flexible schema |
| Historical Store | Azure Data Lake Storage Gen2 | Cost-effective long-term storage with analytics-ready file organization; supports immutable storage (WORM) for compliance |
| Caching | Azure Redis Cache | Read-through caching with explicit invalidation for templates and triggers |
| API Gateway | Azure API Management | Centralized auth, rate limiting, CORS, and subscription key management |
| Configuration | Azure App Configuration | Centralized config with feature flags across all services |
| Secrets | Azure Key Vault | RSA signing certificates, API keys, connection strings |
| Frontend | React 18 + Vite + Chakra UI | Fast development, modern DX, accessible component library |
| API Spec | TypeSpec to OpenAPI v3 | Type-safe API definitions with auto-generated OpenAPI specs |
Trade-off worth noting: We chose Storage Queues over Service Bus for inter-service messaging. Storage Queues are simpler and cheaper, and sufficient for our throughput requirements. If ordering guarantees or sessions become necessary in the future, upgrading to Service Bus is a well-understood migration path.
Compliance Considerations: WORM Storage
Do you need Write-Once-Read-Many (WORM) storage? Even without PII, regulatory requirements may mandate immutable storage.
When WORM Is Required
| Scenario | Requirement | Implementation |
|---|---|---|
| Healthcare facilities | FDA 21 CFR Part 11 for temperature logs | Enable time-based retention on Data Lake Gen2 containers |
| Food service | FDA FSMA cold chain compliance | 7-year retention with legal hold capability |
| Financial institutions | SOX compliance for operational data | Immutable storage with audit logging |
| Utility/energy | Meter data for billing disputes | Tamper-proof storage with version immutability |
| Manufacturing | Equipment warranty/maintenance audit trails | Retention policies matching warranty periods |
Azure Data Lake Gen2 already supports WORM through immutable blob storage policies:
const retentionPolicy = {
immutabilityPeriodSinceCreationInDays: 2555, // ~7 years
allowProtectedAppendWrites: true // Allow appending audit logs
};
Cost impact: WORM policies have no additional storage cost, only storage lifecycle management configurations.
Recommendation: If your facility management platform serves regulated industries (healthcare, food service, finance), enable WORM policies on historical telemetry containers from day one. Retrofitting immutability later is complex.
Lessons Learned
1. Provider isolation pays off early. By designing the ingestion pipeline to be provider-agnostic from day one, onboarding new device vendors became a configuration task rather than an engineering project.
2. Template-driven data models reduce code changes. Instead of modifying code when device capabilities change, teams update templates. The telemetry processor handles transformation automatically.
3. Distributed systems need distributed locking. Timer-based triggers in serverless functions will execute on multiple instances simultaneously. Without distributed locking, every timer trigger becomes a source of duplicate processing.
4. Code review is a production readiness gate. Our review caught authentication bypass risks, ReDoS vulnerabilities, and missing circuit breakers, issues that would have caused outages or security incidents in production.
5. Schema validation at the edge saves downstream trouble. Using Zod for input validation at HTTP boundaries prevents malformed data from propagating through the event pipeline and causing failures in downstream services.
6. Plan for compliance early. If serving regulated industries (healthcare, food service, finance), enable WORM storage policies and audit logging from the start. Retrofitting immutability and compliance controls is significantly harder than building them in from day one.
Conclusion and Next Steps
This IoT platform demonstrates how Azure serverless services can be composed into a production-grade IoT solution that handles multi-provider telemetry ingestion, real-time rule evaluation, and automated notifications while maintaining fine-grained access control and template-driven extensibility.
Key Architectural Takeaways
1. Event-driven decoupling through Event Grid and Storage Queues creates natural scaling boundaries between services.
2. Template-driven device management eliminates code changes when onboarding new devices or capabilities.
3. Layered security combining Entra ID, APIM policies, custom platform tokens, and function-level auth provides defense in depth.
4. Resilience patterns (circuit breakers, distributed locking, rate limiting) are not optional for production IoT workloads.
Recommended Next Steps
If you are building a similar IoT platform on Azure:
1. Start with the ingestion and processing pipeline, this is the backbone of any IoT solution.
2. Design your template system before building device management, it will save significant rework later.
3. Implement distributed locking for any timer-triggered function from the beginning.
4. Add circuit breakers to every external API call before going to production.
5. Run a security-focused code review specifically looking for authentication bypass, injection, and DoS vectors.
6. Evaluate compliance requirements early, if serving regulated industries, configure WORM policies and audit trails before ingesting production data.
For deeper guidance, the Azure IoT reference architecture and the Azure Well-Architected Framework provide additional patterns and checklists that complement the approach described here.