Blog Post

Azure Architecture Blog
8 MIN READ

Building a Scalable IoT Platform for Facility Management with Azure Serverless Services

nishantmv's avatar
nishantmv
Icon for Microsoft rankMicrosoft
Apr 14, 2026

Facility management at scale requires real-time visibility into thousands of connected devices across geographically distributed sites. HVAC systems, generators, occupancy sensors, and independent monitoring devices each produce continuous telemetry streams that must be ingested, processed, stored, and acted upon — often within seconds.

Business Requirements

The business required a platform that could:

- Handle multi-provider device telemetry ingestion

- Apply configurable business rules in real-time

- Send automated notifications via email and SMS

- Provide operations teams with a modern portal for monitoring and control

- Support multi-tenant role-based access

- Enable template-driven device management

- Provide historical data analytics capabilities

Our Solution

This article walks through the architecture we built on Azure Functions, Azure Event Grid, Azure Cosmos DB, Azure Redis Cache, and Azure Data Lake Storage Gen2. It covers design choices we made, patterns that worked well, and critical security and resilience improvements identified through rigorous code review.

 

Architecture Overview

The platform follows a microservices architecture with six independently deployable services, all built on Azure Functions v4 with TypeScript. Each service is containerized and deployed to Azure Container Apps.

Main Building Blocks

- IoT Portal UI: React 18 frontend with Chakra UI, TanStack Query, and MSAL for Azure AD authentication

- IoT Portal API: Main backend for asset management, locations, templates, rules, and telemetry queries

- IoT Profile API: User profile management and RBAC with custom JWT token generation

- HTTP Ingestion: Provider-agnostic telemetry ingestion endpoints

- Telemetry Processor: Event-driven standardization, state management, and historical archival

- Rule Engine: Configurable rule evaluation with automated notifications via email and SMS

Telemetry Ingestion Pipeline

The ingestion layer follows a two-stage pipeline pattern designed for provider independence and horizontal scalability.

Stage 1: HTTP Reception

Telemetry arrives via authenticated HTTP POST endpoints. Azure API Management (APIM) handles authentication using subscription keys, rate limiting, and request validation at the gateway level. Each provider (occupancy sensors, MQTT-based gensets, and others) gets a dedicated route and controller, but all converge into a shared publishing pipeline.

Stage 2: Event Forwarding

Validated payloads are transformed into CloudEvents with the type Telemetry.Http.Ingested.{Provider} and published to Azure Event Grid. Event Grid routes these to Azure Storage Queues for downstream processing.

Key design decision: New device providers are onboarded by adding a route, a controller, and an APIM configuration with no changes to the core pipeline. This keeps the ingestion layer extensible without risking regressions.

Telemetry Processing and Standardization

The Telemetry Processor consumes events from Storage Queues and performs four critical functions:

StageFunctionTechnology
StandardizeTransform provider-specific payloads into unified StandardizedEvent schema using JSONPath-based mappingsCapability templates + Redis-cached metadata
Update StateUpsert current asset state to Cosmos DB with monotonic update logicAzure Cosmos DB (hierarchical partition keys)
ArchiveWrite historical telemetry as JSON files organized by date hierarchyAzure Data Lake Storage Gen2
Trigger RulesPublish trigger events for rule evaluationAzure Event Grid to Storage Queues

The monotonic update logic ensures that only data with newer timestamps overwrites existing state, a critical safeguard against out-of-order event delivery in distributed systems.

Template-Driven Device Management

Rather than hardcoding device capabilities, the platform uses a three-tier template hierarchy that decouples device definition from device instances.

 

TierPurposeExamples
Capability TemplatesDefine individual data points (telemetry, commands, parameters) with data types, validation rules, and unitsTemperature, Humidity, ON/OFF Command
Asset TemplatesDefine device types by combining capabilities with physical specificationsSensor, Gateway, Genset Controller
Location TemplatesDefine physical spaces and specify required assets with optional strict policy enforcementBuilding, Floor, Room, Equipment Area

The template system supports seven data types (String, Integer, Double, Boolean, DateTime, JSON, Binary) and three validation rule types (value limits, allowed value lists, regex patterns). Templates use semantic versioning starting at 1.0.0, and deletion protection prevents removing templates that are referenced by higher-tier templates or instances.

Why this matters: When onboarding a new device model, the operations team defines capabilities through templates rather than writing code. The telemetry processor automatically uses these templates (via Redis-cached lookups with a 5-minute TTL) to standardize incoming data.

Rule Engine: Event-Driven Automation

The Rule Engine evaluates configurable business rules against device state and triggers automated actions. It uses a template-to-implementation pattern where reusable rule templates are instantiated as concrete implementations bound to specific assets or locations.

Rule Trigger Types

Trigger TypeMechanismUse Case
State ChangeStorage Queue from Telemetry ProcessorAlert when temperature exceeds 30C
Time-BasedCRON patterns evaluated every minuteCheck if equipment has been idle for 4 hours
HTTPDirect HTTP call (debug only)Development and testing

Condition Evaluation

Conditions support comparison operators, temporal operators (olderThan, newerThan), composite logic (all/any with short-circuit evaluation), and history-based conditions (totalDuration, stateChangeCount).

Notification Actions

When rules trigger, the engine publishes notification events to Event Grid, which routes them through Storage Queues to the notification service. This service batches and sends email and SMS notifications via a third-party messaging API.

Rule Triggered -> Event Grid -> Storage Queue -> Notification Service -> Messaging Provider (Email/SMS)

Role-Based Access Control and Identity

Security is handled through a layered approach combining Microsoft Entra ID (Azure AD) with a custom IoT token system.

Authentication Flow

1. User authenticates with Microsoft Entra ID

2. APIM validates the AAD token at the gateway

3. APIM calls the Profile API to generate a custom platform token, a JWT signed with RSA certificates from Azure Key Vault

4. The platform token contains flattened user permissions with scope information

5. APIM caches the token (60-second TTL) and forwards it as a custom header to backend APIs

Four Scope Types

ScopeDescription
platformGlobal administrative access
siteAccess limited to specific sites
clientAccess limited to specific clients
siteAndClientCombined scope for fine-grained access

Each scope supports a GLOBAL target for unrestricted access within that scope level. The data model in Cosmos DB consists of three entities: Permissions (atomic access rights), Roles (permission groups), and User Profiles (role assignments with scoping).

Hardening: What We Found and Fixed

Building the platform is one thing. Making it production-ready is another. A comprehensive code review uncovered critical security and resilience issues across all services. Here are the most impactful findings and the patterns we applied to resolve them.

Race Conditions in Distributed Rule Processing

The problem: When multiple Azure Function instances process time-based triggers simultaneously, duplicate rule executions generate duplicate notifications.

The fix: Distributed locking using Azure Blob Storage leases. Each timer execution generates a unique lock key (timer-execute-${timestamp}), and only the instance that acquires the lock processes triggers.

const lockKey = `timer-execute-${myTimer.scheduleStatus.next}`;
const lock = await acquireDistributedLock(lockKey, 60000);
if (!lock) {
  context.log("Another instance is processing, skipping");
  return;
}

Circuit Breaker for External APIs

The problem: Direct HTTP calls to the third-party messaging API without circuit breaker protection. When the external provider is down, every request waits 30+ seconds for timeout, causing cascading failures.

The fix: A circuit breaker library wraps all external API calls. After 50% failure rate, the circuit opens and requests fail fast. Recovery is tested automatically after 60 seconds.

import CircuitBreaker from 'opossum';

this.emailCircuitBreaker = new CircuitBreaker(this.sendNotification.bind(this), {
  timeout: 30000,
  errorThresholdPercentage: 50,
  resetTimeout: 60000
});

Authentication Bypass Risk

The problem: A configuration flag could completely disable authentication checks. A misconfiguration in production would expose all data and operations.

The fix: Remove the bypass entirely for production. For development environments, tie the bypass to environment detection with multiple safeguards rather than a single boolean flag.

Weak Token Validation

The problem: The JWT validation used decodeToken() (base64 decode only) instead of cryptographic signature verification. Attackers could craft fake tokens with any claims.

The fix: Replace with verifyToken() using Azure AD public signing keys fetched from Microsoft's OpenID configuration endpoint, validating issuer, audience, expiration, and algorithm.

Additional Hardening

IssueSeverityResolution
Missing rate limiting for notificationsHighPer-notification-group rate limiting with Redis counters and TTL
ReDoS vulnerability in multipart parsingMediumNon-backtracking regex with character classes and length limits
No request size limitsMediummaxRequestBodySize configured in Azure Functions host.json
Array splitting without length limitsMediumExplicit count limits (max 100) before processing comma-separated inputs
CORS wildcard in productionHighReplaced with explicit allowed origins configured per environment
Anonymous function authorizationMediumChanged authLevel to function for defense in depth

Technology Decisions and Trade-offs

Key Technology Choices

AspectChoiceRationale
ComputeAzure Functions v4 (TypeScript)Serverless auto-scaling, pay-per-execution, no infrastructure management
MessagingEvent Grid + Storage Queues (CloudEvents)Reliable event delivery with at-least-once semantics and dead-letter support
Primary StoreAzure Cosmos DB (hierarchical partition keys)Global distribution capability, sub-10ms reads, flexible schema
Historical StoreAzure Data Lake Storage Gen2Cost-effective long-term storage with analytics-ready file organization; supports immutable storage (WORM) for compliance
CachingAzure Redis CacheRead-through caching with explicit invalidation for templates and triggers
API GatewayAzure API ManagementCentralized auth, rate limiting, CORS, and subscription key management
ConfigurationAzure App ConfigurationCentralized config with feature flags across all services
SecretsAzure Key VaultRSA signing certificates, API keys, connection strings
FrontendReact 18 + Vite + Chakra UIFast development, modern DX, accessible component library
API SpecTypeSpec to OpenAPI v3Type-safe API definitions with auto-generated OpenAPI specs

Trade-off worth noting: We chose Storage Queues over Service Bus for inter-service messaging. Storage Queues are simpler and cheaper, and sufficient for our throughput requirements. If ordering guarantees or sessions become necessary in the future, upgrading to Service Bus is a well-understood migration path.

Compliance Considerations: WORM Storage

Do you need Write-Once-Read-Many (WORM) storage? Even without PII, regulatory requirements may mandate immutable storage.

When WORM Is Required

ScenarioRequirementImplementation
Healthcare facilitiesFDA 21 CFR Part 11 for temperature logsEnable time-based retention on Data Lake Gen2 containers
Food serviceFDA FSMA cold chain compliance7-year retention with legal hold capability
Financial institutionsSOX compliance for operational dataImmutable storage with audit logging
Utility/energyMeter data for billing disputesTamper-proof storage with version immutability
ManufacturingEquipment warranty/maintenance audit trailsRetention policies matching warranty periods

Azure Data Lake Gen2 already supports WORM through immutable blob storage policies:

const retentionPolicy = {
  immutabilityPeriodSinceCreationInDays: 2555, // ~7 years
  allowProtectedAppendWrites: true // Allow appending audit logs
};

Cost impact: WORM policies have no additional storage cost, only storage lifecycle management configurations.

Recommendation: If your facility management platform serves regulated industries (healthcare, food service, finance), enable WORM policies on historical telemetry containers from day one. Retrofitting immutability later is complex.

Lessons Learned

1. Provider isolation pays off early. By designing the ingestion pipeline to be provider-agnostic from day one, onboarding new device vendors became a configuration task rather than an engineering project.

2. Template-driven data models reduce code changes. Instead of modifying code when device capabilities change, teams update templates. The telemetry processor handles transformation automatically.

3. Distributed systems need distributed locking. Timer-based triggers in serverless functions will execute on multiple instances simultaneously. Without distributed locking, every timer trigger becomes a source of duplicate processing.

4. Code review is a production readiness gate. Our review caught authentication bypass risks, ReDoS vulnerabilities, and missing circuit breakers, issues that would have caused outages or security incidents in production.

5. Schema validation at the edge saves downstream trouble. Using Zod for input validation at HTTP boundaries prevents malformed data from propagating through the event pipeline and causing failures in downstream services.

6. Plan for compliance early. If serving regulated industries (healthcare, food service, finance), enable WORM storage policies and audit logging from the start. Retrofitting immutability and compliance controls is significantly harder than building them in from day one.

Conclusion and Next Steps

This IoT platform demonstrates how Azure serverless services can be composed into a production-grade IoT solution that handles multi-provider telemetry ingestion, real-time rule evaluation, and automated notifications while maintaining fine-grained access control and template-driven extensibility.

Key Architectural Takeaways

1. Event-driven decoupling through Event Grid and Storage Queues creates natural scaling boundaries between services.

2. Template-driven device management eliminates code changes when onboarding new devices or capabilities.

3. Layered security combining Entra ID, APIM policies, custom platform tokens, and function-level auth provides defense in depth.

4. Resilience patterns (circuit breakers, distributed locking, rate limiting) are not optional for production IoT workloads.

Recommended Next Steps

If you are building a similar IoT platform on Azure:

1. Start with the ingestion and processing pipeline, this is the backbone of any IoT solution.

2. Design your template system before building device management, it will save significant rework later.

3. Implement distributed locking for any timer-triggered function from the beginning.

4. Add circuit breakers to every external API call before going to production.

5. Run a security-focused code review specifically looking for authentication bypass, injection, and DoS vectors.

6. Evaluate compliance requirements early, if serving regulated industries, configure WORM policies and audit trails before ingesting production data.

For deeper guidance, the Azure IoT reference architecture and the Azure Well-Architected Framework provide additional patterns and checklists that complement the approach described here.

Updated Apr 27, 2026
Version 2.0

3 Comments

  • Great practical write-up on building a scalable IoT platform with Azure serverless services.
    I especially liked the focus on event-driven design, provider flexibility, template-driven device management, and production hardening.

  • Nice, practical write-up. I liked how you explained the event-driven flow and kept the design flexible for different device providers—that’s something many teams struggle with. The production hardening bits felt very real, not just theory. Maybe adding a quick note on cost trade-offs or scale limits would help readers planning deployments. Overall, a solid and relatable reference.

    • nishantmv's avatar
      nishantmv
      Icon for Microsoft rankMicrosoft

      Thanks for the feedback, glad the event‑driven design and provider flexibility resonated. Great call on cost trade‑offs and scale limits; adding a short note there would definitely help teams planning real‑world deployments. Appreciate you taking the time to share this.