In today's rapidly evolving technological landscape, the quest for reliability in cloud-native systems is more critical than ever. This post delves into the intricate relationship between physics, information theory, and practical reliability in the cloud. Building on concepts such as thermodynamic entropy, Shannon entropy, and chaos theory, we explore how these ideas manifest in Azure estates. From understanding SLAs and multi-region deployments to managing architectural and epistemic entropy, this post provides actionable insights and patterns for designing and operating highly reliable systems. Join us as we navigate the complexities of cloud architectures and uncover strategies to mitigate downtime and enhance system resilience.
From Physics and Information Theory to Practical Reliability in the Cloud
Two years ago I wrote about mitigating downtime and increasing reliability by managing complexity in cloud‑native systems. I used ideas from physics and chaos theory as a lens to explain why cloud architectures feel fragile as they grow: more moving parts, more states, more surprising failure modes.
Since then, I’ve gone deeper, especially into three concepts:
- Thermodynamic entropy: energy spreading out, order decaying.
- Shannon entropy: uncertainty and information in signals.
- Chaos theory: deterministic systems behaving unpredictably due to sensitivity to initial conditions.
What is striking is how naturally these ideas line up with what we experience every day in Azure estates:
- The system's that we design and deploy through their published SLA's that look good on paper but still experience weird outages.
- Architectures that are theoretically resilient but operationally brittle.
- Systems that only seem understandable right up until the moment they fail.
This post is the “Part 2” I did not write in 2023. I will start with entropy and information, map them onto cloud and distributed systems, and then land in concrete patterns for designing and operating highly reliable systems on Azure. Along the way, I’ll use a small SLA example (revisited, corrected, and extended to multi‑region) to connect the theory to real design decisions.
1. Entropy: from physics to information
You do not need full statistical mechanics or information theory to make these concepts useful as an architect. You just need a few core ideas.
1.1 Thermodynamic entropy: energy vs usefulness
At the physical level, two things are true:
- Energy is conserved. It does not disappear.
- Entropy tends to increase in a closed system.
Entropy, in thermodynamics, measures how spread-out energy is and how many microscopic arrangements (microstates) correspond to the same macroscopic appearance (macrostate).
- When energy is concentrated, for example, in a small number of particles with high energy, entropy is low and it’s easy to extract useful work.
- When energy is dispersed, many particles with a little energy each, entropy is high and the energy is harder to harness.
Same total energy, remarkably different usefulness.
A classic example: the Earth and the Sun.
- Earth receives low‑entropy energy: relatively few high‑energy photons.
- Earth radiates back high‑entropy energy: many lower‑energy photons.
Everything interesting that happens in between: weather, chemistry, life, is the process of converting concentrated energy into dispersed heat.
1.2 Shannon entropy: uncertainty and information
Claude Shannon came at entropy from a completely different direction: communication.
He wanted to quantify:
- How unpredictable a source of messages is.
- How much information each message carries.
- How much we can compress data or correct errors over a noisy channel.
The key idea:
- Information is linked to surprise.
- A highly predictable event carries little information.
- A rare, surprising event carries a lot.
Shannon entropy is the average surprise of a source. It is high when:
- Many outcomes are possible, and
- They are all likely.
It turns out the formula Shannon derived for entropy in information theory has the same mathematical form as entropy in statistical physics. The interpretations differ:
- Thermodynamic entropy: missing information about the exact microscopic state of a physical system.
- Shannon entropy: missing information (uncertainty) about the outcome of a random variable or source.
But they are connected by the same idea:
Entropy is about uncertainty and the number of possible states.
That is going to matter a lot when we talk about observability and incident response.
1.3 Chaos: when deterministic systems feel random
Chaos theory completes the picture.
Chaotic systems are:
- Deterministic: future states are fully determined by the current state and rules.
- But highly sensitive to initial conditions: tiny differences in the starting point grow into large differences in behaviour over time.
Weather is the classic example:
- The equations governing fluid flow are known.
- But a tiny uncertainty in measurements today can grow into a completely different storm pattern in a week.
From an information point of view:
- Chaos amplifies small unknowns into large uncertainties about the future.
- Shannon entropy of “where the system might be” grows quickly as you look further ahead.
Keep those three in your head:
- Thermodynamic entropy: the number of microstates and how energy spreads.
- Shannon entropy: how uncertain we are about state or signals.
- Chaos: how uncertainty evolves over time.
If we were look at a cloud system.
2. Cloud systems as entropy machines
In the first post, I argued that cloud‑native architectures are highly ordered structures sitting on top of an unreliable substrate. That intuition still holds, but we can add more clarity to it.
2.1 Resources vs structured capability
On Azure, we rarely run out of raw resources:
- Compute, storage, and network are elastic.
- We can usually scale up and out.
What we run out of is structured capability:
- Clear domain boundaries and ownership.
- Clean, stable APIs and schemas.
- Well‑understood patterns for retries, timeouts, and idempotency.
- Bounded configuration and feature flag complexity.
- Telemetry that tells us what’s happening, not just that “something is wrong”.
- Human attention to support and evolve all the above.
Every time we:
- Add a new microservice or function,
- Introduce a new data store,
- Add another “temporary” feature flag,
- Create a special‑case code path for one tenant or one region,
- Wire in another third‑party dependency,
we increase the number of possible states the system can be in.
Many of those states are benign. Some are edge cases waiting to be discovered at 4 a.m.
This is architectural entropy: the space of states your design permits.
2.2 Architectural entropy vs epistemic entropy
It helps to split entropy into two types:
- Architectural entropy : how many states the system can be in.
- Number of services and instances.
- Number of data stores, schemas, and caches.
- Number of configuration combinations and feature modes.
- Number of possible call paths and feedback loops.
- Epistemic (Shannon) entropy: how much uncertainty we have about which state it’s actually in right now.
- What do we know from logs, metrics, and traces?
- How many different hypotheses are still plausible when something goes wrong?
You can have:
- Low architectural entropy but high epistemic entropy
(simple system, almost no telemetry), or - High architectural entropy but low epistemic entropy
(complex system, but excellent observability, automation, and discipline).
Most real systems are somewhere in between.
Chaos matters because:
- As systems grow, small unknowns (tiny config difference, slight timing change) can lead to very different behaviours.
- The “distance” between your architecture diagram and real runtime behaviour grows.
In other words: Diagrams are macrostates; incidents happen in the microstates.
Our job is to make sure the microstate space is constrained, observable, and survivable.
3. Four kinds of entropy in cloud architectures
To make this actionable in design reviews, I want to talk about four dimensions of entropy.
3.1 State entropy: the many ways reality can differ from the diagram
Questions:
- How many data stores are involved (SQL, NoSQL, queues, caches, blobs, search, etc.)?
- For each business concept (customer, order, balance):
- Is there exactly one source of truth?
- Or multiple systems that can write it?
- How many schema versions or representations exist at once?
- Do we have dual‑writes or “temporary” copies that became permanent?
Higher state entropy means more ways the system can be ”mostly right but subtly wrong”.
3.2 Configuration entropy: the number of behavioural knobs
Questions:
- How many feature flags are active in production?
- How many environment, region, or tenant‑specific settings are there?
- Is there a lifecycle for config and flags (creation > rollout >retirement)?
- Who can change config, and How (pipeline, portal, manual script)?
Higher config entropy increases the number of “if these three flags are on, in that region, for that tenant, after that deploy…” states.
3.3 Interaction entropy: tangles in the dependency graph
Questions:
- For each critical user journey:
- How many services and external dependencies are in the hot path?
- What does the call graph look like?
- Fan‑in, fan‑out, cycles.
- How many asynchronous links:
- Queues, topics, streams, background jobs?
- Are there shared components multiple domains depend on?
Higher interaction entropy means more ways failures can cascade and more opportunities for chaotic feedback (retries, timeouts, autoscaling) to bite.
3.4 Organisational entropy: how many mental models are involved
Questions:
- How many teams are involved in a single critical journey?
- For each component, is there a clear accountable owner?
- Are docs and runbooks up to date and used?
- How many teams typically end up on incident calls?
Higher organisational entropy means larger gaps between:
- What people think the system does, and
- What it actually does under pressure.
These four dimensions give you a way to talk about an entropy budget: Where are we comfortable adding more possible states and interactions, and where are we not?
Before we go into patterns, it’s worth doing a short SLA calculation with some Azure numbers, because it illustrates both the power of multi‑region and Availability Zones and where the maths stops being the true limiting factor.
4. Sidebar: SLAs, regions, zones: and where math stops helping
Let’s revisit a simple example using three common services:
- Azure App Service: 99.95%
- Azure SQL Database: 99.99%
- Azure Storage (Blob): 99.9%
Assume a request needs all three to succeed. If any one is unavailable, the scenario fails.
4.1 Single region, no zones
In one region (say, UK South), convert the SLAs to decimals:
- Web App SLA = 99.95% = 0.9995
- SQL DB SLA = 99.99% = 0.9999
- Blob SLA = 99.9% = 0.9990
Composite regional SLA (all three must be up):
- SLA_region = 0.9995 × 0.9999 × 0.9990
- SLA_region = 0.99840065 (≈ 99.8401%)
Failure probability for the region:
- FailureRate_region = 1 − SLA_region
- FailureRate_region = 1 − 0.99840065
- FailureRate_region ≈ 0.00159935
Annual downtime (minutes per year):
- MinutesPerYear = 365 × 24 × 60 = 525,600
- DowntimeMinutes_region = FailureRate_region × MinutesPerYear
- DowntimeMinutes_region ≈ 0.00159935 × 525,600
- DowntimeMinutes_region ≈ 840.6 minutes ≈ 14 hours per year
That’s our baseline for a single region.
4.2 Two regions (UK South + Sweden Central), active‑active
Now deploy the same stack in UK South and Sweden Central, and consider the system “up” if either region can serve the request.
Assumptions:
- Both regions behave like the baseline above.
- Regional failures are independent (simplifying assumption, standard for SLA math).
Failure probability for one region (from above):
- FailureRate_region ≈ 0.00159935
Probability that both regions are down at the same time:
- FailureRate_both = FailureRate_region × FailureRate_region
- FailureRate_both ≈ 0.00159935 × 0.00159935
- FailureRate_both ≈ 0.00000256 (2.56 × 10⁻⁶)
Multi‑region SLA:
- SLA_multi = 1 − FailureRate_both
- SLA_multi ≈ 1 − 0.00000256
- SLA_multi ≈ 0.99999744 (≈ 99.999744%)
Annual downtime:
- DowntimeMinutes_multi = FailureRate_both × MinutesPerYear
- DowntimeMinutes_multi ≈ 0.00000256 × 525,600
- DowntimeMinutes_multi ≈ 1.34 minutes ≈ 80 seconds per year
So on paper, going from one region to two reduces expected downtime by about 600× – from ~14 hours to ~1.3 minutes per year.
4.3 Single region with 3 Availability Zones
Now consider a single region that uses three Availability Zones for each service. For this model, we assume zonal failures are independent.
For each service:
- Convert SLA to a per‑zone failure rate.
- Cube that value (failure in all three zones).
- Convert back to an “across‑zones” SLA.
Web App (99.95%)
- SLA_web = 0.9995
- FailureRate_web = 1 − SLA_web = 1 − 0.9995 = 0.0005
Failure across all 3 zones:
- FailureRate_web_all_zones = 0.0005 × 0.0005 × 0.0005
- FailureRate_web_all_zones = 0.000000000125 (1.25 × 10⁻¹⁰)
SLA across 3 zones:
- SLA_web_AZ = 1 − FailureRate_web_all_zones
- SLA_web_AZ = 1 − 0.000000000125
- SLA_web_AZ ≈ 0.999999999875
SQL Database (99.99%)
- SLA_sql = 0.9999
- FailureRate_sql = 1 − 0.9999 = 0.0001
Failure across all 3 zones:
- FailureRate_sql_all_zones = 0.0001 × 0.0001 × 0.0001
- FailureRate_sql_all_zones = 0.000000000001 (1 × 10⁻¹²)
SLA across 3 zones:
- SLA_sql_AZ = 1 − FailureRate_sql_all_zones
- SLA_sql_AZ ≈ 0.999999999999
Blob Storage (99.9%)
- SLA_blob = 0.9990
- FailureRate_blob = 1 − 0.9990 = 0.001
Failure across all 3 zones:
- FailureRate_blob_all_zones = 0.001 × 0.001 × 0.001
- FailureRate_blob_all_zones = 0.000000001 (1 × 10⁻⁹)
SLA across 3 zones:
- SLA_blob_AZ = 1 − FailureRate_blob_all_zones
- SLA_blob_AZ ≈ 0.999999999
Composite SLA for the zone‑redundant stack (one region)
- SLA_region_AZ = SLA_web_AZ × SLA_sql_AZ × SLA_blob_AZ
- SLA_region_AZ ≈ 0.999999999875 × 0.999999999999 × 0.999999999
- SLA_region_AZ ≈ 0.999999998874
So:
- Single region with 3 AZs ≈ 99.9999998874%
- FailureRate_region_AZ = 1 − 0.999999998874
- FailureRate_region_AZ ≈ 0.000000001126 (1.126 × 10⁻⁹)
Annual downtime:
- DowntimeMinutes_region_AZ = FailureRate_region_AZ × MinutesPerYear
- DowntimeMinutes_region_AZ ≈ 0.000000001126 × 525,600
- DowntimeMinutes_region_AZ ≈ 0.000592 minutes ≈ 0.035 seconds per year
On paper, that’s a huge improvement
4.4 Two regions + AZ: where the maths stops being the bottleneck
Finally, if you deploy that zone‑redundant stack to both UK South and Sweden Central, and assume:
- Independence between zones within a region, and
- Independence between regions,
then:
Per‑region failure rate with AZ:
- FailureRate_region_AZ ≈ 0.000000001126
Both regions down at once:
- FailureRate_both_AZ = FailureRate_region_AZ × FailureRate_region_AZ
- FailureRate_both_AZ ≈ 0.000000001126 × 0.000000001126
- FailureRate_both_AZ ≈ 0.00000000000000000127 (1.27 × 10⁻¹⁸)
Multi‑region, multi‑AZ SLA:
- SLA_multi_AZ = 1 − FailureRate_both_AZ
- SLA_multi_AZ ≈ 1 − 0.00000000000000000127
- SLA_multi_AZ ≈ 0.9999999999999999987 (effectively “18 nines”)
Annual downtime:
- DowntimeMinutes_multi_AZ = FailureRate_both_AZ × MinutesPerYear
- DowntimeMinutes_multi_AZ ≈ 1.27 × 10⁻¹⁸ × 525,600
- DowntimeMinutes_multi_AZ is on the order of 10⁻¹³ minutes
- That’s about 10⁻¹¹ seconds per year – effectively zero on human timescales
At this point, the maths is telling you:
Under these assumptions, this combination of service outages reduces the likelihood hood of an event happening to a statistically small interval
And this is exactly where the assumptions break down:
- Real‑world failures are not fully independent between zones and regions.
- There are shared dependencies (control planes, DNS, identity, networking, global services).
- And most importantly, there is us – our designs, our changes, our operational mistakes.
So while multi‑region + AZ absolutely are the way forward for building highly reliable systems, it is fair (and important) to say:
The system itself is only as reliable as its known failure modes and recovery behaviours, irrespective of how available the underlying platform looks on paper.
SLAs and redundancy describe the potential for reliability.
Whether you realise that potential depends on how well you manage entropy in your architecture, information, and operations.
5. The system is only as reliable as its failure and recovery behaviours
It is tempting to look at the SLA math and conclude that:
“If I have enough regions and zones, I’m basically okay.”
In practice, once you’re using zones and (where appropriate) multiple regions, the limiting factors for reliability are usually:
- Unknown or unmanaged failure modes
- Features that assume a dependency will “never fail”.
- Data paths that haven’t been exercised under partial failure.
- Poorly defined recovery behaviour
- Services that don’t fail fast and cause cascading timeouts.
- Workflows that get stuck in “unknown” or “zombie” states.
- Gaps in observability
- You can’t see which parts of the system are broken quickly enough.
- You can’t distinguish between “region issue”, “dependency issue”, and “our bad deployment”.
- Architectural and organisational entropy
- Too many flags, modes, and special cases.
- Too many teams involved in a single change or incident.
- No clear owners for shared components.
In other words:
The platform might support “five nines” and beyond.
But your system will only be as reliable as:
- The failure states you’ve anticipated,
- The recovery paths you’ve designed and automated, and
- The signals you have to detect and act on them.
That’s where entropy, information, and chaos stop being metaphors and start being design inputs.
How do we fight this kind of entropy in practice?
6. Patterns for fighting entropy (and building real reliability)
6.1 Reduce state entropy: sharpen domains and data ownership
Aim: fewer ways for reality to drift from your mental model.
- Establish strong domain boundaries:
- One team owns each core business concept and its data.
- Others consume via APIs or events, not direct writes to its store.
- Limit data technology sprawl per domain:
- Use the right tool (OLTP, analytics, search), but don’t allow every team to pick a different DB “just because”.
- Each new engine adds states and failure modes to manage.
- Version schemas and contracts explicitly:
- Don’t rely on “schema by convention”.
- Use forwards/backwards compatible changes, versioned contracts, and automated tests for compatibility.
- Time‑box support for old versions and automate discovery of stragglers.
6.2 Design for partial knowledge: assume you are never fully in sync
Aim: remain correct even when services have incomplete or delayed information.
Patterns:
- Idempotent APIs for all non‑trivial writes:
- Every request has a stable identity.
- Safe to retry without knowing if the last attempt “took”.
- Exactly‑once effects on top of at‑least‑once delivery:
- Use deduplication tables, sequence numbers, or commutative operations to ensure business effects happen once even if messages do not.
- Explicit consistency boundaries:
- Decide where you require strong consistency (often within a bounded context).
- Everywhere else, accept eventual consistency and make the UX honest about it:
- “Pending”, “processing”, last updated timestamps, etc.
- Saga and workflow patterns for multi‑step business operations:
- Model long‑lived operations as state machines with compensations.
- Do not pretend a distributed transaction is a local one.
These patterns recognise that:
- No node has perfect knowledge.
- The network is asynchronous.
- Timing and ordering are subject to chaotic behaviour.
They are your tools for living with that reality.
6.3 Use observability to reduce Shannon entropy
Aim: shrink the set of plausible system states during an incident as fast as possible.
Start from invariants:
- Identify 3–5 non‑negotiable properties:
- “No double‑billing.”
- “No lost, confirmed orders.”
- “Data accepted with a 200 must be durable.”
Design signals to track those:
- SLIs/SLOs that reflect user‑visible health, not just CPU and memory.
- Metrics that segment by region, tenant, version, the axes you debug along.
- Correlation that lets you follow a business transaction across services, queues, and background jobs.
Think information theory:
- Every log, metric, and trace is a message.
- Good observability maximises the information those messages carry about what you care about and minimises noise.
You know observability is working when:
- In an incident, you can rule out entire classes of failure with one or two dashboards.
- The set of hypotheses you are considering collapses quickly instead of expanding.
6.4 Map your dynamics with load and chaos
Aim: understand how your system moves through its state space under stress.
Think in terms of attractors:
- Healthy steady state.
- Graceful degradation states.
- Unhealthy “meltdown” states (retry storms, cascading timeouts, data corruption).
Use:
- Load testing to explore capacity edges:
- How do autoscaling, throttling, and back-pressure behave?
- Do you see oscillations (scale up/down thrash)?
- Chaos experiments to explore failure edges:
- Inject realistic faults: latency, packet loss, node failures, partial region impairments.
- Observe whether the system falls back to a safe state or spirals.
On Azure, this typically means combining services like Azure Load Testing and Azure Chaos Studio with your own playbooks and SLOs.
The goal is not to “break production for fun”. It is to:
- Discover failure states on your terms,
- Exercise recovery paths, and
- Continuously refine both the architecture and the runbooks.
6.5 Govern with an entropy budget
Aim: make complexity an explicit trade‑off, not a by‑product.
In reviews and planning, ask:
- How much state entropy can this journey tolerate?
- How much config entropy before we lose control?
- How much interaction entropy before another dependency is a liability?
- How much organisational entropy before ownership becomes a bottleneck?
Set budgets:
- For mission‑critical journeys (payments, identity, safety):
- Extremely tight budgets on state and interaction entropy.
- Aggressive simplification and standardisation.
- For less critical or internal workloads:
- Higher budgets but nonetheless bounded.
- Emphasis on blast radius control and observability.
Then hold teams (and yourself) to those budgets:
“If we want to introduce this new data store / dependency / feature flag, what are we removing or standardising elsewhere to pay for it? And how are we making it observable and operable?”
That turns “this feels too complex” into a repeatable pattern.
7. Closing: living in the productive middle
Entropy, Shannon information, and chaos theory are not just metaphors. They are honest descriptions of the forces acting on modern cloud systems:
- Structures decay.
- The number of possible states grows faster than we like to admit.
- Our knowledge of those states is limited and noisy.
- Small uncertainties can grow into large incidents over time.
The good news is: interesting systems live in the middle.
- Too little entropy and your architecture cannot scale or adapt.
- Too much and it becomes inoperable.
Our job as architects and leaders is to keep our systems in that productive middle zone:
- Use zones and regions to raise the ceiling on what’s possible.
- Use patterns, observability, and governance to make that potential real.
- Continuously learn from incidents and experiments to push complexity back into structure.
Or, in the language we’ve been using:
Cloud is a war against entropy.
Infra redundancy wins you the right battles.
Failure and recovery design decide whether you win the war.
Lavan Nallainathan
Director - Customer Success UK
Cloud & AI