Blog Post

Storage at Microsoft
12 MIN READ

Deep Dive: The Storage Pool in Storage Spaces Direct

Cosmos_Darwin's avatar
Cosmos_Darwin
Icon for Microsoft rankMicrosoft
Apr 10, 2019
First published on TECHNET on Nov 21, 2016
Hi! I’m Cosmos. Follow me on Twitter @cosmosdarwin .

Review


The storage pool is the collection of physical drives which form the basis of your software-defined storage. Those familiar with Storage Spaces in Windows Server 2012 or 2012R2 will remember that pools took some managing – you had to create and configure them, and then manage membership by adding or removing drives. Because of scale limitations, most deployments had multiple pools, and because data placement was essentially static (more on this later), you couldn’t really expand them once created.

We’re introducing some exciting improvements in Windows Server 2016.

What’s new


With Storage Spaces Direct , we now support up to 416 drives per pool, the same as our per-cluster maximum, and we strongly recommend you use exactly one pool per cluster. When you enable Storage Spaces Direct (as with the Enable-ClusterS2D cmdlet), this pool is automatically created and configured with the best possible settings for your deployment. Eligible drives are automatically discovered and added to the pool and, if you scale out, any new drives are added to the pool too, and data is moved around to make use of them. When drives fail they are automatically retired and removed from the pool. In fact, you really don’t need to manage the pool at all anymore except to keep an eye on its available capacity.

Nonetheless, understanding how the pool works can help you reason about fault tolerance, scale-out, and more. So if you’re curious, read on!

To help illustrate certain key points, I’ve written a script (open-source, available at the end) which produces this view of the pool’s drives, organized by type, by server ('node'), and by how much data they’re storing. The fastest drives in each server, listed at the top, are claimed for caching.

[caption id="attachment_7465" align="aligncenter" width="764"] The storage pool forms the physical basis of your software-defined storage.[/caption]

The confusion begins: resiliency, slabs, and striping


Let’s start with three servers forming one Storage Spaces Direct cluster.

Each server has 2 x 800 GB NVMe drives for caching and 4 x 2 TB SATA SSDs for capacity.



We can create our first volume ('Storage Space') and choose 1 TiB in size, two-way mirrored. This implies we will maintain two identical copies of everything in that volume, always on different drives in different servers, so that if hardware fails or is taken down for maintenance, we’re sure to still have access to all our data. Consequently, this 1 TiB volume will actually occupy 2 TiB of physical capacity on disk, its so-called 'footprint' on the pool.

[caption id="attachment_7365" align="aligncenter" width="960"] Our 1 TiB two-way mirror volume occupies 2 TiB of physical capacity, its 'footprint' on the pool.[/caption]

(Storage Spaces offers many resiliency types with differing storage efficiency . For simplicity, this blog will show two-way mirroring. The concepts we’ll cover apply regardless which resiliency type you choose, but two-way mirroring is by far the most straightforward to draw and explain. Likewise, although Storage Spaces offers chassis and/or rack awareness , this blog will assume the default server-level awareness for simplicity.)

Okay, so we have 2 TiB of data to write to physical media. But where will these two tebibytes of data actually land?

You might imagine that Spaces just picks any two drives, in different servers, and places the copies in whole on those drives. Alas, no. What if the volume were larger than the drive size? Okay, perhaps it spans several drives in both servers? Closer, but still no.

What actually happens can be surprising if you’ve never seen it before.

[caption id="attachment_7515" align="aligncenter" width="960"] Storage Spaces starts by dividing the volume into many 'slabs', each 256 MB in size.[/caption]

Storage Spaces starts by dividing the volume into many 'slabs', each 256 MB in size. This means our 1 TiB volume has some 4,000 such slabs!

For each slab, two copies are made and placed on different drives in different servers. This decision is made independently for each slab, successively, with an eye toward equilibrating utilization – you can think of it like dealing playing cards into equal piles. This means every single drive in the storage pool will store some copies of some slabs!

[caption id="attachment_7525" align="aligncenter" width="960"] The placement decision is made independently for each slab, like dealing playing cards into equal piles.[/caption]

This can be non-obvious, but it has some real consequences you can observe. For one, it means all drives in all servers will gradually "fill up" in lockstep, in 256 MB increments. This is why we rarely pay attention to how full specific drives or servers are – because they’re (almost) always (almost) the same!



[caption id="attachment_7475" align="aligncenter" width="764"] Slabs of our two-way mirrored volume have landed on every drive in all three servers.[/caption]

(For the curious reader: the pool keeps a sprawling mapping of which drive has each copy of each slab called the 'pool metadata' which can reach up to several gigabytes in size. It is replicated to at least five of the fastest drives in the cluster, and synchronized and repaired with the utmost aggressiveness. To my knowledge, pool metadata loss has never taken down an actual production deployment of Storage Spaces.)

Why? Can you spell parallelism?


This may seem complicated, and it is. So why do it? Two reasons.

Performance, performance, performance!


First, striping every volume across every drive unlocks truly awesome potential for reads and writes – especially larger sequential ones – to activate many drives in parallel, vastly increasing IOPS and IO throughput. The unrivaled performance of Storage Spaces Direct compared to competing technologies is largely attributable to this fundamental design. (There is more complexity here, with the infamous column count and interleave you may remember from 2012 or 2012R2, but that’s beyond the scope of this blog. Spaces automatically sets appropriate values for these in 2016 anyway.)

(This is also why members of the core Spaces engineering team take some offense if you compare mirroring directly to RAID-1.)

Improved data safety


The second is data safety – it’s related, but worth explaining in detail.

In Storage Spaces, when drives fail, their contents are reconstructed elsewhere based on the surviving copy or copies. We call this ‘repairing’, and it happens automatically and immediately in Storage Spaces Direct. If you think about it, repairing must involve two steps – first, reading from the surviving copy; second, writing out a new copy to replace the lost one.

Bear with me for a paragraph, and imagine if we kept whole copies of volumes. (Again, we don’t.) Imagine one drive has every slab of our 1 TiB volume, and another drive has the copy of every slab. What happens if the first drive fails? The other drive has the only surviving copy. Of every slab. To repair, we need to read from it. Every. Last. Byte. We are obviously limited by the read speed of that drive. Worse yet, we then need to write all that out again to the replacement drive or hot spare, where we are limited by its write speed. Yikes! Inevitably, this leads to contention with ongoing user or application IO activity. Not good.

Storage Spaces, unlike some of our friends in the industry, does not do this.

Consider again the scenario where some drive fails. We do lose all the slabs stored on that drive. And we do need to read from each slab's surviving copy in order to repair. But, where are these surviving copies? They are evenly distributed across almost every other drive in the pool! One lost slab might have its other copy on Drive 15; another lost slab might have its other copy on Drive 03; another lost slab might have its other copy on Drive 07; and so on. So, almost every other drive in the pool has something to contribute to the repair!

Next, we do need to write out the new copy of each – where can these new copies be written? Provided there is available capacity, each lost slab can be re-constructed on almost any other drive in the pool!

(For the curious reader: I say almost because the requirement that slab copies land in different servers precludes any drives in the same server as the failure from having anything to contribute, read-wise. They were never eligible to get the other copy. Similarly, those drives in the same server as the surviving copy are ineligible to receive the new copy, and so have nothing to contribute write-wise. This detail turns out not to be terribly consequential.)

While this can be non-obvious, it has some significant implications. Most importantly, repairing data faster minimizes the risk that multiple hardware failures will overlap in time, improving overall data safety. It is also more convenient, as it reduces the 'resync' wait time during rolling cluster-wide updates or maintenance. And because the read/write burden is spread thinly among all surviving drives, the load on each drive individually is light, which minimizes contention with user or application activity.

Reserve capacity


For this to work, you need to set aside some extra capacity in the storage pool. You can think of this as giving the contents of a failed drive "somewhere to go" to be repaired. For example, to repair from one drive failure (without immediately replacing it), you should set aside at least one drive’s worth of reserve capacity. (If you are using 2 TB drives, that means leaving 2 TB of your pool unallocated.) This serves the same function as a hot spare, but unlike an actual hot spare, the reserve capacity is taken evenly from every drive in the pool.

[caption id="attachment_7355" align="aligncenter" width="1804"] Reserve capacity gives the contents of a failed drive "somewhere to go" to be repaired.[/caption]

Reserving capacity is not enforced by Storage Spaces, but we highly recommend it. The more you have, the less urgently you will need to scramble to replace drives when they fail, because your volumes can (and will automatically) repair into the reserve capacity, completely independent of the physical replacement process.

When you do eventually replace the drive, it will automatically take its predecessor’s place in the pool.

Check out our capacity calculator for help with determining appropriate reserve capacity.

Automatic pooling and re-balancing


New in Windows 10 and Windows Server 2016, slabs and their copies can be moved around between drives in the storage pool to equilibrate utilization. We call this 'optimizing' or 're-balancing' the storage pool, and it’s essential for scalability in Storage Spaces Direct.

For instance, what if we need to add a fourth server to our cluster?
Add-ClusterNode -Name <Name>


The new drives in this new server will be added automatically to the storage pool. At first, they’re empty.

[caption id="attachment_7485" align="aligncenter" width="764"] The capacity drives in our fourth server are empty, for now.[/caption]

After 30 minutes, Storage Spaces Direct will automatically begin re-balancing the storage pool – moving slabs around to even out drive utilization. This can take some time (many hours) for larger deployments. You can watch its progress using the following cmdlet.
Get-StorageJob
If you’re impatient, or if your deployment uses Shared SAS Storage Spaces with Windows Server 2016, you can kick off the re-balance yourself using the following cmdlet.
Optimize-StoragePool -FriendlyName "S2D*"
[caption id="attachment_7395" align="aligncenter" width="960"] The storage pool is 're-balanced' whenever new drives are added to even out utilization.[/caption]

Once completed, we see that our 1 TiB volume is (almost) evenly distributed across all the drives in all four servers.



[caption id="attachment_7495" align="aligncenter" width="764"] The slabs of our 1 TiB two-way mirrored volume are now spread evenly across all four servers.[/caption]

And going forward, when we create new volumes, they too will be distributed evenly across all drives in all servers.

This can explain one final phenomena you might observe – that when a drive fails, every volume is marked 'Incomplete' for the duration of the repair. Can you figure out why?

Conclusion


Okay, that’s it for now. If you’re still reading, wow, thank you!

Let’s review some key takeaways.

  • Storage Spaces Direct automatically creates one storage pool, which grows as your deployment grows. You do not need to modify its settings, add or remove drives from the pool, nor create new pools.

  • Storage Spaces does not keep whole copies of volumes – rather, it divides them into tiny 'slabs' which are distributed evenly across all drives in all servers. This has some practical consequences. For example, using two-way mirroring with three servers does not leave one server empty. Likewise, when drives fail, all volumes are affected for the very short time it takes to repair them.

  • Leaving some unallocated 'reserve' capacity in the pool allows this fast, non-invasive, parallel repair to happen even before you replace the drive.

  • The storage pool is 're-balanced' whenever new drives are added, such as on scale-out or after replacement, to equilibrate how much data every drive is storing. This ensures all drives and all servers are always equally "full".


U Can Haz Script


In PowerShell, you can see the storage pool by running the following cmdlet.
Get-StoragePool S2D*
And you can see the drives in the pool with this simple pipeline.
Get-StoragePool S2D* | Get-PhysicalDisk
Throughout this blog, I showed the output of a script which essentially runs the above, cherry-picks interesting properties, and formats the output all fancy-like. That script is included below, and is also available at http://cosmosdarwin.com/Show-PrettyPool.ps1 to spare you the 200-line copy/paste. There is also a simplified version at here which forgoes my extravagant helper functions to reduce running time by about 20x and lines of code by about 2x. :-)

Let me know what you think!
# Written by Cosmos Darwin, PM
# Copyright (C) 2017 Microsoft Corporation
# MIT License
# 08/2017

Function ConvertTo-PrettyCapacity {
<#
.SYNOPSIS Convert raw bytes into prettier capacity strings.
.DESCRIPTION Takes an integer of bytes, converts to the largest unit (kilo-, mega-, giga-, tera-) that will result in at least 1.0, rounds to given precision, and appends standard unit symbol.
.PARAMETER Bytes The capacity in bytes.
.PARAMETER UseBaseTwo Switch to toggle use of binary units and prefixes (mebi, gibi) rather than standard (mega, giga).
.PARAMETER RoundTo The number of decimal places for rounding, after conversion.
#>

Param (
[Parameter(
Mandatory = $True,
ValueFromPipeline = $True
)
]
[Int64]$Bytes,
[Int64]$RoundTo = 0,
[Switch]$UseBaseTwo # Base-10 by Default
)

If ($Bytes -Gt 0) {
$BaseTenLabels = ("bytes", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
$BaseTwoLabels = ("bytes", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB")
If ($UseBaseTwo) {
$Base = 1024
$Labels = $BaseTwoLabels
}
Else {
$Base = 1000
$Labels = $BaseTenLabels
}
$Order = [Math]::Floor( [Math]::Log($Bytes, $Base) )
$Rounded = [Math]::Round($Bytes/( [Math]::Pow($Base, $Order) ), $RoundTo)
[String]($Rounded) + $Labels[$Order]
}
Else {
0
}
Return
}


Function ConvertTo-PrettyPercentage {
<#
.SYNOPSIS Convert (numerator, denominator) into prettier percentage strings.
.DESCRIPTION Takes two integers, divides the former by the latter, multiplies by 100, rounds to given precision, and appends "%".
.PARAMETER Numerator Really?
.PARAMETER Denominator C'mon.
.PARAMETER RoundTo The number of decimal places for rounding.
#>

Param (
[Parameter(Mandatory = $True)]
[Int64]$Numerator,
[Parameter(Mandatory = $True)]
[Int64]$Denominator,
[Int64]$RoundTo = 1
)

If ($Denominator -Ne 0) { # Cannot Divide by Zero
$Fraction = $Numerator/$Denominator
$Percentage = $Fraction * 100
$Rounded = [Math]::Round($Percentage, $RoundTo)
[String]($Rounded) + "%"
}
Else {
0
}
Return
}

Function Find-LongestCommonPrefix {
<#
.SYNOPSIS Find the longest prefix common to all strings in an array.
.DESCRIPTION Given an array of strings (e.g. "Seattle", "Seahawks", and "Season"), returns the longest starting substring ("Sea") which is common to all the strings in the array. Not case sensitive.
.PARAMETER Strings The input array of strings.
#>

Param (
[Parameter(
Mandatory = $True
)
]
[Array]$Array
)

If ($Array.Length -Gt 0) {

$Exemplar = $Array[0]

$PrefixEndsAt = $Exemplar.Length # Initialize
0..$Exemplar.Length | ForEach {
$Character = $Exemplar[$_]
ForEach ($String in $Array) {
If ($String[$_] -Eq $Character) {
# Match
}
Else {
$PrefixEndsAt = [Math]::Min($_, $PrefixEndsAt)
}
}
}
# Prefix
$Exemplar.SubString(0, $PrefixEndsAt)
}
Else {
# None
}
Return
}

Function Reverse-String {
<#
.SYNOPSIS Takes an input string ("Gates") and returns the character-by-character reversal ("setaG").
#>

Param (
[Parameter(
Mandatory = $True,
ValueFromPipeline = $True
)
]
$String
)

$Array = $String.ToCharArray()
[Array]::Reverse($Array)
-Join($Array)
Return
}

Function New-UniqueRootLookup {
<#
.SYNOPSIS Creates hash table that maps strings, particularly server names of the form [CommonPrefix][Root][CommonSuffix], to their unique Root.
.DESCRIPTION For example, given ("Server-A2.Contoso.Local", "Server-B4.Contoso.Local", "Server-C6.Contoso.Local"), returns key-value pairs:
{
"Server-A2.Contoso.Local" -> "A2"
"Server-B4.Contoso.Local" -> "B4"
"Server-C6.Contoso.Local" -> "C6"
}
.PARAMETER Strings The keys of the hash table.
#>

Param (
[Parameter(
Mandatory = $True
)
]
[Array]$Strings
)

# Find Prefix

$CommonPrefix = Find-LongestCommonPrefix $Strings

# Find Suffix

$ReversedArray = @()
ForEach($String in $Strings) {
$ReversedString = $String | Reverse-String
$ReversedArray += $ReversedString
}

$CommonSuffix = $(Find-LongestCommonPrefix $ReversedArray) | Reverse-String

# String -> Root Lookup

$Lookup = @{}
ForEach($String in $Strings) {
$Lookup[$String] = $String.Substring($CommonPrefix.Length, $String.Length - $CommonPrefix.Length - $CommonSuffix.Length)
}

$Lookup
Return
}

### SCRIPT... ###

$Nodes = Get-StorageSubSystem Cluster* | Get-StorageNode
$Drives = Get-StoragePool S2D* | Get-PhysicalDisk

$Names = @()
ForEach ($Node in $Nodes) {
$Names += $Node.Name
}

$UniqueRootLookup = New-UniqueRootLookup $Names

$Output = @()

ForEach ($Drive in $Drives) {

If ($Drive.BusType -Eq "NVMe") {
$SerialNumber = $Drive.AdapterSerialNumber
$Type = $Drive.BusType
}
Else { # SATA, SAS
$SerialNumber = $Drive.SerialNumber
$Type = $Drive.MediaType
}

If ($Drive.Usage -Eq "Journal") {
$Size = $Drive.Size | ConvertTo-PrettyCapacity
$Used = "-"
$Percent = "-"
}
Else {
$Size = $Drive.Size | ConvertTo-PrettyCapacity
$Used = $Drive.VirtualDiskFootprint | ConvertTo-PrettyCapacity
$Percent = ConvertTo-PrettyPercentage $Drive.VirtualDiskFootprint $Drive.Size
}

$NodeObj = $Drive | Get-StorageNode -PhysicallyConnected
If ($NodeObj -Ne $Null) {
$Node = $UniqueRootLookup[$NodeObj.Name]
}
Else {
$Node = "-"
}

# Pack

$Output += [PSCustomObject]@{
"SerialNumber" = $SerialNumber
"Type" = $Type
"Node" = $Node
"Size" = $Size
"Used" = $Used
"Percent" = $Percent
}
}

$Output | Sort Used, Node | FT
Updated Apr 10, 2019
Version 2.0
  • amatesi's avatar
    amatesi
    Copper Contributor

    Hi Cosmos and thank you for this extremely useful explanation.

    I am hereby assuming (please correct me if I'm wrong) that you can manually "regulate" slabs size by tweaking the "New-VirtualDisk -AllocationUnitSize" parameter.

    If my assumption is correct then why creating a New-VirtualDisk via PowerShell gives me a 1GB slab vs 256MB from the GUI?

    Also, given the AllocationUnit-term has already been taken (whenever you mention Allocation Unit, everyone automatically thinks at the cluster size), wouldn't it have been better to call the New-VirtualDisk parameter "-SlabSize" ?

    Please clarify for everyone's benefit and feel free to share (any) performance consequences between the two values (1GBvs256MB Slabs).

     

    Thanks,

    Andrea.

  • tadakk's avatar
    tadakk
    Copper Contributor

    Links in the page for the PowerShell script now go to a malicious Chinese site. Please remove or fix.

  • alexluplus's avatar
    alexluplus
    Copper Contributor

    Thanks , I have some question:

    1.  In the cmdlet of new-volume and new-virtualdisk,  slap size = AllocationUnit Size ?

    2. AllocationUnit size is the block size in the file system of linux ?

    3. In ESXI ,its datastore has a parameter: block size, usually it's 1MB,  here the block and slap is the same thing ?

    4. In RHEL , in LVM , it has a term which is physical or logical extent, usually it's 4MB, here the extent and slap is the same thing ?

    Slap , AllocationUnit , datastore's block , LVM's extent , if they are not the same thing , what are the difference between them?

     

    Cosmos_Darwin, could you please help me understand those terms before I commit them into my brain.

     

    BR

  • Hi Cosmos_Darwin ,

     

    We are using this S2D in Azure, so basically we're eliminating the probability for drive failure. So in this case do we still need to reserve some space?

     

    Regards

    Guru!

  • i3vi3v's avatar
    i3vi3v
    Copper Contributor

    Thanks for the great article!

    I really understood a lot. However, I still have a lot of questions...

    1. You mention Windows 10, but the article title is about "Storage spaces direct". I've heard that it's very different from the regular "Storage spaces". Is everything this article talks about applicable to a single-node Windows 10 Storage Spaces as well?
    2. Provided that slabs are now placed randomly, is the famous "You should also add disks in multiples of the column count" limitation still valid?
    3. If independent 256MB slabs are placed on several different physical disks, and they could be potentially read/written in parallel, to serve somewhat-random-access outstanding IO requests, is the recommendation Set column count = number of physical disks. Use PowerShell when configuring more than 8 disks still valid? It sounds like it should only be valid for sequential IO, isn't it?
  • Really nice gif.
    would like to know how you created it 🙂