First published on TECHNET on Aug 11, 2017
Hi! I’m Cosmos. Follow me on Twitter @cosmosdarwin.
Storage Spaces Direct in Windows Server 2016 and Windows Server 2019 features a built-in, persistent, read and write cache to maximize storage performance. You can read all about it at Understanding the cache in Storage Spaces Direct . In all-flash deployments, NVMe drives typically cache for SATA/SAS SSDs; in hybrid deployments, NVMe or SATA/SAS SSDs cache for HDDs.
In any case, the cache drives will serve the overwhelming majority of IO, including 100% of writes. This is essential to delivering the unrivaled performance of Storage Spaces Direct, whether you measure that in millions of IOPS , Tb/s of IO throughput , or consistent sub-millisecond latency.
But nothing is free: these cache drives are liable to wear out quickly.
Solid-state drives today are almost universally comprised of NAND flash, which wears out with use. Each flash memory cell can only be written so many times before it becomes unreliable. (There are numerous great write-ups online that cover all the gory details – including on Wikipedia .)
You can watch this happen in Windows by looking at the Wear reliability counter in PowerShell:
PS C:\> Get-PhysicalDisk | Get-StorageReliabilityCounter | Select Wear
Here’s the output from my laptop – my SSD is about 5% worn out after two years.
Note: Not all drives accurately report this value to Windows. In some cases, the counter may be blank. Check with your manufacturer to see if they have proprietary tooling you can use to retrieve this value.
Generally, reads do not wear out NAND flash.
Measuring wear is one thing, but how can we predict the longevity of an SSD?
Flash “endurance” is commonly measured in two ways:
Both approaches are based on the manufacturer’s warranty period for the drive, its so-called “lifetime”.
Drive Writes Per Day (DWPD) measures how many times you could overwrite the drive’s entire size each day of its life. For example, suppose your drive is 200 GB and its warranty period is 5 years. If its DWPD is 1, that means you can write 200 GB (its size, one time) into it every single day for the next five years.
If you multiply that out, that’s 200 GB per day × 365 days/year × 5 years = 365 TB of cumulative writes before you may need to replace it.
If its DWPD was 10 instead of 1, that would mean you can write 10 × 200 GB = 2 TB (its size, ten times) into it every day. Correspondingly, that’s 3,650 TB = 3.65 PB of cumulative writes over 5 years.
Terabytes Written (TBW) directly measures how much you can write cumulatively into the drive over its lifetime. Essentially, it just includes the multiplication we did above in the measurement itself.
For example, if your drive is rated for 365 TBW, that means you can write 365 TB into it before you may need to replace it.
If its warranty period is 5 years, that works out to 365 TB ÷ (5 years × 365 days/year) = 200 GB of writes per day. If your drive was 200 GB in size, that’s equivalent to 1 DWPD. Correspondingly, if your drive was rated for 3.65 PBW = 3,650 TBW, that works out to 2 TB of writes per day, or 10 DWPD.
As you can see, if you know the drive’s size and warranty period, you can always get from DWPD to TBW or vice-versa with some simple multiplications or divisions. The two measurements are really very similar.
The only real difference is that DWPD depends on the drive’s size whereas TBW does not.
For example, consider an SSD which can take 1,000 TB of writes over its 5-year lifetime.
Suppose the SSD is 200 GB:
1,000 TB ÷ (5 years × 365 days/year × 200 GB) = 2.74 DWPD
Now suppose the SSD is 400 GB:
1,000 TB ÷ (5 years × 365 days/year × 400 GB) = 1.37 DWPD
The resulting DWPD is different! What does that mean?
On the one hand, the larger 400 GB drive can do the exact same cumulative writes over its lifetime as the smaller 200 GB drive. Looking at TBW, this is very clear – both drives are rated for 1,000 TBW. But looking at DWPD, the larger drive appears to have just half the endurance! You might argue that because under the same workload, it would perform “the same”, using TBW is better.
On the other hand, you might argue that the 400 GB drive can provide storage for more workload because it is larger, and therefore its 1,000 TBW spreads more thinly, and it really does have just half the endurance! By this reasoning, using DWPD is better.
You can use the measurement you prefer. It is almost universal to see both TBW and DWPD appear on drive spec sheets today. Depending on your assumptions, there is a compelling case for either.
Our minimum recommendation for Storage Spaces Direct is listed on the Hardware requirements page. As of mid-2017, for cache drives:
Often, one of these measurements will work out to be slightly less strict than the other.
You may use whichever measurement you prefer.
There is no minimum recommendation for capacity drives.
You may be tempted to reason about endurance from IOPS numbers, if you know them. For example, if your workload generates (on average) 100,000 IOPS which are (on average) 4 KiB each of which (on average) 30% are writes, you may think:
100,000 × 30% × 4 KiB = 120 MB/s of writes
120 MB/s × 60 secs/min × 60 mins/hour × 24 hours = approx. 10 TBW/day
If you have four servers with two cache drives each, that’s:
10 TBW/day ÷ (8 total cache drives) = approx. 1.25 TBW/day per drive
Interesting! Less than 4 TBW/day!
Unfortunately, this is flawed math because it does not account for write amplification.
Write amplification is when one write (at the user or application layer) becomes multiple writes (at the physical device layer). Write amplification is inevitable in any storage system that guarantees resiliency and/or crash consistency. The most blatant example in Storage Spaces Direct is three-way mirror: it writes everything three times, to three different drives.
There are other sources of write amplification too: repair jobs generate additional IO; data deduplication generates additional IO; the filesystem, and many other components, generate additional IO by persisting their metadata and log structures; etc. In fact, the drive itself generates write amplification from internal activities such as garbage collection! (If you're interested, check out the JESD218 standard methodology for how to factor this into endurance calculations.)
This is all necessary and good, but it makes it difficult to derive drive-level IO activity at the bottom of the stack from application-level IO activity at the top of the stack in any consistent way. That’s why, based on our experience, we publish the minimum DWPD and TBW recommendation.
Let us know what you think! 🙂
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.