Dec 24 2017 07:43 AM
Good day all!
I'm working on trying to troubleshoot some S2D performance issues (or so I assume they are issues). I'm seeing very low write speeds.
My Setup:
4x Nodes with each node having
- 4x 2TB Samsung 850 Pro
- 8x 4TB HGST 7.2k RPD rust
- 64GB Memory
- E5-2670v2 processors
- LSI/Avago/Broadcom/Whoever they are today 9300-8i HBA running latest firmware
- 2X Mellanox ConnextX3 flashed with 2.4.5030 firmware (read the latest was buggy)
My initial configuration was 100% 3-way mirror of HDD with SSD as cache. Saw poor performance so I yanked all the HDD and created a 3-way mirror out of just SSD. Performance wasn't much better.
Read performance isn't horrible, however when transferring a VHDX to S2D, I was only getting about 75-100MB/s. I realize that file transfers is not an optimal test, however at those speeds it will literally take ages from my to live migrate VM's to storage with this performance. I could also tell the VM's running on S2D were impacted by the import of other VM's as well so I'm fairly certain something wrong.
All servers are reporting RDMA = True on all NIC's.
If I do a live migration from source storage to a different destination, transfer rates are 600-700MB/s, so I know the source is not the bottleneck.
Any ideas as to what I may be doing wrong?
Dec 21 2020 09:46 AM
I'm with you on trying to understand what's normal per performance, or how to test or fine tune it
like my old 2012 sata hdd only arrays on hardware raid 6 are still kicking the pants of my four node 2019 s2d with nvme+ssd+hdd (journal, performance 3-way mirror, capacity dual parity (what s2d does when you have four nodes); and setup ReFS if that matters); have 2 x 10g nic on each node for cluster only traffic, and 1 x 2g nic team for clusterAndClient (and thinking of adding a 1g nic for clusterAndClient, and thus make the nic team none cluster traffic); even set my 10g nics to allow 4k jumbo packets, and my network latency is def under 5ms (and all are rdma); yes yes, all very vague (but conceptually, a proper s2d setup with no test-cluster problems)
unsure if I'm getting a lot of cache miss: I see maybe one to twenty every couple seconds, but note that I'm dumping everything on it for a sustained period of time (robocopy seeding to a dozen csv/vd for weeks); unsure if we're not supposed to see any at all to be right sized; i.e. only have a pair of nvme in each node, maybe I should up that to four? (overkill won't be bad if my migrate is done, and normal user usage would have been fine with just a pair on each node); see my ClusterNode.SblCache.Iops.Read.Miss below (makes sense: 80/s would be total for my seeing maybe 20 on any one node via perfmon), but why doesn't get-clusterPerf show a stat for cache write (what I think my problem might be)? I'm guessing the cache size dirty vs total not being the same means I'm not using all of my cache (bad work "dirty"; should be "currentlyBeingUsed")?
my performance layer is 10% of the capacity layer; i.e. could add four more ssd to each node to get that up to 12%, so unsure if I should add more?
not running storageReplica (if I had an equal cluster, would be fabulous just in case server), but I am running dedup (but cpu and memory are always low usage); not running any vm right now
PS > get-clusterPerf
Series Time Value Unit
------ ---- ----- ----
ClusterNode.Cpu.Usage 12/21/2020 11:31:19 10.15 %
ClusterNode.Cpu.Usage.Host 12/21/2020 11:31:19 10.15 %
ClusterNode.CsvCache.Iops.Read.Hit 12/21/2020 11:31:20 0 /s
ClusterNode.CsvCache.Iops.Read.HitRate 12/21/2020 11:31:20 100 %
ClusterNode.CsvCache.Iops.Read.Miss 12/21/2020 11:31:20 0 /s
ClusterNode.Memory.Available 12/21/2020 11:31:19 1.17 TB
ClusterNode.Memory.Total 12/21/2020 11:31:19 1.5 TB
ClusterNode.Memory.Usage 12/21/2020 11:31:19 340.48 GB
ClusterNode.Memory.Usage.Host 12/21/2020 11:31:19 340.48 GB
ClusterNode.SblCache.Iops.Read.Hit 12/21/2020 11:31:20 943 /s
ClusterNode.SblCache.Iops.Read.HitRate 12/21/2020 11:31:20 92.14 %
ClusterNode.SblCache.Iops.Read.Miss 12/21/2020 11:31:20 80 /s
PhysicalDisk.Cache.Size.Dirty 12/21/2020 11:31:16 3.75 TB
PhysicalDisk.Cache.Size.Total 12/21/2020 11:31:16 22.9 TB
PhysicalDisk.Capacity.Size.Total 12/21/2020 11:31:25 1.38 PB
PhysicalDisk.Capacity.Size.Used 12/21/2020 11:31:25 730.61 TB
Volume.IOPS.Read 12/21/2020 11:31:25 9 /s
Volume.IOPS.Total 12/21/2020 11:31:25 397 /s
Volume.IOPS.Write 12/21/2020 11:31:25 388 /s
Volume.Latency.Average 12/21/2020 11:31:25 16.09 ms
Volume.Latency.Read 12/21/2020 11:31:25 3.06 ms
Volume.Latency.Write 12/21/2020 11:31:25 16.38 ms
Volume.Size.Available 12/21/2020 11:31:25 47.82 TB
Volume.Size.Total 12/21/2020 11:31:25 298.26 TB
Volume.Throughput.Read 12/21/2020 11:31:25 727.8 KB/S
Volume.Throughput.Total 12/21/2020 11:31:25 19.45 MB/S
Volume.Throughput.Write 12/21/2020 11:31:25 18.74 MB/S
not an answer, just joining conversation (nudging your years old stale question)