vNext 25951 deduplication profile "VDI/Hyper-V" data corruption (applies to Server 2022 too BTW)

Iron Contributor

Hello Community!

Edit: It seems to be a racing condition. To reproduce it needs a rather modern and fast CPU, else the problem not show up at all. My i7-4960x is at the border, the problem does not always show up.

Problem: Using the deduplication profile "VDI", internally called "hyperv" causes data corruption and, in some cases filesystem errors on host-level.

Ready to reproduce package with a clean fresh ENGLISH vNext install: https://joumxyzptlk.de/tmp/microsoft/SNEXT-25951.7z

Follow the text file on the desktop. If you skip the "configure deduplication as VDI for drive D: + Start-Dedupjob" step you won't have issues, it will work fine.

What happens is the same from current Server 2022 and vNext 25951: https://joumxyzptlk.de/tmp/microsoft/S2022-Nested_Deduplication_VDI-Hyperv_profile_kills_filesystem_...

Reproduced on Ryzen 5950x with ECC RAM and i7-4960x, quite different hardware.

The insider feedback hub is still broken... Sadly.

 

EDIT: How to reproduce (contents of text file on desktop of that VM):

 

Creation host: Ryzen 5950x, 64 GB ECC RAM, Windows 11 21H2.
(Issue reproduced on host: intel i7 4960x, 32 GB RAM (non ECC), Server 2022 on host.)
This guest: Standard as you can see except for:
"Set-VMProcessor -VMName SNEXT-25931 -ExposeVirtualizationExtensions:$true" and "Allow network spoofing"
No second VHDX yet.
Server 25931 VM PREPARATION:
- Standard install.
- Two GPEDIT.MSC: Allow empty passwords, and Lanman Workstation "Enable insecure Guest logons",
which makes copying those test VMs easier on my home network.
- Be in a network which allow internet access for Windows Updates, including the VMs.
- Windows/Microsoft updates right after installation.
- Activate Role Deduplication and Hyper-V, do not configure deduplication. Reboot.
- Add VHDX for second drive, add folder \Hyper-V.
- copy two test VMs, here those two Server 2012 R2 VMs with update state of 6th June 2022,
import them in Hyper-V.
- Remove Hyper-V role, cause dependent on the Host CPU different virtualization seem to be used
up upon activation of that role.
- Update 2. Sept 2023: In-Place upgrade to Server vNext 25941.

This is the export you see now.

These are the steps to reproduce the issue from here on, see screenshots attached / youtube video link:
- Activate Hyper-V in this VM. Do not configure Switch during that step. Reboot.
- Import the two machines in drive D:\Hyper-V into Hyper-V Manager.
- Configure the virtual switch. You may have to untick the the "Hyper-V Extensible Virtual Switch" in the
network properties of the adapter, a misbehaviour of the host OS. Connect the nested VMs to that virtual switch.
- Activate deduplication for second volume, choose template "VDI", no adjustment of anything.
- Run from Powershell: Start-DedupJob -Volume <volume> -Type Optimization -Full -Wait
- Wait until it is finished
- Start those two Test VMs, and try to run Windows update. This step is important, 'cause the
corruption occurs on deduplicated files which then get written to.

My test result:
- https://joumxyzptlk.de/tmp/microsoft/S2022-Nested_Deduplication_VDI-Hyperv_profile_kills_filesystem_...
- Both machines may blue screen during Windows Updates or show weird errors. Both may run into a recovery boot loop.
- Both have defective filesystems, if they manage to boot instead of getting stuck in boot loop.
- After the blue screen desaster: Start-Dedupjob -Volume <volume> -Type Optimization -Full -Wait may not work any more.
- Get-DedupVolume | fl still shows valid statistics if Start-Dedupjob fails.
- CHKDSK /f may show file system damage. Usually <NUMBER>.ccc and <NUMBER>.cd files of the deduplication
chunk store which usually resides in \System Volume Information\Dedup\ChunkStore\{UNIQUE IDENTIFIER}.ddp\Data

Counter verification:
- Run Windows Updates on those VMs without deduplication. It will work fine.

How I discovered that bug:
I have a Server 2019-nested-vms testserver with many test-os variants installed there. I upgraded that VM to Server 2022. This is where I ran into the issues with windows updates within the VMs, and then traced it down to Server 2022 corrupting the filesystem on the host level.

kind regards,

Joachim Otahal
Germany

1 Reply

A little update on this: I could reproduce it on a HP DL380 Gen10 with a Dual Xeon Gold 6226R, full flash and 1 TB RAM, running Server 2019 on the host.
Seems this bug only surfaces on reasonable fast machines, and shows up with a higher probability when the "Ready-To-Reproduce" package is NOT on C: . The i7-4960x seems to be the lower limit, and it does not always show up when using the Tiered-SSD-HDD storage there, but always when using SSD storage. For my Ryzen it shows up on any SSD storage except when it is on C:. For the above mentioned DL380 C: is not big enough, so it was a non-system volume there too.
Ready-To-Reproduce Package for Server 2022 for a Server 2019 host (didn't have time to do that for insider server yet): https://joumxyzptlk.de/tmp/microsoft/S2022-nested-2023-09-30-exported-from-S2019-host.7z
That package does not require an internet connection to reproduce, therefore no LAN is configured for that VM.