BSOD on Server 2016 hal.dll (WATCHDOG_VIOLATION)

Copper Contributor

We have a client with a Windows Server 2016 standard installation that gives BSOD at random times. We're talking about only three times a month, but since it is a server with important roles (Hyper-V, AD, DNS, DHCP) we're investigating the issue. I am aware that you'd normally want to split some of the roles, but that is not within our power to change. 

 

We bluescreenview which reports back that the culprit is hal.dll with ntoskrnl.exe for each of the blue screens that we still have a .dmp file for. Looking into the MEMORY.dmp file, I can see:
DPC_WATCHDOG_VIOLATION (133)
DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED
DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

I'm unfortunately unable to really deep digger into this, as I have never before had to. I ran some commands within the Windows debugger and all the information is uploaded in the attached text file. Is anyone experienced with troubleshooting such a problem?

 

We couldn't run sfc/scannow at first, so we repaired with dism with a local source, and then ran sfc/scannow successfully. The repaired files pointed towards Windows defender and I do not think its related. Good to know: Symantec is running on the server, defender real time protection is off.

 

All I need is a push into the right direction, and then I'll dig into it myself :) 

 

Thanks,

Dennis

 

5 Replies

Host or guest BSOD? Hopefully the Hyper-V role is only role on host and active directory domain services are on a separate VM windows instance.

 

 

@Dave Patrick Host BSOD unfortunately. And nope, all the roles are installed on the host itself. It is not something done by me or my colleagues, and its not within our power to change. The host is powerful enough to virtualise the roles, but the client is reluctant to do so.

 

So the system is more or less contaminated with (at least) the following:

- AD/DNS/DHCP/IIS/NAP roles installed

- Symantec antivirus manager (not just the client)

 

Which makes it all the harder to really guarantee a stable system, even if we end up finding what causes this. The host has not crashed yet since we last ran some basic repair commands, but we'll keep an eye on it ..

 

 

I'd work to move the roles (other than Hyper-V) off the host by standing up the required guests. You can do this rather easily. I'd use dcdiag / repadmin tools to verify health correcting all errors found before starting. Then stand up the new guest, patch it fully, license it, join existing domain, add active directory domain services, promote it also making it a GC (recommended), transfer FSMO roles over (optional), transfer pdc emulator role (optional), use dcdiag / repadmin tools to again verify health, when all is good you can decommission / demote old one.

 

To the bsod issues. I'd check here and with manufacturer about support for Server 2019

https://www.windowsservercatalog.com/

 

Also check with manufacturer for the latest ROM bios, firmware, chipset and driver support pack.

 

 

He already said he can't do that... and whilst I agree it is far from ideal having them all on the same box - it shouldn't be causing a BSOD should it. So just moving the DC role to another machine is almost certainly not going to fix this anyway.

 

It seems much more likely this is from a hardware driver. @DLans  would it be possible to ZIP the MEMORY.dmp file and upload it somewhere for us to take a look at? You've done a great job with that initial text file showing various outputs from the debugger, but there's a few more things I want to take a look at and it will be a lot quicker to explore the file rather than relaying commands and results back and forth between us on here.

In my experience you always fix everything you know is wrong as first steps, then work from there.