Forum Discussion
Persistent problem with DPMRA.EXE crashing and multiple inconsistent replicas in DPM 2019
We use DPM 2019 to protect 16 Windows Servers, which for the last year has been running with no issues. I have encountered a problem recently where DPM reports inconsistent replicas for numerous servers (currently 8 servers with 11 datasources) which result in me running consistency checks which consistently fail. The failure is linked to the DPMRA.EXE crashing, which writes an Application Error to the Application Event Log (Event ID 1000) citing the DPMRA.exe as the Faulting Application and the KERNELBASE.dll as the Faulting Module Name (see screen shot below):
This happened at the beginning of last month for the first time and I copied the dpmra.exe from another DPM server which failed to work, and after a long period of troubleshooting, I eventually restored the DPM Database which appeared to resolve the issue, but I am reluctant to go down this path again as it has now re-occurred.
I have tried throttling the DPM clients to 70Mbps and then to 50Mbps, but this hasn't helped (DPM has a 10Gb NIC, but servers have a 1Gb connection) and all clients are on the local LAN. It coincides with Windows update week, but I'm not convinced this is the issue as in both cases it started happening a few days after the servers were rebooted (1 to 4), and I would have thought it would have happened much immediately.
The DPMRACurr.errlog files are not much help as they log so many benign errors, it's really hard to work out what is an issue and what isn't. What I have seen recurring in the logs are entries as shown below:
62A4 41D8 10/05 07:04:29.876 31 failedfilehelper.cpp(548) <REDACTED> WARNING Assertion Failure :: flatRecord->dwFilePathLen > 0
62A4 41D8 10/05 07:04:29.876 31 failedfilestable.cpp(357) [<READCTED-2>] <REDACTED> NORMAL Hr: = [0x00000000] CFailedFilesTable::Loading Failed File Record: Id:0, File:
62A4 41D8 10/05 07:04:29.876 31 failedfilehelper.cpp(333) <REDACTED> FATAL Process Abort: Exp: = "(recordSize == 0)" Hr: = [0x80070057]
62A4 41D8 10/05 07:04:29.907 22 watsonintegration.cpp(73) <REDACTED> NORMAL Inside Watson Handler
I have looked online and can find nothing relating to this - there are also multiple DPMRACurr.errlog.<timestamp>.Crash files in the log directory as well for each time the dpmra.exe crashes (when it does, it does not crash the DPM console, it just fails the job.
Does anyone have any suggestions or have seen the same issue before - I am going around in circles at the moment and need to sort the issue rather than just restoring the database as it seems to be a recurring incident?
2 Replies
- Chris48Copper Contributor
As an update to this, I have stopped all jobs from running / being able to run during the troubleshooting process and used PowerShell to view the logs in real time on both the DPM server and the clients while running a consistency check one at a time on the previously failed jobs. From this process I have noticed that the Watson error coincides with the DPMRA.exe related Application Error in the event log, however the backup job does not fail immediately. After cancelling the failed job, I then removed the datasource from it's Protection Group, importantly ensuring no disk or tape data was retained, and continued this process until all offending datasources have been identified and removed (I had 2 which needed removing). After this the jobs were all able to be successfully resolved and once all the failed jobs had run successfully I could add the two problem datasources back in to their Protection Groups.
This has worked around the problem for me and I now have a functioning DPM server, but as to why the datasources have caused this problem, I am still no wiser. This is the second time in 2 months this has happened, and I don't know if it was the same datasources that caused the previous problem (a DPM database restore was how I got around it last time), but I will monitor this and check it next month to see if it happens again.
If anyone has any ideas as to why this may be occurring, or even how to investigate the cause further, that would be great. Also please let me know if anyone else experiences this problem. I have worked with DPM for a number of years now, and since DPM 2010, not had many issues - certainly not ones as disruptive as this. I do find the DPM logs not the most intuitive as they seem to add a lot of warnings and errors as part of normal behaviour which makes using them for troubleshooting quite hard at times. They are also not the easiest to decipher which adds further frustrations when using them for general troubleshooting.
- bchapman65536Copper Contributor
I had the same thing happen, and began happening since DPM 2019 UR4.
I thought I isolated it down to certain replicas having files/folders with paths longer than 250 characters.
Usually stopping protection of those members mitigates the issue, and putting those members back into protection will cause DPMRA.exe to crash every 15 minutes (according to the Event Viewer's Application Logs).
I've also had this happen when using one DPM server for secondary protection of a primary server's protected items (or in some cases, just the database of the primary server).
Either way, it's bad design to have a backup destination crash your entire backup product.