First published on TechNet on Jun 01, 2012
[This is another guest post from our pal Mark in Oz. Even if you don’t care about DFSR, I highly recommend this post; it teaches some very clever log analysis techniques, useful in a variety of troubleshooting scenarios – The Neditor]
Hi there! It’s Mark Renoden – Premier Field Engineer in Sydney, Australia – here again. Today I’m going to talk about an issue where a customer's DFSR environment lost file updates and they’d see regular alerts in SCOM about replication backlog. While I was on site working with them, I came up with a few creative ideas about how to use the DFSR debug logs that led us to the root cause.
The problem at hand was that a large DFS replication backlog would accumulate from time to time. Finding the root cause meant understanding the trend in changes to files in the replica. To do this, we needed to use the debug logs as our data source: to manipulate them so that they would tell the story.
With the aid of some custom scripts and tools, and a lab environment, I’m going to simulate their experience and talk through the investigation.
Test Lab Setup
The test lab I’ve used for this post is pretty simple: I’ve got two file servers, DPS1 and DPS2 configured with a single replication group called RG1 that replicates C:\ReplicatedFolder between the servers. There are 100,000 files in C:\ReplicatedFolder.
Prepare for Battle
In Ned’s previous post Understanding DFSR Debug Logging (Part 1: Logging Levels, Log Format, GUID’s) , the various options for debug log verbosity are discussed. For this scenario, the only one I’ll change is the number of debug log files. 1000 is generally a good number to choose for troubleshooting purposes:
Harvest the Data
After reproducing the symptoms, we want to harvest the logs from the server that has pending replication. When on site at the customer, I just copied DFSR*.log.gz from the %windir%\debug folder, but the best possible practice would be to stop DFSR, copy the logs and then start the service again. This would prevent log rollover while you harvest the logs.
After you copy the logs for investigation, they need to be un-g-zipped. Use your favourite gzip-aware decompression utility for that.
Understand the Log Format
Before we can mine the debug logs for interesting information, we need to look at what we’re dealing with. Opening up one of the logs files, I want to look for a change and to understand the log format –
20120522 12:39:57.764 2840 USNC 2450 UsnConsumer::UpdateIdRecord LDB Updating ID Record:
+ fid 0x2000000014429
+ usn 0x693e0ef8
+ uidVisible 1
+ filtered 0
+ journalWrapped 0
+ slowRecoverCheck 0
+ pendingTombstone 0
+ internalUpdate 0
+ dirtyShutdownMismatch 0
+ meetInstallUpdate 0
+ meetReanimated 0
+ recUpdateTime 20120521 01:04:21.513 GMT
+ present 1
+ nameConflict 0
+ attributes 0x20
+ ghostedHeader 0
+ data 0
+ gvsn {5442ADD7-04C7-486B-B665-2CB036997A67}-v937024
+ uid {5442ADD7-04C7-486B-B665-2CB036997A67}-v615973
+ parent {8A6CF487-2D5A-456C-A235-09F312D631C8}-v1
+ fence Default (3)
+ clockDecrementedInDirtyShutdown 0
+ clock 20120522 02:39:57.764 GMT (0x1cd37c42d5a9268)
+ createTime 20120516 00:41:05.011 GMT
+ csId {8A6CF487-2D5A-456C-A235-09F312D631C8}
+ hash 00000000-00000000-00000000-00000000
+ similarity 00000000-00000000-00000000-00000000
+ name file0000021380.txt
+
20120522 12:39:59.326 2840 USNC 2453 UsnConsumer::UpdateIdRecord ID record updated from USN_RECORD:
+ USN_RECORD:
+ RecordLength: 96
+ MajorVersion: 2
+ MinorVersion: 0
+ FileRefNumber: 0x2000000014429
+ ParentFileRefNumber: 0x4000000004678
+ USN: 0x693e0ef8
+ TimeStamp: 20120522 12:39:57.764 AUS Eastern Standard Time
+ Reason: Close Security Change
+ SourceInfo: 0x0
+ SecurityId: 0x0
+ FileAttributes: 0x20
+ FileNameLength: 36
+ FileNameOffset: 60
+ FileName: file0000021380.txt
What I can see here is a local database update followed by the USN record update that triggered it. If I can gather together all of the date stamps for USN record updates, perhaps I can profile the change behaviour on the file server…
The command above finds every line in every log file that contains USN_RECORD: and then excludes lines that contain a + (thereby eliminating occurrences of + USN_RECORD: as seen in the log excerpt above). Finally, it directs that output into USN.csv.
Let’s open our CSV file in Excel and see what we can do with it.
Graph the Data
I'd ideally like to graph the data that I've got, to make it easy to spot trends. The data I have right now isn't super-easy to work with, so I'm going to sanitize it a bit, and then make a Pivot Table and chart from the sanitized data.
Here is a single column of USN_RECORD: timestamps:
I’d like to figure out the rate of change on the file system for files in the replicated folder so I’ll use text to columns. I’m using a fixed width conversion and I’m going to split out my timestamp to the minute (so I can see how many changes per minute I have) and I’ll split USN_RECORD: off the end of the line so that I have something to count:
Now I’ve got columns like this:
I delete the columns I don’t need (A and C). My result is a column of timestamps down to the minute and a column of identical values (column B) which I can count to understand the rate of change:
To do this, I insert a pivot table. I simply select columns A and B and then choose PivotTable from the Insert menu in Excel.
Now I configure my PivotTable Field List as follows:
After configuring the PivotTable, it looks like this
All that’s left for me to do is to click on one of the row labels and to select a chart from the Insert menu. The resulting chart tells us quite a lot:
Here I can see that there are constant changes at roughly 230 per minute, and that for a two-hour window, the changes increase to about 1500 per minute.
Conclusions so far
For the entire duration of the logs, a roughly consistent level of change was occurring. However, for a two-hour window, lots of change was occurring. There are two possibilities here: either the cause of change has become more aggressive during this time or this chart represents two different activities.
We need more investigation …
Back to the Debug Logs
I start by skimming a log that contains the timestamps from the two-hour window where we see many changes, and look at the USN record updates. Skimming through, I can see two different types of change:
20120523 10:54:41.249 2840 USNC 2453 UsnConsumer::UpdateIdRecord ID record updated from USN_RECORD:
+ USN_RECORD:
+ RecordLength: 96
+ MajorVersion: 2
+ MinorVersion: 0
+ FileRefNumber: 0x20000000175FA
+ ParentFileRefNumber: 0x4000000004678
+ USN: 0xbf0ad430
+ TimeStamp: 20120523 10:54:39.827 AUS Eastern Standard Time
+ Reason: Close Security Change
+ SourceInfo: 0x0
+ SecurityId: 0x0
+ FileAttributes: 0x20
+ FileNameLength: 36
+ FileNameOffset: 60
+ FileName: file0000031231.txt
And:
20120523 10:54:41.249 2840 USNC 2085 UsnConsumer::UpdateUsnOnly USN-only update from USN_RECORD:
+ USN_RECORD:
+ RecordLength: 96
+ MajorVersion: 2
+ MinorVersion: 0
+ FileRefNumber: 0x2000000019AD4
+ ParentFileRefNumber: 0x4000000004678
+ USN: 0xbf0ad4f0
+ TimeStamp: 20120523 10:54:39.843 AUS Eastern Standard Time
+ Reason: Basic Info Change Close
+ SourceInfo: 0x0
+ SecurityId: 0x0
+ FileAttributes: 0x20
+ FileNameLength: 36
+ FileNameOffset: 60
+ FileName: file0000038828.txt
Skimming a log that covers a timeframe with a low rate of change, I can only seem to find:
20120522 23:28:54.953 2840 USNC 2453 UsnConsumer::UpdateIdRecord ID record updated from USN_RECORD:
+ USN_RECORD:
+ RecordLength: 96
+ MajorVersion: 2
+ MinorVersion: 0
+ FileRefNumber: 0x2000000022440
+ ParentFileRefNumber: 0x4000000004678
+ USN: 0x7e7f5188
+ TimeStamp: 20120522 23:28:52.984 AUS Eastern Standard Time
+ Reason: Close Security Change
+ SourceInfo: 0x0
+ SecurityId: 0x0
+ FileAttributes: 0x20
+ FileNameLength: 36
+ FileNameOffset: 60
+ FileName: file0000072204.txt
Now I have a theory – Basic Info Change Close events only occur during the two-hour window where there are many changes and there’s an underlying and ongoing security change the rest of the time. I can prove this if I extract the timestamps for Basic Info Change Close changes and similarly, extract timestamps for Close Security Change changes.
Looking back at the log entries, I can see I have a time stamp followed by a series of lines that start with a + . I need to parse the log with something (I chose PowerShell) that takes note of the timestamp line and when a Basic Info Change Close or Close Security Change follows soon after, return the timestamp.
Here’s my PS script:
$files = Get-ChildItem *.log
 $processingBlock = $False 
$usnBlock = $False 
 foreach ($file in $files) 
{ 
$content = Get-Content $file 
foreach ($line in $content) 
{ 
if (!($line.ToString().Contains("+"))) 
{ 
$outLine = $line 
$processingBlock = $True 
} 
if ($processingBlock) 
{ 
if ($line.ToString().Contains("+    USN_RECORD:")) 
{ 
$usnBlock = $True 
} 
} 
if ($usnBlock) 
{ 
if ($line.ToString().Contains("+    Reason:              Basic Info Change Close")) 
{ 
$outLine.ToString() 
$processingBlock = $False 
$usnBlock = $False 
} 
} 
} 
} 
And:
$files = Get-ChildItem *.log
 $processingBlock = $False 
$usnBlock = $False 
 foreach ($file in $files) 
{ 
$content = Get-Content $file 
foreach ($line in $content) 
{ 
if (!($line.ToString().Contains("+"))) 
{ 
$outLine = $line 
$processingBlock = $True 
} 
if ($processingBlock) 
{ 
if ($line.ToString().Contains("+    USN_RECORD:")) 
{ 
$usnBlock = $True 
} 
} 
if ($usnBlock) 
{ 
if ($line.ToString().Contains("+    Reason:              Close Security Change")) 
{ 
$outLine.ToString() 
$processingBlock = $False 
$usnBlock = $False 
} 
} 
} 
} 
I run each of these (they take a while) against the debug log files and then chart the results in exactly the same way as I’ve done above.
First, Basic Info Change Close (look at the time range covered and number plotted):
And Close Security Change , below:
This confirms the theory – Basic Info Change Close takes place in the two hours where there’s a high rate of change and Close Security Change is ongoing.
Root Cause Discovery
If this is an ongoing pattern where the high rate of change occurs during the same two hours each day, I can capture both activities using Process Monitor .
Once I have a trace, it’s time to filter it and see what’s happening:
Here I’ve reset the filter and added Operation is SetBasicInformationFile then Include . I chose SetBasicInformationFile because it looks like a good fit for the USN record updates labelled Basic Info Change Close . After clicking OK, my filtered trace has the answer…
As it turns out, the backup window matches nicely with the storm of Basic Info Change Close updates.
Clearly, this is my own little application replicating behaviour but in the case of my customer, it was actually their backup application causing this change. They were able to talk to their vendor and configure their backup solution so that it wouldn’t manipulate file attributes during backups.
Now all we need to do is identify the source of Close Security Change updates. Once again, I reset the filter and look for an operation that sounds like a good match. SetSecurityFile looks good.
What I found this time is that no entries show up in Process Monitor
What explains this? Either I chose the wrong operation or the filter is broken in some other way. I can’t see any other sensible operation values to filter with so I’ll consider other options. Looking at the filter, I realize that perhaps System is responsible for the change and right now, Procmon filters that activity out. I remove the exclusion of System activity from my filter and see what happens:
Aha! Now I’ve got something:
Now I need to understand what System is doing with these files. I right click the path for one of these entries and select “ Include C:\ReplicatedFolder\file… ”:
I also need to remove the filter for SetSecurityFile:
In summary, I’m interested in everything that happened to file0000033459.txt:
If I look at operations on the file that took place prior to SetSecurityFile , I can see a CreateFile operation. This is where System obtained a handle to the file. Looking at this entry, adding the Details column to Process Monitor and examining the fine print I find:
System is making this change in the context of the account CONTOSO\ACLingApp that just happens to be the service account of an application used to change permissions on resources in the environment.
Conclusion
The process I've described today is a good example of the need to Understand the System from my earlier post . The Event Logs - and even the debug logs - won’t always tell you the answer straight away. Know what you’re trying to achieve, know how to use the tools in your arsenal, and know how they can be made to produce the outcome you need.
Knowing what I know now, I might have found the root cause by starting with Process Monitor but there’s a chance I’d have missed Close Security Change updates (considering that System is excluded from Process Monitor by default). I may have also missed the Basic Info Change Close updates if the tracing interval wasn’t aligned with the backup window. By mining the debug logs, I was able to establish there were two separate behaviours and the appropriate times to gather Process Monitor logs.
- Mark “Spyglass” Renoden