Public Folder Replication Troubleshooting - Part 2: Troubleshooting the Replication of Existing Data

The_Exchange_Team · ‎Jan 19 2006

This is a second blog post about troubleshooting public folder replication issues. In first post we covered Troubleshooting the Replication of New Changes. This blog post covers Troubleshooting the Replication of Existing Data and the last post in this series will cover Troubleshooting Replica Deletion and Common Problems. To get the full picture, please read all referenced material!

Troubleshooting the Replication of Existing Data

When new changes replicate, but old unchanged stuff doesn't, you have a backfill problem. The most typical situation where hierarchy backfill occurs is when a new public store is created. The most typical content backfill scenario is when a public store has been added to the replica list of a folder.

When you have a backfill problem like this, it may have already occurred to you that there’s an easy workaround – just make a change to all the items. By doing this you circumvent the broken backfill process and replicate everything as new changes. Despite the fact that I wrote both of the tools that are typically used for this (PFDAVAdmin and ModifyItems), it’s usually best to troubleshoot the backfill process and fix the root cause. If you just change everything to make it replicate, you may end up with the same backfill problem in the future when the replicas get out of sync again. That said, let’s move on to discussing backfill. To understand the backfill process, it's first necessary to understand how changes are tracked.

Every folder and message in the store is assigned a Change Number (CN) when it is created and every time it is modified. When replication occurs, the CNs on each object are used to determine whether that object needs to be replicated. A group of CNs is called a CNSet. The CNSet for a particular folder on a particular server is called status information. This status information is included on every replication message. Every message type 0x2 contains the hierarchy status for the sending server. Likewise, every message type 0x4 contains the status information for that particular folder for the sending server. All other replication message types contain status information for their respective folders as well.

When a new public store mounts for the first time, it sends a status request (type 0x20) for the hierarchy to all the existing public stores. Similarly, when a new store is added to the replica list of a folder, that store will send a 0x20 to all other replicas of that folder. Like every replication message, a status request contains a CNSet of all CNs for the folder in question (or the hierarchy) that the originating store has, and asks the other stores to respond if they have CNs that the originator doesn't. Note that prior to Exchange 2003 Sp2, every replica was not asked to respond to the status request, so some stores would ignore the status request even if they had changes that the originating store did not. A 2003 Sp2 server will ask for responses from all replicas, and will respond even when the originating server did not specifically ask it to, as long as it has changes that the originating server does not. This can greatly improve the decisions made during the backfill process. The unique thing about a status request is that it doesn't contain any data to replicate - it just has a list of change numbers. The other stores respond with status messages (0x10), which list their own CNSets for that same folder (or the hierarchy). When the originating server receives the 0x10 messages, it compares the CNSet contained within to its own CNSet. If the 0x10 contains changes that the store doesn't already have, the backfill process begins.

The first step in the backfill process is to add entries to the backfill array for the folder in question. These entries have a CNSet that describes the missing changes, and a timeout value describing when the store will request the missing data. The backfill timeout will vary depending on the situation. In the case of a new public store being brought online or a new replica of a folder being added, the initial timeout is 15 minutes.

Backfill entries may be added to the backfill array during the course of normal operation as well. Consider a situation where a particular public store has broadcast two changes in two separate 0x2 messages. Let's say the administrator deletes the first 0x2 message out of the queue, but the second one makes it through. When the other servers receive this 0x2, they will find that the CNSet in the status information contains CNs that they never got. As a result, they will create backfill entries for that data. Backfill entries for missing data that was discovered during the normal course of replication will start with a timeout of 6 hours if the data is available in the same Routing Group (RG), or 12 hours if it is only available in a different RG. Each time a backfill request is issued, the next timeout will be 12 and then 24 hours for intra-RG requests, or 24 and 48 hours for inter-RG requests.

Every five minutes the store will check to see if any backfill entries have reached their timeout. If they have, a backfill request (type 0x8) is issued for the missing CNs, and the timeout is set to the next interval. A backfill request is not a broadcast; it is directed at a single server - one of the servers that previously indicated it had the missing CNs in the status information it sent to the requesting server. When that server receives the incoming 0x8, it immediately processes the request and responds with one or more backfill responses (0x80000002 for hierarchy or 0x80000004 for content), which contain the actual data for the requested change numbers. Like backfill requests, backfill responses are not broadcasts - they are sent only to the requesting server.

If the requesting server successfully processes the incoming backfill response, the CNs it contained are cleared from the backfill array on that store. Actually, any incoming message that contains CNs that are outstanding in the backfill array will cause those CNs to be cleared from the array.

Troubleshooting

As you can see, there are a lot more questions to answer when troubleshooting the backfill process.

1. Does the store know it's missing data?

First you should determine if the server even realizes that other stores have changes that it needs to request. Unfortunately, there is no supported tool or utility that will let you view the backfill array directly to see if it has anything in it. However, there are other more indirect ways of figuring this out.

One way is to wait. If the server knows it's missing data, it will be requesting it at least once every 24 or 48 hours. This means you can simply turn up logging and wait to see if a 0x8 message ever goes out. If you never see a 0x8 for the folder in question, but you are seeing 0x8's for other folders, you may have hit the outstanding backfill limit, which we'll discuss shortly.

Another option is to make sure the server receives the latest status information. Remember, the server only sends a status request that one time after you add the new replica. After that, the only status information it receives will be through the normal course of replication. So if its initial attempt to get status was lost because the 0x20 or the 0x10 in response was lost or deleted, it may sit there indefinitely and not even realize that it's missing anything. There are several ways to make sure the server has received status information for a folder.

- Go to a server that has all the data and make a change to the folder by adding, deleting, or modifying a message. In the case of the hierarchy, create, delete, or change the properties of a folder. The resulting 0x4 or 0x2 will contain status information for that folder or the hierarchy, respectively. When the server that's missing the data successfully processes the incoming replication message, you know that it has added any appropriate entries to the backfill array.

- Use the Synchronize Content option in Exchange 2003 ESM. This is a well-hidden but very useful option. To find it, go under the Public Folders tree and go to the folder in question. Highlight the folder in the left-hand pane. In the right-hand pane click the Status tab. Right-click on the server that has all the data and choose Synchronize Content. This does two things - it causes the server to issue a status request 0x20 for the folder, and it causes it to immediately timeout any backfill entries. Notice that I said you should Synchronize Content from the server that already has the data. You may wonder why you would do that, when it's the other server that has the backfill entries that need to be timed out. Remember that at this point we're just trying to ensure that the server missing the data KNOWS it has something to backfill. To that end, we can use Synchronize Content from the server that has the data to send a 0x20 to the server that doesn't. In this case we're not really interested in seeing a status 0x10 response to the 0x20. We just want the store missing content to receive a replication message for the folder from a store with content, so it can add the appropriate entries to the backfill array. The 0x20 from the server with the data serves this purpose. Note that in Exchange 2003 Sp2, Synchronize Content is now available for the hierarchy by right-clicking on the Public Folders node itself.

- Use the Replication Flags registry value (KB813629). If you put this value in place, along with the Enable Replication Messages At Startup value from KB321082, it causes the store to send a status request 0x20 for every folder on startup. Again, you would want to use this on the server that has the content - the point of this step is to get the server that has content to send its status information to the server that's missing content.

- Use 2003 ESM to send a backfill response. In 2003 Sp1, you could use the Send Hierarchy option to send a hierarchy backfill response and the Send Contents option to send a folder content backfill response. In 2003 Sp2, both of these options became Resend Changes. This sends a backfill response for the range of data you specify, but you probably shouldn't specify the whole range of data since that might satisfy all outstanding backfill entries and end up working around the original problem. Instead, specify a range of only a day or two. This causes a 0x80000002 or 0x80000004 to go to the target server, which again serves the purpose of giving it status information for the store that has the data.

Once you've used one of these options to force status information, and you've verified that the store missing the data received the incoming message by watching the application log, then you know it knows it's missing the data.

2. Does the store request the missing data?

After you've made sure the store know it needs to backfill some data, does it ever issue a backfill request? Recall that after it has tried to backfill the data a couple of times, You may have to wait 24 or 48 hours for the next backfill request, since that will be the longest timeout interval for intrasite and intersite backfills, respectively. There is one way to speed this up, and that is to use Synchronize Content again, but this time from the server that's missing the data. This will immediately timeout the backfill entries for that folder. However, you may still find that the store does not issue a backfill request for the folder you're focusing on. If this is the case, watch the app log for the next 24-48 hours. If the store is sending backfill requests for other folders, but not for the folder you're focusing on, it may have hit the outstanding backfill limit.

When you experience a situation where you've added replicas of a lot of folders to a new store, and replication seems fine at first but then grinds to a halt over the next day or two, you have probably hit the outstanding backfill limit. The outstanding backfill limit is a mechanism intended to throttle replication. By default, the store will only allow 50 outstanding backfill requests at a time. Once it has 50 outstanding, it will re-request those 50 over and over until they are satisfied. Once any one outstanding entry has been satisfied, that opens up a slot in the OBL for a new set of data to be requested. This means that if 50 requests are having problems being satisfied for whatever reason, replication can not proceed.

If you are seeing this behavior, you should watch the application log to see what the store is requesting. You'll be seeing periodic 0x8 messages for the current 50 outstanding backfill requests, and you'll find that no backfill response is received, which is why they're still outstanding. At that point you should change your focus to troubleshooting one of the folders the store is currently trying to backfill, since resolving the problem will allow it to move on to other folders.

There is one other option, and that is to increase the Outstanding Backfill Limit (OBL). You can do this by creating a registry value called Replication Outstanding Backfill Limit under the registry key for that store. The maximum value is 5000 decimal. However, once you do this the replication floodgates will open and it will be hard to determine which 50 folders caused it to choke. You'll need to postpone troubleshooting until things settle down again. Typically I recommend leaving the limit at 50 and fixing the problem, instead of working around it by increasing the limit.

If the OBL doesn’t appear to be a problem, and you still aren’t seeing outgoing 0x8 messages for the folder in question, see the “Common Problems” in last post of this series.

3. Does the other store respond to the request?

Once you have a backfill request to focus on, you need to determine if the backfill target ever got the request. Check the application log on that server for the incoming 0x8. You can also search the application log for the message ID mentioned in the outgoing event from the sending side. If you can find no sign of it in the application log, use message tracking to see how far it got. If it received the 0x8, it should respond almost immediately with one or more 0x80000002 or 0x80000004 messages (you will often see many backfill responses to a single backfill request, since the changes are not all sent in a single message). Of course, the time it takes to generate the backfill response messages will vary based on the data in the folder and the replication message size limit. For instance, if you set the maximum replication message size to 1 GB, the responding server could try to pack the entire hierarchy into a single backfill response, which might take an hour or more just to pack up!

4. Does the requesting server get the response?

Now it's time to check that application log on the requesting server to see if it received the backfill response. If not, track the message and see how far it got. If it received the backfill response and logged it in the application log, then that backfill request should have been satisfied and it should be able to move on.

As mentioned earlier, if you find that message tracking shows that one of these messages was delivered to the store, yet the application log does not show the incoming replication message, have a look at “Common Problems” in last post of this series.

In next blog post: Troubleshooting Replica Deletion and Common Problems.

- Bill Long

Report Inappropriate Content · ‎Jan 21 2006

Another GREAT post!!

Like most people I never realized that Synchronize Content was availble on the "status tab" in ESM.

Any chance of a post on hidden gems in ESM ??

Ian

Report Inappropriate Content · ‎Jan 22 2006

Where can we download PFDAVAdmin and Modifyitems?

Report Inappropriate Content · ‎Jan 22 2006

PFDAVADMIN is a download on the Exchange Download Center Homepage...

Report Inappropriate Content · ‎Jan 23 2006

I have PFDAVAdmin now, but modifyitems is the one I'm more interested in and I can't seem to find that one anywhere.

Report Inappropriate Content · ‎Jan 24 2006

ModifyItems isn't a supported tool, so it's not available anywhere on the public web site. You can get it from PSS. Really the only reason to use ModifyItems is when you need something that will run against 5.5 - since ModifyItems goes through MAPI.

Remember this tool is not an official tool and isn't supported. If it doesn't work for you for whatever reason, I most likely will never get around to fixing it. :) Also remember that this really only works around the backfill problem, and it's preferable to find and fix the root cause.

Report Inappropriate Content · ‎Jan 26 2006

I have a question about the send hierarchy and send content function. You mentioned that these 2 features send backfill response to the store that is lacking the content. Does this mean you do not really need a backfill request message coming from the store that is lacking the content when you use these features?

Report Inappropriate Content · ‎Jan 27 2006

That's correct, Kevin. Think of a backfill response as simply a normal replication message but with a specific target server (instead of a broadcast to all replicas). Normally, a replication message that replicates data to one specific server would be generated only in response to a backfill request, but these new ESM options allow you to generate one even when no backfill request has occurred.

Report Inappropriate Content · ‎Jan 27 2006

Hi Bill,

Thanks for the clarification. I have another quick question for you. When you force a status request message to be send from a public folder store that has all the data to a store that is lacking the data, would the receiving store send a staus response back to the sending server? From what I understand, the status response contain the missing CNSets and is used to trigger the backfill process. In this situation, the receving store has the missing content, it should be the one triggering the backfill process therefore there should be no need for the receving store to send a status response. Am I right or wrong?

Report Inappropriate Content · ‎Jan 27 2006

That's right. We wouldn't expect the store that's missing data to respond with a status message (0x10) to the status request (0x20) from the store that has the data. And in that situation we really don't care if it does. The important thing is that the status request (0x20), which itself contains status information, makes it to the store that's missing data. When that store receives it, it will create entries in its backfill array for that missing data. It would only respond if it somehow had data that the other store didn't.

A store will create entries in the backfill array any time it receives status information that includes CNs that it doesn't have. Therefore, the backfill process can be triggered by any type of replication message - not just a status response - because all replication messages contain status for their respective folders. In this scenario we're triggering it with a status request (0x20), and that's really just because we don't have the option of forcing a status response (0x10) in ESM.

Report Inappropriate Content · ‎Jan 31 2006

I'M struggling to get a solution for one of my Exchange 5.5 public folder server replication problem

this server hosts around 80% of our organization PF's and has around 160GB of database size , server had a crash during end of december, and we used online back to restore,, Directory, Priv.edb and PUB.edb

since then server stopped accepting replication from other server, and also no more replication messages send out from this server

while tracking replication messages from a remote server i could see that message is delivered to this ill server's information store, but outlook is not displaying any data on PF's

tracking messages generated from ill server does not show that any message is generated from this server

no errors or warnings are displayed in event log even after enabling all the necessary diag log

just for testing i have reseted pub and priv.edb, i could see that replication refill is happening (waited only for some time, and saw that hierarchy is getting populated )

email send to public folder email address is delivering properly, so internal communication is fine

since i have replica of all these folders in exchange 2003, All Exchange 5.5 server were pointed to use exchange 2003 as the PF server, so data till 21st of December 2005 is available, but looking for a solution to extract data that are changed from 21st of december to 21st of January 2006

not much help from Microsoft

any help will be much appriciated

Report Inappropriate Content · ‎Feb 01 2006

It could be that the ReplState table on your 5.5 server is damaged, or that something else is wrong with one of the tables. If your disaster recovery involved running an eseutil /p, this could very easily be the result of that. In 5.5, we have no ReplState test to fix up any ReplState problems. On top of that, it's hard to even troubleshoot these problems on 5.5 since we don't have the tracing options that we have in 2000 and beyond.

The only likely solution is to get the data out of that store and start over with a clean store on that server. I'd suggest using Outlook to pull the folders down to a PST (note that when you drag and drop a top-level folder into a PST, it will automatically grab all the subfolders as long as you have permissions to them). Then copy the data up to either the 2003 server or a clean store on the 5.5 server.

Sorry I don't have a better solution for you, but there is probably no way to fix that 5.5 store to make it replicate again.

Report Inappropriate Content · ‎Aug 07 2008

Hi, long time since any posts here...

Question: I wish I had found this BEFORE I started my Exchange 2003 migration from box a to box b. I was having difficulty confirming that PUBLIC FOLDER REPLICATION was working.

I just manually created the one public folder that had content, which was a distribution list.

Now, I've moved all mailboxes from box a to box b, and I need to/want to shut down the old Exchange server. I tried to do an uninstall, but it failed.

Should I restart box a Exchange services, and try to replicate the public folders per the post above, THEN try to uninstall Exchange on box a?

Thanks.

Report Inappropriate Content · ‎Aug 07 2008

David, this will lead you through it all:

http://technet.microsoft.com/en-us/library/bb288905.aspx

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Public Folder Replication Troubleshooting - Part 2: Troubleshooting the Replication of Existing Data