Hi Folks,
If you are managing a hierarchy for sometime and dealt with Replication issues before; I assume you would have an understanding of the DRS Initialization process. If you feel you need some guidance then refer:
But what is very obscure and often called a Black box is the DRS Message flow and the synchronizations of groups when the site is ACTIVE.
So in short how is the flow in a day to day life when things are good. You create a package on the console on CAS and it replicates back to the Primary Site or vice versa.
How does this happen?
A 100 feet flow that we have told before is when the data gets inserted in the given site we extract the changes via Change Tracking and then convert them into message and send it via Service broker.
Well all is good when talking in high overview. At CSS, there is no high overview, we have to know the skin of the things and the internals.
For each replication groups much more needs to happens for them being to show as ACTIVE individually when the Site is ACTIVE.
So the million dollar question is After Init is finished and the site is ACTIVE, how do we monitor the Synchronization and changes of the groups?
Note: The below concepts are there to clarify the concepts of DRS. Do not change or run any modification queries in the production without any expert guidance. For any recommendations or issues, please open a ticket with Microsoft CSS.
DRS Sync Process – On the initiating site
Part 1 : To decide whether to send changes for a group
EXEC dbo.spDRSSendChangesForGroup @ReplicationGroup
Part 2 : The main extraction process on the initiating site
Now this procedure sends the changes by extracting the changes and send them as a message. The process starts as follows [expanding on the flow chart steps] -
It checks if the last two syncs for this Replication group has completed or not [SyncCompleteTime is NULL] . We will learn more about SyncCompleteTime later as to how it gets updated. If not it puts it in the temp table @SitesNotToSend and would not send the changes to those Sites for this replication group.
It will log the below in Vlogs –
'Not sending changes to sites <ReceivingSiteCode> for Replication group <ReplicationGroup> since last 2 syncs to these sites have not completed.'
Here is a sample of the TSQL code that does that –
-- get sites which have not returned a sync complete within the throttle window
;with ReplicationGroupSendHistory (ID, TargetSite, SyncCompleteTime) as
(
select ID, TargetSite, SyncCompleteTime from DrsSendHistory with (nolock) where ReplicationGroupID = @ReplicationGroupID
)
insert into @SitesNotToSend (SiteCode)
select TargetSite from
(select *, ROW_NUMBER() over (partition by TargetSite order by ID desc) TopRows from ReplicationGroupSendHistory)as Temp where Temp.TopRows <= @ThrottlingWindow AND SyncCompleteTime is null group by TargetSite having count(Temp.TopRows) >= @ThrottlingWindow
if exists (select * from @SitesNotToSend)
begin
set @LogText = @SyncIDLogString + N'Not sending changes to sites ' + (select SiteCode + N', ' from @SitesNotToSend for xml path(N'')) + N' since last ' + CONVERT(NVARCHAR(12),@ThrottlingWindow) + N' syncs to these sites have not completed.';
exec dbo.spLogEntry @LogText=@LogText, @ProcedureName=@procedureName;
end
If yes, we log the below entry in Vlogs:
‘The dialog handle for sites @PendingSyncSitesWithChangedDialog has changed and previous sync has not completed. Not processing sync for them during this cycle. Will try again in the next cycle.'
Here is the code which checks this:
-- Check for any sites for this replication group which had a new dialog and previous sync is pending
insert into @PendingSyncSitesWithChangedDialog
select SiteCode from DRS_MessageActivity_Send where LastSendResult=-20 and ReplicationID=@ReplicationGroupID and Active=1 and
SiteCode in (select TargetSite from DrsSendHistory with (nolock) where SyncCompleteTime is null and ReplicationGroupID=@ReplicationGroupID);
if exists (select * from @PendingSyncSitesWithChangedDialog)
begin
set @LogText = @SyncIDLogString + N'The dialog handle for sites ' + (select SiteCode + N', ' from @PendingSyncSitesWithChangedDialog for xml path(N'')) +
N' has changed and previous sync has not completed. Not processing sync for them during this cycle. Will try again in the next cycle.';
exec dbo.spLogEntry @LogText=@LogText, @ProcedureName=@procedureName;
end
LastSendResult -20 means the same that the DialogHandle changed. We have this code later which updates the same.
insert into @TargetSites select SiteCode from DRS_MessageActivity_Send where CurrentSendSyncID = @SyncID;
while exists (select * from @TargetSites)
begin
set @SiteCode = (select top(1) SiteCode from @TargetSites);
exec @Rval = dbo.spDRSGetDialogHandle @SiteCode=@SiteCode, @ReplicationGroupID=@ReplicationGroupID, @Handle=@SendDialogHandle output;
if (@Rval != 0 and @Rval != 2)
begin
rollback tran;
return 0;
end
-- If new dialog was created and there are pending sync for this group and target site, then do not sync for this group and site
if (@Rval = 2 and exists(select * from DrsSendHistory with (nolock) where SyncCompleteTime is null and ReplicationGroupID=@ReplicationGroupID and TargetSite=@SiteCode))
begin
set @LogText = @SyncIDLogString + N'The dialog handle for site ' + @SiteCode + N' has changed and previous sync has not completed. Not processing its sync during this cycle. Will try again in the next cycle.';
exec dbo.spLogEntry @LogText=@LogText, @ProcedureName=@procedureName;
-- Store the fact that this happened in the LastSendResult
update DRS_MessageActivity_Send set LastSendResult=-20 where ReplicationID=@ReplicationGroupID and SiteCode=@SiteCode;
end
Here is the full key for LastSendResult in DRSSendHistory
-99: Snapshot isolation error
-20: Sync handle changed while previous syncs were not done
-10: LastSendVersion changed while sync was in progress
-3: Replication group needs re-init and invalid subscription is sent by message builder
-2: Received exception while sending DRS_SyncStart or DRS_SyncEnd
-1: Any other error during sync
We basically call spGet<table>Changes (“Site” replication pattern) or function fnGet<Table>Changes (“global” replication pattern) or function fnGet<Table>ChangesSec (“global_proxy” replication pattern) to extract changes.
Now if we have DebugLogging enabled for this Vlogs we do log useful information for when we extract the changes.
To enable DebugLogging we can alter the function like below –
ALTER FUNCTION [dbo].[fnIsDebugLoggingEnabled]()
RETURNS BIT
AS
BEGIN
RETURN 1;
END
From Vlogs (Debug Enabled):
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Start extract and send of changes, pattern is global and group id is 3
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Encoding and Compression is enabled for this replication group.
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Starting scan, group ID is 3
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] - Extracted 1 changes for table CM_SiteConfiguration
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Extracted 1 total changes.
From Vlogs (Debug Enabled) on initiator site:
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Sending DRS_SyncStart message to 2 sites.
Once the Sync start message is processed on receiving site we log.
INFO: Received message DRS_SyncStart.
INFO: Reading site code and replication group from XML message.
INFO: Read site code CAS and replication group Configuration Data.
INFO: Updating DRS_MessageActivity_Receive CurrentReceiveDialogHandle: 6f9c1d3c-9464-e911-80f1-001dd8b85882, LastReceiveStartTime: 06/01/2019 20:08:35 and LastReceiveResult: 0.
INFO: Updated DRS_MessageActivity_Receive.
From Vlogs (Debug Enabled) on initiator site:
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Called spDRSSendDataMsg 1 times.
Once the SyncData message is processed on receiving site we log.
INFO: DRS_SyncDataCompressed message processing started.
INFO: DRS_SyncDataCompressed needs decompression.
INFO: DRS_SyncDataCompressed message decompressed from length 762 to 2376.
INFO: Read site code: CAS from the DRS_SyncData.
INFO: Parsed data from DRS_SyncData XML: lastVersionReceived: 4099493, lastSyncVersionToSource: 17479330, list of upsert tables: CM_SiteConfiguration, list of delete tables:
INFO: Building dependency map for all upserts.
INFO: Processed all upserts.
INFO: Building dependency map for all deletes.
INFO: Processed all deletes.
INFO: Updating DRS_MessageActivity_Receive with LastVersionReceived: 4099493 for ReceiveDialogHandle: 6f9c1d3c-9464-e911-80f1-001dd8b85882.
INFO: Updated DRS_MessageActivity_Receive.
INFO: Processed message DRS_SyncDataCompressed
7. Call proc spDRSSendEndMsg to mark “ending message send”.
From Vlogs (Debug Enabled) on initiator site:
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Sending DRS_SyncPing and DRS_SyncEnd message to 2 sites and updating DRS_MessageActivity.
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Completed operation in 1 seconds.
Once the SyncEnd message is processed on receiving site we log.
INFO: Received message DRS_SyncEnd.
INFO: Reading last version received from XML message.
INFO: Read last version received 4099494.
INFO: Updating DRS_MessageActivity_Receive with LastVersionReceived: 4099494 and LastReceivedEndTime: 06/01/2019 20:10:12 for DialogHandle: 6f9c1d3c-9464-e911-80f1-001dd8b85882.
INFO: Processed message DRS_SyncEnd.
Now it is important to note that just by sending a SyncEnd message doesn’t mark the completion of the Sync. The sync is not completed until the initiating site receives a SyncComplete message from the receiving site which sends it after it receives the SyncEnd message.
So when we send the SyncEnd message from the Initiating site we update the DRSSendHistory (Leaving the ProcessedTime and SyncComplete as NULL)
-- update sync history
update DRS_MessageActivity_Send set LastSendEndTime=@EndTime, LastVersionSent= CASE WHEN @IsSchedulingOn = 0 THEN LastVersionSent ELSE @MaxVersion END, SkipContextCheck=0, SyncDataSize = SyncDataSize + @StartMsgSize + @EndMsgSize
where ReplicationID=@ReplicationGroupID and LastSendStartTime=@StartTime and LastSendResult>=0 and SiteCode in (select SiteCode from @TargetSitesAndHandleTable WHERE RtmInterop = 1)
update DRS_MessageActivity_Send set LastSendEndTime=@EndTime, LastVersionSent= CASE WHEN @IsSchedulingOn = 0 THEN LastVersionSent ELSE @MaxVersion END, SkipContextCheck=0, SyncDataSize = SyncDataSize + @StartMsgSize + @EndMsgSize + @PingMsgSize
where ReplicationID=@ReplicationGroupID and LastSendStartTime=@StartTime and LastSendResult>=0 and SiteCode in (select SiteCode from @TargetSitesAndHandleTable WHERE RtmInterop = 0)
IF @ReplicationPattern!=N'cloud'
BEGIN
exec dbo.spDrsInsertSendHistory @SyncID=@SyncID, @ReplicationGroupId=@ReplicationGroupID, @ChangeCount=@ChangeCount, @StartTime=@StartTime, @EndTime=@EndTime
END
Note that the @SyncID for each sync of a replication group is created by the NEWID() inbuild function which creates the GUID. It remains the same for a current sync throughout. For the next sync it would be again created as a different GUID.
Note that the LastProcessedTime and the SyncCompleteTime are not set by the above highlighted procedure in the DRSSendHistory of the Sending table. If you check at the record it would be NULL for those two columns. So how do they get updated?
spDRSSendEndMsg sends the ‘DRS_SyncEnd’ messagetype with the @SyncId to the target site.
It is in ProcessSyncEnd() as well
ExecuteStoredProcedure "dbo.spDRSSendSyncComplete"
Here is the logic that goes for processing the DRS_SyncComplete on the Initiating Site.
ELSE IF @ReceivedMessageType = N'DRS_SyncComplete'
BEGIN
DECLARE @SyncID UNIQUEIDENTIFIER;
DECLARE @ProcessedTime DATETIME;
SET @MessageXml = CONVERT(xml, @ReceivedMessageBody);
SELECT @SiteCode = T2.X.value(N'./@TargetSite', 'NVARCHAR(3)'),
@SyncID = T2.X.value(N'./@SyncID', 'UNIQUEIDENTIFIER'),
@ProcessedTime = T2.X.value(N'./@ProcessedTime', 'DATETIME')
FROM @MessageXml.nodes(N'DRS_SyncComplete') AS T2(X);
EXEC dbo.spDRSUpdateSendHistory @SiteCode, @SyncID, @ProcessedTime
END
SET @ReplicationID = (SELECT DISTINCT ReplicationGroupID FROM DrsSendHistory WHERE SyncID=@SyncID);
UPDATE DRSSendHistory SET ProcessedTime=@ProcessedTime, SyncCompleteTime=GETUTCDATE() WHERE TargetSite=@TargetSite and SyncID=@SyncID;
UPDATE DRS_MessageActivity_Send SET LastSyncProcessedTime=@ProcessedTime WHERE SiteCode=@TargetSite and ReplicationID=@ReplicationID;
Part 3 : Tracking the Sync Handles in Action
It is very important to know how these handles relate to troubleshoot the issue. Because a mismatch here wouldn’t let a sync to complete.
What can cause the handles to mismatch?
So Lets try to track them for say suppose ConfigMgrDRS queue communications for NormalPriority message for Configuration Data.
On the initiator Primary site I check for the current handles for all conversations for Global data
select * from SSB_DialogPool where ToService like 'ConfigMgrDRS_SiteCAS' and OnContract = 'NormalPriority' order by CreationTime desc
Now these are the ones equivalent to all the global groups which are allowed to send using Normal priority and we can confirm it here
select * from ReplicationData where ReplicationPattern = 'global' and ReplicationPriority = 5
Now how to find which one handle and groupid is being current used by the Configuration Data ?
select * from DRS_MessageActivity_Send where ReplicationID = 3 And Active =1
That’s the second one from the list above we first shared.
Now note that this GroupID is not unique across instances, meaning that we cannot search for the same group id on the receiving site.
There is in fact a different id which is unique [Conversation_ID] and we need to find the corresponding [Conversation_ID] for this group id.
select * from sys.conversation_endpoints where conversation_group_id = 'EB2EE470-A345-4CC2-8F27-73C84A3A7629'
Now we can take this conversation_id and search on the receiving site.
select * from sys.conversation_endpoints where conversation_id ='AD90F806-A1AB-41B9-9130-8DD2473C16DA'
From here we get the conversation_handle and this should match with the handle stored in the DRS_MessageActivity_Receive from what we have learned above for things to work.
select * from DRS_MessageActivity_Receive where replicationid = 3 AND SiteCode = 'PR2'
And there we go it perfectly matches and so it means sync completed just well.
So now it should be clear that when sync starts and we use a conversation handle, we store it in the DRS_MessageActivity_Receive table and if someone flushes the dialogpool (manually or restoring the snaphot of old db) then this breaks this sync and the handle stored here would not match the new handle.
And we will as a result we uncompleted sync forever in DRSSendHistory where ProcessedTime and SyncCompleteTime would never populate.
There are ways to get out of this state which RLA reciprocates with (exec spFlushSSBDialogPool) which has the logic to get out of this state but don’t run this manually and always use RLA.
If this doesn’t fix the issue then engage Microsoft CSS to deep dive into this who might run some manual queries to get you out of this situation.
Happy learning!
-UK
ConfigMgr | Blog | @TheFrankUK
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.