Issue:
When trying to generate a snapshot, the job is failing with the error:
Error 1- "The connection is broken and recovery is not possible. The connection is marked by the server as unrecoverable. No attempt was made to restore the connection"
After the first error, the customer started to see other errors:
Error 2 - Error: 4891 - "Insert bulk failed due to a schema change of the target table." Message: Replication-Replication Distribution Subsystem: agent mysqlmi01-smdm.-DIS_a -mysqlmi01-smdm.-1 failed. The process could not bulk copy into table '"dbo"."xxx_DISTRIBUTOR_FACT"'.ErrorId = 42179, SourceTypeId = 0 ErrorCode = '' ErrorText = ''
Error 3 - Error: 1326/2250 - The process could not read file X due to OS error 1326 [ERROR_LOGON_FAILURE, The user name or password is incorrect.]. As a result of this failure, the agent attempted to close the connection, however, that also failed with: Disconnected from Azure Storage Y with OS result code: 2250 [ERROR_NOT_CONNECTED, This network connection does not exist.].
Error 4- Error: 13800 - INSERT BULK statement does not support recompile
Error 5- Error: 21036 - Another distribution agent for the subscription or subscriptions is running, or the server is working on a previous request by the same agent.
Investigation/Analysis
All the errors could be seen on Replication monitor. Snapshot and distrubution Agent Job also contains the errors mentioned. Tables MSrepl_errors, MSdistribution_history, and MSsnapshot_history from the distribution database can also be queried.
Mitigation
Error 1 - The error was mitigated by using “concurrent” Sync method. (@sync_method = 'concurrent') while setting up the publication. With concurrent, the snapshot process will only hold a short lock at the start of the snapshot scripting process but will then continue without holding any locks or transactions. Please note that using the concurrent as the sync method will mitigate the issue but also slows down the entire snapshot process as it makes the operation single threaded (Which is otherwise multi – threaded). More details on this can be found on https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-addpublication-transact-sql?view=sql-server-ver16
Error 2 - In general, this error occurs during BCP-in command when there are parallel index rebuilds and/or auto update stats is on while BCP command is executing. In a certain setup, it has been observed that affected subscriber table had more indexes than corresponding publisher table. Regardless of that, customer is advised to remove/disable all indexes on subscriber tables while BCP operations are in progress and also to check and if necessary to turn off the auto update stats on the subscriber side.
Error 3 - The issue was triggered by a code defect in connection sharing module where in rare race condition situation underlining connection state was incorrectly propagated. Engineering team was able to replicate and fix the issue in the local test environment. The issue is mitigated by distribution agent restart as that creates a new clean connection. Since the agent is running in scheduled mode, it restarts itself automatically after running into the issue, so the issue mitigates automatically too. A repair Item for this is being tracked and the fix will be released soon.
Error 4 - While this Error is still under active investigation, initial analysis states that dropping all indexes (except PK) on the subscriber side and retrying the operation might fix the issue, although a second thought process relates this to a Bug. Investigation is underway and testing the index fix with the cx is also well in progress. A third theory also states that the error is produced if Index Rebuild Job and Snapshot loading job in replication step over one another. Disabling the Index rebuild job has helped in some situations.
Error 5 - It already suggests that there are multiple distribution agents running at the same time. This error is thrown only in cases when distribution agent is running and trying to acquire a lock that ensures only one agent is running at a time. If it cannot acquire it, it means there is another agent already running and the abovementioned error is thrown. The easiest solution is to stop all distribution agents and run only one instance.