The Senior Escalation Engineers do various training and mentoring activities. As I do this I thought I would try to propagate some of this information on the blog.
I encountered an interesting issue today that I would like to share. The problem ended up being that the a database snapshot was established for reporting purposes and all the sudden the primary database started encountering performance problems. In this case it got bad enough that the server was reporting I/O timeouts and buffer latch timeouts.
A SQL Server 2005 database can have 1 or more snapshot databases (replicas). Online DBCC also uses a replica to perform checks. As soon as a replica is in place the
Copy On Write
behavior is established for the replica(s). I think
copy before write
provides a better visual.
SQL Server uses the latching mechanisms to protect the physical page data. To make a change on a page a request is made to 'prepare to dirty'. This request checks for any replica(s).
By default pages are not copied to the snapshot files. These files are allocated as spare files and only those pages that have been dirtied after the snapshot creation are copied. This allows the snapshot to retrieve the before image of the pages for those that have been modified in the primary database. For the pages that have not changed they are retrieved from the primary database reducing the size of the snapshot storage.
In this diagram I am showing that page 100 is being modified(dirtied) and it has not been copied to the replica yet (space in the replica has not been allocated yet). When this condition is detected the page is written to the replica.
The replica(s) do not maintain .LDF file(s) instead the during the prepare to dirty the page must be completely written to the replica (hardend to the snapshot stable storage). This becomes the critical path in this performance scenario for a snapshot action.
In this case I was working the issue was two fold.
The snapshot had been placed on a sub-par disk sub-system. This immediately limited the production database modification capabilities to the speed of the replica sub-system.
The primary database also had many snapshot databases. Mon, Tue, Wed, ... so each time an update took place the additional overhead of the replica checks and possible write was also taking place.
The solution for this issue was to manage the location of the replica with the same performance guidelines as the primary database and to evaluate the need for a large set of snapshots for the same database.