The title is actually a little bit misleading, because the issue will not only happen on AlwaysOn, but let's start with this. The symptom is that, you may have alwayson environment and you have readable secondary, when the secondary failed synchronization or secondary sql server service stopped, you may find the query performance on primary degraded also.
In this situation, the first suggestion is always try to collect the ACTUAL execution plan, compare the plan before secondary stopped and after. If the plan are totally the same, then you can refer to this article.
Check the IO information of the query by set statistics io on, or profile trace, or xevent, the ways you can get the IO information of a query
set statistics io on
select * from tb1
set statistics io off
Then compare the I/O under AG healthy and unhealthy status. Now let me use a demo to explain, you can see the difference.
CREATE TABLE [dbo].[tb1](
[id] [int] NULL
)
create clustered index ID on tb1(ID)
insert into tb1 values(1,'delete')
go 20100
delete from tb1 -->this will generate ghost rows
select * from tb1
In this situation, the I/O will be small, since we already deleted the data and the pages are expected to be released
Table 'tb1'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Now insert back the rows
insert into tb1 values(1,'delete')
go 20100
Stop readable secondary replica, delete on primary again
delete from tb1 -->this will generate version ghost rows
select * from tb1
In this situation, the I/O will still be large, because the pages have ghost rows and cannot be released
Table 'tb1'. Scan count 1, logical reads 109, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
In this case, then you can use below way to check the version ghost rows of table:
use yourdbname
SELECT ghost_record_count,version_ghost_record_count,*
FROM sys.dm_db_index_physical_stats (DB_ID(N'yourdbname'), OBJECT_ID(N'yourtablename'), NULL, NULL, 'SAMPLED')
In my case, it is 20100 rows, and it never get reduced unless the secondary come back sync again.
It's not caused by a lazy ghost cleanup thread actually, it is that the ghost cleanup thread unable to clean the version_ghost_record_count. Only when the version_ghost_record_count 'CONVERTED' to ghost_record_count, the ghost cleanup can actually clean the data.
'converted' to ghost_record_count after I started secondary:
We actually have BOL described such issue in AlwaysOn
"Because read operations are mapped to snapshot isolation transaction level, the cleanup of ghost records on the primary replica can be blocked by transactions on one or more secondary replicas. The ghost record cleanup task will automatically clean up the ghost records for disk-based tables on the primary replica when they are no longer needed by any secondary replica. This is similar to what is done when you run transaction(s) on the primary replica. In the extreme case on the secondary database, you will need to kill a long running read-query that is blocking the ghost cleanup. Note, the ghost clean can be blocked if the secondary replica gets disconnected or when data movement is suspended on the secondary database. "
When you have active RCSI/SI transactions that need the deleted versioned rows, you can also meet this issue.
Then how to resolve this issue?
select text, a.*,b.kpid,b.blocked,b.lastwaittype,b.waitresource,b.dbid,b.cpu,b.physical_io,b.memusage,b.login_time,b.last_batch,b.open_tran,b.status,b.hostname,b.program_name,b.cmd,b.loginame,request_id
from sys.dm_tran_active_snapshot_database_transactions a
inner join sys.sysprocesses b
cross apply sys.dm_exec_sql_text(sql_handle)
on a.session_id = b.spid
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.