Nov 23 2023 01:39 AM - edited Nov 23 2023 01:56 AM
Hi,
We are testing performance impact of CDC on SQL Server.
There are two identical databases (KST_S001, KST_002) on SQL Server 2017 which is running in linux container. They both have CDC enabled for 180 tables and a data generator that is doing mostly updates on these tables.
The data generators are doing around 300k DML operations per minute on each database (600k total).
The configuration of CDC jobs is kept default (500 transactions, 5s polling interval for capture job, I think three days retention period on cleaning job)
The host machine is Azure Standard_E8-4as_v4: 4 vCPU, 64 GB RAM, 12800 IOPS, 128GB SSD.
After 1 hour of running the setup with CDC enabled, the average time of 1000 updates is 73ms.
With CDC disabled the average time of 1000 updates after 1 hour is 15ms.
I've attached two grafana screenshots showing the difference in metrics when the CDC is disabled versus enabled on these databases.
We are trying to understand the underlying mechanism that is contributing to the impact visible from the metrics.
Why is memory consumption so much higher?
Why do DML operations take longer with CDC enabled?
CDC disabled:
CDC enabled:
Thanks for answers and suggestions.
Peter
Feb 28 2024 10:34 AM - edited Feb 28 2024 10:36 AM
@PeterK32 Hello there!
CDC is (allegedly) "low impact", not "zero impact" - it sounds like you are simulating a very busy transactional system. DML operations are necessarily going to take longer when you have the CDC process also scanning the log files.
How much longer? That depends.
There really isn't a lot of details forthcoming from Microsoft (or anywhere) regarding metrics for what kind of impact you can expect, which is understandable to a degree, since every situation is different.
Still, it would be nice to have some "official" steps or metrics that can be gathered to measure this.
In our testing (with considerably less activity than yours), we are seeing mainly a CPU increase, not memory.
Are you making sure to back up the logs frequently if running in full recovery? If not, the inactive portion of the log won't get released for reuse, and CDC will have to scan an increasingly large log file (at least that's my understanding, I could be wrong).
Have you experimented with different Polling Intervals etc.?
Good luck!
Mar 05 2024 05:48 AM
Hi @FalseDawn
Thank you for your reply,
The traffic volume is a pessimistic overestimation based on peak traffic on our system, since this was a stress test. The everyday traffic is much more down to earth, but we want to know what to expect in extreme situations.
The recovery model of the databases is SIMPLE.
Since my original post, we've increased the disk tier / speed and now the main difference between CDC and non-CDC setup is memory consumption and size of data file. The memory part is mostly in logbuffer (like tenfold). The data file size increase is very significant, but that is understandable and can be mitigated by cleanup retention period.
"Have you experimented with different Polling Intervals etc.?"
We've isolated this by first doing a 10M row update in a single transaction, measuring that, and then letting the capture job process the transaction and measuring that separately. The actual update did take longer (although not significantly after the disk upgrade), but the memory consumption, and logbuffer waits are much higher. The capture job then proceeded to increase data file size and consume CPU, but that is understandable and to be expected.
It would be nice to have some official documentation explaining what is the overhead of WAL for tables with enabled CDC, whether there is additional data written to log when changing CDC enabled table data and if so, what it is, how much and how it can be addressed.