Database corruption in SQL Server is rare but usually high-impact. When it occurs, it threatens ACID compliance - the foundation of transactional integrity - and can lead to downtime, data loss, and operational risk.
SQL Server Database Corruption: Causes, Detection, and some details behind DBCC CHECKDB
Database corruption in SQL Server is rare but usually high-impact. When it occurs, it threatens ACID compliance - the foundation of transactional integrity - and can lead to downtime, data loss, and operational risk. This article explores:
- Common causes of corruption
- How DBCC CHECKDB works under the hood
- Performance tuning tips for running CHECKDB
- Sample error messages and what they mean
- Best practices for prevention and recovery
Why ACID Matters in Corruption Scenarios
Before diving into causes and detection, remember that SQL Server guarantees Atomicity, Consistency, Isolation, and Durability:
- Atomicity: Transactions are all-or-nothing. If any part fails, the whole transaction has to fail. Corruption can break this, leaving partial writes.
- Consistency: Every transaction moves the database from one valid state to another. Corruption violates this by possibly introducing invalid states.
- Isolation: Concurrent transactions shouldn’t interfere with each other. Corruption in shared pages can cause phantom reads or deadlocks.
- Durability: Once committed, data must persist, even in the event of system failure or crash. Disk-level corruption undermines durability guarantees.
DBCC CHECKDB exists primarily to validate consistency and durability, ensuring that the logical and physical structures adhere to ACID principles.
In any internal decision-making that a database engine needs to do – ACID will be the main motivating factor, overriding concerns about performance, high availability or any other considerations. This is by design – ACID is the single most important principle that the engine cares about.
Common Causes of SQL Server Database Corruption
Corruption usually originates outside of SQL Server, often in the I/O path. Below I list the main causes, ordered in terms of probability - the higher on the list, the more probable it is:
- Hardware Failures
- Disk errors, RAID controller cache issues, or faulty RAM can corrupt pages during writes.
- Even with write-ahead logging, if the physical medium fails, durability is compromised.
- I/O Subsystem Issues
- SAN/NAS instability, outdated drivers, or virtualization misconfigurations can cause torn writes.
- SQL Server relies on the OS and storage stack for atomic page writes; instability breaks this assumption.
- Improper Shutdowns
- Power loss during write operations can leave pages partially written, violating atomicity.
- Torn-page detection mitigates this, but only if checksums are enabled.
- OS or SQL Server Bugs
- Rare, but missing cumulative updates can expose edge cases in buffer pool or checkpoint logic.
- File System Misconfiguration
- Compressed/encrypted volumes or sector size mismatches can corrupt allocation maps in edge cases.
- Human Error
- Manual deletion of MDF/LDF files or incorrect restore sequences can orphan pages in some unsupported scenarios.
- Malware
- Ransomware or malicious scripts altering system tables can break referential integrity.
How DBCC CHECKDB Works Under the Hood
DBCC CHECKDB is SQL Server’s integrity verification tool, validating both physical and logical consistency:
1. Snapshot Creation
- Creates a transactionally consistent snapshot using sparse files (page changes that happen at runtime are tracked in the files to ensure consistency).
- Ensures checks run without blocking user activity, preserving isolation, unless WITH TABLOCK option is used to run on live database taking locks during execution.
2. Phases of Execution
Allocation Checks
- Validates GAM, SGAM, and IAM pages for correct page allocation.
- Detects orphaned extents or double allocations.
System Table Checks
- Verifies metadata in system catalogs like sysobjects and sysindexes.
- Ensures schema-level consistency.
Table and Index Structure Checks
- Traverses B-trees, validating key order and linkage.
- Detects broken pointers or incorrect page splits.
Page-Level Validation
- Reads every page, checks checksums or torn-page bits.
- Critical for durability verification.
LOB Checks
- Ensures integrity of large object chains in IAM and target pages (text, image, XML).
Cross-Object Consistency
- Confirms referential integrity across tables and indexes.
3. Error Reporting
- Errors include severity and repair recommendation:
- REPAIR_REBUILD: Non-destructive, fixes structural issues.
- REPAIR_ALLOW_DATA_LOSS: Last resort, may delete corrupt pages – the integrity of the “repaired” database is not guaranteed when used.
Performance Tuning Tips for DBCC CHECKDB
CHECKDB is I/O and CPU intensive. Here are some options you can use to optimize its performance while running:
- Use PHYSICAL_ONLY for Faster Checks
DBCC CHECKDB ('YourDatabase') WITH PHYSICAL_ONLY;
- Skips logical checks, reducing load. This could be used for daily scans, assuming a full scan still happens, just more rarely.
- Run on a Restored Copy
- Offload workload to a non-production server using recent backups.
- Leverage Availability Groups
- Execute on a readable secondary replica to protect the primary node.
- Control Parallelism
DBCC CHECKDB ('YourDatabase') WITH MAXDOP = 4;
- Explicit MAXDOP ensures predictable performance. You can adjust the parallelism in accordance with available CPUs.
- Schedule During Low Activity
- Avoid peak hours; combine with Resource Governor for throttling.
- Break Down Large Databases
- Use DBCC CHECKTABLE for individual large tables if full CHECKDB is too costly.
Sample Error Messages and Interpretation
- Msg 824
“SQL Server detected a logical consistency-based I/O error: incorrect checksum.”
→ Page-level corruption, often disk-related. - Msg 8905
“Extent (1:12345) in database ID 5 is marked allocated in GAM, but not in SGAM.”
→ Allocation map inconsistency. - Msg 2533
“Table error: Object ID 123456789, index ID 1. Page (1:98765) failed checksum.”
→ Corruption in index or data page. - Msg 8928
“Object ID 123456789, index ID 2: Page (1:54321) could not be processed.”
→ Structural issues in index B-tree.
Apart from the errors above, we also sometimes see issues directly related to our use of sparse files when creating the snapshot:
- The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x00002a3ef96000 in file 'DBFile.mdf:MSSQL_DBCC18'
- The operating system returned error 1450 (Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x00002a3ef96000 in file with handle 0x0000000000000D5C. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists, then immediate action must be taken to correct it.
Those are directly caused by NTFS File Record Segment objects running out of space due to ATTRIBUTE_LIST_ENTRY list growing out of bounds, and not by CHECKDB command itself - for more details check the article https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/database-file-operations/1450-and-665-errors-running-dbcc-checkdb.
Prevention and Recovery Best Practices
- Run DBCC CHECKDB weekly either directly on the database or on restored backups. This makes sure that if corruption happens you have a recent last known good backup to restore from.
- Maintain verified backups with WITH CHECKSUM and regular restore tests. It’s very important to make sure a created backup can be fully restored to make sure there’s no unpleasant surprises if an issue happens.
- Use enterprise-grade hardware (RAID 10, ECC memory, UPS).
- Apply cumulative updates for SQL Server and Windows.
- Avoid unsupported storage configurations (compressed/deduplicated volumes).
Key Takeaways
- Corruption is usually hardware or I/O related, not SQL Server bugs.
- DBCC CHECKDB is your first line of defence - schedule it regularly.
- Always restore from a clean backup if in any way possible; use repair options only as a last resort.
Critical Note on Repair Options
What triggered this blog was a number of issues we saw where the plan in case a database goes suspect was essentially to:
- Switch database to emergency mode
- Switch to single user mode
- Run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS options without any previous tests to check if a corruption is present
- Switching the database back to multi user and using it normally
Note that a database can go suspect for multiple reasons, not all related to corruption – basically if we hit something that stops the recovery process and doesn’t allow us to finish. For example – a deadlock with the recovery thread causes the thread to be killed? Database goes to suspect mode. This plan is not the recommended approach - neither in corruption cases nor other reasons for suspect database.
Using REPAIR_ALLOW_DATA_LOSS can leave the database logically inconsistent. You always need to validate data post-repair and address root causes (hardware, OS issues) before attempting recovery. For detailed guidance, see https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/database-file-operations/troubleshoot-dbcc-checkdb-errors.
Note that the repair recommendation provided after running is the lowest level of repair that can address all errors reported by CHECKDB. However, “minimum” doesn’t mean it will fix everything it finds - some errors simply can’t be repaired. You might also need to run the repair process more than once, since removing some data might expose additional non-linked pages that themselves needs to be dropped as part of repair. Keep in mind that not every error requires this level of repair and using REPAIR_ALLOW_DATA_LOSS doesn’t always lead to data loss. The only way to know if fixing an error will cause data loss is to actually run the repair and verify the data afterwards.
A helpful tip: You can use DBCC CHECKTABLE on any table that shows errors. This will tell you the minimum repair level needed for that specific table.
It’s extremely important to remember that after running CHECKDB repair with data loss, you must manually validate your data. The repair process doesn’t guarantee logical consistency. For example, REPAIR_ALLOW_DATA_LOSS might remove entire data pages with inconsistent data. If that happens, tables with foreign key relationships could end up with rows that no longer have matching parent keys.
I hope this post was able to bring a bit of clarity to the complicated topic of corruption causes and recovery. It’s important to remember that once corruption happens your options are limited – you have to be prepared before the issue occurs.
Further Reading
- https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-checkdb-transact-sql
- https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/database-file-operations/troubleshoot-dbcc-checkdb-errors
- https://learn.microsoft.com/en-us/sql/sql-server/failover-clusters/automatic-page-repair-availability-groups-database-mirroring?view=sql-server-ver17