From:
Robert Dorr [mailto:rdorr@microsoft.com]
Sent:
Thursday, November 20, 2008 11:29 AM
To:
A SQL PASS MEMBER - CSS First Aid Station Question About SQLIOSim
As we discussed at SQL PASS – SQLIOSim is an independent utility and has been used as part of the Windows Hardware Compatibility Testing (WHCL) suite for several years. Many of the major hardware manufacturers use it as part of their release criteria testing as well. I recently worked with a major hardware vendor. They were getting ready to release a new HBA controller series and SQLIOSim was reporting errors. After some investigation the root cause was a firmware bug that would have lead to SQL Server corruption. The firmware was corrected before the product was released.
SQLIOSim performs various I/O patterns. These include ones that mimic various SQL Server I/O requests (read ahead, bcp, backup, dbcc, random writers and readers, …) It also performs stress testing by posting un-throttled amounts of I/O and other random patterns. This is why it is perfect for data consistency checking but cannot be used for performance testing.
I had numerous inquiries at PASS about the output of SQLIOSim and I provided the same guidance. If you are getting errors you should consider it an I/O path problem that will lead to SQL Server corruption with the potential of total database loss. Here is a quote from another customer: “Our file system administrators don’t like us to run SQLIOSim because it puts heavy stress on the SAN. However, the last 3 times we have attempted to add a new machine to production it has triggered an error because the machines where incorrectly configured. I won’t put a machine in production if SQLIOSim does not report success.”
The easiest way to make SQLIOSim completes with success is to set the configuration option to STOP ON ERROR. This way if a data integrity error is encountered the utility logs extended details and shutdown.
For your specific issue I took a look at the page dumps and you really need to revisit the hardware and I/O path configuration.
If you have not read my white papers on “SQL Server I/O Basics” Chapter 1 and Chapter 2 please take the time to do so. It outlines how a stale reads, lost writes, torn pages, checksum failures and other I/O failure conditions are detected and what this means to the data.
http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx
http://www.microsoft.com/technet/prodtechnol/sql/2005/iobasics.mspx
SQLIOSim used the secure crypto APIs to create random data patterns on the pages. On the page is a small header containing the page id and the expected checksum value. These are used to help determine the state of the page.
Diag File
|
Condition/Characteristics
|
SqlSimErrorDump01863
|
Classic Stale Read Pattern
The PageId between on expected page image and the page read from disk is the only data that is the same. All other data is different. This is usually a condition where the hardware returned a stale image of the page from cache – firmware bug.
NOTE: We re-read this from disk 15 times and still received the older buffer. Probably from hardware cache.
Data mismatch between the expected disk data and the read buffer:
File: F:sqliosim.mdf
Offset: 0xD2F9CE000
Expected FileId: 0x0
Received FileId: 0x0
Expected PageId: 0x697CE7
Received PageId: 0x697CE7
Expected CheckSum: 0xA6DDED06
Received CheckSum: 0x76E039CE (does not match expected)
Calculated CheckSum: 0x76E039CE
Expected Buffer Length: 0x2000
Received Buffer Length: 0x2000
Synchronous read was not successful after 15 attempts
|
SqlSimErrorDump01864
|
Classic Stale Read Pattern
|
SqlSimErrorDump01864
|
Classic Stale Read Pattern
|
SqlSimErrorDump01888
|
Classic Stale Read Pattern
|
SqlSimErrorDump01909
|
Classic Stale Read Pattern
|
Bob Dorr
SQL Server Principal Escalation Engineer
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.