First published on MSDN on Jan 07, 2011

I received a request today for help on how SQL Server I/O behaves.   As I was writing up my e-mail response I thought it would also make an interesting blog post.

Sent: Friday, January 07, 2011 2:53 AM
To: Robert Dorr
Subject: Async IO questions

What we do know :

Considering that in Windows the I/O API's allow for sync requests via calls to the API such as WriteFile that will not return control to the calling code until the operation is complete.
Async hands the request off to the operating system and associated drivers and returns control to the calling code.

[[rdorr]] Correct, this is the key to how SQL does the IO so we can hand off the IO and use other resources (CPU) while the IO is being handled by the hardware.

So , when we look at SQL Server which mostly use Async I/O patterns , it  exposes the pending (async) I/O requests in the sys.dm_io_pending_io_requests ,the column ‘io_pending’ provides insight into the I/O request
and who is responsible for it , if the value is TRUE it indicates that 'HasOverlappedIOCompleted' in the Windows the I/O API returned FALSE and the operating system or driver stack has yet to complete the I/O.

[[rdorr]] Correct and exactly why we exposed the column.   This lets us tell if it is a Non-SQL issue or if the IO has been completed and SQL is not processing it in a timely fashion.

Looking at the io_pending_ms_ticks indicates how long the I/O has been pending ,if the column reports FALSE for io_pending it indicates that the I/O still has been completed at the operating system level and
SQL Server is now responsible for handling the remainder of the request.

[[rdorr]] Correct.  We snap the start time when we post the IO (right before the read or write call) and the column is materialized as the result of HasOverlappedIoCompleted.   If you look at HasOverlappedIoCompleted this is really a macro that checks the INTERNAL member of the OVERLAPPED structures for != STATUS_PENDING (0x103, 259).   So in a dump we already look at the internal status on the OVERLAPPED for pending status value.

What we do NOT know and would like detailed feedback on :

When SQL server hands of the Async I/O request to the windows I/O API ,

A1) What is the order of action to complete a write from the I/O API's side in a SAN environment ?
A2) Lets consider an environment where we have delayed IO as a result of a substandard SAN or SAN configuration

a) In a case where the Windows I/O API could not complete the I/O (write) due to a time out or a missing I/O request below the driver / API stack, would the Windows I/O API try to reissue the I/O (write)
OR would the Windows I/O API  thru some call back mechanism in the SQLOS layer notify the SQLOS of such a failure and then rely on the SQLOS to reissue the I/O write request ?
Provide as much detail as possible.

[[rdorr]] The environment does not matter at all to SQL Server.  It could be direct attached, SAN or other Io subsystem.   SQL Server calls the write routine with the OVERLAPPED structure.   This is a kernel transition which builds an IRP and puts it in motion, returning control from the API to SQL Server.  Each of our IOs are placed on a pending list.  This list is associated with the local SQLOS scheduler the task is running on.  Each time the a switch takes place on the scheduler (< 4ms) the I/O list used HasOverlappedIoCompleted to check the status.   If the IO is no longer STATUS_PENDING the completion routine registered with SQL Server is fired.   This is not a completion routine setup with Windows it is an internal function pointer associated with the IO request structure maintained by SQL Server.   The callback routine will check the sanity of the IO (error code, bytes transferred, read check page id, etc..) and then release the EX latch so the page can be used by any requestor.

Not exactly sure what you mean by timeout of the IO.  SQL Server does not allow CancelIO so the IO will stay pending until it completes or returns an error.   From a SQL Server standpoint there is no timeout only success of failure.    If you are talking about HBA level timeouts it is driver specific and the hardware vendor implements the details.   For example a dual HBA system can have a timeout value.  When the HBA can’t service the IO and the timeout is exceeded it will transfer the IO to the alternate HBA controller.   We have seen issues in the past with this pattern where the timeout is not setup properly.  It was supposed to be set to 5 seconds but instead was 45 seconds.  This meant the IO was pending for 45 secs on an HBA that had lost communications with the SAN and would not move the IO to the second controller until that timeout was expired.   The query timeouts from the application were set to 30 seconds so when we went to flush a log record for a commit the SQL Server queries were getting cancelled at 30 seconds because of the underlying hardware setup.

I am not aware of any of the Windows APIs that retry the IO and even if they do SQL Server would not be aware of it.  The OVERLAPPED structure is only updated by the IRP completion routine here.   When the IRP completes one of the last things that occurs is the kernel callback routine if fired.  This callback routine does the basic cleanup, sets the values in the OVERLAPPED structure such as Internal (status value) InternalHigh (bytes transferred) and such so that a caller to GetOverlappedResult or HasOverlappedIoCompleted can obtain the proper state.   It then checks to see if the OVERLAPPED structure contains a valid event and if so will signal it and finally if the IO request is registered with a completion port it will queue the IO to the completion port.    For disk IO SQL Server does not use the completion port.   SQL Server posts the IOs with the OVERLAPPED and an event.  Every IO on the same SQLOS scheduler used the same event.  This allows a scheduler in an idle wait state wake on the event and check the IO list right away.

With that said there are a couple of retry locations in SQL Server but not like you might expect.   When you first attempt to post the IO Read/Write file you can get the error (1450 or 1452) returned which is an out of resources error indicating that the IO could not be started (no IRP created).  In these cases the SQL Server will Sleep for 1 second and attempt to post the IO again.   In these cases the IO is not on the IO list and may not show up in the DMV because the IO is not pending.

For Reads Only if the SQL Server completion routine detects damage (failure of some sort, bad page header, checksum, …) we can retry the same IO up to 4 times before we consider it a failure.   We have found (SQL and Exchange) that if a read fails you can retry the read and sometimes it works properly.  If you do fail 4 times in a row for the same page it is usually damaged and we will log the error.   In either case SQL Server logs information in the error log about this condition.

A failed write will leave the buffer in the buffer pool, hashed with the BUF->berrno set to a value other than 0.  This will essentially, poison the page and for data integrity SQL Server will no longer allow access to the page nor will it write the page to disk.   Since a write is often offline from the query (checkpoint or lazy writer) the original query that dirtied the data is usually not aware of the situation.  However, if that write was for the log records (WAL protocol requires log records to be flushed before commit if valid) not only is the query notified of the error but the database is often marked suspect.   SQL Server has some mechanisms to detect when a log write fails and will even try to take the database offline and bring it online to rectify the situation.  If this fails the database is marked suspect.  If this succeeds SQL Server has cleared a database error condition and allowed runtime but the DBA should be looking closely at the system and their backup strategies.

With all this said there are some situations we have found with snapshot databases that result in OS error 1450 or 1452 that can’t be resolved on Windows 2003.   The error code was changed by Windows 2008 when the spare file limitation is reached and the 1450/1452 can’t be resolved so SQL will stop the retry attempts and be given a clear error message instead.

b) Taking into account the answer above, How many times would the component responsible retry/reissue the I/O (write) request  before it reports a failure up to the calling components ? Sybase = 10 times.

[[rdorr]] Any write request results in a report to the error log.  As I stated the only failure for a write that is generally retried are (1450 and 1452) error conditions.    I went back to the code in SQL Server that handles acquiring the buffer latch (AcquireIoLatch and AcquireLatchActions) on SQL 2008 and it always checks the BUF->berrorno value as I outlined.  There are some special conditions to allow page level restore on enterprise and such but for the most part once a write fails SQL Server won’t retry the write as we can’t trust the data.   Let me give you an example.

If the database is enabled for checkpoint it enables a feature named constant page validation.   When the page is read into memory or was written and is again clean and has a checksum the LW and other actions may validate the checksum the first time the page prepared to become dirty.  If the checksum fails we know the page has been changed while in memory, incorrectly.  Perhaps bad ram a memory scribbler or such activity.   We would never want to flush this page so the BUF->berrno is set to indicate the checksum failure, preventing any further use of the page for any action.

c) and most importantly, can we influence the wait-time before an I/O request gets re-issued ?
d) Finally what is the difference in SQL2000 and SQL2008 wrt IO retries, and the has the IO engine been changed significantly between these two versions, specifically wrt retries, and layer where IO gets handled.

[[rdorr]] Read retry is the only major difference.  SQL 2000 did not retry a failed read because of a page header problem for example.    SQL 2008 can build a write request that is 2x the size of SQL 2000 but that is a perf issue.  Read ahead sizes have always been controlled by SKU.

We think the answer is YES, so any answer would require some detail discussion on this point, possibly with a follow-up. We will do some tests to validate our suspicions with evidence (Shannon)

B1) Does IO to the ERRORLOG happen in an asynchronous way with queuing or measures to minimise SMP contention as well as Engine sync IO spin-wait?

[[rdorr]] Error log is written sync, it does not use the same IO path as database and log writes.

We have trace flags switched on which log's each login into the SQL-server in the ERRORLOG (details/...)
I.E. if IO to the errorlog disk is slow, or happens from 8 engines, could that slow things down significantly.

[[rdorr]] The error log path is a different animal.   The path for handling a report to the log is really quite long.   It has checks for all the dumptrigger behavior, the message has to be formatted each time which involved FormatMessage and the default heap, the messages are reported to the event log with ReportMessage and this path gets longer on a cluster and we write the information to the log in a serialized way.   Also, this requires the workers to go preemptive to avoid holding the OS scheduler so it puts different pressure on the Windows scheduler than you might expect as well.

One way to remove the Event log from the picture as a test is to start SQL Server with the command line parameter of (-n) to avoid writing to the event log.

So I suppose it would be possible that a slow drive could appear to slow SQL Server down but if you could maintain that logic rate the other parts of the logic path are generally much longer than the error log.  For integrated connections you have to talk to Windows (DC usually), you have to validate that login information such as database context, permissions and such.   The longer part of the login would be outside the actual IO to the drive.

Any of this is really easy to review using the public symbols.   You can use the Windows debugger and set a breakpoint on CreateFileW.  When SQL starts up you can see how it opens the file (which flags) as documented by the Windows API on MSDN.com.

Here is a SQL Server 2008 opening the error log

00000000`05727b08 00000000`035831f1 kernel32!CreateFileWImplementation
00000000`05727b10 00000000`035d4a9f sqlservr!DiskCreateFileW+0xf1
00000000`05727b70 00000000`00cab289 sqlservr!DBCreateFileW+0x1bf
00000000`05727e70 00000000`0123560d sqlservr!CErrorReportingManager::InitErrorLog+0x4e9

FileName = struct _UNICODE_STRING "C:Program FilesMicrosoft SQL ServerMSSQL10.SQL2008MSSQLLogERRORLOG"
dwDesiredAccess = 0x40000000  = GENERIC_WRITE(0x40000000)
dwFlagsAndAttributes = 0x000080   ß Normal (0x80)

By default it allows system cache involvement to avoid some of the performance issues you might be suspecting, but you can force it to use FILE_FLAG_WRITE_THROUGH  (-T3663).   One thing to watch here is registry checkpoints.  We have seen a few system that when the registry is checkpointed the IO path slows it response and if the system drive is shared with the error logs or even database files that are some really limited conditions that can impact performance.

We open the error log with either FILE_ATTRIBUTE_NORMAL(0x00000080) or FILE_FLAG_WRITE_THROUGH(0x80000000) and we do not use FILE_FLAG_OVERLAPPED(0x40000000) so the IO is handled by Windows sync, NOT Async .

Testing this is really easy.  You can produce messages at any rate you want with the following.

raiserror('Bob' ,25,1) with log

00000000`1606b878 00000000`7773328a kernel32!WriteFile
00000000`1606b880 00000000`04013055 kernel32!WriteFileImplementation+0x36
00000000`1606b8c0 00000000`0173ffb6 sqlservr!DiskWriteFile+0xa5
00000000`1606b900 00000000`01740b6e sqlservr!CErrorReportingManager::WriteToErrLog+0xc6
00000000`1606b960 00000000`01746dbb sqlservr!CErrorReportingManager::SendErrorToErrLog+0xb7e

[[rdorr]] The other thing I will caution you on is server based trace here.  This has to do the IO and when tracing to a file we guarantee no event loss.  So I have seen lots of cases where customers setup a trace of logins for example but the drive should be as good as the LDF drive and they have pointed the trace destination to a network share or something and caused a bottleneck

Bob Dorr - Principal SQL Server Escalation Engineer