This time I have been asked to explain how Advanced Queuing behaves -- how retries work and when we decide to reroute mail. Since I’ve had plenty of questions on this in the past, buckle your seatbelts – this one is going to be interesting.
Functions of Queuing:
- Categorize messages – Figure out the final destination (by calling the categorizer sink, which asks the directory to provide us with mailbox information)
- Route messages – Figure out the next hop (calling the Routing sink)
- Submit messages to a delivery agent – SMTP for remote delivery or Store (calling a Store Driver sink) for local or gateway delivery
- Provide retry capability in case of errors during any of these processes
The last one is the one that I want to explain in more detail. In particular, I want to address the remote delivery queues, rather than all the internal queues, although they behave similarly.
Figure: Exchange Advanced queuing engine flow
At first, it sounds simple. But there’s a fine balance between being efficient and responsible with resources vs. getting a message to its destination as quickly as possible. Windows SMTP (with or without Exchange) does an excellent job of balancing the two, with some of the best performance you’ll find from any SMTP product with similar functionality.
Definitions:
- Queues – A group of messages (FIFO) waiting to be delivered with the same final destination
- Links – A group of queues that share the same next hop
How it Works:
This overview covers Windows SMTP both with and without Exchange installed. Of course some routing functions are simpler without Exchange, and there is no queue viewer UI without Exchange.
First of all, either the store or SMTP submits a message to queuing (or a message is dropped in via pickup). The message is both the physical copy that you see, as well as some less visible envelope information (e.g., BCC). The entire time that the message is being processed, the physical copy of that message stays where it was (either in the store or on disk, respectively). Queuing simply has a pointer to that message as it moves through various queues. Properties (envelope information like timestamps for timeouts and categorizer specific properties) are written back to the message so that if SMTP needs to be restarted for any reason, the message can still be resumed.
After the destination is determined by the categorizer and routing sinks (if used), the message is placed into a remote delivery queue. SMTP outbound protocol will look at the queues and create connections as necessary. In order to do this, it looks up the destination IP in DNS if necessary, and attempts a connection. Then, it pulls out messages going to the same destination (link). If there are multiple messages, then it grabs up to 20 at a time (because there is overhead associated with creating new connections; if there are more than 20 messages, then new connections can be created, 20 at a time, as needed -- this value is configured on the Messages tab of the SMTP VS). The session will use the configured options for that connection (AUTH, TLS, etc.).
There are three main types of errors. These failure types are:
- Per recipient failures: failures that are deemed by the remote host to be particular to a specific recipient (e.g., mailbox too full)
- Per message failures: failures that are deemed by the remote host to be particular to a specific message (e.g., too many recipients, invalid message body, invalid sender)
- Per connection failures: failures that apply to the entire connection
These errors are determined solely by the SMTP response code, not by the human readable error message. Syntax errors on the part of the sender may be treated as one of these, depending on where in the protocol conversation they occur.
Of course there are also two types of severity:
- Retry-able: 400-level SMTP error implying that if the same message gets sent at a later time, it will go through
- Permanent: 500-level SMTP error implying that if the same message were to be tried again, it still would not go through
If delivery fails for nearly any permanent reason, an NDR (a form of DSN) is generated as a new message going back to the sender. The original message is then deleted from the queue. In the case of per-recipient failures, the successful recipients are not listed as failures in the NDR. Of course, any NDR that cannot be delivered is placed in badmail.
Glitch Retry:
If delivery fails for nearly any retry-able reason, the queue is immediately placed in a glitch-retry state for 60 seconds. If there are three of these failures in a row, then the queue enters a true retry state. Remembering that a queue is FIFO, the same message will be retried two times before being marked as a problem message. Problem messages are sent to the back of the queue so that other messages may have an opportunity to pass through.
Retry:
When a queue enters retry, the configurable retry options take over. By default, for the first three attempts, these attempts are 10 minutes apart. Subsequently, the attempts are 15 minutes apart. Messages that are in a queue that is in retry must wait until the queue comes out of retry.
Connection Failures:
If the link is setup to use DNS, then the connection will not be placed into retry until all of the records have been exhausted. The only type of connection failure that is permanent is if you have an authoritative response from DNS stating that the domain does not exist.
Load balancing:
Load balancing is tricky and depends on the type of link being used, whether or not Exchange is installed, etc. But essentially, a quasi-load balancing should take place in many configurations, if multiple remote hosts are supplied.
Exchange Linkstate:
Advanced Queuing populates Exchange Linkstate data. When a queue has been down (marked as retry) for 5-10 minutes (depending on version), then AQ will report to the Routing Engine that the link is down. (This delay is to prevent unnecessary state changes that will flood the network). Next, routing will adjust the Linkstate tables accordingly, and Routing may start selecting a different route, if available. In the case of SMTP connectors, this can only happen for connectors with a smart host. DNS-based connectors are never marked as down (unless the SMTP VS is not started). When the queue comes out of retry state, there is a similar delay before other routing clients will receive notice that the link is available. The act of Routing notifying AQ engines and other routing clients of changes to routing is called ResetRoutes().
Troubleshooting:
The biggest source of problems comes from a destination server giving the wrong type or severity code. Make sure that the error code that the remote host is giving is correct per RFC 2821. Any remote queue should have an error message when it is in retry, and any queue can be “forced”. For more information, see KB284204.
Because of the way smart hosted SMTP connectors can be marked as unavailable by Routing, it is also important to consider the right configuration for SMTP connectors. If this behavior does not work for you, then you can set the SuppressStateChanges or StateChangeDelay registry keys for Routing. For more information, see the Transport & Routing Guide.
Event logs:
- Message delivery failed to the remote domain.
- 4000 could indicate either DNS or SMTP outbound protocol.
- 4001 indicates an error in the SMTP outbound session.
- 4005 ResetRoutes() has been called by routing (the event also logs the time it took since this is an expensive call that should happen only rarely).
Maximizing performance:
For more information about maximizing performance, see the Exchange 2003 Performance & Scalability Guide.
Registry keys:
First of all, in additional to the traditional warning about editing the registry, I will say that these values should only be modified with extreme consideration. The defaults were carefully selected, and for the most part it is better to address the root cause of the problem rather than modifying the queuing behavior. By changing these values, you may be hurting throughput or putting an unnecessary burden on the network or remote SMTP hosts. Some keys were not available in the RTM version and require certain minimum versions of binaries.
All keys are found at HKLM\System\CurrentControlSet\Services\SMTPSVC\Queuing unless otherwise specified.
Registry key name |
Default value |
Description |
GlitchRetrySeconds |
60 |
Time before attempting to retry a glitch failure |
PerMsgFailuresBeforeMarkingAsProblem |
2 |
Number of message failures to allow before marking a message as problem and queuing differently |
LocalRetryMinutes |
5 |
If local delivery fails, this is the number of minutes before it is tried again |
CatRetryMinutes |
60 |
If categorization fails, then this is the number of minutes before it is tried again |
MaxPendingCat |
1000 |
Maximum number of pending categorizations before AQ starts backing up messages in the pre-cat queue |
RoutingRetryMinutes |
10 (60 in E2000) |
If GetNextHop fails, then this is the number of minutes before it is tried again |
SubmissionRetryMinutes |
60 |
If submission fails, then this is the number of minutes before it is tried again |
FailedMsgQRetryMinutes |
60 |
If a queuing error occurs that causes a message to be placed in this queue, this is the amount of time it has to wait before being reconsidered by the system |
ResetRoutesRetryMinutes |
10 |
If reset routes fails, then this is the number of minutes before it is tried again |
ResetMessageStatus |
0 |
If this value is 1 upon startup, then all messages currently in the queue will have to go through all internal processing queues (e.g. categorizer) again, regardless of any queues they may have been through previously Note: this key's value is 1 on clustered servers! |
MaxDSNSize |
10247680 |
DSNs for message larger than this size will contain only the headers |
PartialHeaderDSNForOverquota |
0 |
Flag to enable partial headers for DSNs generated for over-quota mailboxes. If this flag is set the DSN generated for over-quota mailboxes will have only partial headers |
You Had Me at EHLO.