Update 07/03/2012: Expanded the Workaround section to include more details.
We wanted to bring to your attention an issue that we have been seeing recently in CSS where many mailboxes on the same mailbox database become quarantined seemingly for no reason. Upon inspecting the Application log, you will see many 10018 events with text containing to the following information.
Log Name: Application
Source: MSExchangeIS
Event ID: 10018
Task Category: General
Level: Error
Description:
The mailbox for user <guid>: /o=Contoso /ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=UserMailbox has been quarantined. Access to this mailbox will be restricted to administrative logons for the next 6 hours.
Additionally, the following event ID 5410 will appear in the Microsoft-Exchange-Troubleshooters/Operational event log for each mailbox quarantined by the database space troubleshooter:
Log Name: Microsoft-Exchange-Troubleshooters/Operational
Source: Database Space
Event ID: 5410
Level: Warning
Keywords: Classic
Description:
The database space troubleshooter quarantined mailbox <guid> in database <DBName>.
The key here is the last sentence. “The database space troubleshooter”. In environments where System Center Operations Manager has been deployed, a Monitor is in place by default to check free disk space for database and log volumes. This monitor uses the PowerShell script Troubleshoot-DatabaseSpace.ps1, which ships with Exchange 2010 by default. To read more about the Troubleshoot-DatabaseSpace.ps1 script, refer to the following TechNet article:
Manage Database Log Growth by Using the Troubleshoot-DatabaseSpace.ps1 Script in the Shell
http://technet.microsoft.com/en-us/library/ff477617.aspx
The System Center team recently released Service Pack 2 for the Exchange Management Pack. Among other things, this adjusted a timeout value for running the script. Prior to the Service Pack, the timeout value was 300 seconds, and the monitor was set to fire every 5 minutes. This meant that in many large organizations, the script was virtually guaranteed to timeout, causing it to not complete. After installing the upgrade, the new timeout value is 1200 seconds. In addition, the monitor now runs once per hour instead of every 5 minutes. The effects of this change are that the troubleshooter is now reliably executing code that it previously did not execute prior to the SP2 update. This has caused the troubleshooter to find a bug in the Information Store perfmon counter used by the troubleshooter to determine rate of log byte generation for the database (the perfmon rate can exceed 1.0E+19 Bytes/Hr). When the estimated hourly rate of log generation exceeds the capacity of remaining free disk space on database or log volume, the troubleshooter begins to quarantine mailboxes to stop log generation from consuming all free space and causing the database to dismount. The troubleshooter appears to hit this condition when the perfmon counter has a bogus value (the perfmon counter is updated once per minute and should not have a bogus value for more than 1 min)… so it doesn’t occur frequently but does occur with moderate probability.
What does this mean for you?
Well, as documented in the TechNet article, the main function of the script is to keep track of the log generation rate, and to check free disk space for the database and log volumes. When the script runs, it uses a store perfmon counter to determine the rate of log generation, and calculates whether the current rate will cause the disk to run out of space within the threshold of hours (default is 12 hours), and if so, it will optionally start quarantining mailboxes (up to 150 at a time). The script will then also optionally quarantine mailboxes when the percentage of free disk space goes below a specified value. The default free disk space percentage value used is 25%. In order to quarantine mailboxes when either of these conditions are met, the –Quarantine parameter must be passed to the script.
The monitor utilized by SCOM 2007 and higher uses the Quarantine parameter by default, and there is no option supported option to change this because the Management Pack is sealed. This means that if the free disk space for either the database or log volume for a database goes below 25%, AND if the log generation rate is calculated to consume remaining disk space within the next 12 hours, the heaviest users will be quarantined. When there are many mailboxes on a database, this can mean many mailboxes can be quarantined in a short period of time.
What can you do?
Obviously, having many mailboxes quarantined is not a desired behavior, as it causes outages for those users, though it is still better than the alternative of having the entire database dismount due to lack of disk space, which will cause an outage for all users on that database. Although quarantined mailboxes can be manually released by removing that mailbox GUID from the registry, this is not an optimal solution, as the next time the monitor runs, the same mailboxes may end up being placed into quarantine again.
Quarantined mailboxes are identified in the following location:
Hkey_Local_Machine\SYSTEM\CurrentControlSet\Services\MSexchangeIS\Servername\Private-<D Bguid>\Quarantined Mailboxes\ {Mailbox GUID}
We want you to know that we are aware of this situation and are working to implement a fix. While we work to identify the best way to solve this going forward, we want to let you know of a workaround that can be implemented to prevent mailboxes from being quarantined. In the meantime, to help prevent more customers running into this problem, we have temporarily removed the Exchange 2010 SP2 Management Pack from the Download Center.
Workaround
Create an override to disable the monitors that check the free disk space. This will prevent the Troubleshoot-DatabaseSpace.ps1 file from being run at all.
When creating overrides, it is best practice to place them into a new management pack specifically for customizations within System Center Operations Manager. This allows you to quickly and easily manage your overrides or other custom settings. If you do not already have a management pack for this, you can create one by following the steps below:
- In the Operations Console, click the Administration button. In the Administration pane, right-click Management Packs and then click Create Management Pack. The Create a Management Pack wizard displays.
- In the General Properties page, type a name for the management pack in Name, the correct version number in Version, and a short description in Description. Click Next and then Create.
Once you have designated a management pack for the overrides, you will want to create them for the 4 monitors listed here:
In System Center Operations Manager, go to the Authoring module, and then Monitors under Management Pack Objects. The monitors can be disabled by setting an Override and choosing to disable the Monitor for all objects of class Database Copy DB Logical Disk Space (as shown below):
Check the box next to Enabled, and then modify the override value by selecting False:
On the same Override dialog Properties box, be sure to select the management pack designated for customizations.
At this time, we feel that if you are being impacted by mailboxes being quarantined, the option to disable the Monitors is the only available workaround. We recognize that this will potentially prevent administrators from receiving alerts when disk space is low, but feel this is the best option to prevent mailboxes from being quarantined until a fix is released.
Note that when Database disk space is low, an alert will still be received when the transaction log drive coexists with database since this is generated by a different rule that is purely perfmon based.
Ben Winzenz, Kevin Carker
You Had Me at EHLO.