Quote:
<* Even with SP1 installed, there seems to be a massive memory leak, at least in a single-server environment. I'm waiting on hold with PSS to get this sorted out right now (and yes, I have the aforementioned hotfix installed).
I'm seeing this, too. Server with Hub, Cas, and Mailbox role. Commited RAM will rise, rise, rise, but there is no indication what is taking the RAM. Did PSS give you any info?>
I spent five hours on the phone with the Windows Server support group last night. I'm not 100% sure it is an Exchange issue, so I decided it was better to start with the Windows group and move over if my hunch is verified. Some background:
- This is a single-server environment. Technically there is still an Exchange 2003 server, but all the roles are installed on the Exch2007 box.
- The box is a dual-proc, dual core 3.0 Xeon with 8GB of RAM hosting about 30 mailboxes. It's also a DC and runs SQL 2005 Standard (the production DB is small- <200MB).
- When it's 'freshly' started it runs as expected.
Over a period of a week or two it gradually slows down. If I ignore the calls from the users that the accounting apps (uses SQL) is slow, the box will eventually bugcheck/bluescreen. I got this call yesterday and when I got there the task manager process list show store.exe using less than 400MB RAM and sqlserv.exe using only 115MB. Nothing else was over 100MB. It showed only 1GB RAM available, but the total RAM in use on the process list was no where near 7GB, so RAM is getting 'lost' somewhere.
What is happening is that the page file fills up, the server slows, and eventually, if I don't manually reboot, becomes unresponsive and then bluescreens.
PSS had me bump up the page file (to 1.5xRAM), setup a couple very granular counters in perfmon, and change the registry to get a full memory dump, rather than just a kernel dump. I uploaded the last kernel dump and the result of the reporting tool they had me run. They are analyzing the data and I am monitoring the server to see if the larger page file fixes the problem or just prolongs the period during which the server runs 'normally'.
So there is a small chance it is fixed, but most likely it's not and we're gathering a better set of data from which to make a diagnosis the next time it happens. I'll post back if/when I get more info.
L
PS- Rereading my previous post I see that I said my two biggest issues were actually three :)