Introduction
For the last few years in the Exchange team, we have embarked on a series of programs intended to bring our engineers closer to our customers and the experience they have with actually using our products. We've been doing a number of different things: holding sessions where we bring in customers to talk to our team about their experiences; hands on labs where we play admin ourselves and provide feedback on our experience; and something we call "Dogfood at Home." Dog food, as I'm sure many of you already know, is slang for using your own software, particularly of the beta variety (eating your own dog food). Dog food at home refers to a project where individuals actually install an Exchange server at home and let friends and family use it for emailing. My experience doing this is what this blog entry is about.
I have, of course, installed Exchange many times in the ten years I've worked in the team, but always as a test bed for playing around with specific features I was developing. Examples of these include store group operations and sync events, and the Exchange Best Practices Analyzer. While I could tell you all about the nuts and bolts of the store process (at least, circa E2K), that doesn't really provide a lot of end-user experience, and I've never actually tried to really administer an Exchange system. And although I've been around for some time, most of my direct experience has been in the store code rather than admin, or transport, or client access, so while I had a very high-level understanding of those areas I had never delved too deeply into them. So in many ways I went into this with less experience than many Exchange administrators have. But – and this is pretty key – while I knew I didn't know everything, I did think I knew a lot more than I actually did. I was willing to play around with things that I shouldn't have, and I had confidence I could fix most problems without any help. This puts me at the top of the class when it comes to being a dangerous administrator. One other thing to note before I really begin is that I was doing a single box deployment with everything (AD, DNS, Exchange Hub, CAS and Mailbox). This probably isn't typical (most would employ SBS for this type of thing), so some of my experience may be skewed from the norm a bit.
Installation
Installation and initial configuration mostly went pretty smoothly. I did run into a few things that seemed a bit annoying, even though I knew the reason why we did things the way we did. The prereqs (which I had a part in designing and developing), were a little cumbersome since they had to be run multiple times as the issues were resolved, and in one case of a .NET patch being required I actually had to exit setup to allow it to install. I've of course heard about complaints that if we could identify issues, why don't just automatically correct them, but experiencing it first-hand like this brought the message home much more directly. However, auto-correction is a much more difficult task than auto-detection (particularly when you are trying to drive the whole process using some kind of simple script). I will continue to think about what we can do to improve this area in the future (I've summarized all my findings at the end).
My first real complication was due to the fact that my router is connected to my ISP with dynamic IP, so I needed to get that taken care of. This involved registering my external domain name in one place, creating another domain name that could handle dynamic IP remapping and pointing that to my current IP address and then pointing my router back to that site, and finally getting to a DNS server that I could add an MX record to so I could map from one site to the other. Not having much DNS experience, I took me quite some time to get that all straightened out. And as it turns out, I made stupid mistakes just about every step of the way. I didn't point my external domain to the right DNS server for one, I screwed up the MX record by putting in the wrong destination host name, and finally I messed up what I put in for the accepted domains on the Exchange server. The upsetting part about all of this is there really isn't any good way of tracking down these mistakes other than reviewing everything one step at a time. This is particularly upsetting to me because for the last few years I've been involved in our diagnostic tool suite (namely the Exchange Best Practices Analyzer and the Exchange Troubleshooting Assistant).
So I finally got the outbound mail delivery working, but the next spot of trouble I ran into was getting inbound going. This turned out to require enabling anonymous users on the receive connectors. This wasn't terribly hard to figure out given the errors I was receiving, but the documentation specifically indicates that the receive connectors created out of the box should be configured exactly as needed to work right away. This is true, but only if you also have an Edge server involved. Since I've got a single box deployment, I don't have an Edge server. I decided to give it a shot, and it worked, but the documentation combined with the a little bit of apprehension about whether allowing anonymous users in meant I was opening myself up to attack, made me rather nervous about it. I finally had to check with other people in the team before I was sure I had done the right thing. I will be talking to our documentation folks about this area (this is one of the items listed in my findings at the end).
Next came certificates. I know a little bit about the nuts and bolts of certificates, but actually having them generated and applying them to a site is new to me. The documentation seems to contain all the information needed; it's just that there is a lot of documentation to go through. I used a self-signed cert at first, and I got that working okay but I really wanted to avoid having my users get warnings when running OWA, so I decided to get a real certificate generated. My first attempt at this was a mess because it sent the approval request to my dynamic DNS org admin, who never responded. I then had to step back into the magical world of DNS and add a CNAME record so that my external domain could be accessed through a URL. I reissued the cert and finally got things working as expected.
The next step was to set up Outlook Anywhere. On the Exchange side, this was pretty easy to find and do. The Outlook side was a little more problematic, not knowing where to put the internal name vs the external one, how to setup the Exchange proxy, etc (note: this was for Outlook 2003 – Outlook 2007 makes this much, much easier). I did finally get it right (with a little help from my friends), but I still haven't gotten anyone else up on it yet (one person who did try ran into an issue I hadn't seen). Another point of confusion on this was that until the certificate was set up correctly, Outlook wouldn't connect, and there was no good explanation of why that I could find.
Now I had a reasonably fully functional system that I could let people use. I only had a few friends using it sporadically so far, but I felt it had been a reasonable experience. There were a few problems and I noted some things that could be improved (which was the main purpose of the exercise), but overall it hadn't gone too badly.
Then came the calamity.
Big Problems
Everything was running fine for about a week. I then decided I wanted to install MOM 2005 (Microsoft Operations Manager) and get that experience as well. While not recommended, I asked around and it sounded like it should probably work okay. This is even though MOM was 32-bit only and I was running on a 64-bit platform. I first had to install SQL 2005, and then I installed MOM. I also installed SQL Reporting Services so that I could install MOM Reporting. I was able to install all of SQL and most of MOM correctly, after fixing a few MOM prereq issues, except for MOM Reporting. It kept complaining that it was unable to reach the SQL Reporting web service. I checked the config and the SQL reporting console, and everything looked like it was there okay.
I mulled this over for a couple of days, and asked a few people what the problem might be, but I didn't get anything figured out. Then I tried to logon to my own mailbox just to check if anyone was having any problems. I got a server unavailable error. I soon realized that everyone was getting these errors ever since I installed SQL and MOM. My friends who were using the system never bothered to tell me about it, they just want back to their old emails temporarily and figured I knew what was going on and was working on fixing it.
So now I've got a completely unavailable system (OWA wasn't working and I hadn't yet gotten anyone up on Outlook yet). Even though no one was relying on my system for anything critical, it was still something of a panic-inducing moment. I've had to debug live servers in the past and step through complex code looking for hard to find problems and had hundreds of people screaming at me all the while, but this was somehow different and uniquely unpleasant. I don't think I ever fully appreciated the difficulties involved or the skill set needed in being a system administrator before this experience. The first thing I did was look at the protocol logs for IIS to see if they provided a clue, as well as the event logs. I soon discovered that the error was coming from the rpcproxy.dll, and it was generating an error 0x0000007E, which I was able to look up and see it was about a module that could not be loaded. I looked at the dependency chain for the dll, and it indicated it could not find dwmapi.dll (this contains the Desktop Window Manager API's and is part of Windows).
Okay, I figured I was getting somewhere now. Somehow, installing MOM and SQL caused this dependency to be added and not resolved (or so I thought). Strange, but I'll just find that dll, drop it down, and everything will be fine. I searched on the web, but no dice. The 32-bit version was available, but not the 64-bit version. I had to find someone with a 64-bit machine that had this dll around, and I copied that over. Problem solved? Nope. It was still reporting the same error, but not it was showing that a couple other dlls and undefined imported functions (as it turns out, these were all clues to the problem but essentially red herrings of the first degree). Well, it was a fun experiment, but I'll just get rid of MOM and SQL and put things back to normal. Problem solved now? Nope. Still messed up in exactly the same way. I enlisted some help from others in the team (a nice resource if you can get it) and tried a few things out, but nothing changed anything. I played with a few settings (dangerous know-it-all admin, remember?), and that didn't change anything.
Giving up, I called each of my users (all four of them), and asked them if they had any critical mail they needed me to save off because I was giving up and about to rebuild the system from scratch. I decided to give it one more day.
Finally, someone remembered a kb article about a similar situation in which a MOM 32-bit install flipped IIS into 32-bit mode. I checked that out, and did the steps needed to put it back in 64-bit mode. Someone else also suggested checking to make sure ASP.NET 2.0 was enabled for all the Exchange web sites, because they had heard that MOM 2005 might change this as well. So I checked, and did find they had been reset and needed to be specified again. I tried it out, and hoorah, it did something! It got rid of my 401 errors. Unfortunately, now I was getting 500 errors. A little more digging in the event log and I found I was getting access denied errors generated by some method somewhere. So I played around with IIS permissions some more (the same ways I had mucked with trying to fix things in the first place), and finally got OWA back up and running.
Conclusions
So, there are obviously a number of things I learned from this, and there are things we could do better as a product for these types of situations. Most of these we had already been planning on doing previously. Understand, however, that due to relative priorities and resources and deadlines, I can't make any guarantees on when any of these will actually be delivered, but this is what I've got:
- Diagnostics in a cloud. As useful as ExBPA and ExTRA are, they are still limited by the fact that it can only test things from inside an organization. The idea here is to have some kind of service running on the Internet that can test a system from an external perspective as well. This can be used for both monitoring, to alert an admin when problems start occurring, as well as for additional diagnostics and troubleshooting to assist in root cause analysis for any problems that exist.
- Client troubleshooting. We have a performance troubleshooter, a mailflow troubleshooter, and a database troubleshooter, but we don't have any wizard to help troubleshoot client access problems.
- Documentation improvements. There are a few things I found in the docs that can be improved:
- Hubs need Anonymous Users allowed if no Edge servers. The documentation should at least note this.
- Prereq improvements. There are a number of things we can try to do to improve the prereq experience in setup, but we are somewhat limited in our flexibility in this. Nevertheless, this area is worth more consideration.
- New ExBPA rules. A number of new rules suggested themselves over the course of this exercise:
- Generate a warning if anonymous users are not enabled on hub servers. This one is a bit tricky, because we can't programmatically determine if Edge servers are in the system (they are not in Active Directory). If they are, you definitely don't won't anonymous users allowed on the hubs. If they aren't, you definitely do what them allowed.
- Certificate verification improvements. We already do some certificate validation, but we can do more here.
- Verify that IIS is in 64-bit mode. Not much explanation needed here, although this should probably be a part of both ExBPA and the client access troubleshooter.
- Verify that Exchange web sites have ASP.NET 2.0 set. Same as above.
- Verify that Exchange web sites are configured to use Local System. Same as above.
All in all, I've gotten a lot out of this exercise and I'm very happy with the results. As I said at the beginning, a number of members of the product team have done the same thing and gained new perspective on their work. I think you should already have seen some of the results of this in both Exchange 2003 and 2007, and will continue to see more as time goes on.
You Had Me at EHLO.