First published on TechNet on Aug 25, 2013
Hey y’all Mark here again. A lot of the content we provide here on this blog is to keep you guys ahead of potential issues that can arise and also explain how things work in the field. Doing education like this gives you the ability to investigate "possible hiccups" before they become "giant eruptions of sleepless agony" in your environment. We here in PFE call that type of work “Proactive”. The vast majority of the engagements we provide are proactive in nature. Our famous RAPs and now RAP as a Service are pretty much focused on this.
Sometimes, though, things break and PFE is here to help with that, too. This type of work is classified as “Reactive.” Sometimes, these issues are small and sometimes they are the huge problems. When they fall into the "huge problems" category, we have a name for them - you've likely heard of them - we call them Critical Situations or “CritSits.” Usually we are talking massive, "business down" outages affecting company-wide or highly critical systems. Think along the lines of a major cluster outage that won’t come online, nobody can login, extremely slow performance on a system that is affecting the business, etc. You get the picture. While you are working with our amazing support folks, a PFE will sometimes get dispatched to come help if the customer requests it. Recently a few of us here on the blog worked a CritSit and thought it might be a good idea to document some common mistakes that take place in these types of situations. So without further ado….
I know this is easier said than done, especially when your environment is down and management is asking you for a status update (I’ll get to those) every 20-30 seconds but this is not the time to let things fall apart. When things get tough, stiffen up. Look at the floor and take a few deep breaths. Things can be made much much worse when you make a quick rash decision. These types of decisions tend to get made closer to when an SLA is about to be missed. I’ve made a non-scientific graph to illustrate this:
The red line is the SLA you need to meet before stuff gets seriously bad or the time the bar closes. The blue line is the likelihood you’ll make a bad decision and everything gets worse. All joking aside, I’ve heard some pretty off-the-wall suggestions when things get desperate. “Let’s bounce the data center." Yikes. How about we don’t do that. Management will typically be pushing harder as you get closer to do something. Remember stay calm. The horse is out of the barn, let’s not let have him run clear across the continent by doing something rash.
This is not usually the best time for in-fighting or pointing the index or middle finger. Often, though, this can be when you see it the most. Right now you are in a pickle, and the faster this gets resolved the better it will be for everyone involved, period. Make sure all folks who represent the impacted systems are represented and available to troubleshoot the issue, including your related vendors. You’d be surprised on how many times we just sit everyone down and describe the issue and someone will say “Oh I made a change sort of around this a few days ago, could that have something to do with it?”
Many times these issues go long into the night and into the next day. This is not the time to pull a marathon session of 40 hours straight. Work your normal 8, 10 to 12 hours shifts as much as you can. Having a fresh set of eyes and a sharp mind are critical to getting this thing solved. This goes for all teams involved. Even if it’s as simple as having someone from another team “on call” can be a lifesaver when you need them at 3 AM. Just to clarify something people say all the time, ‘I’m fine working 24 hours straight with no issue’ would you want the person flying your plane to be up for 24 hours when you hit some turbulence or the guy who got 8 hours of sleep? That’s what I thought. Usually once the issue has been determined and you are making a change the issue could be open for 24, 48 hours or even longer. Making sure you are plugging the right cable, on the right server, etc. are easy mistakes to make when you’ve been up way too long.
We’ve all been on these calls. We talk about what we just did for 20 minutes and what the results are, we talk about what we think it might be for 20 minutes. Then we talk about what the next steps are for 20 minutes. Then we have to update everyone else on what we are going to do for 20 minutes. Then we do the actual thing we are going to do for 20 minutes and have to stop because we need to start preparing for our next status update. Ok it might not be this bad but it’s probably not that far off. Management needs to know what’s going on and that’s fine but spending time, every time, explaining the same thing to different people is a real waste of time. Having the information in one central spot for everyone to read or hear really saves time. It also helps if there is 1 person in charge of this that is not part of the core troubleshooting team. That way, if people are late, shifts are changing, etc, they can get caught up and the troubleshooting train keeps on a-rolling. Another idea is to actually setup two conference rooms and phone bridges - one for the tech folks and another one for the management folks.
One of the most important things to do after determining there’s a problem is being able to define it. One critical component for being able to troubleshoot an issue, as well as better defining it, is gathering data…or what I like to call, evidence. When troubleshooting challenging server issues, we become more like detectives sometimes when the cause isn’t so obvious. Whenever there is a significant issue, management wants root cause. How can you determine root cause 3 weeks after a problem occurs if the data is no longer around to be gathered? Someone reading this is asking “Really? 3 weeks later?” That happens quite often for a variety of reasons. Could be that others have already attempted to find root cause for some time before contacting Microsoft. Could be that someone noticed on a management report that a server had an issue weeks ago and now someone wants to know what happened. Event log data and other logs should be gathered as soon as possible after the actual incident. Some logs can be chatty, may have size limits or are circular, and may wrap around and lose history as significant time passes. This is definitely true when troubleshooting issues on server clusters. It is no fun to try to determine root cause with missing data. It is also no fun to try to restore logs from backups weeks later. Management tools that periodically gather event or performance data can be quite helpful as well. Gathering good data in a timely manner can be a great precursor to gathering the right parties to troubleshoot the issue or find root cause.
This could really be its own whole thing here but there is lots out there on troubleshooting already. Our own Hilde has written two posts on the topic, part 1 and part 2 and CTS has done a great job with this post . Guess when you need to rely on these more than ever? What does the evidence point to? Don’t go with your gut, go with solid troubleshooting techniques. Making lots of changes at once “to see what happens” is a sure fire way to waste time and probably make it worse. You start to end up like this . One of the things that is usually overlooked is documenting what you are changing/doing. You think you’ll remember, but that test you ran was 16 hours ago and a full pizza + 4 fully leaded Mountain Dews are between then and now. Also start with the basics. I know we immediately like to jump to some crazy in the weeds advance topic and turn the debug level up to a Spinal Tap 11 but resist this urge. Maybe there is an issue with how the applications is warming up the cache on first start up and every hour, but only on the even hours, and on hour 7…..or maybe the tab on the network cable is broken and it is ever-so slightly ajar? You get the point. It's often a good use of time to spin up efforts in parallel; get someone to begin the process of recalling the tapes from the recent backups, start building the OS on a recovery server or VM, get it patched, etc. If it's needed, the restore option is now closer at hand.
Much like the elusive Beetlejuice, but probably far more troublesome for customers we’ve arrived at that sensitive topic, backups. There have been many times over the years I can think of where during a critical server outage we could have had things back online within minutes (time = $$$) by restoring a recent backup. In those situations, many times valuable root cause data could have been captured, backup restored, and crisis mitigated – if only they could restore a backup. I remember one particular incident about a decade ago where an administrator was calling me over and over from a bathroom stall as not to let anyone know he was having to call for help because there was no backup available and they were having an outage. As a result it took many hours to resolve the situation. Not having a functional backup to restore is a common mistake that can be a very unpleasant surprise in addition to an already unplanned outage.
Typical reasons a backup might not be available for restore include:
· Nobody ever tested the ability to restore…and restore doesn’t work
· Backups weren’t capturing what they thought they were
· The person with the ability to access and restore backups is in the Caribbean somewhere with no phone.
· Scheduled backups weren’t actually running so there is no backup
· They thought that since the data was on a RAID set or SAN that one wasn’t needed
· Backups are stored offsite for safe keeping and the facility is closed
That’s it for now. What did we miss? What are you tricks? Send us your questions and comments about this topic and anything else.
Mark ‘I need a status update’ Morowczynski + The AskPFEPlat blog team. Much like a bad date, everyone has a troubleshooting horror story.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.