Hello, my name is Mike Hildebrand (aka 'Hilde') - I'm a Dedicated Premier Field Engineer with Microsoft. Welcome to the first in a multi-part series from us on troubleshooting. From mindsets to toolsets, a variety of Microsoft Premier Field Engineers (PFEs) will discuss a variety of skills, tips and tricks to help you build your ability to troubleshoot issues from the simple to the complex. These will all build on a foundation of patterned “thought” that most of us do in our day-to-day lives without even thinking about it.
For example: your car won’t start
Do I have the right key?
Does the starter/engine turn over or do I only hear clicks?
Is the car in gear?
Do the lights work?
Is the clock blinking “12:00”?
Likely culprit - the battery is having issues - investigate further.
Installment #1 - The <Not Always So> Obvious
We all have faced similar situations to the dead battery issue but often, in our “IT lives”, many of us are driven into an almost immediate panic in the face of a severe problem (i.e. a complete datacenter failure or a single, but REALLY mad user or VIP). We start changing a variety of things hoping we’ll get it fixed quickly. We begin pulling cables, rebooting servers, and often cause more harm or do nothing to reduce the time to resolution for the current event/issue. First, remind yourself to breathe. Keep your head. Breeeeeathe. Control the situation as much as you can (rather than only reacting to it) and lean on a methodical, repeatable process to help guide and support you while you begin to work the issue. Second, clearly define the problem as much as possible. You’ll often need tact here to keep from fanning the flames. Have you ever asked a super angry end-user who’s reporting a PC problem “Is the PC powered on?” The eerie silence on the other end of that conversation is enough to make anyone’s forehead start to bead up with sweat. Here are a few examples of “What’s reported” vs “what’s really wrong:”
Month-end printing is down – No one can print
Printer was powered on but disconnected from the network
AD is down – no one can login
Network cable not connected for one VIP user
Ask the focus questions:
Who all is impacted (other users, other sites, other ____, etc)
When did it start (just now, after I upgraded my PDF software, after my last reboot, two weeks ago but just now calling in, etc)
What is broken? What changed, if anything, that might have caused the issue?
What has been done so far to troubleshoot (few issues will get to you prior to any other changes being made to the ‘situation’)
Clarify what the problem is, as well as clarify what the problem is NOT
Email is not working – VS – I can send emails, but I’m not getting any
The Internet is down – VS - I can get to external websites but not internal ones
Consider likely causes even at the expense of seeming obvious - just be ready to duck.
Is the printer turned on? Are you sure? What does the display read (if anything)? Is the network cable (skinny blue wire) connected to the printer? Oh, there were painters there over the weekend? Perhaps they moved/unplugged it?
Is the cable connected to the VIP’s desktop? Are you sure? Do you see link lights?
If the obvious fails, start diligent but simple troubleshooting at the physical layer and work your way up to the more complex systems/environments (think back to your early MCSE tests and the “OSI model”).
Is network connected? Try reconnect/reseat. Examine cord/plug(s).
Is there link activity - green/yellow lights? Flashing or steady?
TCP/IP settings and name resolution
Never underestimate the power of PING (but don't forget that it might be blocked by firewalls)
Static or DHCP-assigned IP?
If DHCP, did you get an IP?
Is the IP 169. ?
Is gateway defined? Is it accurate?
Can you ping the gateway?
Is DNS and/or WINS server(s) defined? Are those entries accurate?
Perhaps a static DNS entry or IP was set for home ISP access?
Can you get to the intranet?
Can you get to the Internet?
Can you send/receive emails?
Once you think you’re onto something, try to make one change at a time to actually discover the root cause.
This is often VERY difficult to do
Mgmt in your face – PRESSURE!! FIX IT!!! IS IT BACK UP YET?!?! BUSINESS IS DOWN!! THE SKY IS FALLING!!! SEV-1!! SEV-A!! CRITSIT!!!! OH THE HUMANITY!!!!
Other people/teams involved possibly making changes
Covering up or repairing their ‘oops’ - "roll-back that SAN controller firmware update we pushed"
Trying to help but working/making changes in silos
Unaware that you’re working the same issue - "I didn't realize your business-unit/site/etc was affected, too?"
Continue to expand on the steps and thought-framework presented here and you’ll continue to be better equipped to manage difficult situations and make progress on simple or complex problem resolution. Also, be sure to check out a blog entry by our Dir Service folks on troubleshooting – it is a skill that you should be constantly evolving and growing: https://blogs.technet.com/b/askds/archive/2011/12/08/effective-troubleshooting.aspx
Tune in next time for a discussion of the first tool installment in this series where we’ll take a look at the World Famous Event Viewer.