You are on page 1of 18

Troubleshooting guidelines

Agenda
- Troubleshooting Definition. - What is the benefit. - Elements into consideration. - General guidelines. -The Five basic questions Who, what, when and How. - Keep in Mind. -Slow server example. -Summary

This guideline put in place on 14/5/2009

Definition
Troubleshoot :To isolate the source of a problem and fix it, typically through a process of elimination whereby possible sources of the problem are investigated and eliminated beginning with the most obvious or easiest problem to fix.

Definition
This stereo system has a problem: only one of the two speakers is emitting sound. While the left speaker seems to be working just fine, the right speaker is silent regardless of where any of the stereo's controls are set: Identify three possible faults that could cause this problem to occur, and identify what components of the stereo system are known to be okay (be sure to count each cable as a separate component of the system!). Explain how you might go about troubleshooting this problem, using no test equipment whatsoever. Remember that the speaker cables detach easily from the speakers and from the amplifier.

A very good way to determine which of these components is faulted is to swap cables and speakers between sides, but I'll let you determine which component swaps test which components.

What is the benefit


The NOC deals with all types of technical issues which varied from network, systems, DB, applications etc. however NOC members need to have basic troubleshooting technique to deal with these cases. Having proper troubleshooting technique save the time and enhance the performance and show the level of professionalism of the NOC team.

Elements into consideration


The most fundamental aspect to troubleshooting is knowledge. We, need to completely understand the components, their interactions, and the sequencing of the equipment that is failing. Otherwise, there is potential for the solution to go undiscovered because its root cause lies in the gap of knowledge that the people involved possess. If you do not have all the knowledge you need, make sure those who do are ready to help you assess the situation.

Elements into consideration


As important is access to the environment where error is occurring. In order to verify the various execution steps of whatever is broken, you will need to be able to make measurements at multiple points in the processing. If you have to rely on an intermediary to interpret those results and relay them to you, information can potentially get lost in the translation leading you to draw incorrect conclusions.

An example of this is the childrens party game Telephone. Line up a number of people side by side. Have the first person in the line whisper a phrase to the second person, who then tries to repeat it to the third person, and so on. By the time the last person tries to repeat the phrase for the group, it has changed dramatically.

Elements into consideration


Creativity comes into play when factors not previously considered are influencing the outcome. Based on your knowledge of the situation and of your more general domain expertise, you may need to apply creative thinking in order to see aspects affecting your result that you may not have thought of in earlier phases of your project. When stumped, try to think outside the box and reconsider things you may have assumed away previously.

general guidelines
Analyze symptoms and factors. Check to see whether the problem is a common issue. Isolate the source of the problem.

Define an action plan.


Consult technical support resources

The Five basic questions Who, what, when and How


Part of troubleshooting effectively is simply asking the right questions. At first asking questions can be time-consuming and tedious. However, after you learn which questions to ask, your troubleshooting and issue resolution skills will increase. The five basic questions are who, what, where, when, and how. Who? The answer to a who question is not always obvious. In fact, several who questions, such as the following, could be asked: Who is seeing the problem? Who has the problem but doesn't know about it? Some of these questions won't be asked until other questions are asked, but they can be important to the resolution of the problem. What? Several what questions, including the following, can also be asked: he or she knows. Make sure the problem is understood before making assumptions.

What has changed? Where? The where question is one of the most important questions. It can identify a trend or a location of the problem. The where questions can provide a branching factor when you are trying to eliminate possibilities. Some of these questions are as follows: Where is the problem being reported? When? This is another important question for eliminating possibilities. When does the problem occur: Only when sending? Only when starting the client? Only when a scheduled event takes place? When is it seen: Right away? Every time? Only when sending to that user? How? The how questions are not always applicable, but they can be used in some situations. They can help you verify information. Here are some examples: How is it set up? How is it seen: The same every time? In the same place every time? With the same error every time?

The Five basic questions Who, what, when and How

Keep in Mind
Sometimes it is not what you ask but how you ask it Avoid defensiveness Dont rush it Open vs. Closed questions Avoid you and why to keep the information coming

Slow server example


When told that an application (or server) is slow, the first thing to do is to have a serious conversation with someone knowledgeable about the application and to ask some pointed and detailed questions. From many years of asking these questions, we found that three key areas, as follows, often yield meaningful answers: Where is the problem? We want to determine if the slowdowns occur in the application, the server, the network, and so on. Some questions that we found helpful are: When the slowdown occurs, does it occur for all actions you take on that server or actions only taken in one application? If all actions on the server are affected, then what about other servers? Even file servers? If all actions on the network are affected, then is it isolated to people in your building, wing, or floor? From a slightly different angle, do these problems occur at a particular time of day? Only on certain kinds of machines? Older machines? Machines with the same OS? Where is the problem (actions)? We want to determine which areas of the application, or which specific user actions, are relevant to the performance problem. We want details, for example: Does the slowdown affect you when you open the database? Open views? Run a Job? If so, are all actions or certain actions affected? If only certain actions are affected, identify at least one or two of these slow actions to report back. Does the slowdown happen in the front end and back end? Or is it limited to just one side? Does it occur during a certain time of day? Questions are good, yet too many questions can get us lost. Write down a list of questions that can get you to the answer, remember, most issues have core reasons, and then exceptions. A template list of questions is a very valuable tool

Slow server example


What changed? In nearly every instance, the customer thinks that nothing noteworthy has changed. If the customer knew something important had changed, he would have tracked down that change instead of talking to you! Some focused questions that help elicit facts may be:

Has this application (or server) always been slow? If not, did it slow down when more users were added? More data? Was new functionality recently added to the application? Was the software or hardware recently upgraded? Can you relay the RFC that was responsible for this change? No RFC? When you update your ticket with that information, make sure it is clear and detailed. Why? Months from now, it may happen again, and through the knowledgebase we can actively use your knowledge when someone new is in the NOC to search and find resolution. Take pride in sharing your knowledge. We all know you have tons of knowledge J

Spend time browsing through monitoring products: These products are your eyes and ears. Take advantage of them, such as Solar Winds, SCUM, Keynotes, and whatever else you normally use to see if a problem is repetitive. Use Google, and other resources to see if you can find the issue online and the resolution BEFORE someone tells you it's a problem.

Summary
Prioritize the Incident Determine impact and severity Does it affect WW customers (P1) How does it compare with other open incidents or problems Dont forget to call Ramy.

Summary
Develop Problem Statement. State the incident description as you understand it. Does customer agree the description is accurate. Does customer add new, relevant or clarifying details. Restate to confirm.

Thanks
Anyone who has never made a mistake has never tried anything new.

You might also like