One of the ongoing tasks in industrial automation is troubleshooting. It’s not glamorous, and it’s a quadrant one activity, but it’s necessary. Like all quadrant one activities, the goals is to get it done as fast as possible so you can get back to quadrant two.
Troubleshooting is a process of narrowing the problem domain. The problem domain is all the possible things that could be causing the problem. Let’s say you have a problem getting your computer on the network. The problem can be any one of these things:
- Physical network cable
- Networks switch(es)
- Network card
- Software driver
- etc.
In order to troubleshoot as quickly as possible, you want to eliminate possibilities fast (or at least determine which ones are more likely and which are unlikely). If you don’t have much experience, your best bet is to figure where the middle point is, then isolate the two halves and determine which half seems to be working right and which isn’t. This is guaranteed to reduce the problem domain by 50% (assuming there’s only one failure…). So, in the network problem, the physical cable is kind of in the middle. If you unplug it from the back of the computer and plug it into your laptop, can the laptop get on the internet? If yes, the problem’s in your computer, otherwise, it’s upstream. Rinse and repeat.
As you start to gain experience, you start to get faster because you can start to assign relative probabilities of failure to each component. Maybe you’ve had a rash of bad network cards recently, so you might start by checking that.
In industrial automation, I’ve seen a pattern that pops up again and again that helps me narrow the problem domain, so I thought I’d share. Consider this scenario: someone comes to you with a problem: “the machine works fine for a long time, and then it starts throwing fault XYZ (motion timeout), and then after ten minutes of clearing faults, it’s working again.” These annoying intermittent problems can be a real pain, because it’s sometimes hard to reproduce the problem, and it’s hard to know if you’ve fixed it.
However, if you ask yourself one more question, you can easily narrow it down. “Is the sensor that detects the motion complete condition a discrete or analog sensor?” If it’s a discrete sensor, the chance that the problem is in the logic is almost nil. I know our first temptation is always to break out the laptop, and a lot of people have this unrealistic expectation that we can fix stuff like this with a few timers here or there, but that’s not going to help. If you have discrete logic that runs perfectly for a long time and then suddenly has problems, it’s unlikely there’s a problem in the logic. There’s a 99% certainty that it’s a physical problem. Start looking for physical abnormalities. Does the sensor sense material or a part? If yes, is the sensor position sensitive to normal fluctuations in the material specifications? Is the sensor affected by ambient light? Is the sensor mount loose? Is the air pressure marginal? Is the axis slowing down due to wear?
The old adage, “when all you have is a hammer, every problem is a nail”, is just as true when the only tool you have is a laptop. Don’t break out the laptop when all you need is a wrench.
Great post! I agree with the divide and conquer strategy. I also tend to lean towards components in the field first – as they tend to be in the line of fire (i.e. servo motors and cables are exposed to coolant and stresses that drives in a panel are not).
My experience is intermittent fault are most times somehow mechanically related (a flexing cable, a sensor that barely makes a flag). I had one motion service call where a shuttle would rub on a scrap steel tray. It would run for about 45 minutes before it started faulting out on following error faults. As it ran, it would rub on the tray, build up heat and making it expand, until it started to fault. After it cooled down for an hour, it would run again for 45 minutes and process repeat. I found it when reaching inside machine and burnt my hand on the rack.
@Ken: Yeah, one of my first questions is, “did you make any program changes recently?” If the answer is definitely “no”, then the chances are it has to be a mechanical/electrical problem. Software is the one thing that rarely “wears out” unless someone messes with it.
Scott, this is also a good reason for NOT breaking out the laptop. Even if you just open the laptop and go online just to see what is going on, you are “the last one to touch it” and will inherently spend more time defending the fact that you didn’t actually make any changes. So unless someone can prove the code changed, leave the laptop closed, crack out some basic tools, common sense, get your hands dirty and actually troubleshoot the issue at hand!
@Laurens: great point! 🙂 “You touch it, you buy it.”