Why Troubleshooting Is Overrated

This post is the result of a thought I had after someone asked me to describe an interesting problem I’d faced. I think they meant troubleshooting, because that’s how I answered it.

Speak to most network engineers about what they love about the job, and troubleshooting will crop up quite frequently. I’ve got to admit, being able to delve into a complex problem in a high pressure situation with a clock against it more often than not does give me a rush of sorts. The CLI-fu rolls off your fingers if you’ve been on point with your studies, or you’re an experienced engineer, you methodically tick off what the problems could be and there’s a “Eureka” moment where you triumphantly declare the root cause.

But then what?

I don’t mean what’s the solution to the problem. That’s usually obvious. In most cases, the root cause is one of these culprits:
– Poor design. E.g: 1Gb link in a 10Gb path, designing for ECMP and then realising they’ve used BGP to learn a default route and not influenced your metrics, so anything outside your little IGP’s domain is going to be deterministically routed.
– A fault. E.g: Link down in the path somewhere.
– A bug. I hate these. They usually make your question your sanity.
– It’s just the way it is (a.k.a features). E.g: Finding out Junipers don’t load balance traffic across links by default. Or there’s some legacy kit in the way that works in a strange way.
– Good old human error. E.g: Misconfigured IP addresses are my favourite.

None of these will endear you to the core business, because the default position is along the lines of “why did it take you so long to find it”, or “when is it going to be fixed”. Not “good on your for finding a complex problem in a high pressure situation with little to no help”. Sure, a good boss will throw you a bone, but 99 times out of a 100, the business doesn’t care.

What I do find cool these days is scale. And by scale, I mean automating away as much as is possible to save me time. This time I can use to work on things I find interesting. I’d rather spend 3 hours automating away a mundane task that takes 2 hours each time, than repetitively perform it every time. For example, a lot of times I have to create reverse DNS entries. A couple of hours spent on a basic Python script is a lot better value than an hour spent laboriously copying and pasting octet after octet and sanity checking what you’ve put in there. If things go well, then your teammates can use it too. So suddenly, what used to take 5 x 1 man hour goes to .2 man hours at the expense of say, 2 man hours. So 2 hours to write a script can result in a more than 50% time saving for a single task. How cool is that?

And that is why troubleshooting is no longer the most exciting thing I do.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s