Why Troubleshooting Is Overrated

This post is the result of a thought I had after someone asked me to describe an interesting problem I’d faced. I think they meant troubleshooting, because that’s how I answered it.

Speak to most network engineers about what they love about the job, and troubleshooting will crop up quite frequently. I’ve got to admit, being able to delve into a complex problem in a high pressure situation with a clock against it more often than not does give me a rush of sorts. The CLI-fu rolls off your fingers if you’ve been on point with your studies, or you’re an experienced engineer, you methodically tick off what the problems could be and there’s a “Eureka” moment where you triumphantly declare the root cause.

But then what?

I don’t mean what’s the solution to the problem. That’s usually obvious. In most cases, the root cause is one of these culprits:
– Poor design. E.g: 1Gb link in a 10Gb path, designing for ECMP and then realising they’ve used BGP to learn a default route and not influenced your metrics, so anything outside your little IGP’s domain is going to be deterministically routed.
– A fault. E.g: Link down in the path somewhere.
– A bug. I hate these. They usually make your question your sanity.
– It’s just the way it is (a.k.a features). E.g: Finding out Junipers don’t load balance traffic across links by default. Or there’s some legacy kit in the way that works in a strange way.
– Good old human error. E.g: Misconfigured IP addresses are my favourite.

None of these will endear you to the core business, because the default position is along the lines of “why did it take you so long to find it”, or “when is it going to be fixed”. Not “good on your for finding a complex problem in a high pressure situation with little to no help”. Sure, a good boss will throw you a bone, but 99 times out of a 100, the business doesn’t care.

What I do find cool these days is scale. And by scale, I mean automating away as much as is possible to save me time. This time I can use to work on things I find interesting. I’d rather spend 3 hours automating away a mundane task that takes 2 hours each time, than repetitively perform it every time. For example, a lot of times I have to create reverse DNS entries. A couple of hours spent on a basic Python script is a lot better value than an hour spent laboriously copying and pasting octet after octet and sanity checking what you’ve put in there. If things go well, then your teammates can use it too. So suddenly, what used to take 5 x 1 man hour goes to .2 man hours at the expense of say, 2 man hours. So 2 hours to write a script can result in a more than 50% time saving for a single task. How cool is that?

And that is why troubleshooting is no longer the most exciting thing I do.

Working with JunOs and Optics

Found myself troubleshooting a pesky fibre connection that wouldn’t come up. I was looking for a command that would show me if a light was being received on the interface and found these beauties:

show interfaces diagnostics optics xe-4/1/0

show chassis pic fpc-slot 4 pic-slot 1

The first shows information on light levels on the relevant optic. The second will help you figure out what type of cabling you need to be using. Handy when you don’t know if it should be single or multi mode.


When adding a VLAN doesn’t add a VLAN

Vendor: Cisco
Software version: 12.2(33)SXI7
Hardware: 6509-E

So this is a typical stupid question. How do you add VLANs to a trunk?

Assuming you started with a port with default configuration on it, it would be:

 switchport mode trunk
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan
 switchport trunk native vlan

Now, I was interrupted while doing this by someone interjecting and stating categorically, that

 switchport trunk allowed vlan

Should be:

 switchport trunk allowed vlan add

Not really the way I would do it on a new switchport, but not wanting to hurt feelings I proceeded and saw this:

 TEST(config-if)#switchport trunk allowed vlan add 10,20,30
 TEST(config-if)#do show run int gi9/14
 Building configuration...
Current configuration : 279 bytes
 interface GigabitEthernet9/14
 description TEST
 switchport trunk encapsulation dot1q
 switchport mode trunk
 storm-control broadcast level 0.50
 storm-control multicast level 0.50
 no cdp enable
 no lldp transmit
 no lldp receive

To cut a long story short, the switch takes the configuration, but doesn’t apply it. It lead to a lot of head scratching, because you’d think it should work. Switchport state when doing:

 show interface gi9/14 trunk

Shows a state of “other”

 show interface gi9/14 capabilities

was not much help either.

Any CCNP student can tell you, that this is probably due to `switchport trunk allowed vlan add ` not being applicable to ports that have all vlans trunked on them (which is the Cisco default). I'd guess as soon as `switchport mode trunk` is entered in the CLI, adding VLANs using the `switchport trunk allowed vlan add ` becomes inapplicable.

Still, it would be nice to have an error message confirming it instead of a silent failure.

Getting Paramiko To Work

I’ve had a lot of struggles getting Paramiko to work and today I’ve finally managed it.
Here’s my setup:

-bash-3.2$ cat /etc/redhat-release
 Red Hat Enterprise Linux Server release 7.1 (Maipo)

This is fairly important.

pip install paramiko

Didn’t work for me. Some Googling led me to believe I needed the python-dev package installed. So I tried:

yum install python-dev

This didn’t work, so I had to search for it. So I searched for it using:

yum search python-dev

The above is my new favourite command. It turned up:

$ yum search python-dev
 Loaded plugins: product-id, rhnplugin, subscription-manager
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 ==================================================================================================== N/S matched: python-dev =====================================================================================================
 python-devel.x86_64 : The libraries and header files needed for Python development

I then did a:

pip install paramiko

And I was done!

BGP RIB Failure

An infrequent, yet interesting issue that comes up occasionally is when BGP encounters RIB failures. Usually, it takes the form of a prefix which you’d expect a router to learn via eBGP in its RIB being learnt via a routing protocol with a worse administrative distance.

To understand this problem, we first need to realise that “RIB failure” in a “show ip bgp” output implies that a route offered to the RIB by BGP has not been accepted. This is not a cause for concern if you have a static, or connected route to to that network on the router, but if you’re expecting it to be via eBGP then you can infer that something is misconfigured with your routing.

This can also be simplified to “BGP does not care about administrative distance when selecting a path”.

For reference, the path selection algorithm goes:

Network layer reachability information.

Weight (Cisco proprietary). Bigger is better.

Local preference

Locally originated route

AS path length

Origin code. IGP>EGP>Incomplete

Median Exit Discriminator. Lower is better.

Neighbour type. eBGP better than iBGP.

IGP metric to Next Hop. Lowest Router ID wins.

Verifying SSL Certificate Chains

Found this link very useful doing this:


Some useful commands:
Display a certificate:
openssl x509 -in test-cert-top.pem -noout -text

Display a certificate’s issuer:
openssl x509 -in test-cert-top.pem -noout -issuer

Display a certificate’s subject:
openssl x509 -in test-cert-top.pem -noout -subject

Verify a certificate:
openssl verify test-cert-top.pem

Verify a certificate chain with 3 certificates:
openssl verify -CAfile test-cert-bottom.pem -untrusted test-cert-middle.pem test-cert-top.pem
CAfile keyword indicates which certificate is used as the root certificate, with the -untrusted option being set to validate the intermediate certificate in the chain

Verify a certificate chain with 2 certificates:
openssl verify -CAfile test-cert-bottom.pem test-cert-middle.pem

A10 Health Monitors

This post is an equivalence check of A10 vs ACE probes/health monitors.


ACE-A# show probe

probe : tcp-3121-probe-1
type : TCP
state : ACTIVE
port : 3121 address : addr type : -
interval : 10 pass intvl : 30 pass count : 2
fail count: 2 recv timeout: 5

--------------------- probe results --------------------
probe association probed-address probes failed passed health
------------------- ---------------+----------+----------+----------+-------
serverfarm : vip-
real : ip-[3121] 1286028 1104 1284924 SUCCESS

interval – the time period health checks for a healthy server are sent
pass intvl – the time period health checks for a server marked “DOWN” are sent
pass count – the number of successful probes required to mark a server as “UP”
fail count – the number of unsuccessful probes required to mark a server as “DOWN”
recv timeout – timeout before a probe fails

a10-1[test-1]#show health monitor
Idle = Not used by any server In use = Used by server
Attrs = Attributes G = GSLB
Monitor Name Interval Retries Timeout Up-Retries Method Status Attrs
tcp-443-monitor-1 30 2 5 2 TCP In use

Interval – the time period health checks for a healthy server are sent
Up-Retries – the number of successful probes required to mark a server as “UP” after it has come back fro a “DOWN” state.

The A10 has no concept of a “pass-interval” unlike the ACE. So the same time periods used for probes to servers in an “UP” state are used for those in a “DOWN” state. This is not going to matter 9 times out of 10, because the A10 is a lot beefier than an ACE. Probe overhead is not a lot. However, it does throw up errors if you’re using a conversion tool.