The Two Principles Of Troubleshooting

  1. Never trust someone else’s configuration.
  2. Don’t trust your own configuration.

But in all seriousness. If you’re migrating configuration, this would be a good place to start:

  • Check all your IP addresses are consistent.
  • Check your masks are consistent.
  • Check your interfaces are correct.
  • If you’re working with peers, check your IP addresses for the peers are correct.I mean all 4 octets. Not just the last one, or two, or three. ALL FOUR. If it’s v6, then FML. Bite the bullet and write a script.
  • Is there a naming convention to follow? There’s a temptation when migrating to stick with the old name, but new devices may require a different convention is adhered to. Reasons for this range from the whimsical to the valid.

If you’re coming up with something new, and it involves addressing new interfaces then start with this:

  • First check your IP allocations are correct. By this, I mean check if you have any hierarchy or ordering. For example, do you reserve addresses by site, geographic location or application? If you do, then make sure these are consistent with what you’ve planned.
  • Is your addressing valid? i.e: Are the subnets and host addresses you’ve assigned correct? For example, make sure you’ve not assigned subnet addresses to hosts. This is an important detail to keep in mind when you’re allocation away from the border of an octet (/25-32s).
  • Is there a naming convention to follow?

Apart from this, in general when you’re using a scripted or “run-booked” change and you’re filling it in then try this for starters:

  • Make notes first.
  • Delete all the example, or pre-filled configuration that doesn’t apply to you to prevent human error.
  • If it’s a large change involving multiple stages – for example, you pre-configure interfaces – then track your work. If you’re using Excel for your runbooks then consider using colours to tag what’s complete (green for done, red for not done, or not being done, yellow for things that need checking etc.).
  • If you’re going to work with technologies you’re not familiar with, then consider educating yourself on them in order to troubleshoot accurately.

Saving Backup/Rescue Config on Juniper

A lot of times I find myself having to back a config up on a Juniper before I start work. Usually, I want a quick point I can restore to if I need to rollback. So enter rescue configurations to the, errr, rescue?

request system configuration rescue save

This saves the current saved system configuration as a rescue configuration you can easily rollback to with.

#rollback rescue

You can also save the current configuration to file using:
>file copy /config/juniper.conf.gz /var/tmp/temp_backup.cfg

/config/juniper.conf.gz is synonymous with the current running configuration.

Potentially, you could stash files in /var/tmp/ and restore them using the above. And restore using your backup with #load replace /var/tmp/temp_backup.cfg

View your stashed files using file list /var/tmp

Why Troubleshooting Is Overrated

This post is the result of a thought I had after someone asked me to describe an interesting problem I’d faced. I think they meant troubleshooting, because that’s how I answered it.

Speak to most network engineers about what they love about the job, and troubleshooting will crop up quite frequently. I’ve got to admit, being able to delve into a complex problem in a high pressure situation with a clock against it more often than not does give me a rush of sorts. The CLI-fu rolls off your fingers if you’ve been on point with your studies, or you’re an experienced engineer, you methodically tick off what the problems could be and there’s a “Eureka” moment where you triumphantly declare the root cause.

But then what?

I don’t mean what’s the solution to the problem. That’s usually obvious. In most cases, the root cause is one of these culprits:
– Poor design. E.g: 1Gb link in a 10Gb path, designing for ECMP and then realising they’ve used BGP to learn a default route and not influenced your metrics, so anything outside your little IGP’s domain is going to be deterministically routed.
– A fault. E.g: Link down in the path somewhere.
– A bug. I hate these. They usually make your question your sanity.
– It’s just the way it is (a.k.a features). E.g: Finding out Junipers don’t load balance traffic across links by default. Or there’s some legacy kit in the way that works in a strange way.
– Good old human error. E.g: Misconfigured IP addresses are my favourite.

None of these will endear you to the core business, because the default position is along the lines of “why did it take you so long to find it”, or “when is it going to be fixed”. Not “good on your for finding a complex problem in a high pressure situation with little to no help”. Sure, a good boss will throw you a bone, but 99 times out of a 100, the business doesn’t care.

What I do find cool these days is scale. And by scale, I mean automating away as much as is possible to save me time. This time I can use to work on things I find interesting. I’d rather spend 3 hours automating away a mundane task that takes 2 hours each time, than repetitively perform it every time. For example, a lot of times I have to create reverse DNS entries. A couple of hours spent on a basic Python script is a lot better value than an hour spent laboriously copying and pasting octet after octet and sanity checking what you’ve put in there. If things go well, then your teammates can use it too. So suddenly, what used to take 5 x 1 man hour goes to .2 man hours at the expense of say, 2 man hours. So 2 hours to write a script can result in a more than 50% time saving for a single task. How cool is that?

And that is why troubleshooting is no longer the most exciting thing I do.

Working with JunOs and Optics

Found myself troubleshooting a pesky fibre connection that wouldn’t come up. I was looking for a command that would show me if a light was being received on the interface and found these beauties:

show interfaces diagnostics optics xe-4/1/0

show chassis pic fpc-slot 4 pic-slot 1

The first shows information on light levels on the relevant optic. The second will help you figure out what type of cabling you need to be using. Handy when you don’t know if it should be single or multi mode.


When adding a VLAN doesn’t add a VLAN

Vendor: Cisco
Software version: 12.2(33)SXI7
Hardware: 6509-E

So this is a typical stupid question. How do you add VLANs to a trunk?

Assuming you started with a port with default configuration on it, it would be:

 switchport mode trunk
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan
 switchport trunk native vlan

Now, I was interrupted while doing this by someone interjecting and stating categorically, that

 switchport trunk allowed vlan

Should be:

 switchport trunk allowed vlan add

Not really the way I would do it on a new switchport, but not wanting to hurt feelings I proceeded and saw this:

 TEST(config-if)#switchport trunk allowed vlan add 10,20,30
 TEST(config-if)#do show run int gi9/14
 Building configuration...
Current configuration : 279 bytes
 interface GigabitEthernet9/14
 description TEST
 switchport trunk encapsulation dot1q
 switchport mode trunk
 storm-control broadcast level 0.50
 storm-control multicast level 0.50
 no cdp enable
 no lldp transmit
 no lldp receive

To cut a long story short, the switch takes the configuration, but doesn’t apply it. It lead to a lot of head scratching, because you’d think it should work. Switchport state when doing:

 show interface gi9/14 trunk

Shows a state of “other”

 show interface gi9/14 capabilities

was not much help either.

Any CCNP student can tell you, that this is probably due to `switchport trunk allowed vlan add ` not being applicable to ports that have all vlans trunked on them (which is the Cisco default). I'd guess as soon as `switchport mode trunk` is entered in the CLI, adding VLANs using the `switchport trunk allowed vlan add ` becomes inapplicable.

Still, it would be nice to have an error message confirming it instead of a silent failure.

Getting Paramiko To Work

I’ve had a lot of struggles getting Paramiko to work and today I’ve finally managed it.
Here’s my setup:

-bash-3.2$ cat /etc/redhat-release
 Red Hat Enterprise Linux Server release 7.1 (Maipo)

This is fairly important.

pip install paramiko

Didn’t work for me. Some Googling led me to believe I needed the python-dev package installed. So I tried:

yum install python-dev

This didn’t work, so I had to search for it. So I searched for it using:

yum search python-dev

The above is my new favourite command. It turned up:

$ yum search python-dev
 Loaded plugins: product-id, rhnplugin, subscription-manager
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 ==================================================================================================== N/S matched: python-dev =====================================================================================================
 python-devel.x86_64 : The libraries and header files needed for Python development

I then did a:

pip install paramiko

And I was done!

BGP RIB Failure

An infrequent, yet interesting issue that comes up occasionally is when BGP encounters RIB failures. Usually, it takes the form of a prefix which you’d expect a router to learn via eBGP in its RIB being learnt via a routing protocol with a worse administrative distance.

To understand this problem, we first need to realise that “RIB failure” in a “show ip bgp” output implies that a route offered to the RIB by BGP has not been accepted. This is not a cause for concern if you have a static, or connected route to to that network on the router, but if you’re expecting it to be via eBGP then you can infer that something is misconfigured with your routing.

This can also be simplified to “BGP does not care about administrative distance when selecting a path”.

For reference, the path selection algorithm goes:

Network layer reachability information.

Weight (Cisco proprietary). Bigger is better.

Local preference

Locally originated route

AS path length

Origin code. IGP>EGP>Incomplete

Median Exit Discriminator. Lower is better.

Neighbour type. eBGP better than iBGP.

IGP metric to Next Hop. Lowest Router ID wins.