The Two Principles Of Troubleshooting

October 11, 2016 thestupidengineerLeave a comment

Never trust someone else’s configuration.
Don’t trust your own configuration.

But in all seriousness. If you’re migrating configuration, this would be a good place to start:

Check all your IP addresses are consistent.
Check your masks are consistent.
Check your interfaces are correct.
If you’re working with peers, check your IP addresses for the peers are correct.I mean all 4 octets. Not just the last one, or two, or three. ALL FOUR. If it’s v6, then FML. Bite the bullet and write a script.
Is there a naming convention to follow? There’s a temptation when migrating to stick with the old name, but new devices may require a different convention is adhered to. Reasons for this range from the whimsical to the valid.

If you’re coming up with something new, and it involves addressing new interfaces then start with this:

First check your IP allocations are correct. By this, I mean check if you have any hierarchy or ordering. For example, do you reserve addresses by site, geographic location or application? If you do, then make sure these are consistent with what you’ve planned.
Is your addressing valid? i.e: Are the subnets and host addresses you’ve assigned correct? For example, make sure you’ve not assigned subnet addresses to hosts. This is an important detail to keep in mind when you’re allocation away from the border of an octet (/25-32s).
Is there a naming convention to follow?

Apart from this, in general when you’re using a scripted or “run-booked” change and you’re filling it in then try this for starters:

Make notes first.
Delete all the example, or pre-filled configuration that doesn’t apply to you to prevent human error.
If it’s a large change involving multiple stages – for example, you pre-configure interfaces – then track your work. If you’re using Excel for your runbooks then consider using colours to tag what’s complete (green for done, red for not done, or not being done, yellow for things that need checking etc.).
If you’re going to work with technologies you’re not familiar with, then consider educating yourself on them in order to troubleshoot accurately.

Saving Backup/Rescue Config on Juniper

September 23, 2016 thestupidengineerLeave a comment

A lot of times I find myself having to back a config up on a Juniper before I start work. Usually, I want a quick point I can restore to if I need to rollback. So enter rescue configurations to the, errr, rescue?

request system configuration rescue save

This saves the current saved system configuration as a rescue configuration you can easily rollback to with.

#rollback rescue #commit

You can also save the current configuration to file using:
>file copy /config/juniper.conf.gz /var/tmp/temp_backup.cfg

/config/juniper.conf.gz is synonymous with the current running configuration.

Potentially, you could stash files in /var/tmp/ and restore them using the above. And restore using your backup with #load replace /var/tmp/temp_backup.cfg

View your stashed files using file list /var/tmp

Why Troubleshooting Is Overrated

June 8, 2016 thestupidengineerLeave a comment

This post is the result of a thought I had after someone asked me to describe an interesting problem I’d faced. I think they meant troubleshooting, because that’s how I answered it.

Speak to most network engineers about what they love about the job, and troubleshooting will crop up quite frequently. I’ve got to admit, being able to delve into a complex problem in a high pressure situation with a clock against it more often than not does give me a rush of sorts. The CLI-fu rolls off your fingers if you’ve been on point with your studies, or you’re an experienced engineer, you methodically tick off what the problems could be and there’s a “Eureka” moment where you triumphantly declare the root cause.

But then what?

I don’t mean what’s the solution to the problem. That’s usually obvious. In most cases, the root cause is one of these culprits:
– Poor design. E.g: 1Gb link in a 10Gb path, designing for ECMP and then realising they’ve used BGP to learn a default route and not influenced your metrics, so anything outside your little IGP’s domain is going to be deterministically routed.
– A fault. E.g: Link down in the path somewhere.
– A bug. I hate these. They usually make your question your sanity.
– It’s just the way it is (a.k.a features). E.g: Finding out Junipers don’t load balance traffic across links by default. Or there’s some legacy kit in the way that works in a strange way.
– Good old human error. E.g: Misconfigured IP addresses are my favourite.

None of these will endear you to the core business, because the default position is along the lines of “why did it take you so long to find it”, or “when is it going to be fixed”. Not “good on your for finding a complex problem in a high pressure situation with little to no help”. Sure, a good boss will throw you a bone, but 99 times out of a 100, the business doesn’t care.

What I do find cool these days is scale. And by scale, I mean automating away as much as is possible to save me time. This time I can use to work on things I find interesting. I’d rather spend 3 hours automating away a mundane task that takes 2 hours each time, than repetitively perform it every time. For example, a lot of times I have to create reverse DNS entries. A couple of hours spent on a basic Python script is a lot better value than an hour spent laboriously copying and pasting octet after octet and sanity checking what you’ve put in there. If things go well, then your teammates can use it too. So suddenly, what used to take 5 x 1 man hour goes to .2 man hours at the expense of say, 2 man hours. So 2 hours to write a script can result in a more than 50% time saving for a single task. How cool is that?

And that is why troubleshooting is no longer the most exciting thing I do.

Working with JunOs and Optics

June 6, 2016 thestupidengineerLeave a comment

Found myself troubleshooting a pesky fibre connection that wouldn’t come up. I was looking for a command that would show me if a light was being received on the interface and found these beauties:

show interfaces diagnostics optics xe-4/1/0

show chassis pic fpc-slot 4 pic-slot 1

The first shows information on light levels on the relevant optic. The second will help you figure out what type of cabling you need to be using. Handy when you don’t know if it should be single or multi mode.

When adding a VLAN doesn’t add a VLAN

February 17, 2016 thestupidengineerLeave a comment

Vendor: Cisco
Software version: 12.2(33)SXI7
Hardware: 6509-E

So this is a typical stupid question. How do you add VLANs to a trunk?

Assuming you started with a port with default configuration on it, it would be:

 interface
 switchport
 switchport mode trunk
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan
 switchport trunk native vlan

Now, I was interrupted while doing this by someone interjecting and stating categorically, that

 switchport trunk allowed vlan
 ```

Should be:

```
 switchport trunk allowed vlan add
 ```

Not really the way I would do it on a new switchport, but not wanting to hurt feelings I proceeded and saw this:

```
 TEST(config-if)#switchport trunk allowed vlan add 10,20,30
 TEST(config-if)#do show run int gi9/14
 Building configuration...

Current configuration : 279 bytes
 !
 interface GigabitEthernet9/14
 description TEST
 switchport
 switchport trunk encapsulation dot1q
 switchport mode trunk
 shutdown
 storm-control broadcast level 0.50
 storm-control multicast level 0.50
 no cdp enable
 no lldp transmit
 no lldp receive
 end
 ```

To cut a long story short, the switch takes the configuration, but doesn’t apply it. It lead to a lot of head scratching, because you’d think it should work. Switchport state when doing:

```
 show interface gi9/14 trunk
 ```

Shows a state of “other”

```
 show interface gi9/14 capabilities
 ```

was not much help either.

Any CCNP student can tell you, that this is probably due to `switchport trunk allowed vlan add ` not being applicable to ports that have all vlans trunked on them (which is the Cisco default). I'd guess as soon as `switchport mode trunk` is entered in the CLI, adding VLANs using the `switchport trunk allowed vlan add ` becomes inapplicable.

Still, it would be nice to have an error message confirming it instead of a silent failure.

Getting Paramiko To Work

December 31, 2015December 31, 2015 thestupidengineerLeave a comment

I’ve had a lot of struggles getting Paramiko to work and today I’ve finally managed it.
Here’s my setup:

-bash-3.2$ cat /etc/redhat-release
 Red Hat Enterprise Linux Server release 7.1 (Maipo)

This is fairly important.

pip install paramiko

Didn’t work for me. Some Googling led me to believe I needed the python-dev package installed. So I tried:

yum install python-dev

This didn’t work, so I had to search for it. So I searched for it using:

yum search python-dev

The above is my new favourite command. It turned up:

$ yum search python-dev
 Loaded plugins: product-id, rhnplugin, subscription-manager
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 ==================================================================================================== N/S matched: python-dev =====================================================================================================
 python-devel.x86_64 : The libraries and header files needed for Python development

I then did a:

pip install paramiko

And I was done!

BGP RIB Failure

December 18, 2015 thestupidengineerLeave a comment

An infrequent, yet interesting issue that comes up occasionally is when BGP encounters RIB failures. Usually, it takes the form of a prefix which you’d expect a router to learn via eBGP in its RIB being learnt via a routing protocol with a worse administrative distance.

To understand this problem, we first need to realise that “RIB failure” in a “show ip bgp” output implies that a route offered to the RIB by BGP has not been accepted. This is not a cause for concern if you have a static, or connected route to to that network on the router, but if you’re expecting it to be via eBGP then you can infer that something is misconfigured with your routing.

This can also be simplified to “BGP does not care about administrative distance when selecting a path”.

For reference, the path selection algorithm goes:

Network layer reachability information.

Weight (Cisco proprietary). Bigger is better.

Local preference

Locally originated route

AS path length

Origin code. IGP>EGP>Incomplete

Median Exit Discriminator. Lower is better.

Neighbour type. eBGP better than iBGP.

IGP metric to Next Hop. Lowest Router ID wins.

Verifying SSL Certificate Chains

September 30, 2015October 5, 2015 thestupidengineerLeave a comment

Found this link very useful doing this:

http://www.herongyang.com/Cryptography/OpenSSL-Certificate-Path-Validation-Tests.html

Some useful commands:
Display a certificate:
openssl x509 -in test-cert-top.pem -noout -text

Display a certificate’s issuer:
openssl x509 -in test-cert-top.pem -noout -issuer

Display a certificate’s subject:
openssl x509 -in test-cert-top.pem -noout -subject

Verify a certificate:
openssl verify test-cert-top.pem

Verify a certificate chain with 3 certificates:
openssl verify -CAfile test-cert-bottom.pem -untrusted test-cert-middle.pem test-cert-top.pem
–CAfile keyword indicates which certificate is used as the root certificate, with the -untrusted option being set to validate the intermediate certificate in the chain

Verify a certificate chain with 2 certificates:
openssl verify -CAfile test-cert-bottom.pem test-cert-middle.pem

A10 Health Monitors

August 10, 2015August 10, 2015 thestupidengineerLeave a comment

This post is an equivalence check of A10 vs ACE probes/health monitors.

ACE

ACE-A# show probe

probe : tcp-3121-probe-1 type : TCP state : ACTIVE ---------------------------------------------- port : 3121 address : 0.0.0.0 addr type : - interval : 10 pass intvl : 30 pass count : 2 fail count: 2 recv timeout: 5 --------------------- probe results -------------------- probe association probed-address probes failed passed health ------------------- ---------------+----------+----------+----------+------- serverfarm : vip-11.95.79.90_3121 real : ip-11.95.79.68[3121] 11.95.79.68 1286028 1104 1284924 SUCCESS
interval – the time period health checks for a healthy server are sent
pass intvl – the time period health checks for a server marked “DOWN” are sent
pass count – the number of successful probes required to mark a server as “UP”
fail count – the number of unsuccessful probes required to mark a server as “DOWN”
recv timeout – timeout before a probe fails

a10-1[test-1]#show health monitor Idle = Not used by any server In use = Used by server Attrs = Attributes G = GSLB Monitor Name Interval Retries Timeout Up-Retries Method Status Attrs --------------------------------------------------------------------------------- tcp-443-monitor-1 30 2 5 2 TCP In use
Interval – the time period health checks for a healthy server are sent
Up-Retries – the number of successful probes required to mark a server as “UP” after it has come back fro a “DOWN” state.

The A10 has no concept of a “pass-interval” unlike the ACE. So the same time periods used for probes to servers in an “UP” state are used for those in a “DOWN” state. This is not going to matter 9 times out of 10, because the A10 is a lot beefier than an ACE. Probe overhead is not a lot. However, it does throw up errors if you’re using a conversion tool.

Checking Faulty Cables

July 14, 2015July 14, 2015 thestupidengineerLeave a comment

I recently had to work with a 3rd part to diagnose a link between our devices and came across this handy command. The link in question was a pretty hefty (75m-ish) UTP cable run between a Cisco and HP switch. I have visibility of the Cisco switch, into the structured cabling into the patch panel, and the 3rd parties cable. Unfortunately I didn’t have a DC Operations tech with access to a Fluke, or the ability to interpret the output of a Fluke, but they did have a laptop with a 100Mbps NIC (this becomes important later on).

So I started by running the diagnostic on the production connection. It’s not working, so I don’t have to worry about taking stuff down. This gives me the following:
test cable-diagnostics tdr interface gi7/21 TDR test started on interface Gi7/21 A TDR test can take a few seconds to run on an interface Use 'show cable-diagnostics tdr' to read the TDR results.

switchA#show cable-diagnostics tdr interface gi7/21

TDR test last run on: July 09 10:30:20
Interface Speed Pair Cable length Distance to fault Channel Pair status
——— —– —- ——————- ——————- ——- ————
Gi7/21 auto 1-2 77 +/- 6 m N/A Invalid Terminated
3-6 75 +/- 6 m N/A Invalid Terminated
4-5 75 +/- 6 m N/A Invalid Terminated
7-8 N/A 8 +/- 6 m Invalid Open

It doesn’t really tell me anything, apart from that there’s most likely, an infrastructure fault. This could be a faulty cable, patch panel, or a faulty switchport.

I now start my troubleshooting. I need to ascertain the path I have visibility of. The tech tells me there is structured cabling to the line card on the switch, so I get him to connect his laptop to the section the B end connects to, and I run the diagnostic.
switchA#show cable-diagnostics tdr interface gi7/21

TDR test last run on: July 09 10:31:55
Interface Speed Pair Cable length Distance to fault Channel Pair status
——— —– —- ——————- ——————- ——- ————
Gi7/21 100 1-2 16 +/- 6 m N/A Pair A Terminated
3-6 16 +/- 6 m N/A Pair B Terminated
4-5 N/A 12 +/- 6 m Invalid Short
7-8 N/A 11 +/- 6 m Invalid Short

Interestingly, this works, but at 100Mbps. I also see that a couple of pairs in the cable are not terminated correctly. Pair D is for 1Gbps and pair C is for PoE. This was turned up by a quick Google search. I also see that the fault is located +/-6m from the switchport, suggesting the problem is at my end.

I double check this against a working port:
Another_switch#show cable-diagnostics tdr interface gigabitEthernet 7/16
TDR test last run on: July 09 10:18:43 Interface Speed Pair Cable length Distance to fault Channel Pair status --------- ----- ---- ------------------- ------------------- ------- ------------ Gi7/16 1000 1-2 0 +/- 6 m N/A Pair B Terminated 3-6 0 +/- 6 m N/A Pair A Terminated 4-5 309 +/- 6 m N/A Pair D Terminated 7-8 99 +/- 6 m N/A Pair C Terminated
I now reconnected the customer end, and set the port speed to 100Mbps manually using
interface Gi7/16 speed 100
Unsurprisingly, the port comes up. The cable diagnostic shows the same output as it did above. I suspect the problem is either a faulty punch down on pair D, an incorrectly crimped cable, or the customer manually setting their speed to 100Mbps for some reason.

Now usually, if this were structured cabling to a new switch, I would ask for the patch panel terminations to be punched down again. However, this time, I know the cable is the newest part in the puzzle and ask the tech to re-crimp it. It works, and is now up at 1Gbps.

I now have learnt something new about cabling pairs. There are a few things to be aware of.
– The “Distance to fault” field can throw up false positives.
– I do not know how a correctly terminated cable connected to a 100Mbps will output. Will it see the cable D termination even if it is not being used for communication?
– The connection drops momentarily when the diagnostic is run. Remember to be aware of this if troubleshooting a working connection.

The Stupid Engineer

I ask those questions you're too clever to.