A Troubleshooting Conference Call on the Weekend - That Hasn't Happened in a While

Ah, the Memories....

There I was this afternoon, after returning from a morning with the family at the NC State Fair, about to start my MBA homework when my IM pops up. "Mike, there are problems in the HQ's WAN after the IOS upgrade two days ago, they are thinking of backing out the upgrade". Studying? Troubleshooting? Studying? Troubleshooting? Ok, the curiosity it too great. Plus, I really wanted this IOS upgrade to be completed. So, I jumped on the call and we started talking - all six of us. This one was a good one. The NetOps team had worked for a bit and determined that 12 hours after the IOS upgrade on the core routers, OSPF starting acting flaky. Currently, OSPF with the firewall was down, preventing BGP to establish from the core routers to the Internet Routers. This prevents the default route from propagating from the Internet Routers through the FW to the core routers. So, HQs was routing to the another hub site for Internet access. It was working (mostly) for users, but we needed to rectify this situation. The strange thing was the OSPF problems started 12 hours after the IOS upgrade. If the IOS had bugs, why did it take 12 hours for OSPF to fail? You would expect it to fail shortly after the upgrade. Thus, downgrading the IOS on the core routers seemed not to be the fix. Upon further inspection we found four devices off a WAN switch (Cat3750 stack) that were having problems. One router was completely isolated. The problematic firewall was the second. One of the core routers was the third device. And fourth was a small voice gatekeeper router. All had problems communicating over the VLANs on this WAN switch, but the WAN switch looked fine. Plus there was nothing in the log that showed an issue 12 hours after the IOS upgrade completed. First, we tackled the isolated router. We could see the router in CDP on the WAN switch, but no CDP on the router. It's was like a unidirectional link, but with copper. After looking for a while to no avail we power cycled the router. It came right back up, all problems solved, OSPF and BGP working. OK....??? Next the firewalls. This one proved trickier. In this case OPSF was showing FULL on the core routers, but was stuck in LOADING on the firewall. While the firewall was stuck in LOADING, no routes from the firewall would show up in OSPF on the core. This was breaking the BGP to the Internet routers. After a while (like 30 minutes), OSPF would reset and go back to stuck in LOADING on the firewall. We bounced interfaces to the firewalls and even rebooted both firewalls and were left with the same problem. Given the firewalls were stuck in LOADING, we started discussing OSPF MTU issues. Yes, normally OSPF devices are stuck in EXCHANGE/EXSTART with MTU problems, but we noticed Cisco devices on these VLANS were configured with MTU 9198, but the firewall was 1500. The strange thing is this was never a problem before. Yes, there was an MTU mismatch, but the firewalls were configured to ignore OSPF MTU and everything worked fine for - well - years. Only 12 hours after an IOS upgrade did MTU become an issue? Well, apparently it had. As we looked more, it appears during the IOS upgrades on the core routers the small voice gatekeeper router had become the OSPF DR on one VLAN and the OSPF BDR on the other WAN VLAN. This little router - running 12.4T code (ugh, "friends don't let friends run T-code in their network"), (1) should not have been a DR/BDR in the first place and (2) was the root of the MTU issue. As soon as we configured "ip ospf mtu-ignore" on the GigabitEthernet interfaces on this router the firewall went OSPF FULL and BGP to the Internet routers came up. Configuring "ip ospf mtu-ignore" command forced an OSPF election on both VLANs and allowed the proper DR/BDR - the core routers - to be elected since they have higher OSPF priority. The core routers were correctly configured with higher OSPF priority to make them the DR/BDR, but this little router was not configured with "ip ospf priority 0" so it could never become a DR/BDR. Whatever happened 12 hours after the IOS upgrade caused both cores to isolate from the VLAN, an OSPF election occurred without the higher priority core routers, and this little voice gatekeeper won the election. With OSPF MTU wrong, this router then broke the OSPF relationship with the firewall. Which brings us to the root the problem here - or the trigger. WHAT HAPPENED 12 HOURS AFTER THE IOS UPGRADE TO INITIATE THIS OSPF CHANGE??? Got me. My guess is the WAN switch, which all four devices connected to, freaked out and messed up the connected devices. It would be very helpful to have a log message that showed the smoking gun. Maybe it will be found tomorrow. At least we know the problem and have a fix. As I write this, BGP from the core routers to the Internet routers has been up 5 1/2 hours. Good! What a "network geek" thrilling 3 hours this afternoon. I haven't done that in years! (and now I get to do my MBA homework)

More >From the Field blog entries:

Facebook-Skype Alliance Could Drive Some Serious Video Bandwidth Usage

We Love Tunnels Too - EoMPLS to Connect Two Data Centers

Positive ROI is What Made WAN Transformation Possible

Cisco's Dividend Announcement and a Little Corporate Finance Shows How Cisco is Changing

WAN Transformation is a Huge Project

WAN Transformation is a Go!

  Go to Cisco Subnet for more Cisco news, blogs, discussion forums, security alerts, book giveaways, and more.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2010 IDG Communications, Inc.

SD-WAN buyers guide: Key questions to ask vendors (and yourself)