Working around a memory leak in Cisco Cat 9000 switches

With switches crashing and causing network outages, we found two workarounds that kept the network going until Cisco found a permanent fix.

ram memory background 178875630
Thinkstock

Cisco Catalyst 9000 Series switches have become the switch of choice for many enterprises, including the environment that I work in, where Cisco Catalyst 9300 24- and 48-port switches running Gibraltar-16.12.3 code had become the standard for the access layer when more than 12 ports were needed.

That was about two years ago, and a year or so after that we began receiving notifications from an onsite location that there were intermittent network outages and performance degradation at the site. This is an account of how we found workarounds to the problem until Cisco provided a permanent fix.

We started troubleshooting the issue and found the following syslog messages that we had never seen before:

  • %PLATFORM-3-ELEMENT_TMPFS_CRITICAL: Chassis 1 R0/0: smand: 1/RP/0: TMPFS value 55% above critical level 50%.
  • %PLATFORM-4-ELEMENT_WARNING: Switch 2 R0/0: smand: 1/RP/0: Used Memory value 95% exceeds warning level 90%.

Once we discovered these, we opened a case with the Cisco Technical Assistance Center (TAC) and described some of the symptoms: high CPU processes, slow performance, and network outages caused by the switch crashing and rebooting itself. Cisco TAC told us that there was a bug in the Gibraltar-16.12.3 code—a memory leak in the TMPFS, which keeps all its files in virtual memory. They also told us there was no resolution for the problem.

We used the Cisco Bug Search Tool to find out more and discovered this: “Catalyst 9300 might reload due to the memory leak in TMPFS (RAMDISK). In the output of “show platform software mount switch active R0” the memory used by /tmp keeps increasing. The leak is affecting all stack members.”

We also found that this problem started with the Gibraltar 16.11.1 code release and, at the time we became aware of it, there was no confirmed fix, but we did come up with two workarounds that helped make the problem less severe.

Nightly reloads

We noticed that some of the switches would crash and restart themselves, which would restore them to a normal CPU utilization, so one of the administrators had the idea to utilize the Embedded Event Manager (part of the Cisco IOS) to stage a scheduled reboot for one of the switches. The admin wrote a script using this command: “Event timer cron cron-entry” followed by the time to reload the switch, then issued the “action reload” command.

After some testing we found that this allowed us to reload switches at night, which emptied the memory and worked to keep the CPU usage from being overwhelmed the next day, preventing the switches from crashing. Depending on your environment, policies, and procedures, you might not be able to restart the switches nightly because it would cause them to be unavailable during times when the network is needed. In our case, certain buildings were staffed 24x7 so the switches there could not be reloaded using this script without causing disruptions.

Splunk alert

For sites that couldn’t allow nightly reloads, we found a way to be warned when crashes seemed likely so we could intervene.

One of the technicians created an alert in Splunk that would look for the TMPFS threshold syslog messages and would send out an email to the team to notify us that the CPU was spiking at 75%. Then we could reboot the switch only if necessary, sometimes remotely, but sometimes physically going to the machine. (This took a lot of time and caused frustration for the network team as well as end users who had to wait for the switches and their devices to come back online.)

The Splunk monitoring also allowed us to document which switches reached the 75% threshold and how often that occurred.

Resolution

When the Gibraltar 16.12.4 code release came out we installed it but it didn’t solve the problem. We continued to communicate with Cisco TAC about the issue and continued reloading switches nightly and receiving Splunk notifications about the TMPFS threshold as workarounds. After months of staying in touch with the TAC, we were informed that Gibraltar16.12.5 code was going to be released, and we were hoping it would resolve the TMPFS issue. We upgraded our lab switches to the new code, closely monitored them, and found it was a great success. Upgrading the code to 16.12.5 solved the memory leak.

So far we have migrated over 60% of our switches to 16.12.5, and there have been fewer issues and better performance.

(Find more about how to upgrade to Gibraltar 16.12.5 and 16.12.5b here.)

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:

Copyright © 2021 IDG Communications, Inc.

SD-WAN buyers guide: Key questions to ask vendors (and yourself)