Cisco: Yes, cosmic radiation could have caused router bug

cosmic radiation cisco
Stephen Sauer

Yesterday we reported on the reaction to a Cisco bug report that speculated “partial data traffic loss” on the company’s ASR 9000 Series routers was possibly triggered by “cosmic radiation causing SEU soft errors.”

Reaction to that contention on a Reddit forum ranged from the obvious -- acknowledgment that cosmic radiation is an issue -- to sharp-tongued skepticism and tales of the cosmic radiation villain being used as a tongue-in-cheek place-holder meaning “we really don’t know what caused the problem yet.”

Very early this morning I received an explanation from Cisco that seems only partly satisfying. The company says:  

While we can’t speak to this particular case, Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch.

Cisco published a blog post on this topic in January 2012. In an effort to minimize the impact of radiation from “Single Event Upsets” (SEUs), we sought to redesign our technology with custom silicon chips and software, and adopt protocols that utilize resiliency features.

It’s the not speaking “to this particular case” part that is likely to keep raised eyebrows raised because the Cisco bug report was quite clear in naming cosmic radiation as a primary suspect.

Here’s an excerpt from that 2012 Cisco blog post:

It’s a well known problem for aerospace engineers designing electronics for airplanes and satellites, but these “Single Event Upsets” are an issue even in terrestrial-based systems that must meet high reliability operating requirements (although such problems on the ground would typically be the result of reasons other than cosmic radiation). The key challenge is that as electronics operate at faster speeds (beyond 10G) and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems which affect the performance of a router or switch. And despite being rare, for service providers that are building mission-critical networks, “very rare” is still too often. Our challenge was therefore to figure out how to prevent these unusual events, despite the lack of data or industry standards.

Cisco kicked off a program back in 2001 to research the effects of these rare but real events and determine how to prevent them, especially for our larger, mission-critical systems such as the CRS-3. We’ve even gone as far as to place equipment in a particle accelerator to simulate the effects of cosmic radiation over the long term. One key discovery was that simply making small, incremental changes was insufficient. It was necessary to architect systems from the ground up in order to hit our reliability objectives – and to consider system, component, and software elements working together.

So they’ve been actively addressing the issue since 2001, and if their bug reports are to be taken at face value, it would appear that 15 years later the problem remains less than fully resolved.

Welcome regulars and passersby. Here are a few more recent buzzblog items. And, if you’d like to receive Buzzblog via e-mail newsletter, here’s where to sign up. You can follow me on Twitter here and on Google+ here.

070616blog box open
Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2016 IDG Communications, Inc.