Search /
Docfinder:
Advanced search  |  Help  |  Site map
RESEARCH CENTERS
SITE RESOURCES
Click for Layer 8! No, really, click NOW!
Networking for Small Business
TODAY'S NEWS
Valentine's Day Patch Tuesday: Microsoft to issue 9 patches, 4 critical
Mobile World Congress sneak peek: Quad-core smartphones, Ice Cream Sandwich & more
Microsoft details 'Windows on ARM' program
March debut of 'iPad 3' a sure bet, says analyst
FBI unbolts Steve Jobs 1991 investigation file
Cisco boosted profit, sales in Q2 while cutting costs
Macs take on the enterprise
Four crazy tech ideas from Google's Solve for X project
Obama 2012 campaign playlist revealed courtesy of Spotify
Oracle buying Taleo for US$1.9 billion in direct hit at SAP
Amazon attacks Apple: You get 3 Kindle products for price of iPad 2
Pre-rendered pages highlight latest Google Chrome release
Microsoft exec: Lync-Skype integration a 'compelling opportunity'
The future of hypervisors

AT&T's big fi

Today's breaking news
Send to a friendFeedback


April 13, 1998, is a day that neither AT&T nor many of its big customers will soon forget.

That, of course, was the date AT&T's frame relay network went down, leaving the company's business customers - including many of the country's largest corporations and financial institutions - without service for more than 24 hours.

Over the past year, AT&T has been dealing with the fallout from the outage and ensuring that the events of April 13 never reoccur. Today, much of its work is complete. The carrier has improved its testing and upgrade procedures and built a separate network to carry the kind of management traffic that caused the outage in the first place.

In order to understand how the company has addressed its network problems, it's necessary to go back 12 months and examine exactly what went wrong - and why.

Joe Lueckenhoff, AT&T's vice president of Data Network Services, remembers April 13 vividly. A few minutes before 3 p.m., while he was at the company's Bridgewater, N.J., offices to meet with his boss, Tim Murray, his beeper went off.

Lueckenhoff, a 20-year AT&T veteran, was being paged by AT&T's Frame Relay Network Operations Center in nearby Parsippany, N.J. "We have a network operations center that actively monitors all the switches in the network," he explains. "They saw alarms going off on the trunk side and declared a network emergency."

At the time Lueckenhoff was being paged, dozens of other company technicians and managers were also being beeped by the operations center. AT&T has long had what it calls a "Red Book," which enumerates the methods and procedures for dealing with a network emergency. The company drills and trains for these situations in practice scenarios, but as Lueckenhoff and others soon learned, this was no drill.

One procedure calls for the network operations center to set up bridges, or conference calls. In the worst-case scenario, three bridges are established: one for a dozen or so key AT&T managers, another for technical personnel and a third for customer-care and sales people who are charged with alerting business customers of a major outage. This was, without question, a worst-case scenario.

An all-nighter

Throughout the rest of the afternoon and on through the night, AT&T managers and technicians manned the various bridges while emergency teams frantically sought to resolve the problem and determine its cause. The only thing that was known for sure was that at 2:30 p.m. a technician had undertaken a procedure to upgrade trunk cards on a Cisco Stratacom BPX frame relay switch and something had gone drastically wrong; the switch began producing a stream of error messages (see graphic). Even though the switch didn't have customer traffic on it, it was connected to the network and, therefore, was active. Mistakenly, the technician had assumed the switch was on standby. AT&T knew about the procedure right away because the technician had logged in what he was doing.

By 11 p.m. an emergency team had separated the individual network elements - the switches and routers - so the company could isolate the propagation of error messages should it reoccur. Soon AT&T had determined which switch and trunk condition had caused the problem. The company also began rebuilding the various subnetworks that together comprise the network, ensuring that each was healthy. As individual subnetworks got increasingly robust, AT&T connected them to one another. "When you're building subnetworks, customers on Subnetwork A can't talk to customers on Subnetwork B, and frankly, connecting the various subnetworks is what took so long to get the network re-created," Lueckenhoff explains.

By 8 a.m. the next morning, all the subnetworks were connected, and AT&T started bringing customer traffic back online. An hour later about 90% of customer traffic was up, but a number of permanent virtual circuits (PVC) used by some of AT&T's biggest customers were proving troublesome to restore. "There were some unique PVCs that took the rest of the morning to get up and working," Lueckenhoff says. "By 2:30 on the afternoon of April 14, 99.9% of the PVCs were up."

Throughout the day, Lueckenhoff and Frank Ianna, AT&T executive vice president, and president of AT&T's network unit, repeatedly briefed AT&T CEO C. Michael Armstrong, as well as Cisco CEO John Chambers, regarding AT&T's progress in restoring service. "I was personally talking to Mr. Chambers on the 14th as well as the following day," Lueckenhoff recalls. "He was very interested in making sure our customers were happy and up and working. Cisco was very cooperative in this."

As for Armstrong, this was his first real crisis in the CEO post, having come over from Hughes Electronics just a few months prior. "He had experience in this because we were talking about a computer problem basically, and he came out of the computer industry," Lueckenhoff says. "He clearly understood what was going on."

Unlike his predecessor, Robert Allen, a remote, seemingly imperious executive who came out of the analog side of the company, Armstrong was actively showing his support for the data services operations by manning the battle lines with Ianna and his troops during the crisis.

At a press briefing on the morning of April 14, Armstrong said he believed a software problem started in two Cisco Stratacom BPX frame relay switches, propagating itself in about 145 nodes throughout the network. He conceded, however, that AT&T still hadn't pinpointed the root of the problem.

Armstrong also announced that AT&T would not charge any of its frame relay customers for service until the problem was fully repaired, a pronouncement that would prove more costly than the CEO may have realized at the time.

Uncovering the cause

For the next eight days, AT&T set about trying to determine what triggered the outage and to mend fences with its angry customers. Within 48 hours of the network going down, Armstrong had written letters to the CEOs of the customer companies apologizing profusely and explaining what was happening.

Meanwhile, a crisis team gradually put together the pieces of the puzzle behind the network crash.

With AT&T's automated upgrade procedure, there are two ways to upgrade pairs of redundant trunk cards, one of which is active while the other is on standby.

With one approach, an engineer orders the switch's network operating system to upgrade the standby card exclusively. Once that process is complete - and it's clear the newly upgraded card is stable - the active card is put on standby, and the just-upgraded card goes active. Then, and only then, is the second card upgraded.

With this sequential approach, if there is a flaw in one of the cards, it is isolated during the upgrade. "Had we done that, the problem never would have happened," Lueckenhoff says.

Instead, the technician used a second approach. Assuming that the trunk cards were isolated from the network because the switch to which they were connected wasn't carrying traffic, the technician upgraded both cards in one procedure. As it happened, there were flaws in the firmware of both cards, but the automated procedure upgraded the standby card and the active card so quickly that safeguards failed to kick in. "Normally, the system would have checked to see if there were alarms coming from the backup card, but it didn't," Lueckenhoff says. As a result, even though both cards were flawed, the upgrade procedure wasn't aborted. When system safeguards belatedly uncovered the flaws, the trunk cards immediately sent a stream of error messages back to the switch.

The exposure of the error in the trunk cards, coupled with another previously undisclosed flaw in the switching fabric of at least one of the Cisco switches, caused error messages to propagate to the other connected nodes in the network.

"What happened is that the stream of messages triggered by the errors filled up the buffers on all the switches connected to the network," Lueckenhoff explains. The overload effectively caused the switches to shut down, putting the network out of commission - all within a matter of a few seconds.

The big fix

The kinds of problems that triggered the outage are certainly not unique to AT&T. "Any other carrier could experience a situation of this kind," notes David Goodtree, group director of Forrester Research in Cambridge, Mass.

Moreover, once the errors were uncovered, they were readily addressed. AT&T permanently shelved the upgrade procedure that triggered the breakdown. "The last time we ever did that procedure was on April 13," a company spokesperson says.

AT&T and Cisco also fixed and changed the flawed trunk cards by the end of April. Finally, Ianna's team and Cisco's lab people repaired the switching fabric problem, though that process wasn't complete until mid-May, a month after the network went down.

The repairs didn't end there, however. The outage and the ensuing restoration efforts exposed some fundamental flaws in AT&T's network. "Part of the problem was that we had recovery techniques for one node, or two, or even a section of the country, but we didn't have techniques that were robust enough to deal with the entire network going down," Lueckenhoff concedes.

Nor was there a fully operational out-of-band network in place to carry network management traffic; the message stream that actually caused the switches to go down was made up of management messages. And finally, AT&T didn't have sufficient planning, testing and simulation capabilities to deal with the entire network. "We had a test bed in place that would mimic about 20 nodes, and our network is much larger than that," Lueckenhoff says. He adds that while AT&T had contingency plans in place for a partial outage, inconceivably, there wasn't a comprehensive plan for a total network failure.

Once AT&T took stock of its network's flaws, the company launched an ambitious effort to enhance reliability - what it calls a four-dimensional technical blueprint - that is now largely complete.

Specifically, the carrier incorporated explicit overload controls for message streaming and has restricted the ways in which switches can reset themselves, as happened in the outage. AT&T also installed a recovery technique that Lueckenhoff claims is sufficient to handle the entire network and restore it in less than an hour should another outage occur. "Now we have a preplanned recovery procedure for an entire network outage," he says. "To tell you the truth, the procedures for a total network failure took too long to implement."

AT&T's new blueprint calls for an expanded testing environment and improved network management capabilities. "We now have a test bed in place that will actually mimic the size of the network we have in place," Lueckenhoff says. He declined to provide greater detail on the operational aspects of the new simulation and emulation techniques.

With implementation of a fully operational out-of-band network, management messages are now isolated from customer traffic, and network monitoring features have been upgraded. "We had trunk and switch monitors that notified us whether the trunk was up or down or had an alarm, but now we can actually monitor what the volume of the traffic in the message network is and whether it is approaching any thresholds that might indicate problems," Lueckenhoff says.

Finally, in the process improvements area, AT&T engineers and supervisors have been instructed to treat all network components as live. "Even if a switch doesn't have customer traffic on it, we assume that it does when we perform a routine," Lueckenhoff says. The company also requires technicians to answer a series of questions before they undertake any activity on the network. Explains Lueckenhoff: "These are questions such as, 'Do you know what to do in case of a problem? Do you know what the recovery techniques are in case a situation occurs?' "

Repercussions

Such improvements are positive fallout, but they don't change the fact that AT&T was badly damaged by the outage in several regards. "For one thing the money it gave back to clients was substantial," notes Christine Heckart, an analyst at the TeleChoice consultancy in Boston. Armstrong's offer not to charge customers from the time of the outage until the problem was fully resolved - April 13 through the first week in May - ultimately meant the company lost almost one-twelfth of the estimated $1 billion its data services business generates annually, a significant setback for an organization that was under considerable financial pressure at the time.

In addition, AT&T lost customers after the outage, including a business-consulting firm in St. Paul, Minn., J. Hill Group. AT&T claims, however, that fewer than 10 customers defected, many for reasons that had nothing to do with the outage. "A lot of customers had outstanding bids and contracts they were regarding before this ever happened," Lueckenhoff says.

Another negative: Over the past year, the outage has also become emblematic of the ultimate network disaster. "It continues to be kept alive by the press and to serve as fodder for reporters," Heckart notes.

On the other hand, AT&T generated quite a bit of goodwill in the way it responded to the outage. "I think Armstrong handled the situation extremely well and got a lot of points for his forthrightness about the problem," says Rosemary Cochran, a principal with Vertical Systems Group, a consultancy in Dedham, Mass.

"We rely on carriers like AT&T to bulletproof the network so we don't experience these problems," says one customer, Virgil Palmer, director of telecommunications and networks at Air Products and Chemicals in Allentown, Pa. "But anything manufactured by humans is going to break, so it's just a matter of when, not if, it's going to happen. AT&T was trying its best to recover from a bad situation."

TeleChoice's Heckart notes that under the terms of service-level agreements AT&T had signed off on a few months before the outage, the company was only legally responsible for providing free service during the crisis to a relatively small number of customers. "Armstrong went above and beyond what AT&T was required to do in this regard," Heckart says.

For customers such as Palmer, however, the free service was simply a token gesture. "They gave us the credit, but I would rather have had the service and not had to hassle with all the problems," he says.

Long-term, the outage served as a wake-up call to AT&T, which over the past year has poured substantial funds into overhauling its network and beefing up its data services research and development and technical force, all to safeguard its frame relay service. "We've implemented our technical blueprint, and we're confident that this type of outage will never happen again with our network," Lueckenhoff says.

McCartney is an editor and writer in New York. He can be reached at LatonM@aol.com.

Related Links


NWFusion offers more than 40 FREE technology-specific email newsletters in key network technology areas such as NSM, VPNs, Convergence, Security and more.
Click here to sign up!
New Event - WANs: Optimizing Your Network Now.
Hear from the experts about the innovations that are already starting to shake up the WAN world. Free Network World Technology Tour and Expo in Dallas, San Francisco, Washington DC, and New York.
Attend FREE
Your FREE Network World subscription will also include breaking news and information on wireless, storage, infrastructure, carriers and SPs, enterprise applications, videoconferencing, plus product reviews, technology insiders, management surveys and technology updates - GET IT NOW.