Basking Ridge, N.J. - When it was all over, everyone agreed: Even a major fiber cable cut would have been better than this.
AT&T last week suffered the catastrophic failure of its frame relay network. All of the company's 145 switches were brought down, leaving thousands of frame relay customers offline for more than 24 hours. The event laid bare numerous weaknesses in the carrier's network infrastructure and product management.
The network failed at about 3 p.m. Monday and did not completely recover until Tuesday evening. Although AT&T on Friday was still uncertain about what caused the collapse of its network, the carrier did say the failure stemmed from a problem between two Cisco Systems, Inc. StrataCom frame relay switches; one in Albany, N.Y., and the other in Cambridge, Mass. From there, the problem propagated across the network, according to AT&T.
With AT&T saying so little, observers were left speculating about the root cause of the crash. Analysts suggested the cause to be either a traffic overload that flooded the network with reroute messages or a failure in the forwarding tables that left switches without instructions where to send traffic.
But the frame relay failure was just the start of the problem for the company, which relies on frame relay for $1 billion in revenue each year. High-volume users frantically seeking backup protection reported that AT&T's shortage of T-1 access capacity forced the company to seek other carriers' facilities to terminate remote dial-up sessions.
AT&T's T-1 shortage first came to a head this past winter (NW, Feb. 16, page 1). The shortage became a major headache last week when AT&T's standard frame relay backup options were rendered useless by the fact that the switches, rather than the physical routes, got knocked out.
At press time, AT&T officials said they had still not discovered the root cause of the failure. AT&T Chairman and CEO C. Michael Armstrong said in a briefing with reporters that the company would not charge customers for frame relay service until the network was re-stored and the root cause was identified and fixed.
But some users thought AT&T had an obligation to go even further. In January, AT&T announced service-level agreements (SLA) that included 99.99% availability of the frame relay network. At the time, AT&T Data Services Vice President Steve Hindman promised a four-hour mean-time-to-repair guarantee, or customers would receive free ports and permanent virtual circuits (PVC) for a month.
An AT&T spokesman last week said the SLAs had slipped past the scheduled March general availability date. But he confirmed some larger customers had been given the guarantees anyway.
Six jets and 9.6K modems
Thrown back on their heels, nearly every user contacted by Network World reported that they did not have enough backup lines to keep their networks running. Many resorted to unusual measures. One giant pharmaceutical company had to call in a fleet of six jets.Some of the pharmaceutical company's 80 sites had ISDN backup, but others had mere 9.6K modems, said a network manager at the company, who requested anonymity. Only half of the company's orders were able to get through via dial-up, so the company brought in the planes to fly paper orders to distribution centers. The company did not get its frame relay fully restored until mid-morning Wednesday.
One of AT&T's largest frame relay networks, the 7,000-node WorldSpan network connecting airlines to travel agencies, was crippled for 24 hours, according to a WorldSpan spokeswoman. Some travel agencies were able to dial up WorldSpan's Atlanta data center, but most could not.
In fact, retailers, distributors and companies connected to outside agencies in businesses such as travel or insurance seemed to have had a particularly hard time. Loren Wilkinson of Egghead.com reported that the online computer retailer lost communications between its headquarters in Spokane, Wash., and its distribution center in Sacramento, Calif., for 36 hours.
"During that period of time we were unable to pass orders to ship to customers. Thanks, AT&T!" Wilkinson wrote in a posting on Network World Fusion.
Pier 1 Imports, Inc., in Fort Worth, Texas, lost communication among its regional offices, zone offices and the home office for more than 24 hours. The company was getting ready to put out a bid on adding 700 stores across the country to an existing frame relay network, said Brad Williams, the company's senior telecommunications analyst.
"From now on, we're going to query vendors about their backup contingency plans in a similar scenario," Williams said.
One big bank did not wait to seek out other carriers. San Francisco-based Wells Fargo Bank saw more than 1,000 automated teller machines in California go down Monday afternoon because they could not connect over AT&T's frame relay network to the bank's data center.
Not all the ATMs could be accommodated on ISDN dial backup because Wells Fargo did not have enough ISDN Primary Rate Interface lines into its Roseville, Calif., data center. And Roseville Telephone Co., the local exchange carrier, could not get any spare T-1 capacity from AT&T to concentrate dial-in access from points outside the local calling area. At that point, MCI Communications Corp. stepped in. "We tested and turned up 12 T-1s of capacity in seven hours," said Jeff Jordeson, an MCI senior technical services manager in Sacramento. Overnight, MCI and Wells Fargo converted the remaining sites that were on the frame relay network to MCI dial-up. Sherry Nash, Wells Fargo senior vice president for data networking, said she was so delighted with the arrangement that she kept the dial-up lines open until Wednesday night, 24 hours after the outage had ended.
Some AT&T customers do have financial protection beyond the credits promised by Armstrong. For
example, drugstore retailer Walgreens Co., in Deerfield, Ill., was one of the lucky ones
assigned an SLA after AT&T's January announcement. Ray Sheedy, Walgreens' director of corporate
telecommunications, said 278 of the company's stores connected via AT&T were down for 24 hours.
Walgreens' mail-order locations in Tempe, Ariz., and Orlando - ordinarily on the AT&T network - had backup frame relay capacity
with MCI. "But we can't afford to have dual networks in stores," Sheedy said. "And ISDN is too expensive in stores and not available everywhere."
Many analysts did offer AT&T some sympathy, blaming Cisco's StrataCom BPX ATM switches with a frame relay access concentrator shelf as the likely culprit in the outage. In a statement, Cisco said it was working closely with AT&T to resolve the problem, but would not comment further.
Most analysts discounted the possibility of lightning striking twice in the same place. But the longer the root cause is not discovered, the greater the concern that a similar problem could occur. Steve Sazegari, president of Tele.Mac, a Foster City, Calif., consulting firm, noted that data traffic typically spikes on Monday - the day this outage occured - as order entry systems reflect weekend mail orders and transactions.
"These switches were never put under this kind of test in the public network before," Sazegari said. "I think a fiber cut would have been a lot less devastating for people. The interruption would have been one or
two or three minutes because they would have rerouted around the problem, and the rest of the network would not have been affected."
RELATED LINKS
Contact Online Reporter Sandra Gittlen or Senior Editor David Rohde.
How did the outage affect you? Let us know in our AT&T outage forum.
Press Briefing: AT&T Frame Relay Network Outage
Transcript of a press conference in which AT&T's CEO explains what happened.
AT&T restores frame relay data service for business customers
AT&T's statement on the outage.
Coping with failure
More user comments. Network World Fusion, 4/16/98
AT&T network goes down for the count
Network World Fusion, 4/14/98
AT&T faces T-1 line shortage
Network World, 2/16/98.
Apply for your free subscription to Network World. Click here. Or get Network World delivered in PDF each week.
![]()
Request a reprint or permission to use this article.
