- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
Network World - It's been a difficult two weeks for Rackspace and its users, with two power outages in a co-location facility interrupting service for an estimated 2,000 customers.
Rackspace, which prides itself on “fanatical support,” has been open about its failures, communicating with customers directly and through the company's official blog and Twitter account. Open communication and a commitment to fixing technical problems will both be crucial for Rackspace as it attempts to repair damaged credibility, says CEO Lanham Napier.
“Any time we have an incident like this, it does impact our credibility,” Napier said in an interview Friday with Network World. “The only way we earn it back is we have to execute at a high level for a long time.”
Power outages on June 29 and July 7 hit Rackspace's 144,000-square-foot data center in the Dallas suburb of Grapevine. Rackspace operates nine data centers worldwide for about 60,000 customers. Within the Dallas facility, some customers experienced downtime of about 40 minutes on June 29 and on July 7 some customers suffered downtime of 15 to 20 minutes.
The facility has three “phases,” or physical areas, and both outages hit the same phase, affecting a total of about 2,000 customers, according to Rackspace. Judging by comments on a recent Network World article, reactions range from anger at Rackspace for not eliminating every point of failure to acceptance that downtime can never be completely prevented and that Rackspace did well in quickly solving the problems and communicating with customers.
“I’m sure there will be some [customers] who are upset with us,” Napier said. “Let’s face it. We let them down. It wouldn't surprise me if some customers leave. I hope most of them stay with us.”
Rackspace has said it will issue between $2.5 million and $3.5 million in service credits to customers. Depending on the service a customer has paid for, service-level agreements can range between 99.9% uptime to 100%, Napier said.
On June 29, Rackspace suffered a utility power interruption, and was forced to move equipment over to generator power. The generators initially held the load and then failed, resulting in 40 minutes of downtime, Napier said.
An incident review cited failure of generators to synchronize with UPS systems, and failure of switches in the electrical infrastructure, preventing transfer of electrical load between different power sources. By July 3, the Rackspace blog reported that maintenance to the generator had “eliminated the excitation failures that caused recent customer disruptions.”
Trouble struck again on July 7 with the failure of a bus duct, a 10-foot, 300-pound piece of copper that distributes electricity. This prevented proper operation of a UPS system, taking customer servers down for about 20 minutes before Rackspace could connect them to generator power. The generators worked this time and carried the load for hours while workers replaced the bus duct, Napier said. Rackspace is still investigating the root cause of the bus duct failure, he said.