It's been a difficult two weeks for Rackspace and its users, with two power outages in a co-location facility interrupting service for an estimated 2,000 customers.
Rackspace, which prides itself on “fanatical support,” has been open about its failures, communicating with customers directly and through the company's official blog and Twitter account. Open communication and a commitment to fixing technical problems will both be crucial for Rackspace as it attempts to repair damaged credibility, says CEO Lanham Napier.
“Any time we have an incident like this, it does impact our credibility,” Napier said in an interview Friday with Network World. “The only way we earn it back is we have to execute at a high level for a long time.”
Power outages on June 29 and July 7 hit Rackspace's 144,000-square-foot data center in the Dallas suburb of Grapevine. Rackspace operates nine data centers worldwide for about 60,000 customers. Within the Dallas facility, some customers experienced downtime of about 40 minutes on June 29 and on July 7 some customers suffered downtime of 15 to 20 minutes.
The facility has three “phases,” or physical areas, and both outages hit the same phase, affecting a total of about 2,000 customers, according to Rackspace. Judging by comments on a recent Network World article, reactions range from anger at Rackspace for not eliminating every point of failure to acceptance that downtime can never be completely prevented and that Rackspace did well in quickly solving the problems and communicating with customers.
“I’m sure there will be some [customers] who are upset with us,” Napier said. “Let’s face it. We let them down. It wouldn't surprise me if some customers leave. I hope most of them stay with us.”
Rackspace has said it will issue between $2.5 million and $3.5 million in service credits to customers. Depending on the service a customer has paid for, service-level agreements can range between 99.9% uptime to 100%, Napier said.
On June 29, Rackspace suffered a utility power interruption, and was forced to move equipment over to generator power. The generators initially held the load and then failed, resulting in 40 minutes of downtime, Napier said.
An incident review cited failure of generators to synchronize with UPS systems, and failure of switches in the electrical infrastructure, preventing transfer of electrical load between different power sources. By July 3, the Rackspace blog reported that maintenance to the generator had “eliminated the excitation failures that caused recent customer disruptions.”
Trouble struck again on July 7 with the failure of a bus duct, a 10-foot, 300-pound piece of copper that distributes electricity. This prevented proper operation of a UPS system, taking customer servers down for about 20 minutes before Rackspace could connect them to generator power. The generators worked this time and carried the load for hours while workers replaced the bus duct, Napier said. Rackspace is still investigating the root cause of the bus duct failure, he said.
Whether an individual customer suffered downtime was in some cases determined by the level of service they've paid for. For example, some customers pay for a higher level of service that lets them draw power from different phases of the facility, and were able to avoid downtime, Napier said.
Rackspace offers both traditional hosting of dedicated servers, and a cloud service that offers access to virtualized server instances. In general, cloud customers were not affected by the power outages, Napier said. But the Rackspace cloud system suffered problems of its own this week, with “intermittent slow load times” and error messages. Napier said the Rackspace cloud suffered from a network problem, rather than a power failure.
All in all, the incidents this week were not as bad as another affecting the same Dallas facility in November 2007, Napier said. In that case, a truck missed a turn and took out a transformer, ultimately taking the entire facility offline, he said.
“Both [incidents] were very painful and very disappointing,” he said.
Although the most recent troubles affected just a portion of one out of Rackspace's nine data centers, Napier noted that “for each customer that's impacted, it's everything to them. Our job is not to have this happen.”
Rackspace made use of Twitter in the wake of its most recent outages, updating customers and letting customers report issues over the social networking site. But that's just one outlet for communicating with users, he says.
“I think Twitter's a good tool,” Napier said. “Twitter has a high usage rate among a certain class of our customers. Other customers prefer a phone call. What we endeavor to do with Twitter and other social media tools is we want to communicate with customers the way they want to.”
Napier said he can’t offer any guarantees that such a service outage won’t happen again.
“It will happen again in our facility, in someone else's facility, it will happen again in our industry,” he said. “We let our customers down. We will do everything we can to prevent this in the future. For the customers who stand by us … we very much appreciate that.”