- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
Network World - Amazon Web Services makes a big deal out of its availability zones, recommending that customers use multiple AZs to build clouds that are tolerant of failures like the one the company recently experienced.
But experts say using a second availability zone when you're starting up an AWS instance isn't a panacea to prevent your app from going down during an outage, nor is it as easy to do as simply flipping a switch. And it will cost you more -- sometimes more than double the cost of deploying to a single AZ.
AMAZON TAKES THE BLAME: Amazon reveals cause of outage, refunds customers
MORE CLOUD: Top 8 cities for cloud computing jobs
There are a variety of tools for customers to create highly available, fault tolerant systems within Amazon's cloud. One option is to use AWS's own services, specifically its Elastic Load Balancers (ELBs) that allow workloads to be moved from one availability zone into another. However, AWS acknowledged in a post-mortem report about the October outage that even its ELBs were impacted.
There are a variety of vendors, some call them cloud access management tools, that will manage this process of creating highly available cloud systems using AWS. RightScale and enStratus are two of the most popular. RightScale offers customers prepackaged solutions that spread workloads across AZs, or even various providers. But as Gartner IaaS analyst Kyle Hilgendorf says, "It's a cost vs. risk play. It's not easy, and it's not cheap." Highly available cloud systems simply cost more and add complexity.
Fault-tolerant applications have to be built to be horizontally scalable. In an ideal world they would be stateless, meaning they wouldn't be constantly changing with new data being inputted and saved into them. In most cases, it requires a copy of your system to be made somewhere else to ensure fault tolerance during an outage. Even when all of that is taken into account, there can still be problems.
RightScale CTO Thorsten von Eicken says during the latest Amazon outage, internal operations within RightScale had trouble scaling across availability zones in Amazon's cloud. AWS admitted it was "throttling" customers, meaning it limited how much data they could transfer from one AZ to another, something it has vowed it will not be as aggressive doing in the future. The point is that even if a system is architected to be fault tolerant, unexpected problems can still arise.
There are multiple ways to architect fault tolerant systems though, von Eicken says. Customers can create two active-active services, or create one active and a "clone" standby, for example. Each has its own advantages and cost considerations, though.
Basic fault tolerance: In a basic fault tolerant architecture, there is a production architecture and a standby "clone architecture." If there is a fail in the master AZ, then the system can be manually switched to use the cloned version, a process that not only usually requires a manual switch-over, but the databases are usually replicated in Amazon's Simple Storage Service (S3) about every 10 minutes, so when a switch-over does occur, you could lose about the last 10 minutes worth of data, RightScale says.