Kevin Felichko didn’t get as much sleep as he wanted to on Monday night.
Felichko is the CTO of PropertyRoom.com, an online auction site of seized goods that is run entirely on Amazon Web Services’ cloud. Late last week AWS announced that it would be rebooting up to 10% of the company’s virtual machines, known as its Elastic Compute Cloud (EC2) instances. For a company like PropertyRoom.com, which processes tens of millions of dollars worth of online auctions all through Amazon’s cloud, that could have been a big problem.
But Felichko says it turned out to be a manageable problem. One key to using IaaS cloud computing resources is to prepare for failure. Amazon’s CTO Werner Vogels even preaches that. And that’s what Felichko and his tech team of four had done when they migrated over to Amazon’s cloud earlier this year.
On Friday PropertyRoom.com got notification from Amazon that most of the reboots of the company’s instances would happen during the late evening hours on Monday. Late Monday night Amazon informed Felichko that the reboots would be delayed until Tuesday morning. After staying up late monitoring the situation, Felichko was slightly frustrated that the maintenance window had been moved on him at the last minute. But, on Tuesday the reboot happened and PropertyRoom.com website never went down.
However much of an inconvenience the whole process was, Felichko says it could have been much worse, but he’s thankful it wasn’t. He credits heeding the advice of AWS and cloud experts to prepare your cloud applications to be flexible in the face of uncertainty.
Using a service named CloudWatch (which monitors the health of EC2 instances) Felichko has set up the system so that if any of the instances serving the front end of the website go down then CloudFormation (which is a tool that sets up and deploys AWS services) will automatically scale the front-end web server to another healthy instance. The services are scaled across multiple AWS Availability Zones (AZ), which are different data centers within a single region of AWS’s cloud.
So, when Felichko learned about the reboot, he was fairly confident the system would work on its own to migrate the workloads off any instance that shut down and onto a running one. It worked as planned, mostly.
The one issue Felichko ran into was that one of the instances serving a back-end function for managing inventory was stuck in a reboot cycle and would not fully restart. That created somewhat of a domino effect in the system because the company’s order processing system is tied closely to the inventory. Felichko reached out to an AWS customer service representative who resolved the issue. It had been a hardware issue in AWS’s data center and that instance was taken offline.
Kevin Felichko, CTO of PropertyRoom.com
“We’ve built our system to run in multiple AZs within a region, so we can hopefully survive some instances going down,” he says. “We built this with the idea that the infrastructure can fail, so this was a great time to test it out.”
When Felichko joined PropertyRoom.com the company was hosted in a managed service provider, but one of his first tasks at the company was to transition off of a Savvis service and onto AWS’s cloud. He had been impressed with the variety of services AWS offers and the customer case studies AWS boasts about, from Netflix to AirBNB.
Since migrating over earlier this year Felichko hasn’t looked back, even with a little hiccup like the recent reboot situation. In the old setup, architecting the website to be spread across multiple data centers was extraordinarily complicated. The company even ran internal backups of some data for caching. Now, all of that has been moved over to AWS’s cloud, and spreading across multiple AZs is as easy as configuring a handful of AWS’s services. “It’s been not that bad of an experience compared to using a dedicated hosting environment,” he says about using the cloud. In the dedicated hosting environment once when there was a hardware issue PropertyRooms went down for six hours. Since transitioning to Amazon it hasn’t experienced any significant down time.
Overall, the last half a week has been slightly stressful, but Felichko understands why AWS did what it did. Amazon has released scant details as to the actual reason for the mass reboot, but it is related to a security issue that many suspect is in the Xen hypervisor. It’s not likely that all customers had as smooth of a process as Felichko. But most users who have taken to Twitter to vent about the issue bemoan a minor inconvenience rather than a devastating outage. Felichko may have lost a couple hours of sleep on Monday waiting for the maintenance to occur, but in the grand scheme of things he’ll take that tradeoff for all the other advantages the cloud brings.