The sky's the limit
For Priceline CIO Ron Rose, the 'new data center operating system' leads to infinite business possibilities.
By Beth Schultz, Network World, 06/27/05
Granddaddy online travel-bidding site Priceline.com reported gross travel bookings of $1.68 billion for 2004, up 52% over
2003. You don't get to that volume without a superior Web infrastructure - one that Ron Rose, CIO at the Norwalk, Conn., company,
says must support a "data center operating system." In a recent interview, Rose shared his vision of the data center operating
system and explained how such an approach helps Priceline achieve 99.997% database availability.
You use the term "data center operating system." What does it mean?
Just about all companies today use an architecture that I refer to as the "N-tier hairball." The N-tier hairball is when you've
got Web servers, middleware servers, mainframe servers, database servers - this hodgepodge of architectures - that you're
trying to control. The idea of the data center operating system is that all of those components should be controllable in
a consistent manner.
That's a pretty tall order, especially for a company with thousands of servers. Where do you start?
The first important step for a data center operating system is getting an infrastructure in place that provides sophisticated
provisioning, rollback and control capabilities. The No. 1 reason this is important is because configuration variability is
a very bad thing. When you're building servers by hand, the only thing you ever know for sure is that no two servers are going
to be the same. At Priceline, our gut feel is that 40% to 50% of infrastructure variability was caused by bad manual configurations,
and programmers were spending lots of times chasing "ghosts" in the machines. When we began using BladeLogic's [Operations Manager , for automated provisioning], the vast majority of variability in terms of the server build was taken out of the loop. [Automated
provisioning] means less variability in the production plant and less time spent by programmers trying to figure out whether
problems are real or not.
So the BladeLogic tool is central to your data center operating system?
BladeLogic crosses architectural tiers and vendor architectures. So in our case we can use one tool to control Windows, Linux
and Solaris and to roll out - and, just as importantly, roll back - features with precision and speed. We rolled BladeLogic
out three years ago. At that time, a rollback across the entire Web tier could take an hour and a half. Using BladeLogic,
our ability to roll back went from an hour and a half to literally 10 minutes.
Another metric is a 60% reduction in administrative work related to configuration ... and a concomitant increase in programmer
productivity. Morale is better, too, because programmers are not trying to debug configuration problems.
We also found another benefit, to our surprise. As we rolled BladeLogic out we were able to dramatically reduce the number
of people that had powerful permissions to production boxes. BladeLogic enabled application developers to do things they needed
to do in a controlled way rather than by giving them privileges on production boxes. The application developers are happier
because the tools work consistently across multiple chunks of infrastructure and our security people are delighted because
the number of people who have powerful rights has been curtailed.
What are some other methods for improving Web infrastructure availability?
Another one of the keys to getting your availability up and reducing your mean time to repair is knowing, as quickly as possible,
what problems there are in the plumbing and being able to respond to them with precision and speed. So we spend a lot of time
on instrumentation - we have over 30,000 alerting points in the infrastructure. Great, thorough instrumentation is vital.
Exactly how does Priceline use instrumentation?
Like everybody, we use machine-level instrumentation, monitoring CPUs and disk-drive activity using standard tool sets like
BMC Patrol . And we're thorough, never rolling out a chunk of infrastructure without proper instrumentation. Then we go a step further.
We don't roll out applications without at least examining whether they should be instrumented - and, most of the time, we
do instrument them. In this step, we're alerting on specific types of error conditions that the applications are encountering
even if the machine itself is healthy. Then comes the next level of instrumentation - business service metrics, which everybody
is getting all lathered up about.
Business service metrics is buzzy these days. How do you handle it?
We've been doing business service metrics for six years as a basic part of our business model [but w
e're doing it better today than we've ever done it before]. We pump alerting events into a big MySQL database, collecting
them on our business-activity monitoring infrastructure . The BAM box is able to report on a variety of conditions - application errors, business-oriented conditions like quantities
of itinerary and total tonnage of business driven by product line, variances week on week and all that stuff - so we can understand
trends and business-oriented metrics on an ongoing basis. The reports are continually available, so we can see the pulse of
Priceline's business as it flows through the company's veins. And we can see it in graphic detail. A lot of people talk about
business service metrics and how they'd really like to know on a weekly basis what throughput is. Since Priceline's inception,
we've instrumented the company from CPU load all the way up to business metrics because the nature of an e-commerce company
is that you have to know what's going on with your product lines. And it's real time, all the time.
Is predictive modeling the next step, then?
We're looking at predictive models with a company called Netuitive . You don't hear about Netuitive much, but its tool set is conceptually significant in that it does statistical analysis
and helps locate what's different in the infrastructure from a statistical rolling average basis. That sounds like a convoluted
way to say what the value is, but what's often very important to know is what's different today than yesterday. Tools like
BMC Patrol, [Mercury Interactive's] SiteScope and NetIQ are great, but generally one of the limitations of that kind of tool
set, and particularly limiting their use for business services metrics, is that humans set the thresholds. Netuitive's predictive
engine automatically derives a set of thresholds based on behavior then reports on anything that's really weird about that
behavior. For example, your CPU may go to 50% every evening between 3 and 6 p.m. while you're doing a data pull, and you set
your threshold manually at 60% to see if it reaches there. But what you really care about is if your CPU doesn't go to 50%.
That would mean something isn't running on that machine [that ought to be].
The same is true for business metrics. We're looking at predictive analytics to help us see trends in the underlying business
behaviors that we wouldn't otherwise detect. On the business side where we have various trends of requests - hotel or airline
types of requests - or inventory types of events, we hope knowing a rolling statistical average of those behaviors and being
able to predict whether the behavior is getting better or worse and being able to analyze what's different about supply chain
behaviors in a statistically valid way, day by day or week by week, will be beneficial. [As far as I know], we're one of the
only shops on the planet trying to apply statistical measures to detect underlying business and infrastructure problems in
a correlated way.
So what lessons learned can you share?
If you're going to automate your data-processing processes, the first two things you should look at are storage - going to
utility storage to avoid the administrative headaches of the most prevalent types of disk storage around today - and provisioning,
since the ROI on that is generally extremely strong. Getting your disk storage and your provisioning flexible and responsive
are two homework assignments that have to get done in advance of being ready to do grid and utility computing properly. Then
comes the really, really hard part, which is virtualization.
FunfactThough Ron Rose became Priceline's first CIO six and a half years ago, he was the fifth person in one calendar year to head
technology development at the fledgling company — a fact Rose says he didn’t find out until the day he took the job.
|
|
|
What we do now is some server consolidation using VMware, and we're happy enough with the tool. But there is so much more
in this area that needs to happen, and that will happen, and we're really excited by this. A great virtualization approach
is needed to get a data center operating system that operates cohesively and consistently. ... having greater centralization,
greater control across the N-tier hairball, and greater efficiencies within the data center. So we're delighted to see [open-source
and other virtualization] trends take off and we think all of these trends are going to gather speed dramatically in the rest
of 2005 and 2006. From control, security and application provisioning standpoints, the tools will all begin to develop over
the next two years and work more seamlessly together. Box counts could be affected by this, but more importantly, it's a combination
of reduction in box count required per feature and additional security and control.
The question is, 'How do you get to the promised land?' We're doing and seeing a lot of great things, but getting from where
everyone is today with the toolsets that exist to a data center operating system is a long voyage.
Related Links