The sky's the limit

For Priceline CIO Ron Rose, the 'new data center operating system' leads to infinite business possibilities.

Granddaddy online travel-bidding site Priceline.com reported gross travel bookings of $1.68 billion for 2004, up 52% over 2003. You don't get to that volume without a superior Web infrastructure - one that Ron Rose, CIO at the Norwalk, Conn., company, says must support a "data center operating system." In a recent interview, Rose shared his vision of the data center operating system and explained how such an approach helps Priceline achieve 99.997% database availability.

You use the term "data center operating system." What does it mean?

Just about all companies today use an architecture that I refer to as the "N-tier hairball." The N-tier hairball is when you've got Web servers, middleware servers, mainframe servers, database servers - this hodgepodge of architectures - that you're trying to control. The idea of the data center operating system is that all of those components should be controllable in a consistent manner.

That's a pretty tall order, especially for a company with thousands of servers. Where do you start?

More from Ron Rose

Roll over image and click play
What types of servers and operating sytems comprise the Priceline.com Web infrastructure?
What does your storage architecture look like?

How extensively, or not,

are you using Web services and a service-oriented architecture?
What else can you tell us about the Priceline.com infrastructure?

The first important step for a data center operating system is getting an infrastructure in place that provides sophisticated provisioning, rollback and control capabilities. The No. 1 reason this is important is because configuration variability is a very bad thing. When you're building servers by hand, the only thing you ever know for sure is that no two servers are going to be the same. At Priceline, our gut feel is that 40% to 50% of infrastructure variability was caused by bad manual configurations, and programmers were spending lots of times chasing "ghosts" in the machines. When we began using BladeLogic's [Operations Manager , for automated provisioning], the vast majority of variability in terms of the server build was taken out of the loop. [Automated provisioning] means less variability in the production plant and less time spent by programmers trying to figure out whether problems are real or not.

So the BladeLogic tool is central to your data center operating system?

BladeLogic crosses architectural tiers and vendor architectures. So in our case we can use one tool to control  Windows, Linux and Solaris and to roll out - and, just as importantly, roll back - features with precision and speed. We rolled BladeLogic out three years ago. At that time, a rollback across the entire Web tier could take an hour and a half. Using BladeLogic, our ability to roll back went from an hour and a half to literally 10 minutes.

Another metric is a 60% reduction in administrative work related to configuration ... and a concomitant increase in programmer productivity. Morale is better, too, because programmers are not trying to debug configuration problems.

We also found another benefit, to our surprise. As we rolled BladeLogic out we were able to dramatically reduce the number of people that had powerful permissions to production boxes. BladeLogic enabled application developers to do things they needed to do in a controlled way rather than by giving them privileges on production boxes. The application developers are happier because the tools work consistently across multiple chunks of infrastructure and our security people are delighted because the number of people who have powerful rights has been curtailed.

What are some other methods for improving Web infrastructure availability?

Another one of the keys to getting your availability up and reducing your mean time to repair is knowing, as quickly as possible, what problems there are in the plumbing and being able to respond to them with precision and speed. So we spend a lot of time on instrumentation - we have over 30,000 alerting points in the infrastructure. Great, thorough instrumentation is vital.

Exactly how does Priceline use instrumentation?

Like everybody, we use machine-level instrumentation, monitoring CPUs and disk-drive activity using standard tool sets like BMC Patrol . And we're thorough, never rolling out a chunk of infrastructure without proper instrumentation. Then we go a step further. We don't roll out applications without at least examining whether they should be instrumented - and, most of the time, we do instrument them. In this step, we're alerting on specific types of error conditions that the applications are encountering even if the machine itself is healthy. Then comes the next level of instrumentation - business service metrics, which everybody is getting all lathered up about.

Business service metrics is buzzy these days. How do you handle it?

We've been doing business service metrics for six years as a basic part of our business model [but w e're doing it better today than we've ever done it before]. We pump alerting events into a big MySQL database, collecting them on our business-activity monitoring infrastructure . The BAM box is able to report on a variety of conditions - application errors, business-oriented conditions like quantities of itinerary and total tonnage of business driven by product line, variances week on week and all that stuff - so we can understand trends and business-oriented metrics on an ongoing basis. The reports are continually available, so we can see the pulse of Priceline's business as it flows through the company's veins. And we can see it in graphic detail. A lot of people talk about business service metrics and how they'd really like to know on a weekly basis what throughput is. Since Priceline's inception, we've instrumented the company from CPU load all the way up to business metrics because the nature of an e-commerce company is that you have to know what's going on with your product lines. And it's real time, all the time.

Is predictive modeling the next step, then?

We're looking at predictive models with a company called Netuitive . You don't hear about Netuitive much, but its tool set is conceptually significant in that it does statistical analysis and helps locate what's different in the infrastructure from a statistical rolling average basis. That sounds like a convoluted way to say what the value is, but what's often very important to know is what's different today than yesterday. Tools like BMC Patrol, [Mercury Interactive's] SiteScope and NetIQ are great, but generally one of the limitations of that kind of tool set, and particularly limiting their use for business services metrics, is that humans set the thresholds. Netuitive's predictive engine automatically derives a set of thresholds based on behavior then reports on anything that's really weird about that behavior. For example, your CPU may go to 50% every evening between 3 and 6 p.m. while you're doing a data pull, and you set your threshold manually at 60% to see if it reaches there. But what you really care about is if your CPU doesn't go to 50%. That would mean something isn't running on that machine [that ought to be].

The same is true for business metrics. We're looking at predictive analytics to help us see trends in the underlying business behaviors that we wouldn't otherwise detect. On the business side where we have various trends of requests - hotel or airline types of requests - or inventory types of events, we hope knowing a rolling statistical average of those behaviors and being able to predict whether the behavior is getting better or worse and being able to analyze what's different about supply chain behaviors in a statistically valid way, day by day or week by week, will be beneficial. [As far as I know], we're one of the only shops on the planet trying to apply statistical measures to detect underlying business and infrastructure problems in a correlated way.

So what lessons learned can you share?

If you're going to automate your data-processing processes, the first two things you should look at are storage - going to utility storage to avoid the administrative headaches of the most prevalent types of disk storage around today - and provisioning, since the ROI on that is generally extremely strong. Getting your disk storage and your provisioning flexible and responsive are two homework assignments that have to get done in advance of being ready to do grid and utility computing properly. Then comes the really, really hard part, which is virtualization.

FunfactThough Ron Rose became Priceline's first CIO six and a half years ago, he was the fifth person in one calendar year to head technology development at the fledgling company — a fact Rose says he didn’t find out until the day

he took the job.

What we do now is some server consolidation using VMware, and we're happy enough with the tool. But there is so much more in this area that needs to happen, and that will happen, and we're really excited by this. A great virtualization approach is needed to get a data center operating system that operates cohesively and consistently. ... having greater centralization, greater control across the N-tier hairball, and greater efficiencies within the data center. So we're delighted to see [open-source and other virtualization] trends take off and we think all of these trends are going to gather speed dramatically in the rest of 2005 and 2006. From control, security and application provisioning standpoints, the tools will all begin to develop over the next two years and work more seamlessly together. Box counts could be affected by this, but more importantly, it's a combination of reduction in box count required per feature and additional security and control.

The question is, 'How do you get to the promised land?' We're doing and seeing a lot of great things, but getting from where everyone is today with the toolsets that exist to a data center operating system is a long voyage.

Learn more about this topic

q
IT monitoring with a twist

05/23/05

Priceline checks out Netuitive

08/25/03

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2005 IDG Communications, Inc.