Measuring cloud performance: A different approach needed

When you move applications to the cloud, you give up certain kinds of control. So, your measurement and monitoring strategies have to change.

Measuring cloud performance: Data center strategies don't work
Credit: Josep Ma. Rosell

As Lord Kelvin almost said, "To measure is to know." But this simple dictum is surprisingly hard to follow. For it really has two meanings.

The first meaning is obvious: You cannot really know about something without measuring it. If you want to know how quickly an application works, for instance, take some key functions of the application and measure how long they take. "Good performance" is defined by the function taking less time than the acceptable threshold, and poor performance is defined by the function taking more time.

+ Also on Network World: Measurement is key to cloud success +

Other measurements may tell you if the application is "performing well." For instance, does performance vary greatly under different loads or in different parts of the application?

The second meaning is perhaps less obvious (and it was probably not what Lord Kelvin meant), but just as important: By defining your measurements, you are defining the limits of your knowledge. In an application environment you control, this is not such a big deal. When you need a new bit of knowledge, you can measure that function, too. But the same is not always true in the cloud. When you don't control the environment, you may not be able to measure everything. But even where you can measure everything, you can't use one measure as a shorthand for others in quite the same way.

When you move apps to the cloud

This is why I noted in my last post that as you move services to the cloud, you must ask yourself whether a measurement you are taking tells you exactly one thing about a user's experience. If it does not, try to break it down some more or try to create a composite measurement that allows you to focus narrowly on one thing at a time. Otherwise, you can have difficulty determining where to make adjustments.

For example, in traditional web application environments operating inside your own infrastructure, it is not at all uncommon to use page load or render times as an indicator of issues unrelated to page loading or rendering. An awful lot of page load alerts get used as indicators for some sort of database problem because in database-backed sites, a slow page load too often means you need to reduce load on the database to make everything snappier.

But think of the assumption that this kind of measurement shorthand implies: it assumes the connection between the application and the database is not the problem. In infrastructure you operate yourself, it's probably a good first-pass assumption. Most of the time, your back-end network doesn't break down. High database load happens all the time. So, you can use the shorthand and have an escalation path in your operations playbook in case the alert reveals something other than the usual database load.

The cloud changes things

In a cloud environment, that assumption no longer holds. Your cloud provider changes your environment all the time, reconfiguring things underneath you in ways that you cannot control. That service is, in fact, what you are paying for! Choosing a cloud service where you get the very same capabilities you have in your own data center, only operated by someone else, is just a way to increase costs.

No, the reason to put things in the cloud (even a private one) is to get capabilities that would not be available in a traditional data center-based infrastructure deployment. That means that the underlying infrastructure—the network, the hardware or environment in which a service runs, the storage, all of it—always has to be treated as an independent variable that cannot, even as a first approximation, be assumed to be stable. That's a good thing: It means that the cloud is adapting according to your application's needs in the environment.

But it also means you have to measure many more independent variables in order to respond the right way to undesirable changes and know what is causing those changes. Fortunately, the very cloud environments that present this challenge provide a way to manage it, too:

Get the data: 
Most cloud-based services offer feeds of individual metrics. Use them—and if your provider doesn't offer them, start looking for another vendor. Messages per second on a bus, storage operations,"compute power" use and so on are all evidence of underlying service behavior.

Process the data: 
There are cloud services just for aggregating data and reporting on it in a digestible way. Find one you like, and use it. A firehose of data that nobody examines is no use, particularly when a crisis hits. This is not an optional part of your cloud-deployment plan, but an imperative. If you are implementing any sort of cloud-based system, plan your processing of the data from the outset or you will have difficulty evaluating what to do during a crisis.

Visualize the data: 
Most of the services for data aggregation include various visualization tools. Both for operations staff and for later discussion and explanation to management, you must ensure you have a good "normal" baseline. Knowing what normal "looks like" in your systems will prevent you from mistaken diagnoses in a crisis.

Ensure your application generates useful data: 
Far too often application logs have two modes: full debug and radio silence. Make sure your application generates useful and actionable metrics.

Don't rely on one source: 
If you use only the measurements provided by your cloud vendor, then you're not auditing independently. An irony of outsourcing part of your infrastructure is that while you need to do less of the operation, you need to know more. Just as good financial controls and audits are the key to profitable outsourcing of business functions, good technical controls and audits are the key to profitable outsourcing of technical functions. Relying only on measurements from your service provider is trusting without verifying.

Of course, it can always be your cloud provider that is the source of some issue you are seeing, and if that is the case, you will want to know it. Next time, we'll talk about how to measure your provider from outside so you can know you're building better networks.

This article is published as part of the IDG Contributor Network. Want to Join?

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.