I recently read an article in InfoWorld by David Linthicum entitled, "Good Cloud Ops Need Good Cloud Metrics." It caught my attention for two reasons.
First, Linthicum is correct in his assessment that metrics are as important in cloud operations as they are in any other aspect of your operations. To paraphrase Lord Kelvin, "to measure is to know." And knowing lets you do something about it.
But second, it caught my attention because while the metrics he describes are important, they are only part of what you need to measure to fully understand your Internet Performance. Linthicum focuses on measuring the performance of the systems and applications within the cloud provider. But as I've discussed earlier in this blog, system and application performance is only part of the story—network performance across the Internet between the cloud and end users is a huge factor, too. Because, of course, those end users are your customers.
Another way to look at it is that tracking metrics within the cloud will give you valuable insight to help keep your applications up and running, but the most important point is not whether your website or online content is up and running. Instead, what's most important is whether your website or content is reachable (i.e. can a customer connect via her local Internet service provider?) Optimizing the customer experience can be challenging, since you can't control whatever outages or downtime the customer's ISP or other ISPs in their path are experiencing.
Measuring performance across the network, however, is the first step to help mitigate issues for your customers. With that in mind, let's take a look at Linthicum's three points about cloud metrics:
1. "You can trend data and spot issues with recent operations."
2. "You can use the data to provide predictive analytics."
Once you have enough network performance data to the places where your content and applications live, you can explore to find patterns and make predictions. Does performance change on a regular basis? You might be surprised at how often this is true. Many systems used by humans follow a diurnal pattern (showing the same changes every day). For example, the end of the business day on the U.S. East Coast is evening in Europe and early morning when people are waking up in Asia, and this busy time can affect traffic and change performance.
3. "You can make your systems in the clouds self-healing."
So once you have the network performance data and are finding patterns in the data, what can you do? There are multiple ways to take action. Since we're talking about network performance, an obvious action is shifting traffic from a poorly performing path to a better one when you detect a degradation or outage. Having worked for a long time with and around DNS, I need to point out that using DNS answers "steered" by performance data is a simple and effective way to route your users to the best-performing path.
So start measuring—not just within the cloud, but the entire network path starting with the users. Once you have the data, you know. And once you know, you can take action.
This article is published as part of the IDG Contributor Network. Want to Join?