Skip Links

Network World

Peter and Rebecca

Looking at the Performance Management "Big Picture"

By Sevcik and Wetzel on Mon, 03/24/08 - 8:47am.
Newsletter Signup

Last week we laid out the subset of ITIL processes that define the essential building blocks of performance management (PM). You probably already operate some of these management processes - incident, availability, capacity and service level. But you probably also have been polishing just one side of your IT coin by applying these processes to a single aspect of performance: just to assets or just to information flows. It's time to look at the big performance management picture.

Performance management has two equally valuable and mutually dependent aspects that we describe as "columns" (infrastructure) and "rows" (flows). If you are like most IT professionals, you are applying performance management either to columns, expecting that if all is well with each column then the user experience is as good as it can get - or to rows, expecting that if the user experience is within norms then the infrastructure must be running well. Although both views are valuable, when looked at independently, each delivers only half of the big picture. Without both views, you invite panicked all-hands-on-deck emergencies.

By gathering data and delivering reports on both views, you are putting together the two halves of performance management into the big picture shown below.

As the figure shows, you can apply the same PM building blocks (incident, availability, capacity and service level management) to columns or rows. Most enterprises and management tool vendors start by applying PM to columns, thus putting in place infrastructure performance management (IPM). This can achieve good IT asset efficiency, which is why most IPM sales pitches talk about the great return on investment you will experience when you tune the assets to just stay ahead of resource needs. On the other hand, measuring and tracking rows provides the clearest insights into the actual user experience - which is application performance management (APM). Here the sales pitch is that the user is able to do more with the application so they can generate in more revenue.

Put simply, good IPM saves money and good APM makes money. Now you can see why you really need both. But most organizations do not cover both or if they do, the coverage is not balanced. Which side are you managing?

Maybe it is the words again

0

Not blasting ITIL, I like it, but.. "why most IPM sales pitches talk about the great return on investment" - because it is a sales pitch, I would think? Performance - what performance? ROI, system performance, infrastructure performance, getting new business, development performance, stock, and so on - related? How to measure? After 30+ years in this business I think "incident, availability, capacity and service level" has one problem - the word capacity already covers these, and many more, areas. Capacity planning and management has been forgotten a long time - it used to be one of the main focuses in IT (and corporate) strategy covering everything in planning and not just IT technology. Maybe this is part of IT crises today - to run IT you need people, premises, power, locations, support infrastructure, water, coffee, ISPs, software / hardware vendors, short/long term sustainable business plans, cost of contracts, user acceptance, growth control, etc - all part of your capacity. Now - there is no one software, architecture or any other system which alone can manage all that - you need people. ITIL is very good covering the technical aspects but it is no miracle cure, it is just a framework to manage some needs in modern IT.

Good points, however automated correlation is key.

0

Very interesting take on performance management issues. I certainly agree that simply looking at infrastructure metrics or user experience metrics alone will nor provide an adequate basis for performance management. However, even if an organization is doing a great job of monitoring user experience and infrastructure components, they are still likely to have a difficult time quickly resolving performance and availability issues. We consistently see organizations armed with multiple, siloed monitoring solutions with static monitoring thresholds on long bridge calls trying to resolve problems. They use massive amounts of manual effort sifting through alerts from these solutions, trying to understand which are important to solving the problem and which can be ignored. They use tribal knowledge of their applications to humanly correlate the behaviors they are seeing to identify the problems. It is no wonder that 70% + of IT budgets are labor cost. With shrinking budgets, increasing management complexity (due to new technologies like virtualization and SOA) and sheer numbers of devices, this approach is unsustainable. With tens of thousands of devices and hundreds of thousands (even millions!) of metrics measured, human or rule-based correlation approaches are impossible.

What is needed is a new approach to performance management that starts with learning the normal behavior of all measured metrics. With the knowledge of normal behavior, the true abnormal precursor behaviors to problems can be identified. Key indicators can be set for user experience metrics (e.g., key transaction response times) and when the key indicators are breached, automated correlation techniques identify the areas in the infrastructure that are performing abnormally (i.e., the app server and database tiers of the application) to pinpoint trouble shooting efforts. This type of approach eliminates alert storms, bridge calls and the massive manual efforts in identifying and resolving problems. It also allows a truly proactive approach to performance management as these correlations are captured in models that can predict these problems when they begin to re-occur. There is a new class of performance management solutions (Real-time Analytics-based Performance Management solutions)that provide these capabilities and allow Operations to continue to scale without increasing labor spend.

Yes, it is the key

0

Thank you Steve, you said it better than I, maybe I'm just a little disillusioned today. And with people I didn't really mean more headcount but people who are assigned to solve performance and capacity problems. It can't be done ad hoc and the better people working on those know the business, the easier it gets.

Solving todays IT performance problems is sometimes frustrating, sometimes even it would be easy as adding hardware BUT there is no space, power, whatever capacity left. Or the network capacity is already used and there is no capacity plan to get more. Or there is space but it isn't accepted by local building inspection yet, and so on. Or one Monday morning a new application system starts and eats all the disk / database capacity (space or performance) - you may have extra but that is planned to be used in another application next week, installed to another floor or building. Or you can't get the union workers who have to install the electrical cables until next month. Or the hardware importer / vendor who was supposed to deliver today is out of business and their assets are frozen.

Yes - you are absolutely correct, there is a need for correlation but in my mind it has to cover the whole business not just IT, otherwise the "performance management" can turn to nightmare very fast. By the way, all previous examples (and more) are from real life situations.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Welcome, visitor. Register Log in
Advertisement:
About App Performance View
NetForecast is an internationally recognized engineering consulting company that benchmarks, analyzes, and improves the performance of networked data, voice, and video applications.