Skip Links

Troubleshoot to repair, or predict and prevent?

By Steve Henning, Network World
June 10, 2008 02:35 PM ET
This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
  • Print

It sounds simple. Instead of spending hours or days troubleshooting an application slowdown or system outage, why not just avoid it to begin with?

Until recently, the only way for IT organizations to resolve problems was to sift through alerts, log files and trouble tickets and burn the midnight oil on conference calls. Today, powerful analytics and automation capabilities built into system management tools can help organizations identify and resolve issues before they become problems.

Interconnected business services have made management exponentially more difficult. Collecting more data isn’t the answer because:

* Monitoring static thresholds triggers a flood of alerts, most of which do not represent actual problems.
* Problems are identified by groups of abnormal behaviors, not a solitary metric.
* With tens of thousands of devices and millions of metrics, the correlation effort required to identify problems is impossible.

This deterministic approach is not only ineffective but also cannot scale to accommodate increasing complexity. Highly complex service infrastructures demand a new approach, a probabilistic approach.

Intelligent system-management solutions now employ sophisticated correlation algorithms to sample subsets of metric data and deliver accurate information about potential system behavior. In addition, new learning technologies continuously refine alert thresholds — providing dynamic thresholds that recognize and accommodate the normal ebbs and flows of business. A probabilistic approach allows organizations to solve problems faster and with far less manual effort.

Intelligent management solutions integrate with existing monitoring infrastructures, automatically collecting and analyzing metrics from across all tiers of an application — such as Web server, application server and database tiers.

The first job for the intelligent management solution is to learn the normal behavior of the application. It should be possible to build behavior models for each resource in your infrastructure by using dynamic thresholding algorithms to continuously collect data. This makes it possible to compare the real-time measurements of metrics with the expected range of values to determine when a metric should trigger a threshold violation.

Unlike traditional, static thresholds that cannot accommodate normal fluctuations without setting off alarms, intelligent management solutions should be able to learn patterns of system behavior and alert only when things deviate from normal. This creates a foundation for advanced correlation capabilities, which elevates problem solving to a new level of efficiency.

With the ability to identify truly abnormal system behavior, sophisticated correlation techniques also can accurately determine behavioral relationships between metrics and alerts. Intelligent management systems can actually predict abnormal behaviors that are likely to occur based on currently identified abnormalities. For example, an alert from the application server tier can predict that a key database performance indicator is highly likely to be exceeded in 15 minutes.

  • Print

Videos

rssRss Feed