Network fire prevention
You'll need a tool kit that includes performance monitoring gear, application modeling tools and, soon, neural network software.
|
|
|||
|
|
It's the universal complaint of the network executive: you spend too much time running around fighting fires - attending to the various network flare-ups that threaten to bring down your business.
But it doesn't have to be that way. There are tools available today that, while perhaps stopping short of fire-proofing your network, will lead you to where brush fires are starting, giving you time to douse the flames before they cause any real damage.
At the most basic level, performance monitoring tools help you get a lead on your normal network performance and perform trend analysis to determine when things are deviating from that norm. That won't fix your problems, but at least it will point you to where trouble spots are developing in time for you to take corrective actions.
There are dozens of performance monitoring tools available - from vendors including Concord Communications, FirstSense, Ganymede Software, International Network Services (INS), NextPoint Networks and Tavve Software - many of which get rave reviews from users.
Modeling tools hold another key to predictive net management. Vendors, including Ganymede, MIL 3 and Optimal Networks, offer tools that can help you predict what the addition of a new application or other changes will mean to your network performance. In some cases, the tools enable you to conduct stress tests and what-if scenarios, such as predicting the effect on response time of adding another 100 SAP R/3 users.
"That's proactive in that you can start to find envelopes of behavior where you can function normally and where you're going to start getting in trouble," says John McConnell, president of McConnell Consulting in Boulder, Colo.
Lurking on the horizon is neural network technology, which promises to further the predictive network management cause even more. Computer Associates already has neural agents, dubbed Neugents, for predicting impending doom on NT servers. The devices "learn" about the behavior of the system they're monitoring and can give advance warnings when situations occur that have led to problems in the past. The company expects to deliver Neugents for routers and switches by year-end.
Baselining and performance monitoring
Performance monitoring tools, such as Concord's Network Health and Ganymede's Pegasus, generally paint a picture of what your network performance looks like under normal, everyday conditions, and then point you to areas in which performance is subpar. Depending on the modules you implement, you can get reports on network performance or on network and application performance.
Ganymede, for example, offers a severity index based on a composite of availability, response time and throughput exceptions. Based on a scale of one to 100, the index helps you identify which elements are operating farthest out of their normal range, theoretically pointing you to the source of a performance problem. Pegasus also provides a trend index that shows which elements are changing most rapidly. "It points you to the issues that will make the phones ring next week," such as a frame relay link that is operating at an unusually high utilization rate, says Jim McQuaid, director of monitoring solutions at Ganymede.
Nortel Networks is nearly done deploying Pegasus on its worldwide WAN, says Peter Massam, network management technologist for the company in its Maidenhead, England, network operations center. "The first real useful data to come out of this was the aggregated views we had of our locations," Massam says. "We could see full application performance by region and identify the hot spots to concentrate on."
For example, the product helped him troubleshoot a nagging performance problem on a link between the U.K. and Madrid. Pegasus confirmed that application response time was subpar and indicated it was due to sluggish throughput on the link. Bumping up the speed of the link quickly cured the problem.
Similarly, Massam says, Pegasus has pointed Nortel to frame relay links that were underutilized, meaning Nortel was paying for more bandwidth than it needed given the amount of traffic flowing over the link. The company was able to lower the committed information rate on those links, saving money.
Jim Gross, manager of telecommunications for Lockheed Martin in Research Triangle Park, N.C., has had similarly positive results with Tavve's performance monitoring tool, which he has been using for about a year. "It points us to a pending problem almost weekly," he says.
For example, the Tavve tool alerted Lockheed to abnormal bandwidth usage that threatened the performance of a client/server application in a Washington, D.C., office. "We were way ahead of the game in identifying what that additional bandwidth was being used for and were on the phone getting them to curtail that activity before customers were on the phone telling us their application wasn't working anymore," Gross says.
Tavve's product, dubbed tsc/PRM, works with Hewlett-Packard's OpenView or Tivoli's NetView for AIX, feeding off SNMP data those systems collect to create Web pages that graphically show network performance. "And it's up-to-the-second current," says Anthony Edwards, founder and chief technology officer for Tavve.
Users can set thresholds and be warned when a device is within 30, 60 or any other number of days of reaching the threshold. For example, maybe a router's CPU is operating at 60% of capacity half the time, when the threshold is 70% half the time. The Tavve product could determine that if CPU utilization continues to increase at the same rate, the threshold will be reached in 30 days, giving the user time to troubleshoot the problem. There's also a report that lists seemingly minor events that are happening repeatedly, an indication that a problem exists. And a service-level agreement facility alerts net managers well ahead of time if they're in danger of not meeting an SLA.
"We're using the reports to try to be proactive with our network and look at trends," says Mark Jones, enterprise management systems analyst III at BB&T, a bank based in Winston-Salem, N.C., that has about 600 branches. Besides the bandwidth utilization reports, Jones notes that tsc/PRM alerts him to the top 10 consumers of bandwidth on the network, which helps in troubleshooting.
Tavve users invariably point to the company's correlation engine, tsc/EventWatch, as the crown jewel of Tavve's product lineup. While a correlation engine would not seem to be proactive in nature - it sorts through alerts to identify the source of an outage - users say the Tavve product is so good that it can sometimes help them repair the fault before end users are aware there's an outage. And it definitely reduces the time it takes to fix a problem, users say. [See story for more on Tavve]
Model of success
Network modeling tools are more proactive still, alerting you to potential problems before you go live with changes to your net.
Robert Rohlin, senior consultant with Modis Solutions, a consulting company in Dallas, has been using MIL 3's ERP Network Guru for SAP R/3 to model an SAP installation for a petrochemical company client that has a 6,000-user network.
Each of the company's 27 sites will be running SAP and many of them are overseas, compounding the response time issue. Rohlin already had baseline performance data from Concord tools that he could plug in to the MIL 3 product. To that, he added the expected SAP traffic load.
"The thing we really like about MIL 3 is that we can get real specific, saying how much memory each router has, how much bandwidth and what protocols it's running," he says. You can also define heavy vs. light users, the specific SAP modules employed and various other characteristics. Given all those variables, he says it takes time to build the initial model, but once that's done it's easy to create multiple scenarios to find the one that yields optimal performance.
The modeling tool also points out potential problems that would be easy to miss. For example, if you double bandwidth from one site to a particular application server, it could slow response time for all others using that server because now there is more competition for the same resources.
There are two general ways to conduct network modeling, according to March Cohen, CEO and chairman of MIL 3.
The first is analytical, where the modeling tool uses mathematical equations derived to approximate protocol behavior and delays. The other is discreet, which involves modeling actual, individual packet transactions, including how packets are delayed at different points in the network and the effect of any protocols involved. "The discreet model is almost a replication of what's happening in reality," Cohen says.
With any modeling tool, it's crucial to verify your results before diving in too deep. Cohen says you should start by getting baseline performance data for a given application using existing monitoring tools. Then model that same application and compare the results. As you make changes, again compare the results from the monitoring tools with what the modeling tool predicted.
That's essentially what Rohlin did to verify that MIL 3's tool was giving him accurate SAP modeling data. SAP was already installed at some sites in Europe, and the company was collecting monitoring data. He diagrammed a small site in Germany using the ERP Network Guru, ran some simulations and matched it up to the real-world performance data. "The results were amazingly close," he says.
Neural nirvana
Neural network technology, and potentially other forms of predictive logic, may likewise one day prove to have some fairly amazing predictive capabilities.
Frank Dzubeck, president of Communications Network Architects, a consultancy in Washington, D.C., says neural logic is one of three types of logic that may come into play to yield more proactive network management systems. The other two are chaos logic, which determines patterns when none are evident, and genetic logic, which detects patterns in hierarchical structures, from one generation to the next.
Neural technology is furthest along at this point, with Computer Associates already delivering the technology for NT systems. Dzubeck says IBM is working on neural agents of its own. IBM says it doesn't use the term neural, but Tivoli's Distributed Monitoring technology is based on the same principles.
Sorrel Jakins, director of server systems at Brigham Young University in Provo, Utah, has been testing CA's NT-based Neugents since last fall. "We have about 30 to 35 NT servers and it's working very well on those."
Typical statistical inference engines make predictions of future failures based on a large spread of existing data. Neugents don't require this existing knowledge base; rather, they continuously monitor a system or device and determine the complex patterns that indicate abnormal behavior.
"Neugents are pretty good at taking skewed data and still making a good prediction," Jakins says.
Ron Cass, divisional assistant vice president for Neugent research and development at CA, says Neugents train themselves to recognize patterns on an individual system or device by monitoring various parameters. On a router, for example, Neugents will monitor for excessive protocol traffic, header errors, IP address errors, fragment failure, discarding of good traffic - about 20 attributes in all.
Rather than working from a database of information about the same type of router, Neugents start fresh with each router they monitor, Cass says. "Every router is different. The amount of traffic a router responds to depends on where it is."
From the data they gather, Neugents learn what patterns of behavior lead to problems, including performance degradation, failed interfaces or whatever you program the Neugent to look for. Thresholds are set automatically, based on what the Neugent determines is normal behavior for the router, and they change every night to keep up with varying conditions. When a Neugent detects a pattern of behavior that led to a problem in the past, and if it's serious enough to trigger the threshold, the Neugent issues an alert detailing the impending problem.
Neugents live on CA's NetworkIT Pro management console and require no software on the device to be monitored other than standard MIB 2 and Remote Monitoring agents. Each Neugent can monitor between 10 and 50 devices, depending on their complexity, but there's no limit to the number of Neugents you can install.
Jakins says BYU has seen occasions where a Neugent says network traffic is increasing such that a server will have a problem in 15 minutes. "We've had our network people take corrective measures, but we've also sat back and watched to see what happened," he says. "The timing has been pretty accurate, and the predicted effects have been accurate, too."
Asked if he sees the Neugent technology translating to a network scenario, where it can make the same sorts of predictions for routers and switches, Jakins answers with a thoroughly optimistic "most definitely."
RELATED LINKS
Tavve: A different take on proactiveness
Religious war peters out
There has been something of a religious war going on among performance monitoring tool vendors over the merits of synthetic transactions versus watching actual network traffic. Network World, 9/20/99.
Outsource the job
If you'd rather hand off the performance management chores, there are companies such as Candle and NetOps that will do it for you remotely. Network World, 9/20/99.
Sandia Labs
paper on its approach to measuring response time
In PDF.
Ganymede's Script Library
A collection of predefined application simulation scripts.
Review and buyer's guide: Network monitoring and alerting
We look at six products, provide a searchable database of detailed specs on 46 tools and show you how vendors responded to our network monitoring RFP. Network World, 8/23/99.
Review: Test-drive your network designs
Four network simulation tools make it easier to allocate bandwidth judiciously and identify impending overloads. Network World, 5/24/99.
Big improvements coming to application management
A host of new measurement tools will debut soon, giving net managers better insight into how network applications are performing. Network World, 8/2/99.
Neugents: The thinking man's agent
CA says Neugents are smarter than Einstein, but users need proof. Network World, 7/26/99.
Concord pack diagnoses application health
Network World, 8/23/99.
Ganymede offers net performance monitoring tool
Pegasus overview. Network World, 2/9/98.
Tavve net mgmt. tools extend remote reach
Network World, 9/13/99.
HP bolsters NetMetrix management software
Hewlett-Packard promised to release performance monitoring software that is more scalable, better at watching WAN performance and easier for network managers to set up than the company's previous tools. Network World, 9/6/99.
Packeteer brings SNA reliability to IP
PacketShaper offering delivers class of service, performance monitoring features. Network World, 9/6/99.
Click here for print subscription.
-->