Management: Searching for system errors

Search tools have simplified our lives in many ways, so why not network management? So reason the founders of Splunk, a start-up that has released a search product to make sense of logs and other types of event information generated by systems as they go about their business.

Michael Baum, founder and chief executive splunker (yes, it says that on his card), says troubleshooting individual boxes is not hard. The fun begins when you assemble multiple components into a system. No single vendor, developer, architect or administrator owns the problems that crop up, which usually stem from operator error, configuration errors, or integration and dependency problems.

"So customers approach it the old-fashioned way," he says, "with picks and shovels." To find out which of the many things that could go wrong did go wrong, you start digging.

One alternative is the autonomic self-healing approach advocated by IBM. Baum argues that although this approach might be feasible with stand-alone boxes, it is impossible at the complex systems level. "Automation is great, but it adds complexity," he says. "Are you really increasing mean time between failures enough to cover the mean time to recover after a failure in these complex environments?"

Splunk sides with the experts who are exploring recovery-oriented computing. They assume systems are complex and failures are inevitable, so therefore it is a matter of how fast you can recover.

Enter Splunk's search tool, which is all about fast recovery.

A typical application server, database or Web server can generate 100MB of event data per day, Baum says. "And when something goes wrong, we ask people to make sense of it all." With Splunk's search product, every event builds a fingerprint based on its syntax and grammatical structure. The results are then organized into buckets, indexed by time and analyzed for relationships. That helps troubleshooters quickly round up pertinent information from a range of resources and sift through the errors to find unique causal events.

Why not just use a Google-like search tool? It's a much different problem, Baum says. Log data changes every millisecond and all log data is different, so it's not like searching documents or photos.

For now, the tool is intended to be used in Java 2 Platform Enterprise Edition and messaging environments, and to augment commercial systems' management tools.

A free version of the product, called the Splunk Server, can be downloaded from www.splunk.com and used to index up to 500MB per day. Splunk Professional - which can be scheduled to run at set intervals, supports multiple user accounts and includes other features - starts at about $2,500 for an annual license.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:

Copyright © 2006 IDG Communications, Inc.

IT Salary Survey 2021: The results are in