Reducing maximum time to resolution: The key to decreased downtime and increased savings

This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.

Network downtime is an inescapable fact for networks of all sizes, and all the prevention and detection tools in the world won't allow analysts to quickly solve the problem. What's needed are tools that will pinpoint the root cause of the problem and determine the appropriate steps to solve it. This isn't to say that protection and detection tools aren't necessary. Obviously they are an important part of overall network health, but they should be a part of the solution, not THE solution.

An emerging sector of network technology addresses response and root cause through full-packet capture, which allows analysts to drill down to the epicenter of the incident and drastically shorten the amount of time required to solve a network's most difficult problems. By shrinking maximum-time-to-resolution (MAX-TTR) -- which requires a shift of focus to response and root cause -- organizations can unleash savings that will show a dramatic difference on the bottom line.

BEST PRACTICES: Network packet brokers increase visibility and performance


The ultimate goal of dedicating resources to response and root cause is the reduction of time-to-resolution (TTR), which is the amount of time it takes to correct a network anomaly. Doing so requires 100% packet capture, which offers clear, historical network visibility. If analysts can quickly retrace each step of the problem, guesswork is all but eliminated and they can expedite efforts to repair.

Organizations, however, tend to make a common mistake when attempting to reduce TTR: They focus on mean-time-to-resolution (MTTR). While knowing the average amount of time required to repair network anomalies can be useful, it doesn't tell the whole story. Cutting MTTR from four hours to three hours and 50 minutes is irrelevant, for the most part. The area where full-packet capture offers the most "bang for the buck" can be found in reducing MAX-TTR. That's where a real impact can be made relatively easily.

For most organizations, the majority of incidents are clustered around the four-hour mark, but there are a smaller number of events that can take days and weeks to fix. While not as frequent, they cause the most network downtime and cost the most to repair. Because the technology to rapidly zoom in on where the particular issue was reported or alarmed and identify exactly what happened exists, organizations can drastically reduce the length of the network's most frustrating fixes.

If an organization can drop its MAX-TTR from 24 hours to four hours, it will not only reduce the mean TTR, but it will shrink the amount of resources required to deal with the problem. Less downtime equals greater savings and better network uptime.

The old model is broken

There's nothing fundamentally new about full-packet capture. For as long as networks have been around, operational teams have been "sniffing" packets for diagnostic and troubleshooting purposes. In the past, network recording was reactive, responding to a problem of some description by deploying a recording device -- typically a laptop attached to a span port on a router or switch -- to get a trace file.

In a world where the network wasn't mission critical and TTR wasn't a big deal, this method sufficed. However, in a 10Gbps world where the organization's lifeblood is a fully functional network, this strategy fundamentally doesn't work.

Best practice network management advocates deploying a permanent fabric of network recording appliances on top of the core routing and switching fabric in data center and DMZ environments (top of rack/end of rack). By looking at full packet recording as a permanent feature of core network infrastructure, as opposed to a reactive tool, operational teams can fundamentally re-engineer their workflows to drive down time-to-resolution on a variety of application and network performance problems and unlock a range of other benefits in the security and compliance management domains.

The idea is extremely simple: Rather than guessing what went wrong and wasting valuable time testing a variety of different hypotheses, engineers can go back to the exact point in time that the problem occurred and effectively "replay" the network traffic.

The laws of physics dictate that accurate network recording at speeds in excess of 2Gbps can only be delivered by using purpose-built recording hardware. A standard 3GHz processor can only retrieve packets from the wire at a certain speed before it saturates and begins dropping packets, and unfortunately, going parallel doesn't help.

Organizations interested in recording the tidal wave of traffic in a 10Gbps environment must be aware that having a 10Gbps port on an appliance and actually being able to record at 10Gbps are two entirely different things. Using DMA (Direct Memory Access) packet capture techniques, real line rate 10Gbps performance can be delivered, and still leave processing cycles on the appliance for indexing and analyzing the traffic.

The top attributes organizations should look for when considering network recording infrastructure are:

1. Proven continuous 100% accurate packet capture accuracy, at full duplex line rate 10Gbps (in effect 20Gbps recording)

2. High-resolution packet time-stamping with a minimum accuracy of +/- 50ns to allow accurate comparison of packets captured in different locations

3. High-density local system storage and the ability to offload traffic to a SAN for extended storage, if required

4. Elegant, browser-based application-aware workflow for visualizing, searching and retrieving packets of interest from anywhere across a global network of appliances

5. On-appliance protocol decode capability to remove the security risk that sensitive traffic is removed from the data center (which can cause compliance issues for certain financial institutions)

There are a number of solutions available today that offer traffic recording as a feature of their detection tools (NPM and APM), and a few that focus on network recording as a piece of infrastructure. Understanding the difference between the two is important.

Solutions that offer recording as a feature of a software product that is generally focused on software will typically struggle to scale to meet the accuracy and performance demands of higher speed networks, whereas solutions focused on infrastructure will be able to scale to a higher speeds with much better levels of accuracy.

Conclusion: If you care about TTR and you're running a 10Gbps network then a hardware-based approach to network recording is the key.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.