Network World
Saturday, August 30, 2008
DNSstuff.com
Get information about your IP
IP Information
50+ On-demand DNS and network tools

Brad Reese on Cisco

Cisco Subnet

Navigation

Mastering VoIP network infrastructure problems: Q & A with the inventor of Xangati rapid problem identification (RPI) system

Xangati Rapid Problem Identification (RPI) System

In our recent blog story that generated substantial interest among Network World readers:

VoIP monitoring: The quest for call quality ubiquity

We covered VoIP monitoring tools and how making VoIP enterprise-class will very quickly become a priority item for network administrators.

At a high-level, these tools will enable IT to determine that there is something wrong with VoIP calls or that VoIP calls are being dropped.

However, these tools do not take into account that VoIP is just one component of a very dynamic ecosystem.

There are other applications on this shared network that are contending for the same resources.

So all these monitoring solutions do is confirm what the end-user most likely has already reported – that there is a problem.

When there is a problem with the VoIP service, network operations will need more information on what has affected the equilibrium of these applications to determine the source of the problem that is affecting VoIP quality.

However, some of these conventional monitoring tools may not be sufficient for managing VoIP quality and ensuring end-user satisfaction.

Details about the various network components and applications, such as how they normally behave over the infrastructure and when they are adversely affected, is becoming mandatory information for identifying the complex problems that affect VoIP performance, as problems can originate from anything that is leveraging the same infrastructure.

Knowing where to start in this complex web is the challenge for network operations and it can often take them days or weeks to locate and identify the target problem.

A new technology is now available to empower IT organizations to better understand their networks so that they can proactively detect and control problems associated with VoIP on the network infrastructure.

Called rapid problem identification (RPI), this agentless-solution uses NetFlow-based information (and other flow technologies) to provide live views of the networks, applications, users and the servers (including call managers) that are sharing the network infrastructure.

RPI analyzes network flows from every endpoint, network and application to establish individual, dynamic profiles for tens of thousands of networking elements that are sharing the infrastructure.

It then leverages these profiles to compare against the live activity of the ecosystem to pinpoint the problem source of a reported issue.

For example, to detect problems with VoIP, the RPI technology analyzes the application behavior of all the IP phones, call managers, and voicemail systems deployed across the enterprise.

This behavior analysis includes traditional performance variables like packet rate and bit rate, but also includes affinities to specific applications, periods of times and other endpoints.

The next step for RPI is to correlate and analyze all the VoIP phone-specific issues with other symptoms experienced on the infrastructure to determine a single specific problem source.

The result is actionable information about the source of the core problem that allows for a quick problem resolution to get the VoIP service up and running to expected quality levels.

Traditional monitoring tools do not provide this level of analysis, resulting in a less than enterprise-class service.

Jagan JagannathanFortunately, the inventor of this new rapid problem identification (RPI) technology - Jagan Jagannathan Ph.D., agreed to a Question & Answer session regarding the company he founded Xangati and the technology he invented.

Jagan is a veteran of Reactive Network Solutions, Xerox PARC, Sun Microsystems and SRI International and holds a Ph.D. in Computer Science from the University of Waterloo, Canada.

1. There are a number of solutions on the market that leverage NetFlow data so what makes Xangati’s solution different?

The difference ultimately comes down to the problem area that a given product is trying to address, which then determines that product’s design goals.

The Xangati rapid problem identification (RPI) system is focused on enabling IT to find the problem source when there are complex performance and productivity issues affecting users, the network and/or the applications.

To do this, our product was designed to track the complex inter-relationships of all users/clients, servers, networks and applications in an IT infrastructure.

Our experience shows us that big problems arise when there are subtle changes in this ecosystem and our solution, through targeted live and historical data, delivers actionable information to guide the IT user to the problem source in the shortest amount of time.

In this context it should be noted that we leverage other ambient data including SNMP, LDAP, rDNS to help us concretely map clients, servers, interfaces and subnets.

Comparatively speaking delivering RPI represents a very different goal than other management products that also consume NetFlow.

These products are interface-centric and are specifically designed to provide details on the utilization of a WAN link as well as the breakdown of applications and users per link.

This is just a component of what we do, but it is the central focus for other products.

View a video demonstration of the Xangati RPI solution:

View a higher resolution video demonstration of the Xangati RPI solution:

http://www.xangati.com/demo/demo_ent.html


2. Why do traditional management solutions have challenges with assisting IT in troubleshooting network and application performance and availability problems?

The question is indicative of one of the frustrations that I hear consistently when I meet with enterprises—almost every one of them is able to recount for me a very recent and vivid story of an extended firefight.

A large reason for this is that traditional management solutions have a bottom up view of the world and are focused on the performance and availability management of a given IT silo: network, application, server etc.

The issue is that there is no integrated understanding of how the elements in the different silos interact with each other and according to Network World columnist Jim Metzler: performance problems transcend silos.


3. What does leveraging NetFlow as a primary data source allow you to do differently than traditional network management tools?

The utility of NetFlow data is that it provides sufficiently comprehensive information about interactions between different parts of large and distributed IT infrastructure without imposing a burden on the routers that generate the data while only consuming nominal bandiwtdh.

It obviates the need for probes all over the infrastructure and the need for potentially multiple agents on endpoints, both of which are prohibitive in cost and maintenance.


4. Who are the typical users of the Xangati RPI system and how are they using the solution in context of their daily workflow?

One of the primary users of the system is the network operations center (NOC) staff which has the twin challenge of managing both applications and networks.

When a problem is escalated to the NOC staff, they can quickly drill into the problem area through the Xangati UI and get a live view of the activity in that realm.

With the situational awareness they gain from the UI, they can seamlessly navigate to more detailed and context-laden views.

In addition to the NOC, we have also seen the service/help desk also find significant success with the solution.

In their workflow, they can start down at the end-user level and literally see what that end-user’s networked application usage is the exact moment the end-user is reporting a problem.

And at this level, the service desk rep can truly qualify, investigate and likely resolve the issue without an escalation.

This is quite a shift from what they were doing previously as the only solutions that were available to the help desk in the past were optimized for desktop support.


5. What are the areas of innovation you have focused your engineering team on in terms of cultivating unique intellectual property (IP)?

The framework for our IP started with the belief that you have to know about everything on your infrastructure with a great degree of granularity and specificity to catch complex problems.

The result from that thinking led us down a path of creating a highly scalable platform with the ability to have visibility into and awareness of the activity of each end-user.

Needless to say we undertook this challenge and put a great emphasis on scaling the system which can support up to 100,000 endpoints.

And scaling the system has an added degree of difficulty because it extends in three dimensions all with a high degree of specificity for each infrastructure element:

1) Delivering live activity views.

2) Learning the normal application experience of each element over time.

3) Fine-grained history reports presented at will.

Through these various mechanisms we enable a user of our system to have unparalleled access to critical troubleshooting information.

There is substantial amount of back-end work that our engineering team has created over time to sustain this and we continue to build upon it.

I should also point out that our UI framework is set up in a way that the data our system crunches is presented in an easy to use format that makes the information actionable.

On this front, we have seen this be an attractive aspect of our technology to the extent that the system is now simple enough to use that it can be embedded in the workflow of a help-desk support person.


6. You place a degree of importance on the concept of inter-relationships why is that and what value does it provide to customers?

Inter-relationships are essential for an IT user to understand because networked applications are essentially an amalgamation of cross-silo elements: application (which might actually be multiple applications, example given, web-front end, app server, and database back-end), network, clients and servers.

To understand what is normal, you have to understand the relationships within a networked application ecosystem and then across ecosystems.

And then if you hope to find the root cause of a complex problem, you will want to know where the relationships have changed between the ecosystem elements.


7. Xangati puts a particular emphasis on the end-user experience of networked applications, why is that?

The simple answer is that it is ultimately IT’s role to deliver a high-quality application experience to their business end-users.

Moreover, we place emphasis on it because it helps to make a point that application experience is incredibly important for end-user productivity but not often well understood.

This is surprising given the tremendous investments companies are making in networked applications.

Up to this point, the solutions to deliver visibility all the way down to the end-user have been very rudimentary.

As a result when a user calls to complain about their application experience, it is a great challenge to IT because the helpdesk doesn’t have visibility into what they are doing.

This is where the RPI Virtual Task Manager can be leveraged.

See this video clip for an example:

http://www.xangati.com/taskmanager/Wireless_hog4.html


8. If an enterprise already has a network management system in place, how do you see the Xangati RPI system fitting into the picture?

We see the RPI system as very complementary to traditional management system installations for example HP OpenView and IBM Tivoli.

Those solutions provide effective manager of manager (MOM) capabilities and our system can integrate with them by sending traps related to problems identified.

The role of these solutions and ours are very different.

The big solutions are ideal for providing comprehensive views of the up/down status of many disparate IT infrastructure components.

Our RPI system augments them by helping to find the complex problems that arise even when all the elements being looked at by a MOM are saying things are fine.


9. Can you provide some examples of the common kinds of problems your customers have identified with your solution?

The top issues we have seen are:

Unscheduled back-ups clogging the WAN for an ERP application.
Misconfigured VoIP call processes dragging down call center productivity.
Software as a service application (SaaS) intermittently sluggish due to Internet video streaming.
Server cluster not properly load-balanced.
Hundreds of non-inventoried endpoints accessing centralized servers.
Unmapped/unknown activity by critical servers.

10. Where do you see the market evolving in terms of how Cisco routers can fuel the intelligence of management products like yours?

We think that the richer the data that can be provided by routers and switches the better things are for IT.

In addition to NetFlow, Cisco solutions have NBAR for application recognition and IP SLA for latency and mean opinion score (MOS) measurement.

These technologies will have increasing value over time in the management of your infrastructure.

Since anything critical within your IT infrastructure is leveraging the network, then what better data source than the network itself to fuel management products.


Do YOU agree with Jagan that up to this point, the solutions to deliver visibility all the way down to the end-user have been very rudimentary?

Contact Brad Reese
http://www.BradReese.Com

Brad's Top 5 Story Picks
# 1. Hire success rates for Cisco job posts appear elusive
# 2. Cisco hits triple homer with blog redesign!
# 3. How to setup a Cisco IP SLA TCP connect operation
# 4. Something very big must be brewing over at Cisco mobility
# 5. Power and cooling: Cisco vs. Enterasys
Story Archives Brad Reese on Cisco Story Archives

Cisco Repair

Cisco VoIP Gateways

Cisco Power Supplies

Cisco Aironet Wireless

  

Active & InActive Timout

Useful answer?
0

Hello Xangati Team,

I watched your video on RPI and I have a question regarding the what appeared to be ~1 second refresh rates when looking at the trends for a specific host's traffic (e.g. Peers, Apps, Performance Details [packets, bits]). Are you doing this with NetFlow?

I thought NetFlow was really only accurate at 1 minute intervals at best. The reason being is because of two Cisco NetFlow commands:

# ip flow-cache timeout active 1
The above breaks up long-lived flows into 1-minute segments. You can choose any number of minutes between 1 and 60. How do you get the routers to export faster than 1 minute?

# ip flow-cache timeout inactive 15
The above command ensures that flows that have finished are exported in a timely manner. The default is 15 seconds; you can choose any value between 10 and 600.

Perhaps my information is old. If not, are you performing a playback of recorded data? How do you do this when the exact order of the connections made to and from the host are exported by the router/switch at times out of order?

Also, I have a question regarding your behavioral baseline support but, I'll wait to see how badly you beat up my poor understanding of how NetFlow works. :)

Thanks,

AnonPrime

Response to Questions

Useful answer?
0

AnonPrime,

Above are insightful questions.

We do provide second by second refresh rates leveraging NetFlow data. This is made possible by the fact that the router records the start and end times of flows (as the packets are passed through the router). So regardless of the flow export delay the Xangati system leverages the timestamps to determine the time of each packet.

To deal with scenarios where flows may be exported out of order, the Xangati system uses adaptive buffering techniques to facilitate reordering in a seamless manner.

real time is sor of a loose term

Useful answer?
0

Hello Xangati Team,

Good job with RPI. I'm sure there is still a 15 - 60 second or so delay in your adaptive buffering technique to display the data. Is the delay longer than this? After all, it is NetFlow and the bottom line is that there is no way around significant delay with Cisco's implementation of NetFlow v5:
# ip flow-cache timeout active 1
# ip flow-cache timeout inactive 15

Lets not digress on Flexible Netflow.

Realtime is an over used term in the NetFlow business. Do you have a white paper on your web site that explains how it works? I'd like the URL.

I think more NetFlow vendors should implement this cool feature.

Thanks,

AnonPrime

Detailed Questions

Useful answer?
0

AnonPrime,

Our website has few white papers but none that are architectural drill-downs in the way I suspect you are seeking. We would be happy to brief you directly if you want to ping us on our website and reference this dialogue.

Thank you.

Xangati

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

About Brad Reese on Cisco

Brad Reese is research manager at BradReese.Com, advancing the careers of 1 million certified individuals in the growing Cisco Career Certification Program.

RSS feed

Contact him.

Brad's blogroll

Brad Reese on Cisco archive.

Cisco Subnet

Advertisement: