When users complain that an application is slow or when customers abandon online sessions without buying anything, the cause of the problem is often elusive and mysterious. You know that the network connections are tight and the servers are humming, but the problem persists.
Troubleshooting efforts degenerate into finger-pointing. One vendor suggests spending money on faster computers. Someone invariably says the software needs to be completely rewritten. Another vendor sagely recommends faster storage devices. Yet another says you need more bandwidth.
What’s a network executive to do?
The answer: Try an Application Performance Monitoring (APM) tool. An APM tool monitors a multi-tier application’s performance and availability to show exactly how much time each application component takes to respond to a user’s requests. The information helps you decide what network or computing environment changes to make to solve the problem.
+ ALSO ON NETWORK WORLD 26 helpful open source network management tools +
The perfect APM tool would have the following capabilities:
- Discovers and enumerates applications, devices and computers
- Supports a variety of applications, servers and devices
- Integrates with a global directory
- Graphically depicts the network
- Monitors application availability, performance and health
- Identifies and analyzes problems
- Produces alerts and notifications
- Issues trouble tickets (or integrates with a help desk tool)
- Supports virtualized environments and clouds
- Produces useful, informative reports
We invited APM tool vendors to submit products to our Alabama lab for evaluation. Five vendors participated. ExtraHop sent its Application Delivery Assurance (ADA) 3.9 EH6000 appliance, Dell shipped its FogLight 5.9.1 appliance and Fluke Networks loaned us a Visual TruView 1.3 appliance. We downloaded BlueStripe’s FactFinder 7.2 and the virtual machine edition of BMC’s Real End User Experience Monitoring (EUEM) 2.0. (Watch a slideshow version of this story.)
While all the tools exhibited a range of APM strengths and abilities, we found that ExtraHop’s appliance did the best job of keeping our users’ performance complaints to a minimum. It was quickest to identify performance problems, its display of application activity was easiest to use and it had the best virtual machine support.
BMC EUEM’s endpoint- and session-oriented transaction analysis quickly and accurately spotted our bottlenecks, but EUEM required that we license a number of other vendors’ products, and it lacked a high level of support for virtual computing and public clouds.
While Dell FogLight excelled at tracking database transaction performance and had comprehensive analysis tools, configuring FogLight was tedious.
Although Fluke Networks’ Visual TruView revealed great volumes of network performance detail, it was packet-centric and technically demanding.
BlueStripe FactFinder accurately mapped transaction paths, graphically charted real-time app service levels, issued alerts and analyzed root causes. Unfortunately, BlueStripe FactFinder lost points for being agent based.
Here are the individual reviews:
ExtraHop Application Delivery Assurance
The perfectly passive ExtraHop Application Delivery Assurance appliance used historical trends to recognize normal network and application behavior, gave us clear, easy-to-understand visibility into our applications, accurately pinpointed bottlenecks and notified us of problems with its dynamic, intelligent alerting engine.
We especially liked ExtraHop’s Application Inspection (AI) Triggers scripting feature, which we used to trace, monitor and measure entire transactions as they wended their way through multiple servers and connections.
Using what ExtraHop terms Trouble Groups, the appliance detects common performance problems, such as aborted database transactions, aborted HTTP transactions, excessive CIFS metadata queries, MTU mismatches, expiring SSL certificates, virtual packet loss and DNS missing entries.
Alerts were triggered by either the occurrence of the common problems previously mentioned, by custom alerts that we configured, or by statistical departures from the baselines which the appliance established from watching the network. ExtraHop’s problem notifications appeared as on-screen messages, SNMP traps, email notes and help desk trouble tickets.
ExtraHop’s Web interface is responsive, easy to navigate and intuitive. The customizable dashboard window contains widgets for specific applications and tiers that we chose to see. Other summary windows displayed application-sensitive metrics for transactions, device groups and individual devices we set up.
Configuring the summary windows was a simple matter of dragging and dropping widgets and selecting time intervals for charts. Hovering the cursor over a widget or chart caused ExtraHop to show further details, and we could drill down to zoom in on transactions, servers or connections. The Web interface also displays geographical network maps and graphical depictions of network activity.
Flex grids are custom-tailored summary reports that we found easy to assemble and versatile to use. We quickly produced high-level flex grid reports suitable for sending to a CIO (showing, for instance, an application activity summary view) as well as detail-level, targeted reports containing meaningful information for network administrators (showing network traffic levels), database administrators (showing database activity or database errors) and developers (showing application server network activity that highlighted server responsiveness). We especially appreciated the ability to drill down into network or application details.
The ExtraHop appliance continuously and accurately discovered devices and applications on the network. Passively, the appliance noticed new devices when they began using the network (either as a source or destination). The appliance classified devices based on a heuristic analysis of media access control address, IP address, naming protocol and transaction types. Classification of applications into logical groups and tiers relied on network activity (HTTP, database, CIFS, etc.), and we could easily define custom applications using ExtraHop’s Application Inspection (AI) Triggers.
ExtraHop’s AI Triggers are scripts you write at the application-protocol level. We used them in one test to easily isolate and view mobile device application access by segregating HTTP clients by type. Tracing specific front-end transactions across tiers via session IDs was similarly painless and simple.
We also used AI Triggers to define additional metrics and even ignore irrelevant errors. Impressively, ExtraHop’s device and application discovery produced an accurate network map showing exactly which clients, servers, network devices and applications were talking to each other. The application views recognized protocols (including database, e-mail, file transfer, virtualization, storage, authentication, Java middleware, encryption and directory services) to produce logically related summaries of all application activity.
The support for VMware and Hyper-V virtualized environments was comprehensive and quite helpful. For example, ExtraHop’s objective and reliable network-activity-based view of our virtualized applications gave us an accurate look at an application’s performance when we spread the workload across multiple servers.
Furthermore, ExtraHop’s virtual packet loss metric let us detect congestion problems that we were easily able to fix through resynchronization and modification of overly-restrictive timing thresholds. ExtraHop also helped us to track “VM sprawl” (the proliferation of VMs across the data center) and to know when to rein in the uncontrolled spinning up and migration of too many VMs. ExtraHop distinguished between physical hosts and VMs, and it comes with a predefined dynamic group for identifying VMware guests.
For cloud computing, ExtraHop focuses deeply but narrowly on just Amazon AWS. Its APM appliance understands and has metrics for all AWS services, including EC2, RDS, S3, ELB, Elasticache and DNS. ExtraHop showed us, for example, which users accessed particular S3 files. We also used ExtraHop to monitor the performance of various RDS queries.
BMC Real End User Experience Monitor
BMC’s End User Experience Monitor (EUEM) focused clearly and intelligently on our endpoints to show us session-by-session performance details. Its sophisticated analyzer quickly and unerringly revealed where and why transactions were encountering bottlenecks. And EUEM produced a range of reports useful to a wide audience and its user interface was responsive and intuitive.
On the other hand, EUEM required that we already have Oracle database software, its cloud monitoring required us to install BMC’s Real User Cloud Probe module on a cloud server and its Extended Reports required us to license a variety of SAP BusinessObjects elements. These requirements certainly weren’t showstoppers, but the other products we reviewed were much more self-contained.
+ ALSO ON NETWORK WORLD BMC chief: Our going private is 'great new for customers, bad news for competitors' +
BMC’s EUEM consists of an Application Performance Management Console, a Real User Analyzer, a Real User Collector, a Real User Monitor and a Performance Analytics Engine. Each component runs as a separate virtual machine.
The Management Console’s home window displayed a number of useful, at-a-glance, configurable dashboards. Its browser-based interface contained links to the Real User Analyzer, Real User Collector and Performance Analytics Engine Web interfaces.
The Real User Collector intercepted and captured traffic from a network tap or a switch’s mirror port, then delivered the copied packets to the Real User Analyzer. The Collector used easy-to-configure Traffic Inclusion and Exclusion Policies to specify the application data that it captured, and the Collector’s browser-based interface displayed traffic flow details on its Traffic Capture Statistics page.
The Real User Analyzer evaluated the traffic to produce session statistics, Web page statistics, incident detections (i.e., alerts), EUEM report data and the Management Console’s dashboard data. The Performance Analytics Engine worked with the Real User Analyzer to zoom in on the sessions and the network traffic we wanted to investigate further.
For low-volume networks, BMC says you can simplify things by using the Real User Monitor, which combines the Collector and the Analyzer. In our tests, the Real User Monitor was able to handle traffic up to about 500 transactions per minute. In general, however, we tested separate, independently running Collector and Analyzer modules.
We used EUEM’s Watchpoints to help us focus on just the application traffic we wanted to troubleshoot. EUEM’s best feature, Watchpoints, define precisely the application, the group of users, the geographic region, the network segment or the client browser you’re interested in tracing.
Using the Management Console, we set an application Watchpoint that the Console sent to the multiple Real User Analyzers we had running. The Analyzers used the Watchpoint as a filter to then show us just the application network traffic we wanted to see. At five-minute intervals, EUEM summarized traffic volume, system availability and performance statistics for each Watchpoint and displayed the result.
Impressively, EUEM’s automatic discovery process, based on its analysis of raw network traffic, recommended application Watchpoints to us. And we liked that we could set multiple, hierarchical Watchpoints. We used subordinate Watchpoints, for example, to concentrate on fine-level details regarding our applications’ logon pages and search result pages.
Session timelines are another useful EUEM feature. These visual representations of user sessions were an intuitive window into our applications’ behaviors. Each timeline gave us at-a-glance awareness of a particular user’s progress (or lack of progress) during the execution of each application. Session timelines graphically showed us detail such as Web page errors, database transaction errors and bottlenecks that occurred during a session. Drilling down on each session element exposed the exact nature of each problem that we were investigating.
When EUEM’s passive monitoring of our network didn’t show sufficient detail (such as when part of the processing took place in a cloud), we used the toolkits that BMC provided to instrument our applications. EUEM was then able to show us elapsed times for various detailed activities, including page element load times and cloud server processing times.
And BMC’s Real User Cloud Probe, once installed on a server in the private cloud we tested with, gave us a milestone-by-milestone view of the applications’ workings in the remote cloud. Unfortunately, installing the Real User Cloud Probe in a public cloud requires making special arrangements with the cloud provider. (Note that BMC does already have a relationship with Akamai.)
EUEM helped us identify the root cause of each of the performance problems we tested. Moreover, it gave us a single view of the applications running in multiple data centers, we could follow user sessions that moved across data centers and we used EUEM to compare application health and performance data across different sites when the application was running in multiple data centers.
Regrettably, EUEM lacks the level of virtual machine monitoring that we expected to see, and we’d like EUEM to integrate better with public clouds.
Dell FogLight is a superlative database performance monitor. Furthermore, it visually depicted our application topology in clear and easy-to-understand ways, it offers a rich set of analysis tools and FogLight’s script language makes it extremely versatile, both for defining custom applications and adding new metrics. Unfortunately, for all but the simplest and smallest (i.e., FogLight’s default) IT environment, Dell’s APM tool requires considerable configuration effort.
A Foglight appliance contains a Management Server, Archivers (and associated databases), Sniffers, Browser Instrumentation and Relayers. The Foglight Management Server collects packet capture data from Sniffers and archives the packets. Relayers perform load-balancing between Archivers, which both store and analyze the traffic data.
FogLight’s browser-based Management Server interface uses dashboards to show at-a-glance network health and activity. We used it to configure traffic capture options, such as what Dell terms capture groups, as well as set up sessionizing policies. For all but the simplest case, we found these session-identifying policies tedious to create.
Sessionizing is the process FogLight uses to correlate captured Web traffic packets with real user sessions. Sessionizing policies specify how FogLight should identify and manage user sessions within one or more monitored Web applications.
We set up distinct capture groups, one for each geographical location, to separately identify each location. We configured FogLight’s Sniffers to collect data just on the subnets we were interested in. We made sure each capture group had its own Archiver and Sniffer. Running FogLight’s Discovery tool from the Session Identifiers view displayed a list of candidate cookies from which we selected our application-specific cookies.
After telling FogLight the session ID variable name that our application used, we then ran FogLight’s Discovery tool from the Username Rules view to produce a list of candidate variables. From this list, we picked the ones our application used to hold user names.
FogLight can extract user names from HTML forms, XML variables, cookies or HTTP headers. We ran FogLight’s Discovery tool from the URL Prefixes view to see a list of URL prefixes from which we selected the one appropriate for our application. We gave FogLight directions on how to detect the end of a user session by setting such FogLight parameters as maximum hits per session, maximum session duration and Web page timeout. Finally, we used FogLight’s browser instrumentation guidance to insert code snippets in our Web pages.
FogLight’s replay feature, which analyzes and displays archived packet capture streams, was a godsend for those times we needed to iterate through a particular end-user session with a difficult, non-obvious performance problem to diagnose the cause of an end-user complaint.
In our virtual environment, FogLight displayed a dynamic visualization of vMotion transaction paths in real-time, and it pinpointed the effects of virtualization and shared-resource conflicts in our Web applications. Foglight for Virtualization provides a set of useful tools for managing virtual machines running on VMware, Red Hat or Microsoft virtualization platforms. For example, by comparing vCenter’s manifest of supposedly running VMs with FogLight’s list of actually running VMs, FogLight was able to highlight the differences to us.
FogLight determines the actual resources a VM is consuming, such as CPU, memory or disk, and it offered suggestions on how to more efficiently provision that VM. At our option, FogLight was even able to reprovision the resources itself.
Fluke Networks Visual TruView
The pre-configured, rack-mount Visual TruView appliance excelled at teaching itself baselines of normal network and application behavior, gave us at-a-glance, drill-down dashboards of activity and logically analyzed that activity to pinpoint bottlenecks.
The unit intelligently correlated network traffic by source, destination, time and packet contents to identify transaction and application performance problems. However, Visual TruView only recognizes and identifies applications by protocol. A typical report lists HTTP, Oracle DB, Citrix, etc. as individual applications.
+ ALSO ON NETWORK WORLD Fluke rolls application performance and network monitoring into one box +
In an N-tier environment (e.g., Web server, application server and database server), we had to use our knowledge of the overall application to visualize and understand Visual TruView’s server-by-server and interface-by-interface traffic reports and displays. We concluded that Visual TruView is essentially a protocol analyzer (packet decoder) onto which a layer of transaction recognition and processing had been grafted.
Visual TruView is agentless, installs in less than 30 minutes and collects data from a variety of sources. These include end-to-end transactions, SNMP events and NetFlow (IPFIX) packets from routers. Network-wide device discovery is quick and accurate.
The Visual TruView browser-based interface is thoughtfully designed, highly responsive and easy to navigate. We were never more than a couple of mouse clicks from our next task or from answering our next question.
Visual TruView’s dashboard highlighted atypical network activity, such as high transaction volumes and slow response times. Visual TruView showed us the busiest applications, sites and servers by transaction rate and data volume. It identified the busiest network interfaces, kept a close eye on WAN link utilization and monitored NetFlow-aware devices. Visual TruView also monitors VoIP performance.
The Visual TruView toolset consists of packet capture to disk, packet and transaction analysis, NetFlow analysis, device management and the dashboard. Visual TruView can recognize (and we could select for) VLAN activity, but the appliance doesn’t identify or specially handle cloud connections.
The Site Performance window’s geographic map view depicted our network’s sites with color-coded icons to denote global network status and health information. When we drilled down to a problem site, Visual TruView’s extensive troubleshooting tools helped us quickly recognize the problem’s root cause.
A Visual TruView pop-up window identified site issues and application slowdowns. When we selected a site, Visual TruView revealed transaction and application performance details for individual problematic servers and clients, including round-trip and inter-network, node-to-node response times. At the lowest level of detail, Visual TruView’s network protocol analyzer module decodes and displays individual packets.
The configurable dashboard’s application performance overview showed end user response times, TCP retransmissions, transaction rates and bandwidth utilization for the entire network, neatly categorized by geographic site. We could monitor the network in real time, or we could replay previously stored streams of network activity that we’d saved. We used Visual TruView’s filters to focus on a specific application, site or server.
And we greatly appreciated the ability to associate groups of sites, servers and applications with our simulated company’s line-of-business functions. The result was a clarified, business-oriented view of our network’s activity that helped us prioritize our troubleshooting efforts.
Visual TruView produces useful, easy-to-understand graphs of application, server and site performance, including response times. The graphs are “lively.” Hovering the mouse cursor over a graph causes Visual TruView to unveil further metrics, and clicking on the graph displays a detailed performance breakdown window.
Visual TruView’s trending feature, which graphs and reports on activity levels over time periods, is a great aid for capacity planners. We used the feature to view historical changes in end user response times, round trip network performance, network utilization and even connection setup times.
Visual TruView uncovers a wealth of performance data from the network activity it captures and analyzes, but we found it to be “packet-centric” and more technically demanding than the other products.
FactFinder accurately mapped transaction paths and timings from tier to tier, graphically charted real-time application service levels, quickly notified us when application performance fell below the thresholds we set, thoughtfully distilled a large set of out-of-bounds findings down to amanageable list of possible root causes and tightly integrated with Microsoft’s System Center Operations Manager (SCOM).
However, FactFinder was not perfectly platform-neutral, its usefulness waned in the presence of thick clients (in which the bulk of application processing occurs inside a client computer), it was less than helpful for network device bottlenecks involving multiple connections or mismatched MTUs and FactFinder gave us a subjective, not objective, view of our virtual environments and our cloud processing.
Agents have some advantages. Agents typically consume relatively few computing resources and they provide considerable information about running applications. However, server-based agents don’t see what’s happening inside client computers, they don’t see network traffic details involving multiple switches and routers in a long network path between a client and a server and agents must be compatible with the operating system(s) youuse.
FactFinder consists of a Management Server, the agents (BlueStripe calls them Collectors) and a Console.
BlueStripe Collectors run on specific versions of AIX, Linux Red Hat Enterprise Linux, CentOS Linux, Oracle Enterprise Linux, 64-bit SUSE Linux Enterprise Server, Solaris and Microsoft Windows Server. In addition,FactFinder includes Transaction Plugin agents that run alongside particularversions of Apache HTTP Server, IBM HTTP Server, Microsoft IIS, IBM WebSphere, Oracle WebLogic and Sun ONE Web Server. FactFinder agents for monitoring database activity work with specific versions of Microsoft SQL Server, MySQL Server, Oracle, DB2 and Apache JServ Protocol (AJP). Finally, FactFinder has agents for specific versions of the Tomcat and Resin Java application server environments.
The Java-based FactFinder Management Server needs a dedicated, fast, multi-core server, which the customer supplies.
The FactFinder Console is, ironically, a Java thick client (not a browser-based interface) that communicates with the Management Server. We used the Console’s three views – App Center, Dashboards and Explorers – to identify and locate performance problems as we ran our test applications. The App Center’s summary data showed real-time aggregate activity, while the App Center’s Events and Alerts window displayed errors and warnings related to the machines running Collector agents. App Center’s lists of applications and machines basically confirmed that we had installed Collector agents on the appropriate set of servers. And App Center’s map view provided a graphical depiction of the agented servers.
The Console’s Dashboards view gave us at-a-glance service level status and health information, and we could drill down for more detail from any of the three Dashboard windows (App Dashboard, Top Transactions Dashboard and Tx Dashboard). However, we spent most of our troubleshooting time in the Console’s Explorers views, analyzing various components’ response times, activity levels and errors.
Each different component had its own Explorer view – Process Group, Process, IP Address, Server Port, Database Instance, J2EE Application, etc. The Explorer views we found most useful for troubleshooting application performance problems were App Explorer, Machine Explorer, Transaction Explorer and Trace Explorer. These four views gave us the insight we needed to identify and understand the bottlenecks on our network.
We also noted that FactFinder integrated closely with Microsoft System Center Operations Manager (SCOM) in our tests, using it as a network management tool to store component health, status and alert data.
Editor’s Note: BlueStripe provided this clarification on the capabilities of its product:
“BlueStripe FactFinder is an enterprise Transaction and Application monitoring solution that manages end-to-end application systems within the data center and the cloud. While it has complete visibility of all servers that make up an application system, it is not intended to be a network tool. While FactFinder shows when a network connection between any two servers is problematic, it is not designed nor intended to go into the network device layer.
FactFinder is also not an end user experience monitor, focusing on the parts of the application running on systems owned by IT Operations. Thus, applications that process on PCs are not managed by BlueStripe software, whether an email thick client, or some other office product.”
ExtraHop is our APM tool of choice. It helped us solve our application performance problems better and faster by giving us a clear view of our application activity, quickly identifying our bottlenecks and promptly alerting us to errors and problems.
Nance runs Network Testing Labs and is the author of Network Programming in C, Introduction to Networking, 4th Edition and Client/Server LAN Programming. His e-mail address is email@example.com.
How we tested application performance monitoring (APM)
We custom-programmed two vertical market applications for our tests, and we also monitored e-mail, Web server, database query and miscellaneous office productivity tasks. The two custom-built applications were an automobile insurance rate/quote package and a Web-based search engine process for querying a financial database. The insurance rating software is what an insurance agent or online insurance sales Web site uses to provide you with a price for your auto insurance. A financial adviser would use the financial database search software to help a client build a retirement plan or investment portfolio.
We evaluated each performance monitoring product’s ability to gather comprehensive metrics. We expected a product to reveal bottlenecks and show which network and server resources were being consumed. We looked at a product’s ability to discover, measure and track application response times and network traffic loads. We used each product to measure network and server utilization, identify the various components of an application’s performance and observe the performance aspects of running the application. We tested a product’s ability to show application capacity and scalability. We expected the product itself to be scalable.
Our test environment consisted of six routed Fast Ethernet subnet domains with T-1, T-3 and DSL links to the Internet. We ran the application monitoring software’s server component(s) on a 16-core machine on which we variously installed Windows 2008 Server, Windows 2003 Server and Red Hat Enterprise Linux. The 150 client computers on our network were a mix of Windows 2003, Windows 2008, Windows 7, Windows Vista, Windows 8, Red Hat Linux and Macintosh platforms. Relational databases on the network were Oracle, Sybase Adaptive Server and Microsoft SQL Server. The E-mail servers were Exchange and Sendmail, while the Web servers on the network were Internet Information Server (IIS) and Apache. Our virtual machines included VMware, Hyper-V, Red Hat KVM and Citrix XenServer. The test environment had cloud connections to a private cloud, Amazon AWS, Microsoft Azure and Rackspace.