Jesse Rothstein, who was the lead architect of F5's flagship product line, founded ExtraHop in 2007 to develop products to derive IT operations intelligence from data gleaned from the network. Network World Editor in Chief John Dix recently caught up with Rothstein for an update on the company and what it has learned about things like virtual packet loss (hint: it can be the bane of highly virtualized environments).
How does your background at F5 help you at ExtraHop?
My co-founder Raja Mukerji and I were both at F5 for many years. And what we did at F5 was bring application awareness and application fluency to what was the load balancer, and that created a whole new product category called the application delivery controller. Over at ExtraHop, we leverage that same domain expertise in high-speed packet processing and application fluency, but we’ve brought it to a new space, much more on the IT operations side, and we’re starting to call this IT operations intelligence.
Raja and I had conversations with IT organizations and people we’d worked with in the past and it became apparent to us the end result of megatrends like server virtualization, where VMs spin up and spin down and jump across the data center, and agile development, where we roll out new versions of applications every two weeks or every two days, was resulting in an unprecedented level of scale, complexity and dynamism. And the previous generation of tools and technologies that companies use to manage these environments are no longer tenable. And that’s if they have those tools at all. More often than not companies just throw smart people at the problem of figuring out what’s going on.
So I would say, No.1, the situation has become such that we’re beyond the capability of just throwing smart people at the problem and pulling a few all-nighters and ordering pizza. And No.2, the previous generation of tools were built for much smaller environments that were not dynamic. Those tools basically start off as bricks, and you parachute in teams of sales engineers and systems engineers and consultants to configure them in order to provide the visibility you need. Then if the environment changes, rather than automatically detecting the changes, you have to rinse and repeat that process.
So we started with the notion that these IT megatrends were occurring, that we had the domain expertise to solve some of the problems around scale and dynamism, and that we could provide visibility into these environments.
What are you lumping into the current generation of tools?
This is a taxonomy I’ve been thinking about for a while. In enterprise IT there are four or so sources of data that you can use to derive some intelligence about your environment.
So No.1 we have machine data, and I’m using a term that Splunk popularized. Machine data includes log files, SNMP and WMI, and all of these data sources are largely unstructured. Splunk and others like them realized that enterprises are producing a lot of this unstructured machine data and not really doing anything with it. So they built a platform to index it, archive it, and analyze it to derive some intelligence from it.
I sometimes joke that it’s been transformational in the same way as fracking has been in the energy market. What I mean by that is, the value was always there, but by applying new technology we can now access it and extract it. So I think one source of data in the IT environment is this unstructured machine data.
Another source is what I would call code-level instrumentation. And this is what traditional Application Performance Management is based upon. Wily (acquired by CA) really founded that market, but companies like DynaTrace and AppDynamics and even New Relic make use of code-level instrumentation. They have agents that instrument the Java JVM or the .NET common language runtime, and they can derive some intelligence and some performance metrics around how that service performs. Where are the hotspots and bottlenecks? What’s it doing? These are very useful tools for developers who have intimate knowledge of the code and want to see how it runs in production.
The third source of data I call service checks. There are lots of facilities for doing this. If you’re running some sort of synthetic transaction (basically a script mirroring common user actions), you can use internal checks, which is what HP’s Mercury SiteScope and Nagios do today, or external service checks like a Keynote or Compuware’s Gomez. These give you a sense of if your service or your application are up or down and, to some degree, how it is performing. But there are some challenges with this approach because, given these things are periodic in nature, there’s an inherent under sampling problem. So that means that if you’ve got any sort of intermittent issue you very well might miss it.
And finally the fourth fundamental source of data for intelligence is what we call wire data. That’s everything on the network, from the packets to the payload of individual transactions. It is a very deep, very rich source of data. In fact, all indications are that wire data is at least one or two orders of magnitude larger than other sources of data, because there is just so much moving across our networks. And it’s definitive. We know that a transaction completes if we can observe it completing on the wire and we can observe the peers in this conversation acknowledge that that transaction completed.
To a large degree wire data has been neglected. Yes, there have been products like network probes and packet sniffers for three decades or more, but I would say they only scratch the surface of what’s available on the wire. At ExtraHop we founded the company on the premise that there is this tremendously rich, tremendously deep source of data on the wire, and by leveraging gains in processing power and storage capacity, that we could extract and analyze and derive intelligence from that data. It has required a completely different technology approach than you would do for any of the other sources of data. But it is, I believe, every bit as valuable.
I tell organizations that, as a best practice, they should probably have a product that is focused on each of these four sources. I wish I could say that there’s one that does it all, but there isn’t, because these do require pretty fundamentally different approaches.
APM providers argue they can see it all, embedded as they are in the applications. What are you providing they can’t?
APM is really focused on code-level instrumentation, and there are probably three fundamental differences between us and APM. One is philosophical. We define the application differently. APM tends to define the application as the code running on a server and they instrument that. At ExtraHop we define the application as the entire application delivery chain. That includes the client devices, the network transport, the front end, the middleware, the transaction queuing, back-end storage and even other ancillary services. It’s a chain because if any one link fails, the entire application is down, and any one link can be a bottleneck. I can’t tell you how many applications I’ve seen where the code is running fine but the application fails because of something like DNS resolutions aren’t completing. That has to be considered part of that delivery chain.
No.2 is audience. Traditional APM tends to be used more by developers who have intimate knowledge of the application code, whereas IT operation teams can get more out of our wire data analysis because it is focused on production-level systems. We answer the questions they care about most, like “What’s happening right now? Did something change in my environment? Are transactions succeeding or failing? Is this better or worse than it usually is? What resources are people trying to access?”
And the third difference is between custom applications versus off-the-shelf packaged applications. APM solutions are much more popular with organizations that are developing custom applications because they’re writing the code and the code is changing and they need to see how that’s performing. We really sell to both. Yes, we absolutely are used by organizations that are writing custom applications, but we’re also used by organizations who are dependent on packaged applications that they don’t have very intimate knowledge of, but still absolutely care how well it’s working.
You guys deliver as an appliance, right?
Yes. We’re sold as a physical or a virtual appliance.
And where do you plug in?
For us, we just take a copy of the network traffic with no overhead at all. We’re not in line, we’re out of line. And how we get a copy of the traffic really depends on the environment. Sometimes it’s directly from one or more switches using a SPAN port or a VACL capture. Sometimes there is a whole aggregation-tapping layer that’s in place. Some organizations even use some pretty advanced SDN techniques to get us traffic to analyze. At the end of the day, if we get a feed of the traffic, we can make sense of it.
But I want to stress that, even though we’re a network deployment and we analyze what I’m calling the wire data, we’re really answering questions about the health and performance of business-critical applications. So it’s not just network teams that use an ExtraHop system. And that’s an important distinction, because I see that confusion a lot.
Do you have a sweet spot in terms of customer size?
Our high-end physical appliances can support 20 gigabits of line-rate analysis, and hundreds of thousands of transactions per second. So we have large enterprises and carriers that use multiple EH8000 appliances across the data center with an ExtraHop Central Manager to provide a unified view. Our initial customers were larger enterprises, but we’re starting to see more adoption at mid-size organizations because we also have virtual appliances that can analyze a gigabit of traffic and cost less than $10,000.
How are the virtual appliances used?
First of all, a virtual appliance can actually terminate traffic from physical systems as well as virtual systems. So the fact that it runs in a virtual appliance is really just a form factor for us to deliver. But we’re certified by Cisco to run in the Cisco UCS environment, where there is great flexibility around tapping virtual traffic. With VMware vSphere 5.1 and the distributed vSwitch, they introduced support for both RSPAN and ERSPAN and the ability to tap virtual traffic for security and monitoring purposes. And some of the announcements at VMworld around the new NSX offering afford even greater flexibility. So there are a number of approaches to take there, but I think the short answer is that virtual networking has really matured rapidly in the past 24 months or so, and we’re seeing great capabilities for tapping virtual traffic much as you would tap physical traffic.
Do efforts to virtualize everything increase the need for your type of product?
Absolutely. Any time there are additional layers of abstraction it increases the need for not just our product, but solutions to help manage that complexity. That’s a general trend. And certainly server virtualization and SDN are additional layers of abstraction and complexity. But we’ve worked with a lot of customers around things as simple as physical-to-virtual migrations, where they need to prove to the application owners that when they migrate an application from a physical environment to a virtual environment the performance and availability are the same or better. Or if they’re not, they need to be able to measure that they’re not.
And in a virtual environment, you can’t measure performance by looking at resource utilization -- how much CPU it takes or how much memory is required. Resource utilization is not the same thing as performance, it’s not the same thing as response time. In fact, in the virtual environment, we derive greater efficiency and cost savings by not having as much headroom and by utilizing that CPU and memory resources more efficiently. You actually want the CPU of your physical host to be highly utilized, but you don’t want it to be under provisioned. And that’s the balance.
A great example of additional complexity in these environments is something we call virtual packet loss. Hypervisors are basically schedulers. They have to share resources across multiple guest machines and packets can get delayed, sometimes delayed enough that they’re considered lost by the underlying network stack. Now TCP is very resilient to loss. If loss occurs on the network TCP will retransmit, so you might see extra packets on the network, and that might affect your throughput, but it doesn’t necessarily affect your performance.