Is 'in situ' performance monitoring the holy grail for cloud-native apps?

Monitoring the performance of cloud-native apps at scale is daunting, and the traditional approach of doing periodic collection and analysis of statistics is simply impractical

Is 'in situ' performance monitoring the holy grail for cloud-native apps?
Credit: Thinkstock

Developers specifically design apps natively for the cloud with the expectation that they will achieve massive scale with millions or billions of concurrent users. While many aspire to be the next Facebook, Twitter, Snapchat or Uber, plenty of app developers for banks, ecommerce sites or SaaS companies design for scale that is still far beyond what was even imagined a decade ago.

Monitoring the performance of cloud applications with this kind of scale, however, is daunting, and the traditional approach of doing periodic collection and analysis of statistics is simply impractical. Only machine learning techniques, applied to intelligent performance data collection, can reduce data loads without inadvertently omitting context- and performance-sensitive data.

Microservices increase complexity

The adoption of microservices introduces more complexity to application performance monitoring. Individual microservices can be updated, replaced or changed continuously, which increases elasticity and supports the move to continuous delivery, rendering static performance monitoring as useless.

As we divide an application into a number of linked microservices, each inside an individual container, we need a new form of cloud application monitoring—one that aggregates and correlates statistics from these individual containers into a logical monitoring unit suitable for consumption within a cloud-wide workflow.

+ Also on Network World: Are there workloads in the cloud that don’t belong there? +

New cloud performance tools are now being introduced to address these problems. These tools often automate and adapt the collection of performance statistics on top of a sophisticated data model that provides layers of isolation to ensure successful operation in various cloud infrastructures without manual intervention.

However, these tools largely focus on defining events that need to be tracked across a workflow and signal to a dashboard abnormal trends that require instant attention. At best, this approach offers a coarse-grained event-processing framework that requires intimate knowledge of an application workflow.

The best insight but requires code changes

Alternatively, application-specific in-code instrumentation can be inserted into a region of critical code to provide a very granular view of how that region of application code is performing. While this approach gives the best insight, it requires code changes and assessment of critical code regions that are accessible only to application developers. This is a lot of work spread across a lot of developers who are then not focused on the core app.

What we need in order to understand the exact cause of poor app performance is simple. We want a view from inside the operating system of the code, as it is being executed, without any special monitoring code inserted. I call this “in situ” monitoring. It kind of sounds like I’m asking for the Holy Grail, something with special powers that brings us “app” happiness—and perhaps infinite abundance and eternal life. In the past, a quest for such a monitoring tool was fruitless, but perhaps it’s time to search again.

As has often been the case, several barriers prevent in situ performance monitoring from becoming a reality. From a resource management perspective, operating systems such as Linux are designed for sharing, which means process context switching, system call and interrupt overheads, and unavoidable user and kernel space resource contentions. At best, these overheads can be minimized, but they cannot be eliminated. When multi-core CPUs are added into the mix, excessive inter-core communications and unwanted cache invalidations are inadvertently introduced while providing only moderate performance scaling, which eventually plateaus after about four cores.

Reaching the Holy Grail for performance analysis

The key to reaching our Holy Grail of in situ performance analytics, I believe, is to isolate user processes from the above system overheads as much as possible. We don’t need complete isolation; near-complete isolation will improve things immensely. And while perfect isolation may not be possible, a virtual execution environment that allows an application to run in near-complete isolation, with control of its own application compute, network and I/O resources allocated from the operating system, may be viable.

And, not all processes are performance-critical. We don’t have to treat all processes or threads equally to find a solution. Actually, only data path processes and threads matter.

Can an application process run in near isolation to enable in situ performance analytics in a standard Linux, container or VM environment? Well, the map for our quest is emerging, but we need to search diligently. We begin by recognizing that modern CPU cores are inherently parallel, meaning once the OS kernel allocates an application to run on a specific core, it pretty much runs on that core in isolation. This provides the foundation for CPU isolation, a critical element of our quest.

Now looking at network and I/O, with the advent of SR-IOV and how it is being applied to Ethernet NIC cards and storage controllers, it is possible to use an application-specific software-defined methodology to access application-specific virtual functions allocated to different applications. The next critical element in our journey.

Recognizing that CPU isolation and network and I/O application-specific virtualization are enabling technologies available to dynamically construct a virtual execution environment for an application, we see it may be possible to determine—without inserting monitoring code—whether an application is bound by compute, network or I/O right at the time congestion occurs.

That, to me, is the Holy Grail. We have the in situ performance monitoring with per application granularity that we have dreamed of and now must institute if we are to scale performance monitoring to match the needs to cloud-native applications.

This article is published as part of the IDG Contributor Network. Want to Join?

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.