Heat maps can reveal, though not always explain, systems latency, an Oracle engineer argues
While data-center managers have long used heat maps to help determine where to best position racks of servers and cooling units, this mode of visualization can also be handy for better understanding system latency, argues an Oracle engineer in the July issue of Communications of the ACM.
"Presenting latency as a heat map is an effective way to identify subtle characteristics that may otherwise be missed," writes Brendan Gregg, a principal software engineer at Oracle, in the article "Visualizing system latency."
Gregg also cautioned that while such visualization can give us greater overview of what is taking place, it doesn't always provide answers for the behavior being observed. Still, heat maps can provide insight into tackling the next generation of data-center latency issues.
Pinpointing the causes of system sluggishness has long been a frustration for data-center managers and system administrators. Network analysis tools are available to visualize network performance, though other aspects of a system, such as the responsiveness of disks in a storage array, have been harder to quantify.
Sun Microsystems has long offered one tool for its Solaris operating system, called DTrace, that can characterize latency within various parts of a system on a second-by-second basis. The overwhelming data it can produce, however, still needs to be boiled down into a readily understandable form.
Enter Gregg's heat maps. Heat maps are a simple visualization technique in which, on a two-dimensional graph, different values are represented by different colors.
Heat graphs can reveal more than the line graphs on most network analysis tools, because while graphs "would allow average latency to be examined over time, the actual makeup or distribution of that latency cannot be identified beyond a maximum, if provided," he writes.
Heat maps are also good for rapidly identifying outliers, which then can be examined in greater detail, he argued.
For the article, Gregg plotted a variety of unusual workload conditions, using the Oracle Analytics visualization software to visually render data gathered by DTrace. He set the X axis to represent time and the Y axis to represent the time of latency. The darkest colors represented the most input-output.
In many cases, he found simple workloads can produce a variety of complex -- and sometimes unexplainable -- patterns.
In one case, a small amount of data was sequentially written to a pool of disks. Gregg expected to see only "white noise" representing random latency to appear. Instead, the heat map showed latency levels rising and falling in distinct patterns for some unknown reason. "Visualizing latency in this way clearly poses more questions than it provides answers," he said.
Another pattern proved equally mysterious. The test involved sending a stream of data to 44 disks. First, data would be sent to only one disk, then to two disks, and so on, until all 44 disks were receiving data.
Gregg expected disk latency to increase in a linear fashion as the system buses became saturated with data.
Instead the latency would increase, then subside somewhat, before increasing some more.
He called this pattern the rainbow pterodactyl, in that the heat graph resembled the profile of a colorful flying dinosaur.
"To summarize the rainbow pterodactyl: little is known with accuracy, and much more investigation is needed. What this does show is how deep a simple visualization can become," he writes.
Gregg also used a heat map to reveal the shock effects that loud noise has on servers, phenomena that Gregg demonstrated a few years back on YouTube.
Although these heat maps were done on a system running on the Zettabyte File System (ZFS) running over Network File Storage (NFS) protocol, this approach could be used for characterizing the operations of other file systems, and even other components such as CPUs, Gregg writes.