Kernel space: kerneloops, read-mostly, and I/O port 80

New kernel changes will help organize bug reports, use cache more efficiently, and prevent crashes on some x86-64 systems.

Kerneloops. Triage is an important part of a kernel developer's job. A project as large and as widely-used as the kernel will always generate more bug reports than can be realistically addressed in the amount of time which is available. So developers must figure out which reports are most deserving of their attention. Sometimes the existence of an irate, paying customer makes this decision easy. Other times, though, it is a matter of making a guess at which bugs are affecting the largest numbers of users. And that often comes down to how many different reports have come in for a given problem.

Of course, counting reports is not the easiest thing to do, especially if they are not all sent to the same place. In an attempt to make this process easier, Arjan van de Ven has announced a new site at Arjan has put together some software which scans certain sites and mailing lists for posted kernel oops output; whenever a crash is found, it is stuffed into a database. Then an attempt is made to associate reports with each other based on kernel version and the call trace; from that, a list of the most popular ways to crash can be created. As of this writing, the current fashion for kernel oopses would appear to be in ieee80211_tx() in current development kernels. Some other information is stored with the trace; in particular, it is possible to see what the oldest kernel version associated with the problem is.

This is clearly a useful resource, but there are a couple of problems which make it harder to do the job properly. One is that there is no distinctive marker which indicates the end of an oops listing, so the scripts have a hard time knowing where to stop grabbing information. The other is that multiple reports of the same oops can artificially raise the count for a particular crash. The solution to both problems is to place a marker at the end of the oops output which includes a random UUID generated at system boot time. Patches to this effect are circulating, though getting the random number into the output turns out to be a little harder than one might have expected. So, for 2.6.24, the "random" number may be all zeroes, with the real problem to be solved in 2.6.25.

Read-mostly. Anybody who digs through kernel source for any period of time will notice a number of variables declared in a form like this:

    static int __read_mostly ignore_loglevel;

The __read_mostly attribute says that accesses to this variable are usually (but not always) read operations. There were some questions recently about why this annotation is done; the answer is that it's an important optimization, though it may not always be having the effect that developers are hoping for.

As is well described in What every programmer should know about memory, proper use of processor memory caches is crucial for optimal performance. The idea behind __read_mostly is to group together variables which are rarely changed so they can all share cache lines which need not be bounced between processors on multiprocessor systems. As long as nobody changes a __read_mostly variable, it can reside in a shared cache line with other such variables and be present in cache (if needed) on all processors in the system.

The read-mostly attribute generally works well and yields a measurable performance improvement. There are concerns, though, that this feature could be over-used. Andrew Morton expressed it this way:

So... once we've moved all read-mostly variables into __read_mostly, what is left behind in bss? All the write-often variables. All optimally packed together to nicely maximise cacheline sharing.

Combining frequently-written variables into shared cache lines is a good way to maximize the bouncing of those cache lines between processors - which would be bad for performance. So over-aggressive segregation of read-mostly variables to minimize cache line bouncing could have the opposite of the desired effect: it could make the kernel's cache behavior worse.

The better way, says Andrew, would have been to create a "read often" attribute for variables which are frequently used in a read-only mode. That would leave behind the numerous read-rarely variables to serve as padding keeping the write-often variables nicely separated from each other. Thus far, patches to make this change have not been forthcoming.

I/O port delays. The functions provided by the kernel for access to I/O ports have long included versions which insert delays. A driver would normally read a byte from a port with inb(), but inb_p() could be used if an (unspecified) short delay was needed after the operation. A look through the driver tree shows that quite a few drivers use the delayed versions of the I/O port accessors, even though, in many cases, there is no real need for that delay.

This delay is implemented (on x86 architectures) with a write to I/O port 80. There is generally no hardware listening for an I/O operation on that port, so this write has the sole effect of delaying the processor while the bus goes through an abortive attempt to execute the operation. It is an operation with reasonably well-defined semantics, and it has worked for Linux for many years.

Except that now, it seems, this technique no longer works on a small subset of x86_64 systems. Instead, the write to port 80 will, on occasion, freeze the system hard; this, in turn, generates a rather longer delay than was intended. One could imagine the creation of an elaborate mechanism for restarting I/O operations after the user resets the system, but the kernel developers, instead, chose to look for alternative ways of implementing I/O delays.

In almost every case, the alternative form of the delay is a call to udelay(). The biggest problem here is that udelay() works by sitting in a tight loop; it cannot know how many times to go through the loop until the speed of the processor has been calibrated. That calibration happens reasonably early in the boot process, but there are still tasks to be performed - including I/O port operations - first. This problem is being worked around by removing some delayed operations from the early setup code, but some developers worry that it will never be possible to get them all. It has been suggested that the kernel could just assume it's running on the fastest-available processor until the calibration happens, but, beyond being somewhat inelegant, that could significantly slow the bootstrap process on slower machines - all of which work just fine with the current code.

The real solution is to simply get rid of almost all of the delayed I/O port operations. Very few of them are likely to be needed with any hardware which still works. In some cases, what may really be going on is that the delays are being used to paper over driver bugs - such as failing to force a needed PCI write out by doing a read operation. Just removing the delays outright would probably cause instability in unpredictable places - not a result most developers are striving for. So the task of cleaning up those calls will have to be done carefully over time. Meanwhile, the use of port 80 will probably remain unchanged for 2.6.24.

Learn more about this topic

LWN article with comments

This story, "Kernel space: kerneloops, read-mostly, and I/O port 80" was originally published by LinuxWorld-(US).

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2008 IDG Communications, Inc.

SD-WAN buyers guide: Key questions to ask vendors (and yourself)