Kernel space: Toward better direct I/O scalability
A new function in the kernel could reduce the need for
a kernel lock that can be a bottleneck for high-end
databases. Kernel hacker Nick Piggin posts some
performance numbers.
By Jonathan Corbet, LinuxWorld.com
April 10, 2008 06:01 PM ET
- Share/Email
- Tweet This
- Print
Linux enthusiasts like to point out just how scalable the system is; Linux runs on everything from pocket-size devices to
supercomputers with several thousand processors. What they talk about a little bit less is that, at the high end, the true
scalability of the system is limited by the sort of workload which is run. CPU-intensive scientific computing tasks can make
good use of very large systems, but database-heavy workloads do not scale nearly as well. There is a lot of interest in making
big database systems work better, but it has been a challenging task. Nick Piggin appears to have come up with a logical next
step in that direction, though, with a relatively straightforward set of core memory management changes.
For some time, Linux has supported direct I/O from user space. This, too, is a scalability technology: the idea is to save
processor time and memory by avoiding the need to copy data through the kernel as it moves between the application and the
disks. With sufficient programming effort, the application should be able to make use of its superior knowledge of its own
data access patterns to cache data more effectively than the kernel can; direct I/O allows that caching to happen without
additional overhead. Large database management systems have had just that kind of programming effort applied to them, with
the result that they use direct I/O heavily. To a significant extent, these systems use direct I/O to replace the kernel's
paging algorithms with their own, specialized code.
When the kernel is asked to carry out a direct I/O operation, one of the first things it must do is to pin all of the relevant
user-space pages into memory and locate their physical addresses. The function which performs this task is get_user_pages():
int get_user_pages(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
int len,
int write,
int force,
struct page **pages,
struct vm_area_struct **vmas);
A successful call to get_user_pages() will pin len pages into memory, those pages starting at the user-space address start as seen in the given mm. The addresses of the relevant struct page pointers will be stored in pages, and the associated VMA pointers in vmas if it is not NULL.
This function works, but it has a problem (beyond the fact that it is a long, twisted, complex mess to read): it requires
that the caller hold mm->mmap_sem. If two processes are performing direct I/O on within the same address space - a common scenario for large database management
systems - they will contend for that semaphore. This kind of lock contention quickly kills scalability; as soon as processors
have to wait for each other, there is little to be gained by adding more of them.
There are two common approaches to take when faced with this sort of scalability problem. One is to go with more fine-grained
locking, where each lock covers a smaller part of the kernel. Splitting up locks has been happening since the initial creation
of the Big Kernel Lock, which is the definitive example of coarse-grained locking. There are limits to how much fine-grained
locking can help, though, and the addition of more locks comes at the cost of more complexity and more opportunities to create
deadlocks.
Comment