Intel’s processor flaw is a virtualization nightmare

Design flaw affects the processes behind virtualization the hardest. Users could see a VM slowdown of 20 to 30 percent.

Intel’s processor flaw is a virtualization nightmare
Melissa Riofrio/IDG

2018 is off to a very bad start for Intel after the disclosure of a flaw deep in the design of its processors, dubbed Meltdown. And while the company has publicly said the issue won’t affect consumers, they aren’t the ones who need to be worried.

The issue is found in how Intel processors work with page tables for handling virtual memory. It is believed that an exploit would be able to observe the content of privileged memory by exploiting a technique called speculative execution.

+RELATED: Meltdown and Spectre exploits: Cutting through the FUD; Red Hat responds to the Intel processor flaw+

Speculative execution exploit

Speculative execution is a part of a methodology called out-of-order execution (OOE), where basically the CPU makes an educated guess on what will happen next based on the data it has. It’s designed to speed up the CPU rather than burn up CPU cycles working its way through a process. It’s all meant to make the CPU as efficient as possible.

Intel has been mum on how long the problem has been around, but it’s believed to date back to its move to 64-bit processors and the Penryn/Merom family of processors in 2006. Intel was first informed of the problem back in June 2017 by Google researchers, and Google kept quiet about it while Intel and the OS vendors addressed the problem. Google has since published its findings.

How to fix the speculative execution flaw

All told, there are three variants of the problem, all of them unique to how Intel handles speculative execution, and all three can be fixed — but only through the operating system. These errors are baked into the silicon. There is no replacing them — no BIOS update that will fix it. Only an OS fix will work. Linux distros are already rolling out fixes, and Microsoft is expected to introduce one in a future Patch Tuesday fix.

This can only be fixed with a rearchitecting of the CPUs. How long that can take is open to debate. Jim McGregor of Tirias Research said a design fix could add six to nine months to Intel’s roadmap, while Nathan Brookwood of Insight64 says two to four years. Intel was informed last June, but it’s unclear if it was able to institute changes into chips on its 2018 roadmap.

Normally, the OS kernel and apps share address space in memory to optimize performance when the app makes OS calls. They have to switch page tables whenever an app calls the kernel and returns data. The solution is to preclude an app from sharing the kernel memory space. That’s going to add a lot of overhead to every OS call. The fix means the kernel has to be loaded into memory and the app unloaded — and then vice versa.

The worst part is that this has to happen whenever there is an interrupt. What causes an interrupt? Well, let’s start with I/O, like a disk read or write or network connections. Now, instead of keeping the OS kernel and app in memory, CPUs are going to load and unload one or the other. It will happen at CPU speeds, which is to say exceptionally fast, but it’s still going to impact performance.

It also impacts any scenarios where the OS and an app talk to each other. Can you think of a more intensive situation than a virtualized server running dozens of VMs, each with its own OS instance, talking to the hardware through the hypervisor? Virtualization is going to be hit the hardest by this.

How performance is impacted

How much impact? The Register estimates anywhere from 5 percent to 30 percent, depending on the task, while an open source site called Phoronix ran tests of patched Linux systems and put the hit at between 7 percent and 20 percent for things like databases, but virtually no impact at all on games. One analyst told me of anecdotal stories of Amazon Web Services (AWS) slowing down in the past week as the fixes are rolled out, but I can’t find anything to back that up.

Intel said it has not seen any exploits in the wild and that the exploit only allows for reading the contents of memory, not altering it. But that’s more than enough. The greatest threat is to multi-tenant scenarios where multiple AWS or Microsoft Azure customers have their VMs on the same CPU and one user is able to peek into the contents of another VM.

That is completely unacceptable to any customer. But so is a VM slowdown of 20 percent to 30 percent. Intel’s year just went into the toilet, and we’re only three days into it.

Little to no exposure for AMD chips

And here’s the kicker: AMD has minimal if any exposure and said so, despite Intel saying it is at risk. Even though AMD came up with 64-bit extensions, which Intel licenses, the two firms implemented their 64-bit architectures in completely different ways.

The difference is AMD’s chips don’t do speculative loads if there is the potential for memory access violations. They don’t load data beyond the branch point, so no predicting is done. Intel does the exact opposite. It’s more aggressive in its use of branch prediction and it bit them.

More as it develops. This is not a one-day story.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Now read: Getting grounded in IoT