How server disaggregation could make cloud data centers more efficient

Standard servers are wasteful of resources, but future systems may be configurable to match the requirements of the workload

Credit: Thinkstock

The growth in cloud computing has shone a spotlight on data centers, which already consume at least 7 percent of the global electricity supply and growing, according to some estimates. This has led the IT industry to search for ways of making infrastructure more efficient, including some efforts that attempt to rethink the way computers and data centers are built in the first place.

In January, IBM researchers presented a paper at the High Performance and Embedded Architecture and Compilation (HiPEAC) conference on their work towards disaggregated computer architecture. This work is part of the EU funded dReDBox project, which is part of the Horizon 2020 research and innovation program.

Disaggregation means separating servers into their constituent compute and memory resources so that these can be allocated as required according to the needs of each workload. At present, servers are the basic building blocks of IT infrastructure, but a workload cannot use more memory or CPU resources than are available in one single server nor can servers easily share any spare resources outside their own box.

“Workloads deployed to data centers often have a big disproportionality in the way they use resources. There are some workloads that consume lots of CPU but don’t need much memory, and on the other hand other workloads that will use up to four orders of magnitude more memory than CPU,” said Dr Andrea Reale, Research Engineer for IBM.

Across the datacenter, this means that some servers will be utilizing all their CPUs but still have lots of spare memory, while for others it will be vice versa, and these resources continue to suck power even if they are not being used. According to Reale, about 16 percent of CPU and 30 percent of memory resources in a typical datacenter may be wasted this way.

But what if you could compose servers under software control to have as many CPUs and as much memory as each particular workload requires?

Separating compute and memory

The dReDBox project aims to address this by using discrete compute and memory modules known as bricks. These are connected together by high speed links, enabling enough compute bricks to be paired with enough memory bricks to meet the requirements of whichever workload is running at a given moment. In theory, this enables a server to be composed for a specific application, with as many CPU cores and as much memory as the job requires, and those resources can then be returned to the pool and used for something else once the workload is no longer required.

As part of its research project, the dRedBox team has built a demonstration system where the bricks are built around Xilinx Zynq Ultrascale+ ARM-based system-on-chip (SoC) silicon. The compute bricks have a small amount of local memory, while the memory bricks have a much larger amount of DDR4 memory that they serve up for the compute bricks.

There are also two other kinds of brick in the dRedBox architecture; accelerator bricks that may provide either GPU or FPGA hardware to boost applications like machine learning or analytics; and a controller brick, which is a special type of brick that manages all the others.

To fit in with existing data center infrastructure, the dRedBox team envisages that the bricks in any production deployment would be housed in a 2U enclosure resembling a standard rack-mount server system. These enclosures may contain any mixture of brick types.

The beauty of this modular arrangement is that it also makes for easy upgrades; the operator can simply replace compute bricks for newer ones with higher performance, or likewise swap memory bricks for ones with a greater memory capacity, rather than junk the entire server.

However, the key part of the whole architecture is the interconnect technology that links the bricks together. This has to be both high-speed and low latency, otherwise performance would take a hit when a compute brick reads data stored in a memory brick.

Low-latency architecture

For its demonstration system, the dRedBox team has used an electrical switch matrix to connect bricks within an enclosure, while an optical switch matrix links to bricks within another enclosure in the rack. Unusually for an IT environment, these switch matrices are circuit switched, meaning they create a dedicated pathway between bricks once configured, unlike a packet-switched network such as Ethernet, where data is routed to its destination based on the address in the data packet.

This arrangement was chosen precisely because of the need for low latency, according to Reale.

“Having circuit switched compared to packet switched allows you to have a much lower latency for memory requests when going from compute brick to memory brick,” he said.

In fact, Reale claims that even with research-grade hardware, the dRedBox system was able to demonstrate well below 1 microsecond of end-to-end latency for remote memory accesses, and that with production-grade processor chips running at full clock speeds, the performance would be much higher.

Another advantage of having circuit switched links between compute and memory blocks is that it looks exactly the same to software as a standard server where the memory is directly connected to the CPU.

“We are using some existing operating system extensions like NUMA support for non-uniform memory representation in Linux to represent the distance of memory for applications that are aware of the architecture, while for other applications that are not aware, they can just assume it is local memory, they don’t need to know where the memory is,” Reale said.

The demonstration setup is on a relatively small scale, comprising just three trays, but the dRedBox team has apparently been able to test it by running actual cloud workloads, although the results of those tests have yet to be disclosed.

“We didn’t want to use benchmarks, as we wanted high fidelity results, so we actually used a combination of a set of real cloud applications, including data analytics and online transaction processing, in-memory caches, and used the message broker to test how this effort could impact the IoT market, for example,” Reale said.

According to the dRedBox team, the demonstration system can at least match a standard scale-out server deployment in terms of performance, while using between 25 and 50 percent fewer resources.

By the end of the project, the team expects to be in position where it will be able to demonstrate how an entire rack of dRedBox hardware would perform.

Meanwhile, any production version of the architecture would need to fit in with existing infrastructure, in particular management tools. To this end, the dRedBox control plane would interface with common orchestration tools via APIs.

“The control plane or orchestration plane is basically some out of band server that is used to connect up the CPUs and memory, and the idea is that this interface is exposed as an API, specifically a REST API, and that can be used either manually by the operator of the data center or more likely integrated – as we are already doing in the project – with higher level orchestration software like OpenStack if you want to deploy virtual machines or Kubernetes for containers,” Reale explained.

HPE, Intel working on disaggregation, too

The dRedBox team is not the only organization pursuing disaggregation as a possible solution to some of the issues facing existing datacenter architectures.

Another is HPE’s The Machine research project, which was designed primarily to deliver a system that could support a very large memory space for applications such as big data analytics. It also features separate compute and memory blocks, fitted into a cluster of enclosures that are essentially rack-mount servers, interconnected using a memory fabric. In a demonstration system unveiled last year, HPE used optical links to connect 40 nodes containing 160TB of shared memory.

Meanwhile, Intel has its own initiative called Rack Scale Design (RSD). This started out with similar goals, but Intel has so far focused on disaggregating storage from server nodes, rather than separating compute and memory. Intel has also focused on creating a management API called Redfish, designed to provide resource discovery and management at rack scale and enable interoperability among RSD offerings from different vendors.

Intel’s RSD is evolving gradually, in order to allow vendors like Dell EMC, Ericsson and Supermicro to incorporate the technology into their products at a pace they are comfortable with. Meanwhile, the technology and concepts developed in The Machine are likely to be infused into other platforms, such as the Exascale Computing Project at the US Department of Energy, to which HPE contributes.

As for the dRedBox project, it is a collaborative effort between a number of organizations, including several universities and their spin-off companies, and there are many IP agreements between the partners covering the technology. However, the expectation is that when the project concludes it will deliver something that could be deployable in a target environment with a little extra effort.

With the ability to run workloads using 25 to 50 percent fewer resources, systems based on disaggregated architectures ought to appeal to data center customers. However, as we have often seen before, great ideas do not always succeed in overturning the status quo; anyone remember IBM’s PureSystems?

All too often, vendors find it too risky to invest in anything that is too much of a leap away from the products they currently ship to customers, and it takes a firm with Intel’s clout to really push a new technology into market. So it remains to be seen whether truly composable hardware will actually make it to market. Perhaps if the hyperscale users like Google, Facebook and Amazon show an interest, we can expect it to become a reality.

Data CenterServers

How server disaggregation could make cloud data centers more efficient

Standard servers are wasteful of resources, but future systems may be configurable to match the requirements of the workload

Separating compute and memory

Low-latency architecture

HPE, Intel working on disaggregation, too

More from this author

What does hybrid cloud mean in practice?

Show me more

Groundcover raises $100M as observability pivots from monitoring to AI infrastructure

Dangling DNS records and reverse DNS gaps give attackers new openings

Microsoft doubles down on multi-model AI as it builds a Copilot super app

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

Master Linux Math with the bc Command | Easy CLI Calculations Explained!

Master Linux Math in Seconds: How to Use the expr Command Like a Pro

How to Do Math in the Command Line Using Double Parentheses

How server disaggregation could make cloud data centers more efficient

Separating compute and memory

Low-latency architecture

HPE, Intel working on disaggregation, too

From our editors straight to your inbox

More from this author

What does hybrid cloud mean in practice?

Show me more

Groundcover raises $100M as observability pivots from monitoring to AI infrastructure

Dangling DNS records and reverse DNS gaps give attackers new openings

Microsoft doubles down on multi-model AI as it builds a Copilot super app

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

Master Linux Math with the bc Command | Easy CLI Calculations Explained!

Master Linux Math in Seconds: How to Use the expr Command Like a Pro

How to Do Math in the Command Line Using Double Parentheses