How pumped up is your pumped-up cloud data center?

A cloud data center is supposed to scale sky-high, but few know the actual capacity. How many virtual machines can it host, and how will the cloud perform as CPU, storage, networking and I/O utilization climb higher and higher?

icloud rain

"Ve are here to Pump You Up." I can't help but think about the old Saturday Night Live routines with bodybuilders Hans and Franz when looking at today's cloud data centers. They are big. They are bulked up. They are, indeed, pumped up. But how strong are they, really? As we would ask in IT terms: Do they scale? Can they perform? Or are they girly-man clouds?

Those are hard questions.

Knowing the capacity of a data center is next to impossible. The tech specs are easy – so many servers, so many CPUs, so many gigahertz, such-and-such network connectivity, so much storage I/O bandwidth. Those specs are easy, and also meaningless, without actually measuring the complete stack's end-to-end performance.

Last week, I toured a major telco's data center in the Phoenix area, and talked to one of the engineers about capacity. A small part of the data center is designated for the telco's own multi-tenant hosted cloud services based on VMware. The rest of the facility contained hundreds (thousands?) of customer-leased collocation racks and cages, some of which are used to host cloud-oriented services as well.

"How many virtual machines can you host in your cloud?" I asked. The engineer said he didn't know. The operations team monitors both virtual and real CPU and I/O utilization statistics, he said, and when they go above predetermined thresholds, they reallocate the load or add more resources.

The engineer didn't have good answers to my annoying questions about the data center's cloud capacity, in terms of where the performance bottlenecks are, how they know how many services the hosting company could offer, how they tune the cloud's hardware and software stack for maximum scalability, and how much excess capacity is in the racks for future growth. The only capacity numbers he was confident in were total numbers of servers, numbers of CPUs, total petabytes of storages, and network bandwidth (and he wouldn't tell me those numbers).

That's scary to me, particularly because I've heard those same non-answers before. Telcos, cloud providers, and even enterprises spend tens of millions of dollars to build out cloud data centers – sometimes for internal use and sometimes to rent out for customer hosting. To the best of my knowledge, they have no idea how much load a particular cloud can carry, and at what load levels the performance of the cloud will begin to tank. Whether the cloud service is provisioned by one rack of gear or by dozens, they have no real knowledge, and have no confidence in their capacity estimates.

Part of the problem is that it's really hard to figure this out. The spec sheets don't tell you, and virtual machine hypervisor control panels don't tell you either. If it's a big cloud set-up, the amount of capacity it has is based on best guesses worked out on a napkin or in Excel, based on extrapolating the workload of a single computer, perhaps. The cloud simply doesn't scale that way, especially when you factor in the multiple levels of hardware, software and infrastructure in the stack. Add in dynamic architectures provisioned by Software Defined Networks (SDN), and software-based firewalls and load balancers defined by Network Functions Virtualization (NFV). The result? Nobody knows.

If nobody knows, everyone is guessing – and then watching the real-time operations screens to manage capacity in real time, hoping that admins will see performance slowdowns with enough time to react before there are outages, packets get dropped, and application response times drop below acceptable levels. Even when that happens, the response must be quick – talking minutes or hours, maybe days. That doesn't lend itself to long-term capacity planning (weeks, months, quarters), and thus the answer is to have lots of spare capacity ready to be dropped into the cloud on short notice.

A few months ago at Interop, I chatted with executives from Spirent Communications, a network test and measurement vendor. I explained my concern that while there are lots of load-testing products (including theirs) for testing individual data center components and small data centers (like one rack), there was nothing commercially available that could stress-test an entire cloud. They told me, "Wait and see."

I waited, and in early September Spirent prebriefed me about their announcement of "HyperScale Test Solution," a load-testing system designed for really really big virtualized cloud data centers running vSphere/vCenter or OpenStack. How big? Spirent says HyperScale is designed to test a cloud by populating and instrumenting it with up to a million virtual machines, automatically designed, provisioned and deployed.

Each of the test VMs is designed to consume realistic quantities of CPU, storage, and LAN and WAN I/O bandwidth. The goal is to see how the cloud data center performs at various load levels, and see where it slows down or breaks as the load increases.

Now, those are "destructive" tests, meaning that you can only run them on an offline cloud data center; obviously, you can't have live or customer traffic if you're going to stress the cloud until it break. Still, I see HyperScale as a breakthrough, in that cloud operators can truly load-test the cloud while it's being built, and have certainty about actual, measured, real-world capacity.

Later, when operators are monitoring the now-in-production cloud system, they know exactly where its performance fall-off points are, and can make decisions based on true knowledge of where the breakpoints are for the end-to-end system, instead of projections extrapolated from individual servers or small clusters. It'll also help admins more realistically predict when the cloud will reach its limits, so managers can determine when (and how) to expand it, or choose to stop adding new services.

Knowing how big a cloud can scale, and how it performs when pumped up under heavy loads, like Hans and Franz, are indeed hard questions. I'm delighted that Spirent has apparently figured out the answers.

Copyright © 2015 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022