Yahoo builds ultimate private cloud

Yahoo's private cloud expands and contracts computing resources nearly instantaneously and does not rely on a public cloud for extra capacity. Here's how they built it.

Yahoo's private cloud expands and contracts computing resources nearly instantaneously and does not rely on a public cloud for extra capacity. Here's how they built it.

Imagine the kind of infrastructure needed for a website fielding 1.5 million requests per second. That was one of the challenges faced by Yahoo's Todd Papaioannou, vice president of cloud architecture.

"What's my biggest pain spot? No, it's not Google," he recently quipped with attendees during his keynote speech at the Cloud Leadership Forum, held last month in Santa Clara, Calif. "My biggest problem is elasticity. VM spin-up time. Virtualization isn't there yet." Ten to 20 minutes is just too long to handle a spike in Yahoo's traffic when big news breaks such as the Japan tsunami or the death of Osama bin Laden or Michael Jackson.

RESEARCH: Public cloud vs. private cloud: Why not both?

That's why Yahoo has built itself the ultimate private cloud. And by private cloud, we don't mean just a cluster of virtualized servers -- we mean an infrastructure that can expand or contract as quickly as you can take a deep breath and exhale.

And failover to a public cloud won't cut it, either. By Papaioannou's estimates, it can take 20 to 40 minutes to spin up a VM instance relying on Amazon's Elastic Block Store storage.

True, Yahoo, based in Sunnyvale, Calif., is overshadowed by the 800-pound gorilla a short drive up the 101 in Mountain View -- at least in the U.S. Yet Papaioannou points out that in other nations in the world, like Taiwan, Yahoo is the most popular Internet destination. This means that the sun never sets on the page requests made of Yahoo's 400,000 servers (compare that to cloud-for-sale Rackspace's 70,000 servers, he notes). Yahoo supports more than 680 million registered users and stores more than 200 petabytes of data, much of that on 42,000 Hadoop servers. It collects and processes 100 billion events per day and those 11.5 million requests per second add up to 11 billion pages served per month.

Yahoo considers itself to be the cloud -- a personal cloud for consumers. It is the Internet service that stores consumer data like photos, email and other media, provides users with online services like search, news, games and TV. Its secret sauce is its Web of Objects or WOO. This is the customization engine that serves up related content as users use its services. Yahoo describes WOO as "semantic map of web entities." The more visitors use Yahoo, the more WOO can zero in on personalized related content. If a user searches for a band, WOO could show news stories, videos, lyrics and other deep content related to the band and associated with the person's online behavior.

It takes a big engine built on top of a hyper-flexible cloud to collect all of that big data and to analyze it and to keep it up when traffic spikes.

For Papaioannou this means that the private cloud isn't just a fancy marketing phrase. When a spike happens, "currently our only option is to do 'load shedding,'" says Papaioannou. This means the private cloud pauses or moves lower-priority workloads off those servers and dedicates them to the spike. Lower-priority workloads include servers that are running batch workloads, for instance.

On the bottom of the stack of Yahoo's private cloud are two layers that Papaioannou thinks of as "infrastructure as a service," a familiar term for public cloud providers that offer multitenant bare-metal hardware. In Yahoo's case, it's not sharing its data center with anyone. Instead it has a custom-developed abstract layer dubbed "Cloud Fabrics." It can look at the entire pool of compute/data center resources as a pool and doesn't care where any part is physically located as it assigns tasks for the application at hand.

IN DEPTH: Guide to cloud management software

The next layer in this version of infrastructure as a service is "cloud services" such as the Yahoo Caching Proxy, load balancing around the world. Services like "Traffic Server" live here. Traffic Server is an open source content caching tool that Yahoo released to the open source community in 2009.

The next portion of the stack can be related to platform as a service, which in the public cloud world means a rented space that includes operating systems and middleware. In Yahoo's case, here sits Hadoop, which in another era would have been called a grid computing engine. In today's terminology, Hadoop is open source software for distributed processing of large data sets across clusters of computers. Yahoo is a big proponent of using open source software (see sidebar: "Yahoo and open source"). Here are the "serving containers," storage and hardware plumbing. "By standardizing these services on top of a unified fabric, we want to be able to order racks of servers at a time and slam them into the pool," Papaioannou describes. A vanilla infrastructure offers another benefit: automating management tasks. "With hundreds of thousands of servers, you can't have a human running around with some Perl script to manage your infrastructure. You need to lift up the level of abstraction," he says.

The next layer up is what Papaioannou refers to as "Yahoo's secret sauce, Knowledge as a Service." This includes WOO other applications that match ads to content. Such apps perform analysis, scoring, optimization and ranking of ads, related links and other user content.

At the top of the stack lives "software as a service," or Yahoo Digital Media Services. This includes connected TV, Yahoo Developer Network, Front Page, Mail, Messenger, user-generated content and so on.

Diagram of Yahoo's cloud stack

When talking about a private cloud that truly works as an automated utility computing model, the "stack" metaphor becomes less accurate. Another way of looking at it is a circle, with all services feed into one another and are protected by an edge of infrastructure. 

Diagram of Yahoo's cloud

Running this cloud has given Papaioannou perspective on the state of the cloud today and tomorrow. "The cloud is ready. You have to choose between hybrid cloud, private and public, however. If you analyze the workload, you should be asking yourself, Does it need to be on an internal server or not? A bunch of businesses are already running in the public cloud. If I were to create a startup tomorrow, I would launch in a public cloud," he says.

And yet, private clouds and private data centers will never go away completely, he believes. If a company grows big enough, it can become less expensive to own the infrastructure than to share it. "With enough size, economy of scale kicks in. It is our business to run data. It is cheaper for us to do our own things."

But then again, we're talking 1.5 million requests per second. That's the kind of scale that brings economies. For most other businesses, the cloud is looming.

Julie Bort is the editor of Network World's Cisco Subnet community. She also writes the Odds and Ends blog for Cisco Subnet, the Microsoft Update blog for Microsoft Subnet and the and Source Seeker for the Open Source Subnet community sites. Follow Bort on Twitter @Julie188.

Learn more about this topic

Yahoo and open source

Public cloud vs. private cloud: Why not both?

Guide to cloud management software

Insider Tip: 12 easy ways to tune your Wi-Fi network
Join the discussion
Be the first to comment on this article. Our Commenting Policies