Accept failure, but focus on recovery

Researcher says we can’t prevent system failures, so we should focus on fast recovery.

1 2 Page 2
Page 2 of 2

Yes. If you look at the backgrounds of Dave Patterson and myself, the two principal investigators of ROC, you’ll see his main background is in computer architecture and storage and mine is in systems building and a bit of networking.

In the RAD Lab Project we’re expanding along two axes. One axis starts with the idea that, hey, this stuff worked really well when we applied it to building applications, can we apply the ideas to developing and debugging the networks that connect the applications, because all interesting applications are going to be distributed by nature. And not just clusters, they’re going to be distributed across data centers. So, can we look at these techniques of statistical machine learning and visualization and apply those to some of the core networking challenges that we are facing as we move to fully distributed apps.

The second axis is "closing the loop" between these machine learning algorithms, which are good at learning about the system's behavior, and the human operators, who have a tremendous amount of experience with the system. Operators often have good instincts and hunches about what's causing a problem or when a system is drifting towards bad behavior. Wouldn't it be great if they could directly transmit their knowledge into the machine learning algorithms, to speed up the learning rate of an algorithm or to improve its modeling accuracy? And if the algorithm makes a mistake or a bad judgment, wouldn't it be great if the operators could interrogate the algorithm to understand how it made its decision so that they can improve its behavior for future incidents? We're really excited about combining state-of-the-art machine learning with what we've learned about providing better tools for system operators.

Is the goal of this next project to solve something in particular?

You have to have project mission statement to start. The five-year mission for RAD Lab is to make it possible for one individual to develop, deploy and operate an enormous scale, next generation Internet service.

The initial version of eBay was coded over four days by one guy. But since then eBay has gotten so large it has had to rebuild its entire system twice. Pretty much from the ground up every time. Similarly, Google started out as a university research prototype, but in order to operate Google at the scale it operates at today, they had to raise money to build a Google-sized organization. Our goal is, if you have the idea for the next Google or the next eBay or the next mash-up of interesting stuff, that you can actually get to a scale of deployment comparable to what Google has now without building a Google-sized company to do it.

Presumably you achieve that by building some of the capabilities into the cloud, right?

The infrastructure will be operated in some sense as a utility. So, I deploy my service and if it gets really big it automatically starts taking over additional resources. It starts scaling itself up. In some sense it’s not unlike the way you think about electricity today. You use a little, you pay a little. You use a lot, you pay a lot. But, you don’t have to worry about the provisioning part. The grid does the provisioning. All you have to do is regulate how much you’re using. That’s the level of simplicity that we want to achieve. And we actually have a shot of taking a bite out of this with the combination of statistical machine learning and visualization.

But the Googles of the world are a rarified breed. Will there be lessons for the enterprise user?

We consider the enterprise class users more important because there aren’t that many Googles around. And in some sense, solving the operations problem for an enterprise-scale company is a bigger challenge than solving it for a huge thing. Because if the hypothesis is we’re going to have thousands of enterprise-scale things all sharing these resources, then each individual enterprise-scale thing has to be negligibly difficult to operate. It’s got to add very little to the operational overhead, because otherwise we’re limited in how many of them can co-exist on this resource grid.

Okay.

Whenever we have research discussions about this we always say, let’s not think about just Google and Amazon and Yahoo. Let’s think about the long tail [a reference to the idea that small companies actually reach more individual consumers than large firms]. Let’s think about the one guy that’s got a medium sized application and his management problems have to be made just as simple as the management problems if we were solving it for a planet-sized thing like Google. So, we actually care a tremendous amount about the long tail.

And if you look at the recent wave of innovation using things like mash-ups and service composition, it looks like we’re finally starting to move into the world of service-oriented architectures. People have been talking about SOA for years, but we’re finally actually seeing some real-life applications.

Craigslist.com has listings of apartments for rent. Google has great maps. Put them together and you can see apartment rentals on a map. This is exactly what the service-oriented architecture people have been hoping would happen for years. And it’s not happening exactly the way they foresaw, but the important point is that a lot of the innovation we expect to see is not going to come from people building whole new applications. It’s going to come from people combining applications and components and then layering some of their own functionality on top of it. So the typical innovative new application is not going to be a huge, morass of code. It’s going to be a modest amount of code and depend on many services working correctly.

And the RAD Lab hopes to solve some of the problems that will be exacerbated in that kind of environment?

They’re exacerbated because the way services are being built is, you take an existing app, use that as one of your building blocks and put stuff on top of it. Well, that means that all the things that you depend on have to work. And all the things that each of those things depends on has to work. So the level of innovation is greatly accelerated if you can do this. But it’s contingent on making sure that that pyramid doesn’t collapse. And that’s a really interesting research problem.

How far along is RAD?

The RAD Lab Project has just started. It is largely funded by industry. In fact more than two-thirds of the financial support for this is coming from industry. Sun, Microsoft and Google -- three companies that are not usually mentioned in the same sentence -- have each contributed a substantial amount of support. Not just in terms of money, but actual research relationships with us. Students go to the companies and the companies send people to our retreats. That’s one of the ways we stay focused on the real problems, by having companies that are grappling with these problems every day advising us. More recently, IBM, HP, Nortel, NTT-MCL, and Oracle have contributed as well.

How do these efforts compare to, say, IBM’s autonomic push?

We’ve had a great relationship with IBM for several years, and when we started the ROC Project we were asked if ROC was the same as autonomic computing. Autonomic computing is a great vision. And, in fact, you could argue that with the RAD Lab we’re taking a step in that direction by saying any one service is going to require only a tiny fraction of a human operator. So that’s getting pretty close to autonomic.

But I think an important difference in the ROC Project was we said we’re not ready to take the human out of the loop yet because we don’t understand what the human does. The human doesn’t have good tools to do what they do. Human errors are actually responsible for a huge fraction of the down time in real life services. So we, the ROC Project, are not going to focus on removing the human from the equation. Instead, we’re going to look at how can we help. I think we have always had the same long-term goal as the autonomic computing guys, but I think our tactical approach was different for ROC.

For enterprise users, how far off is the benefit of some of this work?

Our long-term vision of creating a prototype platform where a single individual can deploy their service and basically turn it on and forget about it, I wouldn’t hold my breath for that to come out next year. But, a lot of the techniques that we’re developing as ingredients for that, better statistical machine learning algorithms, better visualization, our plan is to develop those things in the context of existing open standards, so whatever we do will work with existing frameworks.

So, we expect to be deploying pieces of things and have downloadable software artifacts that will work with existing tools. And we plan to invite companies to pick up that stuff and use it. The Berkeley philosophy is to basically give the software away so it can be deployed in real environments. So, I think pieces of what we do are going to be available in the next couple of years.

Learn more about this topic

For research news, see the Alpha Doggs blog

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:
1 2 Page 2
Page 2 of 2
Now read: Getting grounded in IoT