Armando Fox believes that, if you can\u2019t build fail-proof systems, you should at least build systems that can recover so quickly that service blips become negligible. A Research Associate with the University of California Berkeley\u2019s Reliable, Adaptive Distributed systems laboratory (RAD Lab), Fox was one of the leads on the joint Berkeley\/Stanford Recovery-Oriented Computing (ROC) Project that investigated techniques for building dependable Internet services that emphasized \u201crecovery from failures rather than failure-avoidance.\u201dFox has since brought some of the ROC lessons forward into the RAD Lab, which was launched in 2005 with $7.5 million in funding from Google, Microsoft and Sun. Affiliate members include IBM, HP, Nortel, NTT-MCL and Oracle. The RAD Lab is focused on problems that plague large Internet-based businesses because the environments represent an extreme, but Fox says the lessons learned should ultimately trickle down to enterprise users. Network World Editor-in-Chief John Dix asked Fox to explain the vision.Let\u2019s start with a review of ROC. What was that all about?The philosophy of the ROC Project was stuff happens. Despite our best efforts to design and debug these complicated Internet systems, they inevitably end up failing in ways we didn\u2019t expect. Hardware is not perfect. Software has bugs. Even really, really well tested software like Oracle, you find bugs in it after it\u2019s been out in the field. And, you know, humans are in charge of running these systems and sometimes they make mistakes.So the ROC Project philosophy was, let\u2019s accept that those things are going to happen and start thinking about designing for fast recovery as opposed to designing to avoid failure, which is not really a realistic goal. One way to improve system availability is to never fail. But, another way to improve is to make recovery from failure so fast that the contribution to availability is negligible.Why do you start off with the assumption that you can never build systems that won\u2019t fail?Because we don\u2019t think we\u2019re smart enough to counter the last, what, 60 years of computer science history. There are a lot of people working on design by correctness and other techniques to improve systems and minimize bugs. And that\u2019s a good thing. But so far, despite our best efforts, I cannot think of a single computer system ever designed in which no bugs were ever found once it was in the field.So, I suppose we could take the position that, somehow in the future that\u2019s all going to change. But we\u2019ve been saying that for decades. And it\u2019s not that we\u2019re stupid, right? I mean, in terms of performance, storage density, network communications speeds, look what we\u2019ve been able to do in 30 years. But then compare that with what have we have been able to do in terms of reliability. The complexity of these systems has gotten to the point where it\u2019s very difficult for any one individual to understand how one of these systems works.Plus, market reality being what it is, it\u2019s not as if you\u2019d polish the whole thing, deploy it and then leave it alone. Systems have to evolve. You add new features, get more users, scale your system up. All of those processes work counter to reliability. Some of the most reliable software out there is the software that runs the Space Shuttle, and ask those guys how they make changes in their software. They have to write thousands of pages of documentation and have hundreds of hours of design reviews before a single line of code gets touched. So they have super reliable software, but it comes at a price.And the reality is most Internet companies can\u2019t pay that price. Amazon can\u2019t have hundreds of hours of design meetings before deciding whether it can roll out a new feature. So the ROC Project basically said, look, we need to find a way to deal with this issue in the context of what commercial realities are. Because, yes, these systems evolve rapidly. And, yes, that\u2019s bad for reliability. But that innovation is where a lot of the value of these systems comes from. And we\u2019re not going to, as academics, propose an approach to the problem that says, you can fix your systems, but at the cost of rapid innovation.So, that was the philosophy of ROC. And we actually made a fair amount of progress in identifying a couple of things. We identified some specific techniques that could be built into software systems that would help recover from certain kinds of common problems, really fast. In fact, so fast that sometimes you might not even notice it, except a minor blip in performance. So, that was an important finding. And, those ideas are starting to find their way into some commercial products.How about an example.Sure. One idea we worked on was called micro rebooting. When you have a weird, unexpected, unrecoverable bug and don\u2019t know what else is wrong, you reboot your machine. Sometimes that\u2019s enough to fix the problem. But rebooting takes a long time. So, given that applications have evolved to this componentized architecture using things like Enterprise Java Beans (EJB), our idea was to apply this concept of rebooting to a small number of components at a time. So instead of rebooting the whole EJB server, which can take minutes, you micro reboot only the EJB components that appear to have been failing. So, you reset the thing that was failing but you do it at much lower cost, because you\u2019re only doing it to the EJB component you believe was the actual source of the problem.And you\u2019ve seen that practice picked up?Variants of that practice are being put into some commercial products. Although I\u2019m not sure I\u2019m allowed to say specifically which ones.Micro rebooting was one of the seven core research areas under ROC, right? Are there others that have been acted on as well?Yes. There are some that we\u2019re carrying forward into the RAD LAB Project. One of the big ones is the use of statistical and machine learning to detect and localize hard-to-find problems in systems. For example, if your whole site crashes, that\u2019s not difficult to detect. You\u2019re probably hosed, but at least you know you\u2019re hosed. The tricky problems are the ones where a certain subset of your customers are getting incorrect page views, or some specific feature of your site is not working correctly. And, because of that, you\u2019re losing traffic.Some of the more sophisticated sites have various kinds of monitoring in place to try to detect these conditions. But the monitors aren\u2019t perfect so sometimes these conditions will persist for a while before anybody detects something might be wrong. And even if you do detect it, you still have to figure out what\u2019s causing it. So one thing we started looking at in the ROC Project was statistical and machine learning. You can summarize the field as, here\u2019s an enormous amount of data, find me some interesting patterns in it. That\u2019s a gross oversimplification, but the idea is you have this enormous amount of data you gather and what you want to do is extract information.So, in our case we can capture a lot of instrumentation from Java Enterprise Edition servers and Internet systems as they\u2019re running. We can gather information about the way they behave in response to users\u2019 workload and we can mine that information and look for interesting patterns.For example, we would watch a J2EE application server as regular users were using it and try to capture the different types of paths a user\u2019s request would follow through that system. Some users are going to browse a catalog of items, some are going to put things in a shopping cart, some are going to check out. And, it turns out you can group those paths into collections. So, if I build up a profile based on that, and then all of a sudden I start seeing a path that doesn\u2019t really fit into any of those categories, that would be a good time to ask myself, is this a new kind of behavior that no user has ever exercised? Or, is there a problem with the system, and because of that users are actually following a path that isn\u2019t really one of the valid paths, or one of the paths that I had set up for him to follow?It turns out that method is actually very effective at locating some of the kinds of partial problems that only affect some people or only affect certain features, and that don\u2019t normally show up in things like regular server logs. Path-based analysis has been used to diagnose performance problems, bugs, and system evolution issues at eBay and Tellme Networks, which operates complex voice-recognition-based phone applications.Now, of course, if you could foresee all of these possible problems in advance, you could have somebody manually write test cases that monitor the system constantly to see if any of the problems show up. But you can\u2019t always predict every problem in advance and even if you could, it\u2019s a lot of work for somebody to code all those cases. And then, of course, if you add or change anything, you\u2019ve introduced a bunch of new variables so you\u2019d have to go back and do those tests all over again.So the idea is to automate the task. I say, okay, watch the system because right now I believe it\u2019s working normally, and you build up a profile of what normal actually means. And, then you try to look for deviations from that.Like they do in some intrusion detection systems?Yes. But there\u2019s an important difference. These statistical algorithms are amazing, but none are perfect. They all make mistakes, and basically the two kinds of mistakes are false negatives, which means something happens and you miss it, and false positives, which means you raise an alarm but in fact there isn\u2019t really anything wrong. So, the difference between us and the intrusion detection guys is, if they act on a false alarm and shut the system down, they\u2019ve inconvenienced a great many people. If we act on a false alarm and do a micro reboot, it\u2019s so fast you barely notice it, except as a performance blip.So, micro rebooting has this nice property and we\u2019re working on identifying other techniques that fix a common class of problems and, if you try it and it doesn\u2019t work, doesn\u2019t cost you much. The real thing that came out of ROC was this combination of statistical machine learning, which is great at finding these patterns, but sometimes will make a mistake, combined with fast recovery actions that make it okay to act on a false alarm.Anything else come out of ROC?If the ROC Project is a three-legged stool, I\u2019ve just described two of the legs. The third leg was about human operators: How can we reduce the incidence of mistakes and, if they do make a mistake, how can we give them better tools so they can identify and recover from that mistake?One of the issues with statistical machine learning is the algorithms are not easy to understand. So if we show the operator the analysis of one of these algorithms they\u2019ll roll their eyes. Plus, it\u2019s their butt on the line if the system fails, so they\u2019re hesitant to turn control over to an algorithm they\u2019re not even sure they understand.So, an important thing we\u2019ve been doing near the end of the ROC Project and now moving into the RAD Lab Project, is combine the statistical machine learning with visualization. So we\u2019re not only analyzing the output of these algorithms, we\u2019re also presenting in-depth, information-rich graphic visualizations based on the same kinds of system behaviors they have to deal with every day and have developed an intuition for.For example, we worked with a medium-sized Internet Company called E-Bates, and they allowed us to use their actual server logs from several periods during which they had incidents with their system, features failing, part of their site going down, whatever. And what we did was created a simple visualization where operators can see what was happening. I could show you a picture and, without even knowing what the picture represents, you would be able to point to something and say, that\u2019s wrong.Operators get used to seeing visual patterns. And, if they see something in a picture that doesn\u2019t match their pattern, they immediately go on the alert and say, \u201cOh, it doesn\u2019t usually look that way.\u201d And, when they see that, they can then click on that part of the picture and drill down to what the statistical machine learning algorithm said about this part of the data.That may show the number of hits these pages were receiving in the last few hours as being anomalous, by this much. And, in particular, here\u2019s these three pages that contributed the most to the algorithm decision. Then the operator can go off and look for those three pages and see if there\u2019s some problem relating to the way those pages are used, or a bug.And we were able to not only identify all of the failures that had actually occurred, we actually found a couple of places in their data where our systems said a failure occurred and their operators didn\u2019t know about it. That caused them to want to go back to their e-mail logs and see if anything had really happened during those incidents.Over time the operators start trusting the algorithms more because they get more familiar with how they work and, more importantly, they have increased confidence that the algorithms are actually saying something meaningful.Okay. So, you are carrying these three core concepts forward into the RAD Project?Yes. If you look at the backgrounds of Dave Patterson and myself, the two principal investigators of ROC, you\u2019ll see his main background is in computer architecture and storage and mine is in systems building and a bit of networking.In the RAD Lab Project we\u2019re expanding along two axes. One axis starts with the idea that, hey, this stuff worked really well when we applied it to building applications, can we apply the ideas to developing and debugging the networks that connect the applications, because all interesting applications are going to be distributed by nature. And not just clusters, they\u2019re going to be distributed across data centers. So, can we look at these techniques of statistical machine learning and visualization and apply those to some of the core networking challenges that we are facing as we move to fully distributed apps.The second axis is "closing the loop" between these machine learning algorithms, which are good at learning about the system's behavior, and the human operators, who have a tremendous amount of experience with the system. Operators often have good instincts and hunches about what's causing a problem or when a system is drifting towards bad behavior. Wouldn't it be great if they could directly transmit their knowledge into the machine learning algorithms, to speed up the learning rate of an algorithm or to improve its modeling accuracy? And if the algorithm makes a mistake or a bad judgment, wouldn't it be great if the operators could interrogate the algorithm to understand how it made its decision so that they can improve its behavior for future incidents? We're really excited about combining state-of-the-art machine learning with what we've learned about providing better tools for system operators.Is the goal of this next project to solve something in particular?You have to have project mission statement to start. The five-year mission for RAD Lab is to make it possible for one individual to develop, deploy and operate an enormous scale, next generation Internet service.The initial version of eBay was coded over four days by one guy. But since then eBay has gotten so large it has had to rebuild its entire system twice. Pretty much from the ground up every time. Similarly, Google started out as a university research prototype, but in order to operate Google at the scale it operates at today, they had to raise money to build a Google-sized organization. Our goal is, if you have the idea for the next Google or the next eBay or the next mash-up of interesting stuff, that you can actually get to a scale of deployment comparable to what Google has now without building a Google-sized company to do it.Presumably you achieve that by building some of the capabilities into the cloud, right?The infrastructure will be operated in some sense as a utility. So, I deploy my service and if it gets really big it automatically starts taking over additional resources. It starts scaling itself up. In some sense it\u2019s not unlike the way you think about electricity today. You use a little, you pay a little. You use a lot, you pay a lot. But, you don\u2019t have to worry about the provisioning part. The grid does the provisioning. All you have to do is regulate how much you\u2019re using. That\u2019s the level of simplicity that we want to achieve. And we actually have a shot of taking a bite out of this with the combination of statistical machine learning and visualization.But the Googles of the world are a rarified breed. Will there be lessons for the enterprise user?We consider the enterprise class users more important because there aren\u2019t that many Googles around. And in some sense, solving the operations problem for an enterprise-scale company is a bigger challenge than solving it for a huge thing. Because if the hypothesis is we\u2019re going to have thousands of enterprise-scale things all sharing these resources, then each individual enterprise-scale thing has to be negligibly difficult to operate. It\u2019s got to add very little to the operational overhead, because otherwise we\u2019re limited in how many of them can co-exist on this resource grid.Okay.Whenever we have research discussions about this we always say, let\u2019s not think about just Google and Amazon and Yahoo. Let\u2019s think about the long tail [a reference to the idea that small companies actually reach more individual consumers than large firms]. Let\u2019s think about the one guy that\u2019s got a medium sized application and his management problems have to be made just as simple as the management problems if we were solving it for a planet-sized thing like Google. So, we actually care a tremendous amount about the long tail.And if you look at the recent wave of innovation using things like mash-ups and service composition, it looks like we\u2019re finally starting to move into the world of service-oriented architectures. People have been talking about SOA for years, but we\u2019re finally actually seeing some real-life applications.Craigslist.com has listings of apartments for rent. Google has great maps. Put them together and you can see apartment rentals on a map. This is exactly what the service-oriented architecture people have been hoping would happen for years. And it\u2019s not happening exactly the way they foresaw, but the important point is that a lot of the innovation we expect to see is not going to come from people building whole new applications. It\u2019s going to come from people combining applications and components and then layering some of their own functionality on top of it. So the typical innovative new application is not going to be a huge, morass of code. It\u2019s going to be a modest amount of code and depend on many services working correctly.And the RAD Lab hopes to solve some of the problems that will be exacerbated in that kind of environment?They\u2019re exacerbated because the way services are being built is, you take an existing app, use that as one of your building blocks and put stuff on top of it. Well, that means that all the things that you depend on have to work. And all the things that each of those things depends on has to work. So the level of innovation is greatly accelerated if you can do this. But it\u2019s contingent on making sure that that pyramid doesn\u2019t collapse. And that\u2019s a really interesting research problem.How far along is RAD?The RAD Lab Project has just started. It is largely funded by industry. In fact more than two-thirds of the financial support for this is coming from industry. Sun, Microsoft and Google -- three companies that are not usually mentioned in the same sentence -- have each contributed a substantial amount of support. Not just in terms of money, but actual research relationships with us. Students go to the companies and the companies send people to our retreats. That\u2019s one of the ways we stay focused on the real problems, by having companies that are grappling with these problems every day advising us. More recently, IBM, HP, Nortel, NTT-MCL, and Oracle have contributed as well.How do these efforts compare to, say, IBM\u2019s autonomic push?We\u2019ve had a great relationship with IBM for several years, and when we started the ROC Project we were asked if ROC was the same as autonomic computing. Autonomic computing is a great vision. And, in fact, you could argue that with the RAD Lab we\u2019re taking a step in that direction by saying any one service is going to require only a tiny fraction of a human operator. So that\u2019s getting pretty close to autonomic.But I think an important difference in the ROC Project was we said we\u2019re not ready to take the human out of the loop yet because we don\u2019t understand what the human does. The human doesn\u2019t have good tools to do what they do. Human errors are actually responsible for a huge fraction of the down time in real life services. So we, the ROC Project, are not going to focus on removing the human from the equation. Instead, we\u2019re going to look at how can we help. I think we have always had the same long-term goal as the autonomic computing guys, but I think our tactical approach was different for ROC.For enterprise users, how far off is the benefit of some of this work?Our long-term vision of creating a prototype platform where a single individual can deploy their service and basically turn it on and forget about it, I wouldn\u2019t hold my breath for that to come out next year. But, a lot of the techniques that we\u2019re developing as ingredients for that, better statistical machine learning algorithms, better visualization, our plan is to develop those things in the context of existing open standards, so whatever we do will work with existing frameworks.So, we expect to be deploying pieces of things and have downloadable software artifacts that will work with existing tools. And we plan to invite companies to pick up that stuff and use it. The Berkeley philosophy is to basically give the software away so it can be deployed in real environments. So, I think pieces of what we do are going to be available in the next couple of years.