Dept. of Energy hunting fault tolerance for extreme scale systems

Keeping massively powerful computers humming will be an enormous challenge

20111202 ia supernova graphic 1

A computer simulation of a Class 1a supernova. Argonne National Laboratory's Mira will have enough computing power to help researchers run simulations of exploding stars.  More powerful systems in the futrure will need advanced fault tolerance to support even more advanced apps.

Credit: DOE

The immensely powerful supercomputers of the not too distant future will need some serious fault tolerance technology if they are to fulfill their promise of ingenious research.

That’s why the U.S. Department of Energy ‘s Office of Advanced Scientific Computing Research this week said it is looking for “basic research that significantly improves the resiliency of scientific applications in the context of emerging architectures for extreme scale computing platforms. Extreme scale is defined as approximately 1,000 times the capability available today. The next-generation of scientific discovery will be enabled by research developments that can effectively harness significant or disruptive advances in computing technology.”

+More on Network World: DARPA demos lightweight, 94GHz silicon system on a chip+

According to the DOE, applications running on extreme computing systems will generate results with orders of magnitude higher resolution.

“However, indications are that these new systems will experience hard and soft errors with increasing frequency, necessitating research to develop new approaches to resilience that enable applications to run efficiently to completion in a timely manner and achieve correct results,” the agency stated.

Today, the DOE says that 20% or more of the computing capacity in a large high performance computing facility is wasted due to failures and recoveries. The situation is expected to worsen sharply as systems increase in size and complexity, wasting even more capacity. Research is required to improve the resilience of the systems and the applications that run on them, the agency stated.

 What the DOE says the research it is looking for focuses on three things:

 +More on Network World: Coolest house in the world: A Boeing 727+

The DOE states that a variety of factors will contribute to increased rates of faults and/or errors on extreme scale systems a few of which are

  • The number of components with both memory and processors will increase by an order of magnitude, and the number of system components is increasing faster than component reliability, resulting in an increase in hard and soft errors;
  • Constraining hardware and software to a power envelope of 20 Megawatts will necessitate operating components at near-threshold power levels and power levels may vary over time, making errors more likely; and
  • Use of the machines will require managing unprecedented parallelism and complexity, especially at the node level of extreme scale systems, increasing the likelihood of programmer errors.

Follow Michael Cooney on Twitter: nwwlayer8 and on Facebook.

 Check out these other hot stories:

Rocket Lab wants to make Model T of space satellite launchers

FTC urges mobile carriers to help send cramming charges packing

NASA looking for out-of-this-world Mars communications services

FTC takes out “tech support” scammers; $5.1 million in fines, retribution

Finding life in space by looking for extraterrestrial pollution

Dumping an open source Honeypot on Rachel: FTC reloads on liquidating robocallers

Cisco counterfeiter gets 37 months in prison, forfeits $700,000

DARPA initiates reusable, aircraft-like spaceship development

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.