Heftiest supercomputers fall hardest, researcher suggests - woodsgreaughts

Atomic number 3 supercomputers grow to a greater extent powerful, they'll besides grow more vulnerable to failure, thanks to the increased amount of constitutional componentry. A few researchers at the recent SC12 conference last week in Salt Lake City, Utah, offered possible solutions to this growing problem.

Today's high-performance computing (HPC) systems rear end have 100,000 nodes or more—with apiece node built from seven-fold components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt trading operations when they do so, aforesaid David Fiala, a Ph.D pupil at the North Carolina Tell University, during a talk of the town at SC12.

The problem is not a new one, of course. When Thomas Edward Lawrence Livermore National Lab's 600-node ASCI (Accelerated Strategic Computing Initiative) White supercomputer went online in 2001, it had a mean time 'tween failures (MTBF) of only five hours, thanks in part to portion failures. Later tuning efforts had improved ASCI Ovalbumin's MTBF to 55 hours, Fiala said.

But as the count of supercomputer nodes grows, so testament the problem. "Something has to be done close to this. It will relapse as we move to exascale," Fiala said, referring to how supercomputers of the next X are expected to have ten multiplication the machine power that nowadays's models cause.

Today's techniques for dealing with arrangement failure may not scale identical well, Fiala said. He cited checkpointing, in which a running computer program is temporarily halted and its state is saved to disk. Should the program then crash, the organization is fit to restart the job from the last checkpoint.

The problem with checkpointing, reported to Fiala, is that American Samoa the number of nodes grows, the sum of money of system of rules overhead needful to do checkpointing grows likewise—and grows at an exponential function rate. On a 100,000-node supercomputer, e.g., solely about 35 percent of the activity volition Be involved in conducting work. The rest wish be taken up by checkpointing and—should a system bomb—recovery operations, Fiala estimated.

Because of all the additional hardware needed for exascale systems, which could atomic number 4 stacked from a million surgery to a greater extent components, system reliability will have to follow improved aside 100 multiplication ready to hold on to the same MTBF that nowadays's supercomputers enjoy, Fiala said.

Old, sainted advice: back up data

Fiala presented technology that atomic number 2 and feller researchers developed that may help improve reliability. The technology addresses the problem of silent data putrescence, when systems make undetected errors writing data to disk.

Basically, the researchers' approach consists of gushing ninefold copies, OR "clones" of a program, simultaneously and then comparing the answers. The software, called RedMPI, is run in conjunction with the Message Passing Interface (MPI), a library for splitting running applications crosswise multiple servers so the different parts of the program can be executed in parallel.

RedMPI intercepts and copies every MPI message that an applications programme sends, and sends copies of the message to the clone (or clones) of the program. If different clones calculate different answers, then the numbers butt represent recalculated on the fly, which will save time and resources from running the entire program again.

"Implementing redundancy is not expensive. IT may be last in the keep down of substance counts that are needed, merely it avoids the deman for rewrites with checkpoint restarts," Fiala aforementioned. "The alternative is, of track, to simply rerun jobs until you guess you have the right solution."

Fiala suggested running two backup copies of each program, for triple redundancy. Though running multiple copies of a program would initially take up more resources, over time it may in reality make up more efficient, due to the fact that programs would not need to be rerun to check answers. Also, checkpointing may not be needed when multiple copies are run, which would besides deliver on system resources.

"I suppose the idea of doing redundancy is actually a great idea. [For] very large computations, involving hundreds of thousands of nodes, there certainly is a chance that errors will creep in," said Ethan Miller, a computing professor at the University of Calif. Father Christmas Cruz, who tended to the presentation. But atomic number 2 aforementioned the go about may be not be suitable surrendered the amount of web traffic that such redundance might create. He suggested running completely the applications on the Saami set of nodes, which could minimize internode traffic.

In another presentation, Ana Gainaru, a Ph.D student from the University of Illinois at Urbana-Field, presented a proficiency of analyzing logarithm files to forecas when arrangement failures would occur.

The work combines signalize analysis with information mining. Signal analytic thinking is accustomed characterize normal behavior, so when a failure occurs, it tail live easily spotted. Data mining looks for correlations between separate reported failures. Early researchers have shown that octuple failures are sometimes correlated with each other, because a failure with one technology may pretend performance in others, according to Gainaru. For instance, when a web card fails, it will presently hobble unusual system processes that trust on network communication.

The researchers found that 70 percent of correlated failures provide a window of opportunity of more than 10 seconds. Put differently, when the original sign of a failure has been detected, the system may experience functioning to 10 seconds to economize its work, or proceed the exercise to another node, before a more critical failure occurs. "Failure prediction can equal merged with strange fault-tolerance techniques," Gainaru said.

Joab Jackson covers enterprise software and imprecise technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail call is Joab_Jackson@idg.com