Future computer systems need to be significantly faster than the supercomputers around today, scientists believe. One reason is because analyzing complex problems properly, such as climate modeling, takes increasing work. Massive quantities of calculations, performed at high speed, and delivered in mistake-free data analysis is needed for the fresh insights and discoveries expected down the road.
Limitations, though, exist in current storage, processing and software, among other components.
The U.S. Department of Energy’s four year $48 million Exascale Computing Project (ECP), started at the end of last year for science and national security purposes, plans to overcome those challenges. It explains some of the potential hiccups it will be running into on its Argonne National Laboratory website. Part of the project is being studied at the lab.
The most radical of the ECP undertaking’s thrusts is based on the assumption that memory is, and will continue to be too expensive to be used in the same way it is today—where it matches processing power.
“Exascale systems will be 50 times faster than existing systems,” says Ian Foster, of the University of Chicago and Argonne, on the website. “But it would be too expensive to build out storage that would be 50 times faster as well.”
Data volumes need to be reduced by a factor of 10, the project’s scientists believe. That could introduce errors, but those errors can’t affect the scientific end-results. Summarizing data, and reducing features from the initial data are among angles being explored. But lossy compression is an approach that may well be used, they think. That’s a similar technique to how a smartphone converts images captured on its sensor into smaller JPEG files—reducing the amount of superfluous data. Data is lost, but no one notices.
Other problems manifest themselves with applications running at the new, high speeds. “Faults in high-performance systems are a common occurrence,” Argonne computer scientist Franck Cappello says. “And some of them lead to failures.”
Cappello reckons that an advanced form of checkpoint/restart should be pursued. Checkpoint/restart is a way to continue calculating after a computer crash, with no loss of data. It’s accomplished by reverting to a known checkpoint that was created before the failure. The same concept is used already today, but ECP’s will be more scalable, the group claim.
“Efficiency of memory and storage have to keep up with the increase in computation rates and data movement requirements,” Pete Beckman, of Argonne, adds in one of the website articles. The problem being that a more complicated, larger memory arrangement “affects how quickly you can retrieve it.”
Power is a further issue, the lab believes. “Tens of megawatts of power” will be needed.
Processing cores, too, will need special management. So software must be created and implemented to partition and organize the multiple cores. “Improving performance by say, two to three percent, is equivalent to thousands of laptops’ worth of computation,” Beckman says. Containerization, where cores are grouped together and controlled as a unit will be used to manage the processing.
“Imagine you were able to solve a problem 50 times faster than you can now,” the lab says. “The ability to create such complex simulations helps researchers solve some of the world’s largest, most complex problems.”