Microsoft has apparently firmed up its plans for a DNA-based storage device that it expects to be commercially available within about three years.
The software giant originally unveiled its research into DNA as an archival storage medium last year; it described the technology being able to store the amount of data in “a big data center compressed into a few sugar cubes. Or all the publicly accessible data on the Internet slipped into a shoebox.
“That is the promise of DNA storage — once scientists are able to scale the technology and overcome a series of technical hurdles,” the company said in a 2016 blog post.
A Microsoft spokesperson declined to comment on the progress of its DNA storage research.
But in an article in MIT Technology Review, Doug Carmean, a partner architect at Microsoft Research, said the company hopes to create a “proto-commercial system in three years storing some amount of data on DNA in one of our data centers, for at least a boutique application.”
The storage device was described by Carmean as about the size of a large, 1970s-era Xerox copier with a data write speed of only 400MBps — something Carmean admitted needs to increase to 100MBps to compete with other archive storage mediums such as magnetic tape drives.
Natalya Yezhkova, a research director at IDC, said with the staggering rate at which digital data is growing, the necessity of a DNA-type storage medium will be critical in the next 10 to 15 years.
“Currently, the only way to address this growth is to increase footprint of data optimization techniques, whether that’s compression or deduplication,” Yezhkova said. “Those technologies are great, and mitigate some data growth, but in the longer term, we definitely need something else.”
For example, some healthcare data must be stored for the life of a patient, and federal regulations for auditing and civil litigation purposes require some financial records to be stored for seven or more years.
And, as big data analytics evolve, more companies are finding ways to cull useful marketing information from their sales and customer data archives.
Then there’s video, photograph and audio files, something every smart phone owner can create at their leisure and that’s increasingly stored by cloud services.
Researchers with Microsoft and UW developed what they described as “a novel approach” to convert the long strings of ones and zeroes in digital data into the four basic building blocks of DNA sequences — adenine, guanine, cytosine and thymine — represented as As, Gs, Cs and Ts.
The digital data is broken down into pieces and stored by synthesizing it as a massive number of tiny DNA molecules, which can be dehydrated and preserved for long-term storage.
To access the stored data, the researchers encode the equivalent of zip codes and street addresses into the DNA sequences. Polymerase Chain Reaction (PCR) techniques — commonly used in molecular biology — help them more easily identify the zip codes they are looking for.
DNA has a theoretical limit of being able to store more than one exabyte per millimeter, which is eight orders of magnitude denser than magnetic tape. DNA-based storage also has the benefit of eternal relevance: As long as there is DNA-based life, there will be strong reasons to read and manipulate DNA, the researchers said in a research paper.
Cloud service and hyperscale computing providers are constantly seeking new ways to store increasingly cumbersome amounts of data; that’s where DNA storage would likely see its initial home, according to Yezhkova. Cloud archive services such as Amazon Glacier or Google’s Cloud Platform would be likely candidates for a storage medium with vastly better capacities and longevity than today’s most prominent technologies.
“It’s a trade-off of speed versus the economics of storing massive amounts of data for 50 years or more that could be untouched,” Yezhkova said.
“It’s quite possible Amazon or Google could be researching DNA storage as well,” she continued. “They wouldn’t necessarily be talking about this or making it public.”
As promising as DNA storage appears to be, there are still issues that need to be solved before it can be a viable technology for the data center — for example, compatibility with existing applications and hardware. But, if those issues could be solved, “it would have a tremendous impact,” Yezhkova said.
Since 2005, the amount of electronic data has been doubling every two years, according the Digital Universe, an ongoing study by IDC.
The study estimates that from 2005 to 2020, the amount of electronic data generated throughout the world will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 trillion gigabytes, which is more than 5,200GB for every person on earth.
Only a tiny fraction of the digital universe has been explored for analytic value. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed.
By 2020, nearly 40% of the information in the digital universe will be “touched” by cloud computing providers — meaning that a byte will be stored or processed in a cloud somewhere on its journey from originator to disposal.
Last year, researchers at Microsoft and the University of Washington (UW) said they had broken a world record by storing 200MB of data on synthetic DNA strands.
The researchers said the impressive part about reaching the 200MB milestone is not just how much data they could encode onto synthetic DNA and then decode, it’s also the space they were able to store it in.
Once encoded, the data occupied a spot in a test tube “much smaller than the tip of a pencil,” Carmean said at the time.
The DNA storage also has a half-life of 500 years, even in harsh conditions. The half-life of DNA — just as with radioactive material — determines its rate of decay.
Today’s most popular storage mediums, magnetic tape, hard disk drives, optical discs and NAND flash storage all have limited lifespans, which max out anywhere from five years to several decades.
Meanwhile, the proportion of data in the digital universe that requires protection is growing faster than the digital universe itself, from less than a third in 2010 to more than 40% in 2020, according to IDC.
“With projects like the Internet of Things and big data analytics, the amount of data will just continue to increase and need to be stored. The problem with how to store all this data is always discussed in the industry,” Yezhkova said.