26
A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug Carmean, Georg Seelig, James Bornholt, Randolph Lopez, Lee Organick, Rob Carlson, Hsing-Yeh Parker, Yuan Chen, Chris Takahashi, Bichlien Nguyen, Sergey Yekhanin, Siena Dumas Ang, Sharon Newman. © University of Washington and Microsoft Research. All rights reserved. Library of Congress, Sep 2016.

A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

A DNA-BasedArchival Storage SystemLuis Ceze and Karin StraussUniversity of Washington Microsoft Research

joint work with Doug Carmean, Georg Seelig, James Bornholt, Randolph Lopez, Lee Organick, Rob Carlson, Hsing-Yeh Parker, Yuan Chen, Chris Takahashi, Bichlien Nguyen, Sergey Yekhanin, Siena Dumas Ang, Sharon Newman.

© University of Washington and Microsoft Research. All rights reserved.

Library of Congress, Sep 2016.

Page 2: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA is the information storage medium for life

Gene

Protein

Function/Characteristic

Page 3: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Using synthetic DNA for data storage

100101010

But why?

Manufacture DNADehydrate & storeRead DNA

Page 4: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA molecules for digital data

Extremely dense 1 exabyte in 1 in3

Extremely durable Half life > 500 years

Readers never become obsolete! (no migration :)

And consumes very little power at rest.

Page 5: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Comparing storage density

107 potential improvement over tape

Addressing RedundancySystem overheads

Page 6: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

The ultimate storage hierarchy

DNA-based Archival

Tape

HDD

FlashAccess Time Capacity

µs-ms

10s ms

minutes

hours

TBs

100s TBs

PBs

ZBs

Durability

~5 yrs

~5 yrs

~10s yrs

~100s yrs

Our goal: build an integrated DNA storage system.

Page 7: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA moleculesFour nucleotides:

A

C

G

T

Adenine

Cytosine

Guanine

Thymine

DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C T

Two strands can bind to each otherif they are complementary:

C T G T G A

G A C A C C T

C, G arecomplementary

A, T arecomplementary

T

A

Page 8: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA data storage at 30,000 feet

11011101 11011101Encoding

DecodingAGCTATCAG AGCTATCAG

Synthesis

Sequencing

write path read path

Page 9: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA data storage at 30,000 feet

11011101 11011101Encoding

DecodingAGCTATCAG AGCTATCAG

Synthesis

Sequencing

write path read path

Page 10: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Encoding digital data in DNA101000111001000111100111110001011001010010111101…

2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1

G G A T G C A C T G C T T A C C G C C A G T T C

A0C1G2T3

Repeated letters are bad:

G C C TA AA

Use base 3 and “rotate” mapping.

G G A T G C A

P[Attach] = 99%

99% 98% 97% 96.1% 95.1% 94.2%

100 nts

36.6%

200 nts

13.4%

… …

Synthetic DNA sequences have limited length: Break it up

Page 11: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Breaking up data into chunks (~150nts) G C A C C G A T T G C T G A C G G C T A G C T C

A A A A

A A A C

A A A GAddresses within the file

1 of N

2 of N

3 of N C A T C C

C A T C C

C A T C C

A T G T T

A T G T T

A T G T TFile identifiers(“primers”)

~ 20 bytes per DNA strand. Many strands per file.

Page 12: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Errors in writing/reading DNA

G G A T A G C

G G A T G C A

G G A T G A

G G A T C C A

Insertions

Deletions

Substitutions

A

Aggregate error rates ~1% Encode redundant data in additional DNA strands. Many possibilities: parity, Reed Solomon, LDPC, …

Page 13: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA data storage at 30,000 feet

11011101 11011101Encoding

DecodingAGCTATCAG AGCTATCAG

Synthesis

Sequencing

write path read path

Page 14: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA SynthesisManufacturing DNA strands

GACACCT G A C A C C T

• Normally used for life sciences and medicine• Millions of copies of each sequence• Can make many different sequences in parallel

Twist Bioscience

Page 15: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.Photo: Tara Brown / UW

10TB

Page 16: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA Storage “Library”

1101110111011101Encoding

DecodingAGCTATCAG AGCTATCAG

Synthesis

Sequencing

write path read path

DNA storage (physical) library

foo.mp4

Data address specifies physical location.

~100TB-1PB

Page 17: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA data storage at 30,000 feet

11011101 11011101Encoding

DecodingAGCTATCAG AGCTATCAG

Synthesis

Sequencing

write path read path

Page 18: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Random access?

?

Page 19: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

C A T C C

Random access!

PCR Sample

T A T C T

G C A C G G A T T G C T T A C C

G C C A G T T C

A C T A G A T C

A G C G

G A T A C A T G T T C C A C T

A T G T T T A C A A A T A G A

File identifiers(“primers”)

Page 20: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA Sequencing

Reading DNA strandsGACACCTG A C A C C T

• Normally used for genome sequencing• Reads many copies of millions of DNA strands at a time• Currently much higher throughput than synthesis

C T G T G G A

G ACAG CA

Page 21: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA data storage at 30,000 feet

11011101 11011101Encoding

DecodingAGCTATCAG AGCTATCAG

Synthesis

Sequencing

write path read path

Page 22: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Decoding back to digital data

Sequencing C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A A A

A A A C

A C A G

C A T C C

C A T G C

A T G T T

A T G T T

A T G T T

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A G A

C A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T T

……..

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A G A

C A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T T

Clustering reads

…110110101101… Error correction Reassemble data

Page 23: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

Results

200MB as of July’16.1.5 B nucleotides10M DNA strands

last year: 1MB

Page 24: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

10MBs/week 100GBs/second

Page 25: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

MISL UW MSR© University of Washington and Microsoft Research. All rights reserved.

DNA manipulation productivity is growing

102

104

106

108

1010

1970 1980 1990 2000 2010Year

Prod

uctiv

ity

Transistors on ChipReading DNAWriting DNA

Source: Robert Carlson

And cost is decreasing…

Page 26: A DNA-Based Archival Storage System · 2016-11-15 · A DNA-Based Archival Storage System Luis Ceze and Karin Strauss University of Washington Microsoft Research joint work with Doug

© University of Washington and Microsoft Research. All rights reserved.

Molecular Information Systems Lab

Computer architects, coding theorists, molecular biologists, fluidics, algorithms, …