32
Data of Unusual Size in Metagenomics C. Titus Brown [email protected] Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)

2013 siam-cse-big-data

Embed Size (px)

Citation preview

Page 1: 2013 siam-cse-big-data

Data of Unusual Size in Metagenomics

C. Titus Brown

[email protected]

Asst Professor, Michigan State University

(Microbiology, Computer Science, and BEACON)

Page 2: 2013 siam-cse-big-data

Openness

• Twit me! @ctitusbrown

• My blog: http://ivory.idyll.org/blog/

• Grants, preprints, etc:

http://ged.msu.edu/

• Software: BSD, github.com/ged-lab/.

Page 3: 2013 siam-cse-big-data

Thanks

• My lab, esp. Jason Pell, Arend Hintze,

Adina Chuang Howe, Qingpeng

Zhang, and Eric McDonald

• Michigan State, USDA and NSF for $$

Page 4: 2013 siam-cse-big-data

“Three types of data scientists.”

(Bob Grossman, U. Chicago, at XLDB 2012)

1. Your data gathering rate is slower than

Moore’s Law.

2. Your data gathering rate matches Moore’s Law.

3. Your data gathering rate exceeds

Moore’s Law.

Page 5: 2013 siam-cse-big-data

Metagenomics

• Randomly sequence DNA from mixed

microbial communities, e.g. soil.

• DNA sequencing rates (cost/volume)

have been outpacing Moore’s Law for

~5 years now… A terabase for

~$10k today.

Page 6: 2013 siam-cse-big-data

Analogy:feeding libraries into a paper shredder, digitizing the

shreds, and reconstructing the books.

Page 7: 2013 siam-cse-big-data

“Shredding libraries” is a good analogy!

• Lots of copies of Dickens, “Tale of Two Cities”,

and SAT study guides, etc.

• Not as many copies of <obscure hipster author>.

• Many different editions with minor differences, +

Reader’s Digest, excerpts, etc.

• (Although for libraries we usually know the

language)

Page 8: 2013 siam-cse-big-data

Two points:1. If we feed all of

the libraries in the

world into a paper

shredder and mix,

how do we recover

the book content!?

Page 9: 2013 siam-cse-big-data

Two points:

2. That’s

actually an

awful lot of

data…

Page 10: 2013 siam-cse-big-data

Digression: Data of Unusual Size (aka Big Data) in Scientific Research

• Research is already hard enough:

– Novel, fast moving, heterogeneous data types.

– Unknown answers.

• Big Data => scaling, requires good engineering

– Apply or invent new data structures & algorithms.

– Write usable, functioning, reusable software.

(Hint: academics are not good at one of these

things)

Page 11: 2013 siam-cse-big-data

The assembly problem

• The N**2 approach: look at all

overlapping fragments.

• The word-based approach: further

decompose words into fixed-length

overlapping hashable words.

(Only one of these scales…)

Page 12: 2013 siam-cse-big-data

Shotgun sequencing

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Page 13: 2013 siam-cse-big-data

Reducing to k-mers overlaps

Note that k-mer abundance is not properly represented here! Each blue k-mer will be present

around 10 times.

Page 14: 2013 siam-cse-big-data

Errors create new k-mers

Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.

Page 15: 2013 siam-cse-big-data

So, our k-mer data contains both true and false k-mers.

Page 16: 2013 siam-cse-big-data

Random sampling => deep sampling needed

Typically 10-100x needed for robust recovery (300 Gbp for human)

Page 17: 2013 siam-cse-big-data

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Can we efficiently distinguish true from false?

Page 18: 2013 siam-cse-big-data

Uneven representation complicates matters.

Since you’re sequencing at

random, you need to sequence deeply in order to be sensitive to rare hipster books.

These rare hipster books may be important to

understanding culture: not only best-sellers have influence!

Page 19: 2013 siam-cse-big-data

Streaming algorithm to do so:digital normalization

Page 20: 2013 siam-cse-big-data

Digital normalization

Page 21: 2013 siam-cse-big-data

Digital normalization

Page 22: 2013 siam-cse-big-data

Digital normalization

Page 23: 2013 siam-cse-big-data

Digital normalization

Page 24: 2013 siam-cse-big-data

Digital normalization

Page 25: 2013 siam-cse-big-data

Streaming algorithm for lossy compression of data sets.

• Converts random sampling to systematic sampling

by building an assembly graph on the fly

• Can discard up to 99.9% of data set and errors, and

still retain all information necessary for assembly.

• Acts as a prefilter for assemblers; ~5 lines of Python.

• Each piece of data is only examined once (!)

• Most errors are never collected => low memory.

Page 26: 2013 siam-cse-big-data

Separately, apply Bloom filters to storing the information/data.

“Exact” is for best possible information-theoretical storage.

Pell et al., PNAS 2012

Page 27: 2013 siam-cse-big-data

Some details• This was completely intractable.

• Implemented in C++ and Python; “good practice” (?)

• We’ve changed scaling behavior from data to information.

• Practical scaling for ~soil metagenomics is 10-100x: need

< 1 TB of RAM for ~2 TB of data. ~2 weeks.

• Just beginning to explore threading, multicore, etc. (BIG

DATA grant proposal)

• Goal is to scale to 50 Tbp of data (~5-50 TB RAM

currently)

Page 28: 2013 siam-cse-big-data

My rules of thumb for Big Data (for a better tomorrow)

1. Write well-understood filters and

components, not monolithic

programs.

Page 29: 2013 siam-cse-big-data

My rules of thumb for Big Data (for a better tomorrow)

2. Throw away data as quickly as

possible.

Page 30: 2013 siam-cse-big-data

My rules of thumb for Big Data (for a better tomorrow)

3. Scripting is an extremely effective

way to connect serious software to

scientists.

Page 31: 2013 siam-cse-big-data

My rules of thumb for Big Data (for a better tomorrow)

4. Streaming/online approaches are

worth the effort to develop them.

(OK, this is obvious to this audience)

Page 32: 2013 siam-cse-big-data

My rules of thumb for Big Data (for a better tomorrow)

1. Write well-understood filters and components,

not monolithic programs.

2. Throw away data as quickly as possible.

3. Scripting is an extremely effective way to

connect serious software to scientists.

4. Streaming/online approaches are worth the

effort to develop them.