35
Next Generation Sequencing, Assembly, and Alignment Methods Andy Nagar

Next Generation Sequencing, Assembly, and Alignment Methods

  • Upload
    manchu

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Next Generation Sequencing, Assembly, and Alignment Methods . Andy Nagar. Agenda. Background Next Generation Sequencing Sequence Assembly Sequence Alignment Traditional Alignment Algorithms Next Generation Alignment Algorithms Conclusion. Background. - PowerPoint PPT Presentation

Citation preview

Page 1: Next Generation Sequencing,  Assembly, and Alignment Methods

Next Generation Sequencing, Assembly, and Alignment Methods

Andy Nagar

Page 2: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 2

Agenda• Background• Next Generation Sequencing• Sequence Assembly• Sequence Alignment• Traditional Alignment Algorithms• Next Generation Alignment Algorithms• Conclusion

Page 3: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 3

Background• Earlier sequencing methods were based on Sanger

sequencing, which goes back to the 1970s.• Sequencing was slow, bases were read one at a time.• Separation is done by

electrophoresis.• Readout by fluorescent tags.

Source:[Wikipedia]

Page 4: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 4

Background• To complete second generation genome projects such

as the Human Genome Project, need for faster and high-throughput sequencing.

• Next-Generation Sequencing technologies based on various implementations of cyclic array sequencing.

• Cyclic Array Sequencing is based on the idea of sequencing of an array of DNA features by continuous process of enzymatic separation and imaging-based data collection.

Page 5: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 5

Growth in Sequencing

Source:[6]

Growth of Next - GenSequencing – doubles every month

Page 6: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 6

Next Generation Sequencing• Workflow :

•DNA is fragmented•Adaptors ligated to fragments•Several possible protocols yield array of PCR colonies.•Enyzmatic extension with fluorescently tagged nucleotides.•Cyclic readout by imaging the array.

Source:[10]

Page 7: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 7

Next Generation Sequencing• Reads are done in parallel to speed up the sequencing.

Source:[11]

Page 8: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 8

NGS - Products• Products based on cyclic array sequencing include:- Roche’s 454- Illumina’s Genome Analyzer- ABI’s SOLiD- HeliScope

• They allow the sequencing of millions of short sequences (reads) simultaneously, and can sequence entire human genome in a few days [Magi et al 2010].

Page 9: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 9

NGS - Products

Source:[13]

Page 10: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 10

Comparison of existing methods

Source:[4]

Page 11: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 11

Whole Genome Shotgun Sequences (WGS)• DNA is broken up randomly into numerous small

segments. • Multiple overlapping reads for the target DNA are

obtained by performing several rounds of this fragmentation and sequencing.

• Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.

Page 12: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 12

Sequencing

Source:[9]

Page 13: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 13

How to ensure enough coverage

Source:[9]

Page 14: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 14

Whole Genome Shotgun Sequences (WGS)

Source: http://www.nature.com/scitable/topicpage/complex-genomes-shotgun-sequencing-609

Page 15: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 15

Assembly - Reconstructing the Genome• 2 possible methods of assembly:1. Consensus Overlap Assembly:The overlap consensus assembly method uses the overlap between sequence reads to create a link between them. The contig is eventually formed by reading along the links as far as possible. Problematic for short reads: - Overlaps must be calculated over a large proportion of the read - Huge number of reads increases the number of links, so contig path is difficult to compute.

Page 16: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 16

Assembly - Reconstructing the Genome• 2 possible methods of assembly:2. de Bruijn Graph Approach:-All k-mers are computed and the reads are represented as a path through the k-mers.- A de Bruijn graph is a graph in which the nodes are sets of symbols (i.e. nucleotides) and the edges represent overlaps between the symbols. This is a convenient way to represent data, such as overlapping sequence reads- de Bruijn graphs handle redundancy better and can assemble sequences more efficiently.

Page 17: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 17

Assembly - Reconstructing the Genome

Source:[13]

Page 18: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 18

Assembly - Reconstructing the Genome

Source:[12]

Page 19: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 19

Assembly –de Bruijn Graph• Reads are parsed into 4-mers• Matches are found and de Bruijn Graph is created• There can be more than one path in the graph.=> Practical problems of assembly.

Source:[12]

Page 20: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 20

What can we do about repeats?Two main approaches:• Cluster the reads

• Link the reads

Source:[9]

Page 21: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 21

What can we do about repeats?Two main approaches:• Cluster the reads

• Link the reads

Source:[9]

Page 22: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 22

What can we do about repeats?Two main approaches:• Cluster the reads

• Link the reads

Source:[9]

Page 23: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 23

Traditional Sequence Alignment• 2 types of traditional Sequence Alignment

Algorithms:1. Hash-table based

eg: BLAST (and its variants)=> keep track of each k-mer in a hash table with sequence being the key [14][15].

SSAHA => builds a position sensitive hash-table [17].

Advantage: Fast search, allows gapped searches.Drawback: Large memory requirement to store the hash

table.

Page 24: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 24

Traditional Sequence Alignment2. Tree-based search

eg: Suffix and Prefix triesAdvantage: Fast search, can easily search for sub-strings

or patterns.Drawback: Inserting new sequences required re-building

the tree.

Page 25: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 25

Traditional Sequence Alignment – Suffix Tree

Suffix tree for the string BANANA. Each substring is terminated with special character $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$.The numbers in the leaves give the start position of the corresponding suffix. Suffix links drawn dashed.

Represents “ANA”Represents “NA”

NA is suffix of ANA so suffix link

Source:[19]

Page 26: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 26

Next Generation Sequence Alignment• With high throughput sequencing, millions of reads

are obtained in a single run.• “Read-mapping” problem:

How do the reads fit in the reference genome.Find hits where these reads occur in the genome.Report position(s) and frequency of hits.A short read may map to many chromosomes in a genome.

Page 27: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 27

Next Generation Sequence Alignment

Source:[25]

Page 28: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 28

Next Generation Sequence Alignment• Burrows-Wheeler Transform can be used to find

matches of a query string inside a reference string.Steps:1. Create a suffix array in which each element is a cyclic permutation of the original string terminated by end character “$”.Example: String “googol”.Original String: googol$

1st circular permutation=> oogol$g2nd circular permutation => ogol$go…till $ moves to front of the stringlast circular permutation => $googol

Source:[27]

Page 29: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 29

Next Generation Sequence AlignmentSteps:2. Sort the elements of the suffix array in a lexicographic order.$ is lexicographically the smallest elementS(i) represents the index in suffix array i represents index in BW Array

Note: All occurrences of any substringoccur next to each other in the BWArray. Such range is called the Suffix Array Interval (SA Interval).For example “go” occurs as prefixin positions 1 and 2.SA Interval of “go” = [1,2]

BW Array

Source:[27]

Page 30: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 30

Next Generation Sequence AlignmentSteps:SA Interval of “go” = [1,2]Value of S(i) give the correspondingpositions in original string.Here the S(i) values and 3 and 0.

X = googol$

This algorithm has many extensions for finding inexact and gapped matches. More details in reference [27]

BW Array

Source:[27]

Page 31: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 31

Conclusion• Next Generation Sequencing is transforming the

fields of genetics, molecular biology and bioinformatics.

• Enormous amounts of data produced by sequencing projects.

• Computing and data analysis are lagging behind.• Need for more efficient data analysis and storage

methods.• Use of data mining to find useful information fast and

without need to store the entire data.

Page 32: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 32

Conclusion• More efficient assembly and alignment techniques

needed.• Need for “metagenomic” analysis – find out which

organisms or species are present in a biological or environmental sample.

Page 33: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 33

References

Page 34: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 34

References

Page 35: Next Generation Sequencing,  Assembly, and Alignment Methods

Andy Nagar 35

References