Download ppt - Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy

Steps in a genome sequencing projectFunding and sequencing strategy• source of funding identified / community drive • development of sequencing strategy

• random shotgun (chromosome & whole genome) sheared gDNA libraries, physical maps not necessary, fast, whole genome coverage produced quickly, assembly may be problematic• clone-by-clone (map-as-you-go) BAC, YAC, cosmid libraries & physical maps, slower, data produced less quickly from isolated regions

• procurement of DNA: library construction, test sequencing, analysis of data• large-scale sequencing of libraries

Assembly and data release• for shotgun projects: at 3 X: first assembly, release of genome data

at 5-6 X: ~97% genes sequenced at 8-10 X coverage, final assembly• for clone-by-clone: sequence of clones released as completedClosure• gap closure, repeat resolution, identification of mis-assemblies: time-consuming, expensive• comparison to physical/genetic/optical mapsGene finding and annotation• train gene finding algorithms and predict gene models• genome annotation: auto-annotation vs manual annotation• genome analysis, comparative genomics, publication, final data release to GenBank

Sequencing strategies for long DNA

We can’t directly sequence long DNA (yet), but we can assemble the master sequence from smaller pieces.

Shotgun Library Construction & Sequencing

Concept:

1) Shred long DNA into lots of random short fragments 2) Sequence both ends of the fragments3) Reassemble the original DNA from overlapping sequences of the

fragments

SOUNDS EASY!

Methods:•sonication•syringe•nebulization

NOT RESTRICTION ENZYMES

Size-selectedshotgun fragment

Libraries

•Small insert library provides most of the sequence coverage (contigs)

•Large insert libraries help order the contigs (and scaffolds)

Mate pair (~1kb between)

Mate pair (~9kb between)

5’ endread

3’ endread

5’ endread

3’ endread

Assembly of contigs from mate pairs

•must have high-quality (well-trimmed) input DNA, to reduce false overlaps•reads must be mostly mate pairs (<25% single reads)•library insert size variance must be kept low (<10%) for accurate prediction of distance between mate-pairs sequences

Scaffolds, or ‘Why we sequence mate pairs from longer

fragments’

low-complexity/repetitive

Knowing the sizes of inserts can tell us roughly what we don’t we don’t know (sometimes).

Scaffolds into chromosomes

- The average number of times any given base in the genome was sequenced (in this case, each base was read 8 times on average. Of course a particular base may have been read more or less than 8 times.)

also

-The amount of sequence that was obtained, relative to the length of the whole genome (in this case, the aggregate length of all reads was 8 times the genome length)

Lander & Waterman (1988) determined that for an ideal genome project (no ‘difficult’ regions) 8X-10X coverage is sufficient to confidently complete the genome.

Two ways of thinking about: COVERAGE

What does “8X coverage” mean??

NO EUKARYOTIC GENOME IS THAT WELL-BEHAVED

So even with 8X shotgun coverage there’s likely at least ~1% of the genome remaining to be finished, by more laborious and expensive means

(The human genome…are we there yet??)

Some genomes are relatively well-behaved: nearly all sequence reads were assembled into contigsscaffoldschromosomes, with relatively few or no gaps remaining (e.g., Plasmodium falciparum)

Some genomes are very badly behaved and far from finished; reads may remain unassigned to contigs, much less scaffolds, much less chromosomes. There are lots of gaps (Ns) and lots of repeats. E.g., Trichomonas vaginalis genome: huge, highly repetitive, AT-rich; low-quality seq was allowed in to increase coverage/gene calls in ‘difficult’ regions..

Finishing

• Closure of gaps between contigs/scaffolds• Correction of misassemblies• resequencing of low-coverage/low-quality

regions

This is usually the most time-consuming part of the project. Repeat/low complexity regions can be hard to sequence and hard to know where to ‘put’ in the final assembly.

Sequence hierarchy

genome (all chromosomes)

Chromosome (one or more scaffolds..ultimately one contig!)

Scaffold (two or more contigs)

contig

reads (mate-pair & single)

overlapping, ordered sets, no gaps

ordered sets w/gaps, size estimatedNot

biologicalentities

ordered sets w/gaps

Post-sequencing steps

Automated• gene calling (setting boundaries)• Annotation (guessing function)

Manual• refining gene models• correcting annotation• should be an ONGOING process…wish it was

OTHER STUFF (demonstrated on the websites)Adding columnsSorting (some are presorted)Gaps: more than one N (within scaffold, gap between scaffold), vs ambiguities (contig) (see P.falc)Chromosome as one giant contig…or one giant scaffold