Genome Sequencing and Annotation (Part 1). Objective of most genome projects Sequencing – DNA, mRNA Identify genes characterize gene features This chapter

Genome Sequencingand Annotation (Part 1)

Objective of most genome projectsSequencing – DNA, mRNAIdentify genes characterize gene features

This chapterHow blocks of DNA seqs. are obtainedHow these blocks are assembled into contigs then genomesBioinformatics – how to do seq. alignment, such as cDNA/EST,

genome seqs.Annotation of ORF,Other features of gene – repetition elements, variable

distribution of GC content, evolutionary conserved elementsGene annotation by cross species annotation

2.1 (Part 2) The principle of dideoxy (Sanger) sequencing

Automated DNA sequencing

1974, F. Sanger developed the chain-termination method (Sanger sequencing)Sanger won his second Noble prize for inventing this process

• Most current sequencing projects use the chain termination method– Also known as Sanger sequencing, after its

inventor

• Based on action of DNA polymerase– Adds nucleotides to complementary strand

• Requires template DNA and primer

Automated DNA sequencing

• Dideoxynucleotides (ddA, ddT, ddC or ddG) stop synthesis– Chain terminators (DNA

polymerase cannot add another nucleotide)

• Included in amounts so as to terminate every time the base appears in the template

• Use four reactions– One for each base: A,C,G,

and T

3’ ATCGGTGCATAGCTTGT 5’

5’ TAGCCACGTATCGAACA* 3’5’ TAGCCACGTATCGAA* 3’5’ TAGCCACGTATCGA* 3’5’ TAGCCACGTA* 3’5’ TAGCCA* 3’5’ TA* 3’

Sequence reaction products

Template

Chain-termination sequencing

Sequence detection

• To detect products of sequencing reaction

• Include labeled nucleotides• Formerly, radioactive labels

(33P or 35S) were used • Now fluorescent labels• Use different fluorescent

tag for each nucleotide• Can run all four reactions in

a single gel lane or capillary tube

TAGCCACGTATCGAA*

TAGCCACGTATC*

TAGCCACG*

TAGCCACGT*

Sequence separation

• Terminated chains need to be separated

• Requires one-base-pair resolution– See difference between chains

of X and X+1 base pairs

• Gel electrophoresis– Very thin gel– High voltage applied– Works with radioactive or

fluorescent labels– Negative pole at the top

–

+ C A G T C A G T

Sequence separation

Sequence reading of radioactively labeled reactions

• The final step of sequencing is to read the sequence

• Radioactive labeled reactions– Gel dried– Placed on X-ray film– Film developed, the position of each

band becomes visible• Sequence read from bottom up

(the positive pole)• Each of the four lanes giving the

position of a different base: A, T, C or G

A T C G

+

–

Sequence reading of fluorescently labeled reactions• Fluorescently labeled

reactions scanned by laser as particular point is passed

• Color picked up by detector

• Output sent directly to computer

• The read out is given both in terms of bases and the intensity of each color, so that ambiguous readings are easily identified

Summary of chain termination sequencing

A primer is extended by DNA polymerase based on the sequence present in the template strand.

The chain is terminated by different ddNTP that are complementary to the template strand.

Four reactions are separated on a gel that can resolve one-base differences.

The seq. is then read from the bottomof gel to the top.

The new techniques and equipment include:(1) Four-color fluorescent dyes have replaced the radioactive label(2) Rather than stopping the electrophoresis at a particular time, the

products are scanned for laser-induced fluorescence just before the run off the end of the electrophoresis medium

(3) Improvements in the chemistry of template purification and the sequencing reaction

(4) Slab gel electrophoresis gave way to capillary electrophoresis with the introduction in 1999 of Applied Biosystem’s ABI Prism 3700 automated sequencers, which in turn were updated with ABI Prism 3730 DNA analyzers in 2003 (deliver extremely high quality, long reads; save time and money)

High-Throughput Sequencing

ABI Prism 3730 DNA analyzers

Base-calling – the reading of raw sequence traces

Now routinely performed using automated software that reads bases, aligns similar seqs. and editing

Program – phred http://www.phrap.org

The program assign probability scores to the accuracy of each base call as the trace is read

Reading sequence traces

2.3 Automated sequence chromatograms

(A) This seq. shows ‘noiseness’ of the first 30 bp of a run.(B) The middle two rows show a segment of two seqs. that are polymorphic for

both SNPs and an indel.(C)A decline in seq. quality typically occurs after about 800 bp.

Ex. 2.1 Reading a sequence trace

The base labeled N – due to poor seq. qualityTwo peaks of the same height are observed at the same location, the site is heterozygous for a C and T SNP.

Figure 2.5 An aligned-reads window in consed

Contig Assembly

Assembling DNA seq. fragments Microbial genome seq. ~4Mbp, stitched together using >50,000 fragments (known as reads)

- Input from sequencing machine’s chromatograms or traces (fluorescent profiles with valleys or

peaks) no publicly available websites, only commercial or public packages

- cDNA the reverse transcript of the mRNA, and EST one pass sequenced partial cDNA

- deduce cDNA from EST, or assemble ESTs from the dbest database

- contig – recognize significant overlaps between fragments and assemble them into a single seq.

Assembling DNA seq. fragments

NCBI dbest databases http://www.ncbi.nlm.nih.gov/Database/• View the EST statistics• FTP EST files

• IFOM assembler • http://bio.ifom-firc.it/ASSEMBLY/assemble.html • Multiple EST seqs. contig• max. number of seqs. you can enter is 10000 !!• use gi(15744427, 19124086, 8147732, 8147734, 20393914,13728017)• Length (850, 1062, 634, 596, 869, 768) bp• resulting in a single contig consensus seq., can be used for similarity search

against db


>gi|15744427|gb|BI752849.1|BI752849 603022060F1 NIH_MGC_114 Homo sapiens cDNA clone IMAGE:5192510 5', mRNA sequenceCGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGCGGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGACGTGGAACTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTGCCATGGGAGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGATGTGGACTCAAAGCCCT>gi|19124086|gb|BM807263.1|BM807263 AGENCOURT_6574903 NIH_MGC_124 Homo sapiens cDNA clone IMAGE:5732238 5', mRNA sequenceGTCCGGAATTCCCGGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGCGGGGAACAGGGTCCTGAAACCTGACCATTTTGCCCCAGACCTTGACCAATCCACCTCATGGCGATTCTCCCTCCAGGAATTCCCCGACCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAACTTGGAATTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT>gi|8147732|gb|AW958049.1|AW958049 EST370119 MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTCATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAAGCGAGAATCTTGTGAAGCTGAAGAACAGTCTGGAAGGCAAGATGAGCTTTTTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA>gi|8147734|gb|AW958051.1|AW958051 EST370121 MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCATATGAGCTTTTTGCTGTGATTGCGCACCTGGGAATGCAAAACTCCGTCATTACTG>gi|20393914|gb|BQ213074.1|BQ213074 AGENCOURT_7559959 NIH_MGC_72 Homo sapiens cDNA clone IMAGE:6055692 5', mRNA sequenceAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAGCTTTTTGCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTACATCCGGAATGCTGTGGATGGAAAATGGTTCTGCTTCAATGACTCCAATATTTGCTTGGTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGGCCAAGGGCAGTGGCAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGAAGAAACACCATTAATTTCCTAATGAATCCAAGTGGTTTGTAACTTGCCTATTCCTTTTATTCCAGCAAAAAATTAATTGATCATCCCCTCCCCCAAAAAATAGGGG>gi|13728017|gb|BG206330.1|BG206330 RST25778 Athersys RAGE Library Homo sapiens cDNA, mRNA sequenceTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGCCAAAGGTCAGTGGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAACCAACATTAATTCCATATGAATCAAGTGTTTTGGAACTGCTATTCATTTATTCAGCAAATATTTATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAACACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTTTTTCACAAAATTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAATTTAGAACCCGTTCCTGACGCGGGGGN

Assembling DNA seq. fragments – 6 GI fragments


List of assembled fragments


Overlap details


End of overlap details

Assembled mRNA sequence

• The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1 alignment 2

Seq. 1 ACGCTGA ACGCTGASeq. 2 A - - CTGT ACTGT - -

Seeks alignments high seq. identity, few mismatchs and gapsAssumption – the observed identity in seqs. to be aligned is the result

of either random or of a shared evolutionary originIdentity ≠ similaritySequence identity = Homology (a risky assumption)Sequence identity ≠ Homology

Box 2.1 Pairwise Sequence Alignment


Figure A Common evolutionary events and their effects on alignment

indel

Same true alignment arise through different evolutionary events

Scoring scheme: substitution -1, indel -5, match 3

Score 9 5 4 4

Find the optimal score the best guess for the true alignmentFind the optimal pairwise alignment of two seqs. inserted

gaps into one or both of them maximize the total alignment score

Dynamic programming (DP) – Needleman and Wunsch (1970), Smith and Waterman (1980), this algorithm guarantees that we find all optimal alignments of two seqs. of lengths m and n

BLAST is based on DP with improvement on speed

Prof. Waterman http://www.usc.edu/dept/LAS/biosci/faculty/waterman.html



),()1,(

),(),1(

),()1,1(

max),(

jcjiS

icjiS

jicjiS

jiS

The score for alignment of i residues of sequence 1 against j residues of sequence 2 is given by

where c(i,j) = the score for alignment of residues i and j and takes the value 3 for a match or -1 for a mismatch,c(-,j) = the penalty for aligning a residue with a gap, which takes the value of -5

• The entry for S(1,1) is the maximum of the following three events:

• S(0,0) + c(A,A) = 0 + 3 = 3 [c(A,A) = c(1,1)]• S(0,1) + c(A, -) = -5 + -5 = -10 [c(A, -) = c(1, -)]• S(1,0) + c(-, A) = -5 + -5 = -10 [c(- ,A) = c(-, 1)]• Similarly, one finds S(2,1) as the maximum of

three values: (-5)-1=-6; 3-5=-2; and (-10)-5=-15 the best is entry is the addition of the C indel to the A-A match, for a score of -2 (see next page).



The alignment matrix of sequences 1 and 2

TGTCA

A

C

G

C

T

G

A

2520151050

17127235

9416210

1451715

34041220

71191725

224142230

139192735 S(2,1) = max {S(1,0) + c(2,1),S(1,1) + c(2,-), S(2,0) + c(-,1)}

= max { S(1,0) + c(C,A),S(1,1) + c(C,-), S(2,0) + c(-,A) } = max { -5-1, 3-5, -10-5 }= -2


Traceback determine the actual alignmentFrom the top right hand corner the (7,5) cell

TGTCA

A

C

G

C

T

G

A

2520151050

17127235

9416210

1451715

34041220

71191725

224142230

139192735

For example the 1 in the (7,5) cell could only be reached by the addition of the mismatch A-T

ACGCTGAA - - CTGTorACGCTGAAC - - TGT4 matches1 mismatch2 indels

Ambiguity – has to do with which C in seq. 1 aligns with the C in seq. 2

Parameters settings - Gap penalties• Default settings are the easiest to use but they are not

necessarily yield the correct alignment• constant penalty independent of the length of gap, A• proportional penalty penalty is proportional to the length L of

the gap, BL (that is what we used in the this lecture)• affine gap penalty gap-opening penalty + gap-extension

penalty = A+BL• There is no rule for predicting the penalty that best suits the

alignment• Optimal penalties vary from seq. to seq. it is a matter of trial

and error• Usually A > B, because of opening a gap (usually A/B ~ 10)• Hint: (1) compare distantly related seqs. high A and very low B

often give the best results penalized more on their existence than on their length, (2) compare closely related seqs., penalize both of extension and extension


Exercise 2.2 Computing an optimal sequence alignment

Two score schemes(1) Gap penalty = -5, mismatch = -1, match =3(2) Gap penalty = -1, mismatch = -1, match =3

(1) First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-5) = 8(2) First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-1) = 16

A more serious problem – identify the wrong alignment

TATGGCA

A

G

C

G

T

A

T

3530

13

2520151050

35

10

15

20

25

30

35

Exercise 2.2 Computing an optimal sequence alignment

Gap penalty = -5

TATGGCA

A

G

C

G

T

A

T

76

16

543210

31

2

3

4

5

6

7

Gap penalty = -1

Costs of genome sequencingMid-2000 - $30-50 Million dollars to sequencing a mammalian genomeTarget $1000 per human genome by the year 2010J. Craig Benter Foundation - $500,000 award for the first person to

achieve this goal

New technologies1. Sequencing by hybridization (SBH) – detect whether an exact match

is present in a sample of DNA or not2. Mass spectrophotometric technique – ionized fragment, time of flight3. Nanopore sequencing strategies - Ultrafast and relative inexpensive

sequencing of long DNA fragments4. Single-molecule approach – Solexa, Visigen and Genovoxx5. Single-molecule polony sequencing

Emerging Sequencing Methods

Figure 2.6 Single-molecule polony sequencing

Emerging Sequencing MethodsDilute solution of DNA are plated onto a glass microscope slide.

In situ PCR produces thousands oftiny colonies of DNA, which incorporated of single dye-labeled dNTPs.Polony – PCR colonies (聚集區 )

The slide is read after each cycle ofIncorporation of a new base, allowing short seqs. to be determined.

Each numbered polony produces a short 20-25 nucleotide seq. as shown.

These can then be assembled computationally into a contiguous seq.

Figure 2.7 (Part 1) Hierarchical versus shotgun sequencing

Genome SequencingWhole genome seqs. are assembled from~105 of fragments, each typically between500 and 1000 bp in length.

Two general approaches for fragmentationand assembly: (1) hierarchical seq. (2) shotgunseq.For historical overview, seehttp://www.sciencemag.org/feature/plus/sfg/human/timeline1.shtml

(1) Hierarchical seq.* First develop a low resolution physical alignment to measure the seq. is obtained in large order pieces.* Break the genome into small fragments and use computer algorithms to assemble them, see Figure 2.7

Most new genome projects adopt the shotgun approach.

Top down, map-based or clone-by-clone strategy~ late 1980Genome break into small fragmentsThe relative locations of the fragments are known BEFORE

sequencingAdvantages(1) It fostered (help develop) assembly of high-resolution physical

and genetic maps(2) Allow groups working around the global

Technology for cloning large fragments of genomes are progressed rapidly throughout the1990s, such as E. coli, S. cerevisiae, C. elegans. A. thaliana.

Top-down seq. clone seqs. as managable units of framgments (50 – 200 kb in length)

Clone vectors – BAC (~300 kb), PAC (~100 kb), phage-derived cosmids

Genome Sequencing – hierarchical sequencing

Figure 2.7 (Part 2) Shotgun sequencing

Genome Sequencing – Shotgun sequencing

In the shotgun approach, no attempt is made to order the clones in advance, Instead, the whole genome is assembled using computer algorithms that order contigs based on their overlapping sequences.

Figure 2.8 Cloning vectors used in genome sequencing

Cloning vectors used in genome sequencing

DNA libraries• By restriction enzyme (RE) or sonication (以超音波處理 )• Fragments are ligated into a multiple cloning site (mcs) in

the vector• Aim for 5- to 10-fold redundancy larger than 5 to 10 times

in the genome library• Each clone will have different ends possible to select a

scaffold of clones that forms a contiguous seq. coverage – a tiling (貼瓷磚 ) path

• By aligning the regions of overlap (Fig. 2.9)• The tiling path can be assembled using a combination of 3

methods: (1) hybridization, (2) fingerprinting, and (3) end-sequencing


Figure 2.9 Hierarchical assembly of a sequence-contig scaffold (supercontig)


•A minimal tiling path through a library of aligned BAC clones that ensures complete coverage of the chromosome is chosen.

•After sequencing independent shotgun libraries for each BAC.

•Small gaps in the sequenced clone contigs remain.

•These are closed as far as possible by merging the two BAC sequences, as well as by the addition of mate-pair information (yellow) and cDNA structural information (red), which establishes the orientation and distance between cloned segments.

Genome Sequencing – hierarchical sequencingHybridizationAll of the clones in a library that carry a particular seq. can be identified rapidly by hybridizing a small radioactively or chemically labeled probe containing the seq. to a filter on which is printed an array of ~10000 of clones (Fig. 2.10A)

Fingerprinting• Study the Restriction Enzyme (RE) patterns• Assemble contigs of large insert clones is to compare and align them according to RE• RE ~ 6 bp 46 = 212 ~ 4000 bp• For BAC, 100 kb 100 kb/4 kbp ~ 20 – 30 fragments• these fragments can be separated by electrophesis Fingerprint profile BAC alignment by gel software alignment overlapping Contigs assemble of ~Mb length contigs

Figure 2.10 Aligning BAC clones by hybridization and fingerprinting


(A) A macroarray of BAC clones is probedwith a short, radioactive fragment toidentify all BACs that carry a specificfragment.

(B) These clones are digested with a RE, end-labeled, and separated by gel electrophoresis,

(C)Software converts the bands to a virtualprofile, shown hypothetically for a small portion of four bands (high-ligated box in part B). Shared bands (red or blue) implythat the two clones share the same seq.Green indicates the vector band common toall clones.

(D)The fingerprint profile is then converted intoa BAC alignment, In this example, clone 2 does not share any bands with the others andso is placed into a seq. BAC contig, while theother three clones form a tiling pathtiling path.

• End-sequencing• Fill in the gaps after fingerprinting. How ?• sequencing both ends of the collection of BAC clones• Once a critical threshold of seqs. have been achieved overlap• For example, along a 10 Mbp genome, end seqs. of 10,000 BAC

clones, provide a seq. tag every 5kb (for a 5-fold coverage) • Along a 10 Mbp genome 10 Mbp/10000 BAC 1 kbp/BAC• Five fold 10 Mb/2000 BAC ~ 5 kb (a seq. tag distance)• Given this tag density, it is possible to close gap < 50 kb• Once the Tiling path is chosen shotgun the BAC clones into small

fragments• Subcloning, use M13 phagemid (~1 kb, exist as dsDNA and ssDNA • or clone 2 ~ 3 kb fragments into a plasmid vector



• Use computer algorithm to assemble the seqs. (~100,000)• About 5 ~ 10 folds redundancy for each fragment• Library - From a single whole genome• After MSA screen out repetitive seqs., overlap reads of the same seq. generate unitigs and scaffolds >90% of the seqs. are assembled• Finishing phase – closing gaps, cleaning up ambiguities take as much time as the shotgun phase• Users are asked to trust the assembliesCelera Genomics used the following software to assemble the seqs.Screener – to mask (not removed) seqs. that contain repetitive DNA (such as microsatellites, LINE, Alu repeats, retrotransposons and ribosomal DNA)Overlapper – compares every unscreened read against every other unscreened read,searching for overlaps of a predetermined length and identity.•Parallel processing on 40 supercomputers, each with 4GB RAM, allowed the 27 M screened human seqs. reads to be overlapped in < 5 days !•Repeat-induced overlaps of a seq. are resolved using the Unitigger (see Figure 2.11).Scaffolder – uses mate-pair information to link U-unitigs into scaffold contigs


Figure 2.11 U-unitigs and repeat resolution

Figure 2.11(A) Seq. alignment between two or more

shotgun clones can arise between unique seqs. (left) or repetitive seqs. (right).

(B) The Overlapper aligns unitigs, which are identified as unique seq. alignments (U-untigs) or overcollapsed repeats (blue).

Two contigs can be aligned and oriented by using mate-pair seq. information from the ends of longer (10- or

50-kb) clones, as shown at the bottom, while mate-pairs from 2-kb fragments allow assembly of scaffolds despite the presence of simple repeats such as microsatellites (blue) that are masked before performing alignments.


Figure 2.12 Proportion of fly and human genomes in large scaffolds

Figure 2.12 shows the estimated coverage of the fly and human whole genomes after initial assembly: in both cases, 84% or more of the genomes was covered by scaffolds at least 100 kb in length, while most scaffolds were in the Mb range. seq. coverage from 5x to 10x a 10% in the proportion of scaffolds of lengths up to 1 Mb.

The plot shows the percentage ofScaffolds that have a length greater than that indicated for the fly 10x, human 8x (CSA) and human 5x (whole genome assembly WGA) seqs. generated by Celera. The fly and CSA assemblies include shredded (撕成碎片 ) seqs. generated from BAC clones by public genomes sequencing efforts.

NCTS http://math.cts.nthu.edu.tw/Mathematics/conference-PT2005.html

UCSD

http://research.calit2.net/recomb-workshop05/

Documents

Genome Sequencing and Annotation (Part 1). Objective of most genome projects Sequencing – DNA, mRNA Identify genes characterize gene features This chapter