Genome Sequencing DNA Sequence Analysis

7.91 / 7.36 / BE.490Lecture #1

Feb. 24, 2004

Genome Sequencing&

DNA Sequence Analysis

Chris Burge

What is a Genome?

A genome is NOT a bag of proteins

What’s in the Human Genome?

Outline of Unit II: DNA/RNA Sequence Analysis

Reading*

3/2 M Ch. 4

3/4 M Ch. 4

3/9 DNA Sequence Evolution

3/11 RNA Structure Prediction & Applications M Ch. 5

3/16 Literature Discussion TBA

Genome Sequencing & DNA Sequence Analysis M Ch. 3

DNA Sequence Comparison & Alignment M Ch. 7

DNA Motif Modeling & Discovery

Markov and Hidden Markov Models for DNA

M Ch. 6

* M = Mount, “Bioinformatics: Sequence and Genome Analysis”

Feedback to Instructor

Examples from past years:

• Comic font looks stupid

• Burge uses too much genomics jargon

• Better synergy between Yaffe/Burge sections

• Asks questions to the class, student answers,

but I didn’t hear/understand the answer…

DNA vs Protein Sequence Analysis

Protein Sequence Analysis DNA Sequence Analysis

- emphasis on chemistry - emphasis on regulation

- protein structure - RNA structure

- selection is everywhere - signal vs noise (statistics)

- multiple alignment - motif finding

- comparative proteomics - comparative genomics

- data: O(10^8) aa - data: O(10^10) nt

Read your probability/statistics primer!

Genome Sequencing & DNA Sequence Analysis

• The Language of Genomics

• Shotgun Sequencing

• DNA Sequence Alignment I

• Comparative Genomics Examples

- Progress: genomes, transcriptomes, etc.

- How to choose a mismatch penalty

- PipMaker, Phylogenetic Shadowing

Recent Media Attention

Genomespeak

Bork, Peer, and Richard Copley. " Genome Speak."Nature 409 (15 February 2001): 815.

Learn to speak genomic

In the following article, note the use of the following genomic terms: euchromatic, whole-genome shotgun sequencing, sequence reads, 5.11-fold coverage, plasmid clones, whole-genome assembly, regional chromosome assembly.

Venter, JC, MD Adams, EW Myers, PW Li, RJ Mural, GG Sutton, HO Smith, … "The Sequence of The Human Genome." Science 291, no. 5507 (16 February 2001): 1304-51.

Types of Nucleotides

• ribonucleotides

• deoxyribonucleotides

• dideoxyribonucleotides

DNA Sequencing

Adapted from Fig. 4.2 of “Genomes” by T. A. Brown, John Wiley & Sons, NY, 1999

Shotgun Sequencing a BAC or a Genome

200 kb (NIH)3 Gb (Celera)

Sequence, Assemble

Sonicate, Subclone

Subclones

Shotgun Contigs

What would cause problems with assembly?

Shotgun Coverage (Poisson distribution)Sequence N reads, 500 bp each, from a 200kb BAC Coverage/read p = 500/200,000 = 0.0025 Total coverage C = Np Y = no. of reads covering the point x

P(Y=k) = (N!/(N-k)!k!) pk(1-p)N-k ≈ e-cck / k!

P(Y=0)= e-c Examples: e-2 ≈ 0.14 e-4 ≈ 0.02 What could cause reality to differ from theory?

Clickable GenomesEukaryotes Protists Eubacteria S. cerevisiae Plasmodium E. coli S. pombe Giardia B. subtilis C. elegans … S. aureus Drosophila (several) … Anopheles (>100) Ciona Archaea Arabidopsis Methanococcus Human Sulfolobus Phages/Viruses Mouse … Lots Tetraodon (total of ~16) Fugu Organelles Zebrafish Lots Neurospora Aspergillus …

Large-scale Transcript Sequencing

Please see the following example article that uses large-scale transcript sequencing.

Nature 420, no. 6915 (5 December 2002): 563-73.

Okazaki, Y, M Furuno, T Kasukawa, J Adachi, H Bono, S Kondo, … "Analysis of TheMouse Transcriptome Based On Functional Annotation of 60,770 Full-length cDNAs."

EST Sequencing

dbEST release 022004 No. of public entries: 20,039,613

Summary by Organism - as of February 20, 2004

Homo sapiens (human) 5,472,005 Mus musculus + domesticus (mouse) 4,055,481 Rattus sp. (rat) 583,841 Triticum aestivum (wheat) 549,926 Ciona intestinalis 492,511 Gallus gallus (chicken) 460,385 Danio rerio (zebrafish) 450,652 Zea mays (maize) 391,417 Xenopus laevis (African clawed frog) 359,901 Hordeum vulgare + subsp. vulgare (barley) 352,924

Source: NCBI - http://ncbi.nlm.nih.gov

*-omes and -omics

Proteome

Variome

Transcriptome

Genome

Mass spec, Y2H, ?

SNPs, haplotypes

ESTs, cDNAs, microarrays

Genome sequences

Ribonome?

Glycome ???

*Warning: some of the words on this slide may not be in Webster’s dictionary

DNA Sequence Alignment I

How does DNA alignment differ from protein alignment?

Subject:

Use BLASTN instead of BLASTP

1 ttgacctagatgagatgtcgttcacttttactgagctacagaaaa 45|||| |||||||||||| | |||||||||||||||||||||||||

403 ttgatctagatgagatgccattcacttttactgagctacagaaaa 447

Query:

Nucleotide-nucleotide

BLAST Web Server

(BLASTN)

DNA Sequence Alignment IITranslating searches:

translate in all possible reading framessearch peptides against protein database (BLASTP)

ttgacctagatgagatgtcgttcactttactgagctacagaaaa

ttg|acc|tag|atg|aga|tgt|cgt|tca|ctt|tta|ctg|agc|tac|aga|aaaL T x M R C R S L L L S Y R K

t|tga|cct|aga|tga|gat|gtc|gtt|cac|ttt|tac|tga|gct|aca|gaa|aax P R x D V V H F Y x S T E

tt|gac|cta|gat|gag|atg|tcg|ttc|act|ttt|act|gag|cta|cag|aaa|aD L D E M S F T F T E L Q K

Also consider reading frames on complementary DNA strand

DNA Sequence Alignment IIICommon flavors of BLAST:

Program Query Database BLASTP aa aa BLASTN nt nt BLASTX nt (⇒ aa) aa TBLASTN aa nt (⇒ aa) TBLASTX nt (⇒ aa) nt (⇒ aa)

PsiBLAST aa (aa msa) aa

Which would be best for searching ESTs against a genome?

DNA Sequence Alignment IVWhich alignments are significant?

Identify high scoring segments whose score S exceeds a cutoff x using dynamic programming.

Scores follow an extreme value distribution:

P(S > x) = 1 - exp[-Kmn e-λx] For sequences of length m, n where K, λ depend on the score matrix and the composition of the sequences being compared

(Same theory as for protein sequence alignments)

ttgacctagatgagatgtcgttcacttttactgagctacagaaaa 45|||| |||||||||||| | |||||||||||||||||||||||||

403 ttgatctagatgagatgccattcacttttactgagctacagaaaa 447

Notes (cont)From M. Yaffe Lecture #2

• The random sequence alignment scores would give rise to an “extreme value” distribution – like a skewed gaussian.

• Called Gumbel extreme value distribution

For a normal distribution with a mean m and a variance σ, the height of the curve is described by Y=1/(σ√2π) exp[-(x-m)2/2σ2]

For an extreme value distribution, the height of the curve is described by Y=exp[-x-e-x] …and P(S>x) = 1-exp[-e-λ(x-u)] where u=(ln Kmn)/λ

Can show that mean extreme score is ~ log2(nm), and the probability of getting a score that exceeds some number of “standard deviations” x is: P(S>x)~ Kmne-λx. ***K and λ are tabulated for different matrices ****

-λSFor the less statistically inclined: E~ Kmne

Probability values for the extreme value distribution (A) and the normal distribution (B). The area under each curve is 1.

0 1 2X X

DNA Sequence Alignment VHow is λ related to the score matrix?

λ is the unique positive solution to the equation*:

∑ p pjeλsij = 1ii,j

p = frequency of nt i, sij = score for aligning an i,j pair

What kind of an equation is this? (transcendental)

What would happen to λ if we doubled all the scores? (reduced by half)

What does this tell us about the nature of λ? (scaling factor)

*Karlin & Altschul, 1990

DNA Sequence Alignment VI

What scoring matrix to use for DNA?

Usually use simple match-mismatch matrices:

i j: A C G T

A 1 m m m

C m 1 m m

si,j : G

m = “mismatch penalty” (must be negative)

DNA Sequence Alignment VIIHow to choose the mismatch penalty?

Use theory of High Scoring Segment composition*

High scoring alignments will have composition:

qij = pipjeλsij

where qij = frequency of i,j pairs (“target frequencies”) p , p = freq of i, j bases in sequences being comparedi j

What would happen to the target frequencies if we doubled all of the scores?

*Karlin & Altschul, 1990

Genome Sequencing DNA Sequence Analysis

Documents

Genome Sequencing

Whole genome shotgun sequencing DeNovo Assembly...1. Reference guided sequence assembly: map reads to a reference sequence 2. De-novo sequence assembly: determine overlap between sequence

Genome Sequencing - NDSUmcclean/plsc411/Genome Sequencing...Genome Sequencing . ... Here the Phred scores are overlaid on the chromatogram of a Sanger sequencing output. ... o Directed

Genome Sequencing - NDSUmcclean/plsc411/Genome Sequencin… · full shotgun sequence is the working draft sequence. This typically is achieved with a 3-5 fold coverage of the BAC

Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Techniques for Genome Mapping & Sequencing€¦ · ~3,000 bp (0.0001%) of Human Genome Sequence The Human Genome… by the Numbers ~5% of Human Genome is Functionally Important 5%

A Complete Neandertal Mitochondrial Genome Sequence ... · A Complete Neandertal Mitochondrial Genome Sequence Determined by High-Throughput Sequencing Richard E. Green,1,* Anna-Sapfo

The Genome Sequence of Taurine Cattle: A Window The ......The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution The Bovine Genome Sequencing and Analysis

Genome Sequence Informatics & Comparative Genome Sequence Analysis

International Wheat genome Sequencing Consortium · The International Wheat Genome Sequencing Consortium (IWGSC) published today in the international journal Science a draft sequence

ASFV genome sequencing

Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome Project

Goals of the International Human Genome Sequencing ...2 843 433 602 99 281 Is the human genome sequence accurate ? Sequence Accuracy Base miscalls and small indels were determined

Genome Characterization DNA sequence-ULTIMATE Map DNA sequencing-methods Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Service 2006

WHOLE GENOME SEQUENCING: Transforming …...WHOLE GENOME SEQUENCING: Transforming health research What is the purpose of whole genome sequencing? Whole genome sequencing turns blood

Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&

Reverse Sequencing based Genome Sequence using Lossless ...2 Datamatics Global Services Ltd, Bangalore, India-----***-----Abstract - Genome sequence based on reversed sequencing is

Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Whole-Genome Sequencing of Drug-Resistant Mycobacterium … · genome sequencing on 46 multidrug-resistant strains isolat - ed during 2012–2016. Core-genome multilocus sequence

Sequencing a genome and Basic Sequence Alignment Lecture 10 1Global Sequence