45
Methods and Algorithms for Gene Prediction Chaochun Wei 韦朝春 Sc.D. [email protected] http://cbb.sjtu.edu.cn/~ccwei Shanghai Jiao Tong University Shanghai Center for Bioinformation Technology 5/12/2011 K-J-C Bioinformatics Training Course 1

Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Methods and Algorithms

for Gene Prediction

Chaochun Wei 韦朝春 Sc.D.

[email protected]

http://cbb.sjtu.edu.cn/~ccwei

Shanghai Jiao Tong UniversityShanghai Center for Bioinformation Technology

5/12/2011 K-J-C Bioinformatics Training Course1

Page 2: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Outline

1. Introduction

2. Gene prediction methods Gene prediction methods

HMM

TWINSCAN and N-SCAN

Using ESTs for gene prediction

Resources

Latest progress

3. Gene Prediction FAQs

2

Page 3: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

1. Introduction: DNA

DNA contains genetic

information.

DNA can be

expressed as a

sequence of letters

A,C,G and T.

Eg: ACGTTTCGAGGT

3

Page 4: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

DNARNAProtein

4

& processing

Page 5: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

RNA Processing

5

5’Primary

mRNA

Page 6: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Introduction: Gene Structure

6

5’UTR

Initial

Exon

Internal

Exon

Terminal

Exon

3’UTR

Intron

ATG TAA

TAG

TGA

Intron Intron

GT AG

A gene is a highly structured region of DNA, it is a functional

unit of inheritance.

Page 7: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

7

Josep F. Abril et al. Genome Res. 2005; 15: 111-119

Patterns in Splice Sites

Donor Sites Acceptor Sites

Sequence data from RefSeq of human, mouse, rat and chicken.

Page 8: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

A Typical Human Gene

Structure

8

Page 9: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Genes in a Genome

9

Page 10: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

In a Mammalian Genome

Finding all the genes is hard

Mammalian genomes are large

8,000 km of 10pt type

Only about 1% protein coding

10

Page 11: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

DNA, mRNA, cDNA and EST

EST: Short (~650bs)

High error rate (~1-5%)

Contains only UTRs or coding regions11

5’EST 3’EST

Full-length cDNA

5’EST 3’EST

5’ Truncated cDNA

cDNA

mRNA

Reverse Transcription

5’ 3’

3’5’ 5’ 3’

Primary

mRNARNA Processing

5’ 3’

DNA5’

Transcription

3’Coding RegionUTR UTR

Page 12: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

The Challenge and Opportunity ~3000 genomes

222 animals

93 plants

12

Better

Gene Structure

Annotation

Numbers from http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html

0

1000

2000

3000

Genome numbers

Page 13: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

2. Gene Prediction Method History

Generation

Date Feature Systems Information or Methods used

1st Early 1980s

Approximate ends of protein coding regions and non-coding regions

TestCode, Fickett 1982

GRAIL, Uberbacher and Mural 1991

splice sites

promoters

Codon usage bias

Neuro-network methods

2nd Early 1990s

A complete single gene in a short sequence

GeneID,Guigo et al. 1992

GeneParser, Snyder and Stormo 1993

FGENSH, Solovyev et al. 1994

+

Translation start sites

Stop sites

Method:HMM

3rd Mid-

1990s

Multiple complete or partial genes in a long sequence

Genscan, Burge and Karlin, 1997

UTR

Method: Generalized HMM

4th 2000s Complete gene structures in whole genomes

Twinscan, Korf, et al.2001

N-SCAN, Brown,et al. 2006

Twinscan_EST, N-SCAN_EST, Wei and Brent, 2006

Multiple-genomes

Transcript products

Method:Generalized HMM

Bayesian approach 13

Page 14: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene Prediction Methods (1)

Categorization: by input information1. Ab initio methods Only need genomic sequences as input

GENSCAN (Burge 1997; Burge and Karlin 1997)

GeneFinder (Green, unpublished)

Fgenesh (Solovyev and Salamov 1997)

Can predict novel genes

2. Transcript-alignment-based methods Use cDNA, mRNA or Protein similarity as major clues

ENSEMBL (Birney, Clamp, et. al. 2004)

Highly accurate

Can only find genes with transcript evidences cDNA coverage 50-60%

+ EST coverage up to 85%14

Page 15: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene Prediction Methods (2)

Categorization: by input information

3. Hybrid Methods

Integrate cDNA, mRNA, protein and EST alignments

into ab initio methods

Genie (Kulp, Haussler et al. 1996)

Fgenesh+ (Solovyev and Salamov 1997)

Genomescan (Yeh, Lim et al. 2001)

GAZE (Howe, Chothiea et al. 2002)

AUGUSTUS+ (Stanke, Schoffmann et al. 2006)

15

Page 16: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene Prediction Methods(3)

Comparative-Genomics-Based Methods

TWINSCAN and N-SCAN

De novo

Assumption:

Coding regions are more conserved.

No transcript similarity information (like EST, cDNA,

mRNA, or protein) is used

TWINSCAN-EST and N-SCAN_EST

Hybrid

Use EST to improve prediction accuracy

16

Page 17: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

17

TWINSCAN: A Novel Gene Prediction

System Using Dual Genomes

Conservation sequences represent the

conservation patterns between two genomes

No transcript similarity information (like EST,

cDNA, mRNA, and protein) is used

Page 18: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Hidden Markov Model:

Model behind gene predictors

),(maxarg

xP

18

HMM for two biased coins flipping

1 2

7.0)(,3.0)(,2.0)(,8.0)( 2211 TeHeTeHe

TTHHTTHTTTTTHTHHHHHTHTH Observed sequence x

11221111122221111112222 Hidden state sequence

0.9

0.1

0.7

0.2

End0.1

TTHHTTHTTTTTHTHHHHHTHTH Observed sequence x

Page 19: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Most Probable Path and Viterbi

Algorithm

).)1((maxarg)(

);)1((max)()(

klkk

i

klkk

ill

aiflptr

aifxeif

)),,...,,,...,(Pr(max)( 1010},...,{ 10

lxxif iiili

)( 2LNO )(NLO19

1

2sta

tes

max

Recursion (i=1…L)

Let

Time complexity space complexity

0 1 2 i-1 i L-1 L

Page 20: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Probability of All the Possible Paths

and Forward Algorithm

k

k LfxPxP )(),()(

k

klkill aifxeif ))1(()()(

),,...,Pr()( 10 lxxif iil

20

sta

tes

Recursion (i=1…L)

Let

1

2+

0 1 2 i-1 i L-1 L

Probability of all the

probable paths

Page 21: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Posterior Probability and Forward

and Backward Algorithm

)(

),()|(

xP

xkPxkP i

i

21

sta

tes

Posterior Probability

1

2+

0 1 2 i-1 i L-1 L

Page 22: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Posterior Probability and Forward

and Backward Algorithm

l

lll bxeaxPxP )1()(),()( 10

l

lilklk ibxeaib )1()()( 1

)(

)()(

)(

),()|(

xP

ibif

xP

xkPxkP kki

i

22

sta

tes

Recursion (i=L-1,…,1)

Posterior Probability

)|,...,Pr()( 1 kxxLb iLik Let

1

2+

0 1 2 i-1 i L-1 L

Page 23: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

TWINSCAN Model

Generalized HMM

Each feature in a gene

structure corresponds to

one state.

State-specific length

models.

State-specific sequence

models

Use Conservation

information

23

Page 24: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Conservation Sequence

Pair each nucleotide of the target with

“|” if it is aligned and identical

“:” if it is aligned to mismatch

“.” if it is unaligned

24

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Generated by projecting local alignments to the target

sequence

human

mouse

||||||. . . . . . . . .|:|||||||||::||:|| . . . ||:

CTAGAG---------AGACAGGTACCATAGGGCTCTCCT

Page 25: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

N-SCAN: A Novel Gene Prediction

System Using Multiple Genomes

Uses Bayesian

model to include

phylogenetic tree

information

Predicts introns in

5’UTR

Has Conserved

non-coding regions

25

(Brown, Gross and Brent, Genome Res. 2005)

Page 26: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Using ESTs for Gene Prediction:

TWINSCAN_EST

Integrating EST alignment information into TWINSCAN to

improve its accuracy where EST evidence exits and not to

compromise its ability to predict novel genes.

26

Page 27: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Sequence Representation of EST

Alignments

27

1. Use EST-to-genome alignment programs

• BLAT (Kent 2002)

2. Project the top alignment for each EST to the target

genomic sequence

EST

Alignments

ESTseq

EEE III NNN NNNNNN NNNEEE EEEIII

Page 28: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Accuracy Measurement

Annotated data sets for training/testing

RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/)

CCDS (http://www.ncbi.nlm.nih.gov/CCDS/)

Accuracy in different levels

Nucleotide level

Exon level

Gene level

Transcript level

Sensitivity and specificity28

Page 29: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Accuracy Measurement (continue)

AnnotationTotal

edictionCorrectySensitivit

_

Pr_

edictionTotal

edictionCorrectySpecificit

Pr_

Pr_

29

Annotation Prediction

Correct

Prediction

Page 30: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

TWINSCAN_EST and N-SCAN_EST

on the Whole Human Genome

24

14

67

56

34

17

81

58

38

22

85

59

44

23

88

60

0

10

20

30

40

50

60

70

80

90

100

exact gene

sensitivity

exact gene

specificity

exact exon

sensitivity

exact exon

specificity

perc

ent

TWINSCAN2.03

TWINSCAN_EST

N-SCAN

N-SCAN_EST

30

Page 31: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

An Example of N-SCAN_EST

Prediction

31(Hg17, chr21:33,459,500-33,465,411)

Page 32: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

An Example of N-SCAN_EST Prediction

32

Page 33: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Experimental Validation of Predictions

33

Siepel, Genome Research, 2007

Page 34: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Experimental Validation of Predictions

See

The MGC Project Team, “The Completion of the

Mammalian Gene Collection (MGC)”, Genome Research,

2009, 19:2324-2333

Wei, C., et al., “Closing in on the C.elegans ORFeome by

Cloning TWINSCAN predictions”, Genome Research,

2005, 15:577-582.

Tenney, A. E. et al., “Gene prediction and verification in a

compact genome with numerous small introns”, Genome

Research, 2004

34

Page 35: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

N-SCAN/TWINSCAN Webserver:

http://mblab.wustl.edu/nscan/submit

35

Page 36: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Resources for Gene Prediction

Sequence data sets

Nucleotide Sequences (NCBI)

dbEST

mRNA

cDNA

Annotations

RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/)

CCDS (http://www.ncbi.nlm.nih.gov/CCDS/)

Genome Browser

UCSC Genome Browser(http://genome.ucsc.edu/)

36

Page 37: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Latest Progress in Gene Prediction

New Methods

Conrad: gene prediction using conditional random fields.

Decaprio et al., Genome Res. 2007 Sep;17(9):1389-98. Not working for vertebrate genomes

SVM for splice site

Sonneburg et al., BMC Bioinformatics. 2007;8 Suppl 10:S7.

CONTRAST: Gross et al., Genome Biology 2007, 8:R269

Best de novo gene predictor for human (gene level accuracy ~50%)

Used SVM and conditional random field

37

Page 38: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

3. Gene Prediction FAQs (from Ian Korf)

Algorithms vs. experts

Q: are expert biologists better than computer programs?

A: Yes and no.

Next-generation sequencing

Q: Will next-gen transcript sequencing replace gene

prediction?

A: No. Rare transcripts may require directed experiments

to validate.

Prediction accuracy

Q: Why are gene prediction programs inaccurate?

A: We don’t always know why. 38

Page 39: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene Prediction FAQs (continue)

Genes in my favorate genome…

Q: There is no gene predictor for it, what should I do?

39

Page 40: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene prediction for algal genomes

40

Macroalgae

Microalgae

Page 41: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

A Typical Alga Gene Structure

41

A Typical Human Gene Structure

Page 42: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene Prediction FAQs (continue)

Genes in my favorate genome…

Q: There is no gene predictor for it, what should I do?

A: Training a gene predictor or use one that is for

another organism that is close to this genome. But it

may be inaccurate.

Difficult genes

Q: why some genes are not predicted by any program?

A: They are statistical outliers.

42

Page 43: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene Prediction FAQs (continue)

Just coding exons…

Q: why other parts are not predicted, such as non-

coding exons, alternative isoforms, non-canonical

splice sites, gene within genes?

A: There are trade offs.

Pseudogenes

Q: why do some gene predictions have tiny introns?

A: Retro-pseudo genes often have very strong coding

signals, because they are derived from highly

expressed genes.43

Page 44: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Gene Prediction FAQs (end)

How can I tell a good gene prediction from a

bad one?

Scores have been assigned to every exon

and intron of a gene. People can tell if a gene

prediction is good or not by the scores of

exons and introns of this gene.

You may have to run the program on your

own computer to figure them out!

44

Page 45: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input

Acknowledgement

Training program organizers

Former lab members (WashU) Dr. Michael Brent

Dr. Paul Flicek

Dr. Ian Korf

Samuel Gross

45