41
1 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 BCB 444/544 Lecture 26 Gene Prediction #26_Oct22

BCB 444/544

  • Upload
    sylvia

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

BCB 444/544. Lecture 26 Gene Prediction #26_Oct22. Required Reading ( before lecture). Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction Chp 9 - pp 113 - 126 - PowerPoint PPT Presentation

Citation preview

Page 1: BCB 444/544

1BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

BCB 444/544

Lecture 26

Gene Prediction

#26_Oct22

Page 2: BCB 444/544

2BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Mon Oct 22 - Lecture 26

Gene Prediction • Chp 8 - pp 97 - 112

Wed Oct 24 - Lecture 27 (will not be covered on Exam 2)

Regulatory Element Prediction

• Chp 9 - pp 113 - 126

Thurs Oct 25 - Review Session & Project Planning

Fri Oct 26 - EXAM 2

Required Reading (before lecture)

Page 3: BCB 444/544

3BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Assignments & Announcements

Sun Oct 21 - Study Guide for Exam 2 was posted

Mon Oct 22 - HW#4 Due (no "correct" answer to post)

Thu Oct 25 - Lab = Optional Review Session for Exam544 Project Planning/Consult with DD

& MT

Fri Oct 26 - Exam 2 - Will cover:• Lectures 13-26 (thru Mon Sept 17)• Labs 5-8• HW# 3 & 4• All assigned reading:

Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons…

Page 4: BCB 444/544

4BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

BCB 544 "Team" Projects

• 544 Extra HW#2 is next step in Team Projects• Write ~ 1 page outline• Schedule meeting with Michael & Drena to discuss topic• Read a few papers• Write a more detailed plan

• You may work alone if you prefer

• Last week of classes will be devoted to Projects• Written reports due: Mon Dec 3 (no class that day)

• Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period

See Guidelines for Projects posted online

Page 5: BCB 444/544

5BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

BCB 544 Only: New Homework Assignment

544 Extra#2 (posted online Thurs?) No - sorry! sent by email on Sat…

Due: PART 1 - ASAP

PART 2 - Fri Nov 2 by 5 PM

Part 1 - Brief outline of Project, email to Drena & Michael

after response/approval, then:

Part 2 - More detailed outline of project

Read a few papers and summarize status of problem

Schedule meeting with Drena & Michael to discuss ideas

Page 6: BCB 444/544

6BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Seminars this Week

BCB List of URLs for Seminars related to Bioinformatics:http://www.bcb.iastate.edu/seminars/index.html

• Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB

• Dave Segal UC Davis Zinc Finger Protein Design

• Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI

• Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations

Page 7: BCB 444/544

7BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Chp 16 - RNA Structure Prediction

SECTION V STRUCTURAL BIOINFORMATICS

Xiong: Chp 16 RNA Structure Prediction (Terribilini)

• RNA Function• Types of RNA Structures• RNA Secondary Structure Prediction Methods• Ab Initio Approach• Comparative Approach• Performance Evaluation

Page 8: BCB 444/544

8BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07Fig 6.2Baxevanis & Ouellette 2005

Covalent & non-covalent bonds in RNA

Primary: Covalent bonds

Secondary/Tertiary Non-covalent bonds

• H-bonds (base-pairing)• Base stacking

This is a new slide

Page 9: BCB 444/544

9BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

RNA Pseudoknots & Tetraloops

http://academic.brooklyn.cuny.edu/chem/zhuang/QD/mckay_hr.gif

This is a new slide

http://www.lbl.gov/Science-Articles/Research-Review/Annual-Reports/1995/images/rna.gif

• Often have important regulatory or catalytic functions

Pseudoknot Tetraloop

Page 10: BCB 444/544

10BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Base Pairing in RNA

G-C, A-U, G-U ("wobble") & many variants

http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairsSee: IMB Image Library of Biological Molecules

This slide has been changed

Page 11: BCB 444/544

11BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

RNA Secondary Structure Prediction Methods

Two (three, recently) main types of methods:

1. Ab initio - based on calculating most energetically favorable secondary structure(s)

Energy minimization (thermodynamics)

2. Comparative approach - based on comparisons of multiple evolutionarily-related RNA sequences

Sequence comparison (co-variation)

• Combined computational & experimental Use experimental constraints when available

This slide has been changed

Page 12: BCB 444/544

12BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

RNA Secondary structure prediction - 3

3) Combined experimental & computational

• Experiments:Map single-stranded vs double-stranded regions in folded RNA

• How?Enzymes: S1 nuclease, T1

RNaseChemicals: kethoxal, DMS,

OH

• Software:Mfold SfoldRNAStructureRNAFoldRNAlifold

This is a new slide

Kethoxal modification (mild) (strong)DMS modification (mild) (strong)

G

200

240

220

DMS

Page 13: BCB 444/544

13BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Ab Initio Prediction: Clarifications

• Free energy is calculated based on parameters determined in the wet lab

• Correction: Use known energy associated with each type of nearest-neighbor pair (base-stacking) (not base-pair)

• Base-pair formation is not independent: multiple base-pairs adjacent to each other are more favorable than individual base-pairs - cooperative - because of base-stacking interactions

• Bulges and loops adjacent to base-pairs have a free energy penalty

This slide has been changed

Page 14: BCB 444/544

14BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

A UA U

A=UA=U

Basepair

G = -1.2 kcal/mole

A UU A

A=UU=A

G = -1.6 kcal/mole

Basepair

What gives here?

C Staben 2005

Energy minimization:What are the rules?

This is a new slide

Page 15: BCB 444/544

15BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Energy minimization calculations:Base-stacking is critical

AA UU -1.2

CG GC -3.0

AU or UA UA AU -1.6

GC CG -4.3

AG, AC, CA, GA UC, UG, GU, CU -2.1

GU UG -0.3

CC GG -4.8

XG, GX YU, UY 0

- Tinocco et al.

C Staben 2005

This is a new slide

Page 16: BCB 444/544

16BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Ab Initio Energy Calculation

• Search for all possible base-pairing patterns

• Calculate total energy of each structure based on all stabilizing and destabilizing forces

Fig 6.3Baxevanis & Ouellette 2005

Total free energy for a specific RNA conformation = Sum of incremental energy terms for:

• helical stacking (sequence dependent)• loop initiation• unpaired stacking

(favorable "increments" are < 0)

This slide has been changed

Page 17: BCB 444/544

17BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Dynamic Programming

• Finding optimal secondary structure is difficult - lots of possibilities

• Compare RNA sequence with itself• Apply scoring scheme based on energy

parameters for base stacking, cooperativity, and penalties for destabilizing forces (loops, bulges)

• Find path that represents most energetically favorable secondary structure

This slide has been changed

Page 18: BCB 444/544

18BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

3 - Popular Programs that use Combined Computational Experimental Approaches

• Mfold• Sfold• RNAStructure• RNAFold• RNAlifold

Page 19: BCB 444/544

19BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

SL X

SL Y

SL Z

SL Y

SL Z

SL X

SL Y

SL ZSL X

SL Y

SL Z

SL X

Mfold -54.84 kcal/mol

RNAstructure -71.3 kcal/mol RNAfold -80.16 kcal/mol

Sfold -51.14 kcal/mol

Comparison of Predictions for Single RNA using Different Methods

JH Lee 2007

Page 20: BCB 444/544

20BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Mfold plus constraints -54.84 kcal/mol

Mfold -126.05 kcal/mol

Comparison of Mfold Predictions: -/+ Constraints

JH Lee 2007

Page 21: BCB 444/544

21BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Performance Evaluation

• Ab initio methods? correlation coefficient = 20-60%• Comparative approaches? correlation coefficient =

20-80%• Programs that require user to supply MSA are more

accurate• Comparative programs are consistently more

accurate than ab initio

• Base-pairs predicted by comparative sequence analysis for large & small subunit rRNAs are 97% accurate when compared with high resolution crystal structures! - Gutell, Pace

• BEST APPROACH? Methods that combine computational prediction (ab initio & comparative) with experimental constraints (from chemical/enzymatic modification studies)

This slide has been changed

Page 22: BCB 444/544

22BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Chp 8 - Gene Prediction

SECTION III GENE AND PROMOTER PREDICTION

Xiong: Chp 8 Gene Prediction

• Categories of Gene Prediction Programs

• Gene Prediction in Prokaryotes

• Gene Prediction in Eukaryotes

Page 23: BCB 444/544

23BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory"

• Genes can encode:• mRNA (for protein)

• other types of RNA (tRNA, rRNA, miRNA, etc.)

• Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation

What is a Gene?

Page 24: BCB 444/544

24BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Gene Finding

Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences

ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT

ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT

Steps:• Search against protein / EST database• Apply gene prediction programs (many programs

available)• Analyze regulatory regions

Page 25: BCB 444/544

25BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Gene Prediction in Prokaryotes vs Eukaryotes

Prokaryotes• Small genomes 0.5 - 10·106

bp• About 90% of genome is

coding• Simple gene structure

• Prediction success ~99%

Eukaryotes• Large genomes 107 – 1010 bp• Often less than 2% coding• Complicated gene structure

(splicing, long exons)• Prediction success 50-

95%

ATG TAA

Promotor Open reading frame (ORF)

Start codon Stop codon

Promotor5’ UTR

Exons Introns

3’ UTR

ATG TAA

Splice sites

Page 26: BCB 444/544

26BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

DNA "Signals" Used by Gene Finding Algorithms

1. Exploit the regular gene structureATG—Exon1—Intron1—Exon2—…—ExonN—STOP

2. Recognize “coding bias”CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-…

• Recognize splice sitesIntron—cAGt—Exon—gGTgag—Intron

• Model the duration of regionsIntrons tend to be much longer than exons, in

mammalsExons are biased to have a given minimum length

• Use cross-species comparisonGene structure is conserved in mammalsExons are more similar (~85%) than introns

Page 27: BCB 444/544

27BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Computational Gene Finding Approaches

• Ab initio methods• Search by signal: find DNA sequences involved in gene

expression.• Search by content: Test statistical properties

distinguishing coding from non-coding DNA

• Similarity based methods• Database search: exploit similarity to proteins, ESTs, and

cDNAs• Comparative genomics: exploit aligned genomes

• Do other organisms have similar sequence?

• Hybrid methods - best

Page 28: BCB 444/544

28BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Examples of Gene Prediction Software

Ab initio Genscan, GeneMark.hmm, Genie, GeneID…

Similarity-based BLAST, Procrustes…

Hybrids GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP,

ROSETTA, CEM, TBLASTX, SLAM.

BEST? Ab initio - Genescan (according to some assessments)

Hybrid - GeneSeqerBut depends on organism & specific task

Lists of Gene Prediction Softwarehttp://www.bioinformaticsonline.org/links/ch_09_t_1.htmlhttp://cmgm.stanford.edu/classes/genefind/

Page 29: BCB 444/544

2910/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

Synthesis & Processing of Eukaryotic mRNA

exon 1 exon 2 exon 3intron intron

Transcription

Splicing (remove introns)

Capping & polyadenylation

Export to cytoplasm

AAAAA 3’5’

5’

5’

5’ 3’5’3’

3’

3’

7MeGm

1' transcript (RNA)

Mature mRNA

DNGene in DNA

Page 30: BCB 444/544

3010/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

What are cDNAs & ESTs?

cDNA libraries are important for determining gene structure & studying regulation of gene expression

• Isolate RNA (always from a specific organism, region, and time point)• Convert RNA to complementary DNA• (with reverse transcriptase)• Clone into cDNA vector• Sequence the cDNA inserts • Short cDNAs are called ESTs or

Expressed Sequence Tags ESTs are strong evidence for genes• Full-length cDNAs can be difficult to obtain

vector

insert

Page 31: BCB 444/544

31BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

UniGene: Unique genes via ESTs

• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene

• UniGene clusters contain many ESTs

• UniGene data come from many cDNA libraries. When you look up a gene in UniGene, you can obtain information re: level & tissue distribution of expression

Page 32: BCB 444/544

32BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Gene Prediction

• Overview of steps & strategies• What sequence signals can be used?• What other types of information can be used?

• Algorithms

• HMMs, Bayesian models, neural nets

• Gene prediction software • 3 major types

• many, many programs!

Page 33: BCB 444/544

33BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Overview of Gene Prediction Strategies

What sequence signals can be used?

• Transcription: TF binding sites, promoter, initiation site, terminator, GC islands, etc.• Processing signals: Splice donor/acceptors, polyA signal• Translation: Start (AUG = Met) & stop (UGA,UUA, UAG)

ORFs, codon usage

What other types of information can be used?

• Homology (sequence comparison, BLAST) • cDNAs & ESTs (experimental data, pairwise alignment)

Page 34: BCB 444/544

34BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Gene prediction: Eukaryotes vs prokaryotes

Gene prediction is easier in microbial genomes

Why? Smaller genomesSimpler gene structuresMany more sequenced genomes!

(for comparative approaches)

Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are availablee.g., GeneMark.hmm TIGR Comprehensive Microbial Resource (CMR)

NCBI Microbial Genomes

Page 35: BCB 444/544

35BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Predicting Genes - Basic steps:

• Obtain genomic sequence

• BLAST it!• Perform database similarity search

(with EST & cDNA databases, if available)• Translate in all 6 reading frames

(i.e., "6-frame translation")• Compare with protein sequence databases

• Use Gene Prediction software to locate genes• Analyze regulatory sequences• Refine gene prediction

Page 36: BCB 444/544

36BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Predicting Genes - Details:

1. 1st, mask to "remove" repetitive elements (ALUs, etc.)

2. Perform database search on translated DNA (BlastX,TFasta)

3. Use several programs to predict genes (GENSCAN, GeneMark.hmm,

GeneSeqer)• Search for functional motifs in translated ORFs

(Blocks, Motifs, etc.) & in neighboring DNA sequences

• Repeat

Page 37: BCB 444/544

37BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

• Perform pairwise alignment with large gaps in one sequence (due to introns)

• Align genomic DNA with cDNA, ESTs, protein sequences

• Score semi-conserved sequences at splice junctions• Using Bayesian model or MM

• Score coding constraints in translated exons• Using a Bayesian model or MM

Spliced Alignment Algorithm

Brendel 2005

GeneSeqer - Brendel et al.- ISUhttp://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

Intron

GT AG

Splice sites

Donor

Acceptor

Brendel et al (2004) Bioinformatics 20: 1157

Page 38: BCB 444/544

38BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

Brendel - Spliced Alignment II:Compare with protein probes

Genomic DNA

Start codon Stop codon

Protein

Brendel 2005

Page 39: BCB 444/544

39BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

• Information Content Information Content IIii ::

I f fi iBB U C A G

iB= +∈∑2 2, , ,

log ( )

• Extent of Splice Signal Window:

I Ii I≤ +196. σ

i: ith position in sequenceĪ: avg information content over all positions >20 nt from splice siteσĪ: avg sample standard deviation of Ī

Splice Site Detection

Brendel 2005

Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal?

YES

Page 40: BCB 444/544

40BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

HumanT2_GT

HumanT2_AG

Information content vs position

Brendel 2005

Which sequences are exons & which are introns?How can you tell?

Brendel et al (2004) Bioinformatics 20: 1157

Page 41: BCB 444/544

41BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07

en en+1

in in+1

PG

PA(n)PG

(1-PG)PD(n+1)

(1-PG)PD(n+1)

(1-PG)(1-PD(n+1))

1-PA(n)

PG

Markov Model for Spliced Alignment

Brendel 2005