Upload
sylvia
View
44
Download
0
Embed Size (px)
DESCRIPTION
BCB 444/544. Lecture 26 Gene Prediction #26_Oct22. Required Reading ( before lecture). Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction Chp 9 - pp 113 - 126 - PowerPoint PPT Presentation
Citation preview
1BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
BCB 444/544
Lecture 26
Gene Prediction
#26_Oct22
2BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Mon Oct 22 - Lecture 26
Gene Prediction • Chp 8 - pp 97 - 112
Wed Oct 24 - Lecture 27 (will not be covered on Exam 2)
Regulatory Element Prediction
• Chp 9 - pp 113 - 126
Thurs Oct 25 - Review Session & Project Planning
Fri Oct 26 - EXAM 2
Required Reading (before lecture)
3BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Assignments & Announcements
Sun Oct 21 - Study Guide for Exam 2 was posted
Mon Oct 22 - HW#4 Due (no "correct" answer to post)
Thu Oct 25 - Lab = Optional Review Session for Exam544 Project Planning/Consult with DD
& MT
Fri Oct 26 - Exam 2 - Will cover:• Lectures 13-26 (thru Mon Sept 17)• Labs 5-8• HW# 3 & 4• All assigned reading:
Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons…
4BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
BCB 544 "Team" Projects
• 544 Extra HW#2 is next step in Team Projects• Write ~ 1 page outline• Schedule meeting with Michael & Drena to discuss topic• Read a few papers• Write a more detailed plan
• You may work alone if you prefer
• Last week of classes will be devoted to Projects• Written reports due: Mon Dec 3 (no class that day)
• Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period
See Guidelines for Projects posted online
5BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
BCB 544 Only: New Homework Assignment
544 Extra#2 (posted online Thurs?) No - sorry! sent by email on Sat…
Due: PART 1 - ASAP
PART 2 - Fri Nov 2 by 5 PM
Part 1 - Brief outline of Project, email to Drena & Michael
after response/approval, then:
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
6BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:http://www.bcb.iastate.edu/seminars/index.html
• Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB
• Dave Segal UC Davis Zinc Finger Protein Design
• Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations
7BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Chp 16 - RNA Structure Prediction
SECTION V STRUCTURAL BIOINFORMATICS
Xiong: Chp 16 RNA Structure Prediction (Terribilini)
• RNA Function• Types of RNA Structures• RNA Secondary Structure Prediction Methods• Ab Initio Approach• Comparative Approach• Performance Evaluation
8BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07Fig 6.2Baxevanis & Ouellette 2005
Covalent & non-covalent bonds in RNA
Primary: Covalent bonds
Secondary/Tertiary Non-covalent bonds
• H-bonds (base-pairing)• Base stacking
This is a new slide
9BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
RNA Pseudoknots & Tetraloops
http://academic.brooklyn.cuny.edu/chem/zhuang/QD/mckay_hr.gif
This is a new slide
http://www.lbl.gov/Science-Articles/Research-Review/Annual-Reports/1995/images/rna.gif
• Often have important regulatory or catalytic functions
Pseudoknot Tetraloop
10BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Base Pairing in RNA
G-C, A-U, G-U ("wobble") & many variants
http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairsSee: IMB Image Library of Biological Molecules
This slide has been changed
11BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
RNA Secondary Structure Prediction Methods
Two (three, recently) main types of methods:
1. Ab initio - based on calculating most energetically favorable secondary structure(s)
Energy minimization (thermodynamics)
2. Comparative approach - based on comparisons of multiple evolutionarily-related RNA sequences
Sequence comparison (co-variation)
• Combined computational & experimental Use experimental constraints when available
This slide has been changed
12BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
RNA Secondary structure prediction - 3
3) Combined experimental & computational
• Experiments:Map single-stranded vs double-stranded regions in folded RNA
• How?Enzymes: S1 nuclease, T1
RNaseChemicals: kethoxal, DMS,
OH
• Software:Mfold SfoldRNAStructureRNAFoldRNAlifold
This is a new slide
Kethoxal modification (mild) (strong)DMS modification (mild) (strong)
G
200
240
220
DMS
13BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Ab Initio Prediction: Clarifications
• Free energy is calculated based on parameters determined in the wet lab
• Correction: Use known energy associated with each type of nearest-neighbor pair (base-stacking) (not base-pair)
• Base-pair formation is not independent: multiple base-pairs adjacent to each other are more favorable than individual base-pairs - cooperative - because of base-stacking interactions
• Bulges and loops adjacent to base-pairs have a free energy penalty
This slide has been changed
14BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
A UA U
A=UA=U
Basepair
G = -1.2 kcal/mole
A UU A
A=UU=A
G = -1.6 kcal/mole
Basepair
What gives here?
C Staben 2005
Energy minimization:What are the rules?
This is a new slide
15BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Energy minimization calculations:Base-stacking is critical
AA UU -1.2
CG GC -3.0
AU or UA UA AU -1.6
GC CG -4.3
AG, AC, CA, GA UC, UG, GU, CU -2.1
GU UG -0.3
CC GG -4.8
XG, GX YU, UY 0
- Tinocco et al.
C Staben 2005
This is a new slide
16BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Ab Initio Energy Calculation
• Search for all possible base-pairing patterns
• Calculate total energy of each structure based on all stabilizing and destabilizing forces
Fig 6.3Baxevanis & Ouellette 2005
Total free energy for a specific RNA conformation = Sum of incremental energy terms for:
• helical stacking (sequence dependent)• loop initiation• unpaired stacking
(favorable "increments" are < 0)
This slide has been changed
17BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Dynamic Programming
• Finding optimal secondary structure is difficult - lots of possibilities
• Compare RNA sequence with itself• Apply scoring scheme based on energy
parameters for base stacking, cooperativity, and penalties for destabilizing forces (loops, bulges)
• Find path that represents most energetically favorable secondary structure
This slide has been changed
18BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
3 - Popular Programs that use Combined Computational Experimental Approaches
• Mfold• Sfold• RNAStructure• RNAFold• RNAlifold
19BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
SL X
SL Y
SL Z
SL Y
SL Z
SL X
SL Y
SL ZSL X
SL Y
SL Z
SL X
Mfold -54.84 kcal/mol
RNAstructure -71.3 kcal/mol RNAfold -80.16 kcal/mol
Sfold -51.14 kcal/mol
Comparison of Predictions for Single RNA using Different Methods
JH Lee 2007
20BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Mfold plus constraints -54.84 kcal/mol
Mfold -126.05 kcal/mol
Comparison of Mfold Predictions: -/+ Constraints
JH Lee 2007
21BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Performance Evaluation
• Ab initio methods? correlation coefficient = 20-60%• Comparative approaches? correlation coefficient =
20-80%• Programs that require user to supply MSA are more
accurate• Comparative programs are consistently more
accurate than ab initio
• Base-pairs predicted by comparative sequence analysis for large & small subunit rRNAs are 97% accurate when compared with high resolution crystal structures! - Gutell, Pace
• BEST APPROACH? Methods that combine computational prediction (ab initio & comparative) with experimental constraints (from chemical/enzymatic modification studies)
This slide has been changed
22BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Chp 8 - Gene Prediction
SECTION III GENE AND PROMOTER PREDICTION
Xiong: Chp 8 Gene Prediction
• Categories of Gene Prediction Programs
• Gene Prediction in Prokaryotes
• Gene Prediction in Eukaryotes
23BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory"
• Genes can encode:• mRNA (for protein)
• other types of RNA (tRNA, rRNA, miRNA, etc.)
• Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation
What is a Gene?
24BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Gene Finding
Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
Steps:• Search against protein / EST database• Apply gene prediction programs (many programs
available)• Analyze regulatory regions
25BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Gene Prediction in Prokaryotes vs Eukaryotes
Prokaryotes• Small genomes 0.5 - 10·106
bp• About 90% of genome is
coding• Simple gene structure
• Prediction success ~99%
Eukaryotes• Large genomes 107 – 1010 bp• Often less than 2% coding• Complicated gene structure
(splicing, long exons)• Prediction success 50-
95%
ATG TAA
Promotor Open reading frame (ORF)
Start codon Stop codon
Promotor5’ UTR
Exons Introns
3’ UTR
ATG TAA
Splice sites
26BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
DNA "Signals" Used by Gene Finding Algorithms
1. Exploit the regular gene structureATG—Exon1—Intron1—Exon2—…—ExonN—STOP
2. Recognize “coding bias”CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-…
• Recognize splice sitesIntron—cAGt—Exon—gGTgag—Intron
• Model the duration of regionsIntrons tend to be much longer than exons, in
mammalsExons are biased to have a given minimum length
• Use cross-species comparisonGene structure is conserved in mammalsExons are more similar (~85%) than introns
27BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Computational Gene Finding Approaches
• Ab initio methods• Search by signal: find DNA sequences involved in gene
expression.• Search by content: Test statistical properties
distinguishing coding from non-coding DNA
• Similarity based methods• Database search: exploit similarity to proteins, ESTs, and
cDNAs• Comparative genomics: exploit aligned genomes
• Do other organisms have similar sequence?
• Hybrid methods - best
28BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Examples of Gene Prediction Software
Ab initio Genscan, GeneMark.hmm, Genie, GeneID…
Similarity-based BLAST, Procrustes…
Hybrids GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP,
ROSETTA, CEM, TBLASTX, SLAM.
BEST? Ab initio - Genescan (according to some assessments)
Hybrid - GeneSeqerBut depends on organism & specific task
Lists of Gene Prediction Softwarehttp://www.bioinformaticsonline.org/links/ch_09_t_1.htmlhttp://cmgm.stanford.edu/classes/genefind/
2910/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Synthesis & Processing of Eukaryotic mRNA
exon 1 exon 2 exon 3intron intron
Transcription
Splicing (remove introns)
Capping & polyadenylation
Export to cytoplasm
AAAAA 3’5’
5’
5’
5’ 3’5’3’
3’
3’
7MeGm
1' transcript (RNA)
Mature mRNA
DNGene in DNA
3010/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
What are cDNAs & ESTs?
cDNA libraries are important for determining gene structure & studying regulation of gene expression
• Isolate RNA (always from a specific organism, region, and time point)• Convert RNA to complementary DNA• (with reverse transcriptase)• Clone into cDNA vector• Sequence the cDNA inserts • Short cDNAs are called ESTs or
Expressed Sequence Tags ESTs are strong evidence for genes• Full-length cDNAs can be difficult to obtain
vector
insert
31BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
UniGene: Unique genes via ESTs
• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many ESTs
• UniGene data come from many cDNA libraries. When you look up a gene in UniGene, you can obtain information re: level & tissue distribution of expression
32BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Gene Prediction
• Overview of steps & strategies• What sequence signals can be used?• What other types of information can be used?
• Algorithms
• HMMs, Bayesian models, neural nets
• Gene prediction software • 3 major types
• many, many programs!
33BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Overview of Gene Prediction Strategies
What sequence signals can be used?
• Transcription: TF binding sites, promoter, initiation site, terminator, GC islands, etc.• Processing signals: Splice donor/acceptors, polyA signal• Translation: Start (AUG = Met) & stop (UGA,UUA, UAG)
ORFs, codon usage
What other types of information can be used?
• Homology (sequence comparison, BLAST) • cDNAs & ESTs (experimental data, pairwise alignment)
34BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Gene prediction: Eukaryotes vs prokaryotes
Gene prediction is easier in microbial genomes
Why? Smaller genomesSimpler gene structuresMany more sequenced genomes!
(for comparative approaches)
Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are availablee.g., GeneMark.hmm TIGR Comprehensive Microbial Resource (CMR)
NCBI Microbial Genomes
35BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Predicting Genes - Basic steps:
• Obtain genomic sequence
• BLAST it!• Perform database similarity search
(with EST & cDNA databases, if available)• Translate in all 6 reading frames
(i.e., "6-frame translation")• Compare with protein sequence databases
• Use Gene Prediction software to locate genes• Analyze regulatory sequences• Refine gene prediction
36BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Predicting Genes - Details:
1. 1st, mask to "remove" repetitive elements (ALUs, etc.)
2. Perform database search on translated DNA (BlastX,TFasta)
3. Use several programs to predict genes (GENSCAN, GeneMark.hmm,
GeneSeqer)• Search for functional motifs in translated ORFs
(Blocks, Motifs, etc.) & in neighboring DNA sequences
• Repeat
37BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
• Perform pairwise alignment with large gaps in one sequence (due to introns)
• Align genomic DNA with cDNA, ESTs, protein sequences
• Score semi-conserved sequences at splice junctions• Using Bayesian model or MM
• Score coding constraints in translated exons• Using a Bayesian model or MM
Spliced Alignment Algorithm
Brendel 2005
GeneSeqer - Brendel et al.- ISUhttp://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Intron
GT AG
Splice sites
Donor
Acceptor
Brendel et al (2004) Bioinformatics 20: 1157
38BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
Brendel - Spliced Alignment II:Compare with protein probes
Genomic DNA
Start codon Stop codon
Protein
Brendel 2005
39BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
• Information Content Information Content IIii ::
I f fi iBB U C A G
iB= +∈∑2 2, , ,
log ( )
• Extent of Splice Signal Window:
I Ii I≤ +196. σ
i: ith position in sequenceĪ: avg information content over all positions >20 nt from splice siteσĪ: avg sample standard deviation of Ī
Splice Site Detection
Brendel 2005
Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal?
YES
40BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
HumanT2_GT
HumanT2_AG
Information content vs position
Brendel 2005
Which sequences are exons & which are introns?How can you tell?
Brendel et al (2004) Bioinformatics 20: 1157
41BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07
en en+1
in in+1
PG
PA(n)PG
(1-PG)PD(n+1)
(1-PG)PD(n+1)
(1-PG)(1-PD(n+1))
1-PA(n)
PG
Markov Model for Spliced Alignment
Brendel 2005