27
RNA Structure Prediction Chapter 16

RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Embed Size (px)

Citation preview

Page 1: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

RNA Structure Prediction

Chapter 16

Page 2: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Primary, Secondary and Tertiary Structures

Page 3: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

RNA Structures

Page 4: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures
Page 5: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Ab Initio

Prediction based on a single RNA sequenceSearch for RNA structure with lowest energyFree energy calculated from G-C < A-U < G-U < unpaired pairsStacking between aromatic rings (van der Waals interactions [no apostrophe]) gives rise to cooperativetyNeighboring loops or bulges impose unfavorable entropic changeFind all possible base-pairing interactionCalculate the energy of each and choose the lowest energy configuration

Dot MatricesPlot all interactions in self alignment plotFind diagonals after applying sliding window

Dynamic ProgrammingFind the single optimal matchUse Watson-Crick and wobble base pairing scoresConformations with slightly higher energies may exist without optimal base pairing

Page 6: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Partition Function

Use a probability distribution to generate sub-optimal structures within a given energy range

Mfoldhttp://mfold.bioinfo.rpi.edu/applications/mfold/Dynamic programming and thermodynamic calculationRNAfoldhttp://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgiExtend alignment to more than one diagonal in dotplot to calculate thermodynamic stability of structures

Page 7: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Comparative Approach

Assumption that homologous RNA sequences fold into same structure

CovariationCovariant regions in homologous sequences are likely to be basepairedPredict consensus structure based onm predictions for all aligned sequencesRNAalifoldhttp://rna.tbi.univie.ac.at/cgi-bin/RNAalifold.cgiPrealignmentPredictions based on covariance, minimum free eneregy, dynamioc programming finds optimal satructure for entire alignmentFoldalignNo prealignmenthttp://foldalign.ku.dk/Clustal alignment and dynamic programming

Page 8: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Chapter 17

Genome Mapping, Assembly and Comparison

Page 9: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Definitions

Genomics – study of genomes

Structural genomics (genome analysis) – identification of genes, annotation of gene features, comparison of genome structures

Functional genomics – analysis of genome wide gene expression and gene functions

Page 10: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Genome Mapping

`Cytological map •Banding pattern of metaphase chromosomes•Low resolution (Dustin units)Genetic map • Relative positions of genetic markers•Marker associated with specific genetic trait•The closer the markers, the lower the probability of separation in cross-over event, and independent inheritancePhysical map •Order of clone fragments using a library of radio-labeled probes

Page 11: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Genome Sequencing

Shotgun approach•Sequence large number of randomly cloned DNA fragments•Number of fragments to be sequenced is large to allow overlap to reconstruct entire genome•Requires no knowledge of physical map•Typically equivalent of 6 genome length (“6× coverage”) must be sequences to ensure correct assembly•Gaps filled in with PCR “chromosome walking” (successive sequencing from primers designed from last round of sequencing results)

Hierarchical approach•Clone of very large fragments (100-300kb) into Bacterial Artificial Chromosomes (BACs)•Map BAC inserts by restriction enzyme analysis•Arrange in order•Choose smallest number of BACs that cover entire genome (“golden tiling path”)•Sub-clone BAC insert fragments into bacterial vectors and sequence

Page 12: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures
Page 13: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Genome Sequence Assembly

Short sequence 500bp runs → 5-10kb contigs → 30-50kb supercontigs (scaffolds)

Major challenges

•Sequence errors•Vector DNA contamination (filtering programs)•Repetitive sequence regions (RepeatMasker)

Page 14: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Dealing with repeats (almost…)

•Forward-reverse constraint

Page 15: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures
Page 16: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Base calling: Phred

•http://www.phrap.org/•Fourier analysis to resolve fluorescent traces•Assignment to base giving probability score

Sequence assembly: Phrap•http://www.phrap.org/•Takes Phred files as input•Performs Smith-Waterman local alignment•Progressively merge sequence pairs with highest to lowest similarity scores, removing overlaps•Outputs contigs

Base calling and assembly programs

→ Nucleotide sequence

Page 17: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Additional software

VecScreen•To remove “contaminating” vector DNA sequences from genomes•http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html•Performs BLAST screen of submitted sequence against UniVec non-redundant vector database•Matches are displayed

TIGR Assembler (last updated 2003)•http://www.jcvi.org/cms/research/software/•Uses forward-reverse constraints•Smith-Waterman sequence assmbly

ARACHNE•http://www.broad.mit.edu/wga/•Gives statistical scores to overlaps•Corrects error in multiple overlaps•Outputs contigs or supercontigs

EULER•http://nbcr.sdsc.edu/euler/•Uses shortest distance traveling salesman algorithm•Useful for assembly of sequences with repeats

Page 18: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Genome Annotation

•Sequence

•Gene structures (GenScan, FgenesH)

•Predictions verified by BLAST against sequence database, cDNA and EST (GeneWise, Spidey, SIM4, EST2Genome)

•Manually verified by human curators

•Functional assignment of proteins by BLAST searches of protein database

•Further functional description from Pfam and InterPro and literature

Page 19: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures
Page 20: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Gene Ontology

Uses limited vocabulary to describe

•Cellular components•Biological processes•Molecular functions

Vocabulary arranged in a hierarchical manner from widest to most specific description

Page 21: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

GO: “cytochrome c oxidase gene ” in Ensembl

.

.

.

Page 22: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Automated Genome Annotation

•Genome data generated at exponential rate requires automatic genome annotation•Based on homologies

Genequiz•http://swift.cmbi.kun.nl/swift/genequiz/•BLAST and FASTA homology searches of database•Domain analysis with PROSITE and Blocks databases•Analysis of secondary and supersecondary (eg. Coiled-coils)•All results compiled to produce summary with assigned confidence level

Page 23: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Annotation of hypothetical proteins

•In newly sequences genome as much as 40% of protein are “hypothetical”

To assign function:•Homology searches in databases•Search for similar motifs, domains and secondary structures•Identify conserved functional sites by HMM•Predict structure with fold recognition or threading•Assign broad function to protein•Test assigned function experimentally

Page 24: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

How many genes in a genome?

•Total number of human genes ~25,000•Equivalent to that in mouse•4× more than Saccharomyces cerevisiae•Not number of cells in organism that counts, but number of specialized cells (tissues) and response conditions

Page 25: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Genome Economy

•One gene → one protein is not true•EST suggests >100,000 proteins in humans (from 25,000 genes?)

Alternative splicing•Joining different exons from a single transcript to form different proteins

Exon shuffling•Joining exons from different genes

•Drosophila Dscam gene contains 115 exons, 20 of which are constitutively spliced and 95 of which are alternatively spliced •Expresses 38,016 different mRNAs by virtue of alternative splicing

Trans-splicing•Drosophila mdg4 gene•Joins 4 exons on sense strand and 2 exons on anti-sense strand

•Single transcript of encodes dentin phosphoprotein and sialoprotein. Protein is cleaved to form two different proteins

•Human transcript for Prostrate Specific Antigen (PSA) also encodes PSA-LM in 4 th intron

Page 26: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Comparative Genomics

•Compare genomes from different organisms

Whole Genome Alignment•Extent of genome conservation•Mechanism of genome evolution•MUMer and BLASTZ•Modified BLAST to align long genome sequences

Finding a minimal genome•What are the minimum number of genes to support a free-living cellular entity?•Useful to identify genes constituting essential metabolic pathways

Lateral Gene Transfer•Identify by G-C skew•GC%•Codon bias

Page 27: RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures

Gene order comparisons

• Where gene order is conserved between genomes, it is called synteny• Synteny may indicate functional relationships• Often indicate physical interaction of proteins• Genes encoding proteins catalyzing consecutive steps of metabolic

pathway sometimes are ordered – co-regulation of “operon”?• MAL cluster in yeast: multigene complex that encodes the MAL23

trans-acting MAL-activator, MAL21 maltose permease, and MAL22 maltase in order on chromosomes 2, 3, 7, 9 and 10