View
228
Download
0
Category
Tags:
Preview:
Citation preview
Marcella A. McClure, Ph.D.Department of Microbiology and the Center for Computational Biology
Montana State University, Bozeman MT
mars@parvati.msu.montana.edu
Computational VirologyLectures in
Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms
McCLURE
LAB
OF
BIOINFORMATICS
Living on the edge of statistics!
Bioinformatics is the creation of new knowledge from existing data. This
type of research takes place in silico and includes the development
and testing of the software tools necessary to analyze the data.
McClure, 2000
What is Bioinformatics?
is an interplay between knowledge of empirically derived data,
bioinformatic tools and human decision making. Exactly which information and tools are to be
accessed is dependent on the nature of the question of interest.
McClure, 2000
The Practice of Bioinformatics
1) Potential Multiple Endonuclease Functions and a Ribonuclease H Encoded in Retroposon Genomes
2) Hypothesis: The Reverse Transcriptase Domain Shares Common Ancestry with the RNA-dependent RNA polymerase of both positive and negative-stranded RNA
viruses: a Test of Protein Motif FindingMethods.
3) A Functional Genomics Challenge: the Transcription/ Replication Complex of the Order Mononegavirales.
Rabies, measles and Ebola viruses are true RNA-based life-forms that have no DNA stage belonging to the order Mononegavirales. To date, little has been learned about
the distribution of functions within or the actual structure of the replication /transcription complex (three proteins and the RNA template). The goal is to elucidate potential regions and residues Of protein:protein interactions of the replication /transcription complex without structural information. The studies will proceed along three paths: prediction of disorder; determination of compensatory mutation; and the assessment of evolutionary dynamics. Correlation of the results of these methods will provide high probability candidates for the protein:protein contacts.
Recent and Current Projects
Recent and Current Projects, cont.
4) Mapping of All Genomic Retroid Agents: Prototype Human Genome.
The Retroid Agents (e.g., HIV, Hepatitis B, retrotransposons, etc.) encode the reverse transcriptase thereby providing the interface for the transfer of genetic information from RNA-based to DNA-based replication systems. The goal of this project is to identify, classify and map of all Retroid Agents of a specific genome. The Genome Parsing Suite is the prototype software that not only identifies and classes these agents, but also determines Retroid genome boundaries, architecture, and gene complement, and also assesses the host environment of each agent. These data are then used to create a browseable database that will be available for display through the UCSC Genome Browser. The creation of this database is necessary for hypothesis testing regarding the roles that Retroid Agents play in the reproduction, development, evolution and diseases processes in Eukaryotes, including humans.
1) Introduction to RNA-based life forms
2) Methods to test the hypothesis.
3) Testing the hypothesis.
4) Predicting protein contacts.
Summary Lecture I
DNA viruses
ssDNA dsDNAdsDNA
RNA viruses
ssRNA dsRNA
+ ssRNA+ ssRNA - ssRNA- ssRNA
Does the RT domain of the RdDp share common ancestry with the RdRp of negative and positive polarity, single-
stranded viruses?
RdDp
RdRphost Pol II
The World of Viruses
Paramyxoviridae
Filoviridae
Retroviridae
Picornaviridae
Rhabdoviridae
Retroviruses, retrotransposons, pararetroviruses, retroposons, retroplasmids, retrointrons, and retrons
Replication by
DNA-dependent
DNA polymerase
PROTEIN SYNTHESIS
snRNAs, ribozymes, tRNA, rRNA
translationtranscription
reverse transcriptase mediated replication
or transposition
RNA DNA
RNA viruses e.g.,
Ebola, rabies,
influenza, polio
Replication by
RNA-dependent
RNA Polymerase
All cellular systems
& most DNA Viruses
McClure, 2000
Retroid Agents
Mononegavirales“OLD” FOESrabies (Rhabdoviridae)measles, RSV, mumps (Paramyxoviridae)
“EMERGING” THREATSEbola, Marburg (Filoviridae)equine morbillivirus, Nipah virus (Paramyxoviridae)
MODEL AGENTvesicular stomatitis virus (Rhabdoviridae)
1) Disease:a) retroviruses:
1) exogenous infectious: HIV HTLV2) endogenous associations: breast cancer, testicular tumors,
insulin dependent diabetes, multiple sclerosis, rheumatoid arthritis, schizophrenia and systemic lupus erythematosus b)LINEs insertional mutagenesis:
1) Hemophilia A 2) muscular dystrophies; Duchenne and Fukuyama- congenital type3) X-linked disorders; Alport Syndrome-Diffuse
Leiomyomatosis and Chronic Granulomatous Disease 2) Regulation of cellular genes and reproduction3) Telomere maintenance4) Repair of broken dsDNA5) Exchange of genetic information among and between organisms
Roles of Retroid Agents:
Plus-strand RNA Virus Families and Human Diseases
Togaviridae - Riff Valley Fever
Flaviviridae - Dengue Fever virus, West Nile virus
Coronaviridae - Infectious Bronchitis
Caliciviridae - Hepatitis E virus
Picornaviridae - Human poliovirus, Hepatitis A
MMLV Genome
Paramyxoviridae Genome
N P/C/V M F HN RdRp
N P M G RdRp
N VP35 VP40 G VP30 VP24 RdRp
Filoviridae Genome
5’LTR GAG RdDp ENV 3’LTR
PRO RT/RH INT
Rhabdoviridae Genome
Picornaviridae GenomeVPg L P4 P2 P3 P1 2A 2B 2C 3A 3B 3C 3D Poly(A)
RdRp
L
NP
P
3'
3'
L
P PP
P
N
N
n
L
P
PP
5'
P
L
PP
PP
P
CO-ASSEMBLY?
P P
L
leader N
5'
P
read through
VSV Transcription
VSV Replication
VSV Transcription
RNA Template
Replication
Model of a poliovirus polymerase-dsRNA complex based on the structure of HIV-1 RT complexed to dsDNA (Huang etal., 1998).
HIV-1 Reverse Transcriptase
Poliovirus Polymerase
Poliovirus Polymerase Oligorner
Model of a poliovirus polymerase-dsRNA complex
Analysis of Multiple Alignment
Search Databases
Multiple Alignment of Sequences
Refined Multiple Alignment
Annotate and Preparation of Sequences
McClure, 2000
Basic Strategy
McClure, 2000
Biological Patterns
“Whether randomness can be measured is a difficult problem. One cannot judge the absence of pattern without specifying which pattern, and what is a pattern to you may not be a pattern to me.”
McClure 2002
An OSM, which may span hundreds of residues, is defined as a set of conserved or semi-conserved motifs (1-9 contiguous amino acid residues) found in the same arrangement relative to one another in all sequences of a protein family. The amino acids of these patterns are involved in catalysis or structural integrity. The spacing between motifs or motif intervening regions (MIRs) can be highly variable, reflecting the regions of a protein that are less restricted by functional or structural constrains. MIRs may evolve more rapidly and be more subject to insertion/deletion events, and duplications that the OSM.
Why is OSM identification important?The OSM of a protein family can be used to predict function. The identification of an OSM common among protein sequences with as little as 8% amino acid identity has led to successful prediction of function. If a multiple alignment method, (be it global or local) cannot correctly identify the highly conserved residues of a given sequence that are critical for
function and structure, then it is of little value.
What is an ordered series of motifs (OSM)?
LEVELS of SEQUENCE COMPARISONS
> 25% IDENTICAL = HOMOLOGYHT01 PGNNPVFPVKKANGTWRFIHDLRATNSLTIDLSSSSPGPPDLSSLPTTLAHLQTIDLKDAFFQIPLPKQFHT02 PGNNPVFPVKKANGTWRFIHDLRATNSLTIDLSSSSPGPPDLSSLPTTLAHLQTIDLKDAFFQIPLPKQFHT11 PGNNPVFPVKKPNGKWRFIHDLRATNAITTTLTSPSPGPPDLTSLPTALPHLQTIDLTDAFFQIPLPKQYBL01 PGNNPVFPVRKPNGAWRFVHDLRATNALTKPIPALSPGPPDLTAIPTHPPHIICLDLKDAFFQIPVEDRFBL02 PGNNPVFPVRKPNGAWRFVHDLRVTNALTKPIPALSPGPPDLTAIPTHLPHIICLDLKDAFFQIPVEDRF
< 25% IDENTICAL + OSM = HOMOLOGYHV04 GIRYQYNVLPQGWKGSPAIFQSSMTKILDPFRRDNPELEICQYMDDLYVGSDLPLTEHRKRIELLREHLYSV22 GIRYQFNCLPQGWKGSPTIFQNTAANILEEIKRHTPGLEIVQYMDDLWLASDHDETRHNQQVDIVRKMLLBL01 HRRFAWRVLPQGFINSPALFERALQEPLRQVSAAFSQSLLVSYMDDILYASPTEEQRS.QCYQALAARLRSP01 GKQYCWTRLPQGFLNSPALFTADAVDLLKEVP.N.....VQVYVDDIYLSHDNP.HEHIQQLEKVFQILLMM29 TTQYTWTQLPQRFKNSPTIFGEALARDLQKFPTRDLGCVLLQYVDDLLLGHPTAVGWP.REQMLYSGTWR
< 25% IDENTICAL - OSM = NO HOMOLOGYIN10 DHH..MLEKLLVKHFQDQSFIDLYWKMVKAGYVEFDKDKSSMIGVPQGGIASPMLSNLVLNELDEFVQNICO01 VKH..TFIRILMSVVVDQD.LELEQMDVKTAFLHGELEEELYMEQPEGFIS.............EDGKNKNL04 IEAGQSAMRFRRTNGRDNRFLLVVSMDVKNAFNTASWQAIATALQMKGVPAG..............LQRIBA01 LKR............VGNK.KVFSKFDLKSGFHQVAMAEESIPWTAFWVPQG..................HB78 FYHIPISPAAVPHLLVGSP..GLERFNTCLSYSTHNRNDSQLQTMHNLCTRH..............VYSS
McClure, 2000
I II III IV V VI
HT13 pvkKa-- t- IDLkdaf - LPQG-fk qYMDDIll shGL-- kFLGqiiNVV0 ikk K--- ti LDIgday - LPQG-wk -YMDDIyi qyGFM- kWLGfelSFV1 pvp Kp-- tt LDLtngf - LPQG-fl aYVDDIyi naGYVv eFLGfniHERVC pvp Kp-- tc LDLkdaf - LPQR-fk qYVDDLll tvGIRc cYLGftiGMG1 mvr Ka-- tk VDVraaf - CPFG-la aYLDDIli --GLN- kYLGfivGM17 v-p Kkqd tt IDLakgf - MPFG-lk vYLDDIiv --NLK- tFLG-hvMDG1 lvp Kksl sc LDLmsgf - LPFG-lk lYMDDLvv --NLK- tYLG-hkMORG vvr Kk-- tt MDLqngf - APFG-fk lYMDDIiv --GLK- hFLG-hiCAT1 lvd Kpkd eq MDVktaf k SLYG-lk lYVDDMli --EMK- rILGidiCMC1 tit Krpe hq MDVktaf k AIYG-lk lYVDDVvi --- KR- hFIGiriCST4 ftk Krng t- LDInhaf k ALYG-lk vYVDDCvi inKLK- dILGmdlC1095 fnr Krdg tq LDIssay k SLYG-lk lFVDDMil itTLKk dILGleiNDM0 mih Kt-- af LDIqqaf g VPQGsvl tYADDTav tsGL-- kYLGitlNL13 lip Kp-- s- IDAekaf g TRQGcpl lFADDMiv vsGYK- kYLGiqlNLOA fip Ka-- af LDIegaf g CPQGgvl gYADDIvi evGLN- kYLGvi-NTC0 vlr Kp-- am LDGrnay g VRQGmvl aYLDDVtv alGIE- rVLGagvICD0 eip Kp-- vd IDIk-gf g TPQGgil rYADDFki kqWLKv dFLGfklIAG0 fkk Kt-- ie GDIks-f g VPQGgii rYADDWlv -qELKi -FLGvnlICS0 wip Kp-- ld ADIsk-c g TPQGgvi rYADDFvi emGLE- nFLGfnvIPL0 yip Ks-- le ADIr-gf g VPQGgpi rYADDFvv -rGLV- dFVGfnf
The six motifs of the reverse transcriptase (RT) comprising the ordered-series-of-motif (OSM) involvedin enzymatic function are indicated by roman numerals (I-VI). The bold and capitalized letters representthe core amino acids of each motif that are highly conserved among all RT sequences. Abbreviationson the left side bar indicate the names of different types of Retroid agents.Adapted from Hudak and McClure, 1999.
McClure, 2000
Example of local subsequences or OSM
RT
GDNQ
FADDM
HYPOTHESIS: The Reverse Transcriptase domain of the RNA-dependentDNA Polymerase shares common ancestry with the RNA-dependent RNAPolymerase of the Order Mononegavirales and Plus Strand RNA viruses.
RH
RdDp
GDD
RdRp of Mononegavirales
RdRp of Plus strand viruses
Support for homology:Statistical tests
Protein Sequence Data
SEQUENCE COMPARISON
>30% identical= homology
<30% identical
MOTIF DETECTION
OSM present= functionally equivalent= likely homologue
OSM absent= unlikely homologue
Functional identification, Phylogenetic analysis, Structural prediction
McClure, 2000
Support for homology:Gene order and size, common function
Strategy for Assessing Protein Sequence Homology
Program AGlobin Motifs
1(7) 2(5) 3(5) 4 (5) 5(3)
Kinase Motifs
1(6) 2(1) 3(1) 4(9) 5(3) 6(3) 7(8) 8(1)AMULT NW 100 100 100 100 100 100 83 92 100 100 100 100 100CLUSTAL V WL 100 92 100 100 100 100 92 50+42 100 100 100 100 58+42DFALIGN NW 100 100 100 100 100 100 100 100 100 100 100 100 100GENALIGN CW 67+25 100 100 67+17 67+25 100 42+33 83 100 100 100 50+50 67+25MULTAL NW 100 90 100 100 100 100 58+17 50+33 100 100 58+42 100 100PIMA SW 100 100 100 100 100 100 92 92 100 100 100 100 100PRALIGN CW 67 33+17+17 33+25+17 33+17+17 67+17 100 42+42 33+17 33 42+33 42+33 33 33SAM BW 100 100 100 100 100 100 100 100 100 100 100 100 100
Program AProtease Motifs
1(3) 2 (5) 3(3)
Ribonuclease H Motifs
1(3) 2(1) 3(3) 4(5)AMULT NW 92 58 83 92 58+17 50+17 25+17+17CLUSTAL V WL 100 50+25 25+25 100 75 58+17 33+25+17DFALIGN NW 100 70+30 100 100 100 83 100GENALIGN CW 92 42+25 25+17+17 83+17 58 33+17+17 33+25+17MULTAL NW 83 33+25 50+25 75+17 58+17+17 50+25 83PIMA SW 100 25+17 25+17 83 75 33+17+17 42+33+17PRALIGN CW 33+33 17+17 25+25+17 75 33+33 33+17 17SAM BW 100 92 100 100 83 75 83
Column A lists the algorithm employed by each method: NW= Needleman-Wunsch, WL= Wilber- Lipman,CW= consensus word, SW=Smith-Waterman and BW= Baum-Welsh
Values above columns indicate the number of motifs and values in parenthesis indicate the number of residuesin each motif. Values in columns indicate the percentage of sequences in which the motif is correctly identified.
McClure, 2000
Comparison Of Hmm/sam To Classical Multiple Alignment Methods
Bench Mark Sequences:Biologically informative markers Sequence length distributionEvolutionary distributionSet size
Methods:Appropriateness AvailabilityAssumptions Limitations User specific parameters
Evaluate Results for Correct Identification of Biologically Informative Marker
Parameter Range Tests Types of Test Data
Test hypothesis: RdRp share common ancestry with RdDp
Method (s) that Accurately Identify Biologically Informative Marker
RdRp and RdDp sequences
Experimental Design for Testing Motif Detection Methods
Search Databases:Sequence, Literature, StructuralOther??
Data: Retrieve, Annotate, Manage
Analyze Data:Multiple Alignment of Sequences
OSM/MIR Determination2D and 3D Modeling
Phylogenetic ReconstructionGene and Genome Architecture
Determine Methodological Limitations
McClure, 2001
BlockmakerMatchboxMemePimaPralignSAM
Motif-detection Programs
McClure, 2002
Motif Detection Programs
PROGRAM Algorithm MATRIX INDEL RUN USER SPECIFICATIONSd
PENALTYc TIME (# MOTIFS) (WIDTH) (# SEQUENCES)BLOCKMAKER Motifj PAM 250 none ~1m N N NeITERALIGN SI PAM 250 C ~1h40m N Y YMATCHBOX Scanning BLOSUM 62 none ~45m N N NiMEME MM/EM PAM 250 none ~2m Y Y YPIMA SW AACHb I + E ~2m N N NiPROBE SW+G+GA PAM 250 I + E ~2h30m N N YSAM BW none none ~2h20m N N Ni
aAlgorithms are:
SI = Symmetric-Iterative protocolMM = Mixture Model that uses (EM) Expectation MaximizationSW = Smith-WatermanG = Gibbs SamplingGA = Genetic AlgorithmBW = Baum Welch.b Matrices are:
PAM = point accepted mutation as defined by DayhoffBLOSUM = sum of conserved blocks as defined by HenikoffAACH = Amino Acid Cluster Hierarchy (patgen, class 1; and class 2) as defined by R. SmithcThe insertion/deletion penalties are:
C = constantI + E = initial + extension.dUser specific parameters are
# MOTIFS = number of motifs to be detectedWIDTH = width of motifs to be detected# SEQUENCES = number of sequences that contain the motifN = user cannot specifyNe =user cannot specify and program excludes sequencesNi = user cannot specify, but program automatically includes all sequencesY = user can specify, but it is not required.
Sequence Length, Percent Identity and Distance Values of
Globin, Kinase, Aspartic Acid Protease, Ribonuclease H and
Reverse Transcriptase Test Sequence Sets
DATA SET SEQUENCE LENGTH PERCENT IDENTITY DISTANCERange Average Range Average Range Average
GLOB12 141-153 147 14-84 30 9.1-174.8 109.1KIN12 255-340 273 16-44 26 71.0-170.4 130.0PRO12 98-160 127 9-72 20 27.5-205.8 169.2RH12 126-158 141 9-41 19 100.2-237.6 176.1RT20 297-412 348 11-40 20 70.5-205.7 163.7
GLOB174 115-161 145 10-99 39 0.1-204.7 85.8KIN186 246-409 286 9-99 28 1.3-212.1 130.9PRO114 97-150 108 7-99 28 0.1-282.9 146.8
RH169 122-246 144 5 -99 25 0.1-283.0 160.0RT178 288-434 347 10-99 25 0.1-230.4 153.2
The range and average sequence length, percent identity, and distance value is given foreach data set.
Data sets are either, small, 12-20 sequences, representing a rather smooth distributionof the entire sequence collection or large, 114-186 sequences, randomly selected fromthe entire sequence collection. The large data sets more accurately represent theunequal distribution of sequence relationship encountered in real data.
Percent identity is the percentage of identical amino acid residues among all sequencepairs.
Distance (D) is a measure of difference between all sequence pairs that takes intoaccount the probability of amino acid substitutions and the ease of converting from onecodon to another; D = -ln[(Sreal - Srandom)/(Sidentical - Srandom)] x 100, where S = similarityscore.
McClure 2002
Program Data Sets AVG
GLOB(12) KIN(12) PRO(12) RT(20) RH(12)
BLOCKMAKER 80 63 53 31 31 52
INTERALIGN 98 94 22 49 23 57
MATCHBOX 38 85 61 67 37 58
MEME 90 96 67 93 73 84
PIMA 98 99 55 71 87 82
PROBE 93 95 81 94 83 89
Scores reported as percentage of sequences in which Motifs were correctly identified. Values in parenthesis are the number of sequences in each data set.
Summary of small data set analysis
Program Data Sets
AVG
GLOB(174) KIN(186) PRO(114) RT(178) RH(169)
PIMA 43 46 69 47 43 50
12 35 19 16 22 21
MEME 85 97 87 84 76 86
PROBE 98 98 91 85 93 93
Two sets of scores are reported for the results of testing the PIMA method. In each case this method finds two subsets of alignments with the OSM correctly identified, but fails to merge these two into a final multiple alignment. Scores are reported as percentages of sequences in which the OSM is correctly identified. Values in parentheses are the number of sequence in each dataset.
Summary of Large Data Set Analysis
RT
GDNQ
FADDM
HYPOTHESIS: The Reverse Transcriptase domain of the RNA-dependent DNA Polymerase shares common ancestry with the RNA-dependent RNA Polymerase of the OrderMononegavirales and Plus Strand RNA viruses.
RH
RdDp
GDD
RdRp of Mononegavirales
RdRp of Plus strand viruses
MEME OutputDatasets 1 2 3 4 5 6 ParametersPol 16 N/A 75% 100% 100% 63% N/A mod oops, nmotifs = 20RT16 94% 100% 100% 100% 100% 100% mod oops, nmotifs = 20L16 N/A 100% 100% 100% 88% N/A mod oops, nmotifs = 20RT16.Pol16 N/A 0% 69% 100% 0% N/A mod oops, nmotifs = 20L16.Pol16 N/A 0% 0% 94% 0% N/A mod oops, nmotifs = 20L16.RT16 N/A 0% 0% 0% 0% N/A mod oops, nmotifs = 20L16.RT16.Pol16 N/A 0% 0% 94% 0% N/A mod oops, nmotifs = 20
PROBE OutputDatasets 1 2 3 4 5 6 ParametersPol 16 N/A 57% 100% 100% 75% N/A defaultRT16 100% 100% 100% 100% 100% 100% defaultL16 N/A 100% 100% 100% 100% N/A defaultRT16.Pol16 N/A 0% 88% 100% 0% N/A defaultL16.Pol16 N/A 0% 0% 94% 0% N/A defaultL16.RT16 N/A 0% 0% 0% 0% N/A defaultL16.RT16.Pol16 N/A 0% 0% 100% 0% N/A default
1) Protein disorderA) Low hydrophobicity and high mean net charge are good indicators of natively unfolded proteinsB) Predictors of Natural Disordered Regions (PONDR)--
utilizes neural networks to distinguish disordered from ordered regions
2) Evolutionary Dynamic Approaches A) Intermolecular compensatory mutations Pazos and Valencia 1) predicting interacting partners 2) detecting correlated mutations between two interacting proteins 3) extending to three interacting partners B) Evolutionary-Structure Function (EFS) -- Simon and Sidow
Determines numbers amino acid replacements given a fixed phylogenetic topology, ranking constrained regions
C) Intramolecular compensatory mutations -- Pollackcalculates likelihood estimates of allowing for rate variation and robustly discriminates coevolution of intra-sites versus random effects.
A Functional Genomics Approach to Inferring Amino Acid Contacts Among the L, P and N proteins of the Replication/Transcription Complex of the Order Mononivavirales
3) Use experimental results to model and validate expectations
4) Test the predicted structure for the Ebola
New work
L
NP
P
3'
3'
L
P PP
P
N
N
n
L
P
PP
5'
P
L
PP
PP
P
CO-ASSEMBLY?
P P
L
leader N
5'
P
read through
VSV Transcription
VSV Replication
Paramyxoviridae Genome
N P/C/V M F HN RdRp
N P M G RdRp
Rhabdoviridae Genome
Sendai
VSV
PPBS
22281RSR
PPBS
Sendai
1 2109
I II III
PPBS
+ +
V VI
VSV
I II III IV V VI
RNA-BS
L protein
IV
required for replication
1 524PPBS PPBS
RNA-BS
RNA-BS PCS
Sendai
1 422VSV
N protein
1 568 LPBSRSR
** * * ** **
Oligomerization domain
Sendai
1**** * ** * *
GTP bindingNPBS
265VSV
NPBS
P protein
NPBS
NPBS
LPBS
RNA-BS
RES NPBS
+ MT
MT
&
+
N, P and Proteins
N, P and L sequences
Predict regions of disorder
Multiple Alignment
Inter-CM analysis
Evolutionary Dynamics Analysis
Update Mononegavirales Sequence and Literature Database
Calculate H/R PONDRPhylogenetic reconstruction
ESF-analysis Intra-CM analysis
Annotated N, P, L protein maps with ALL information regarding positions of experimentally determined functions and interactions
Dr. Marcella McClure, P.I. (Marcie)
Dustin Lee, M.S., Bioinformatics Programmer
Brad Crowther, B.S., Bioinformatician I/Lab Manager
Aaron Juntunen, Undergraduate programmer
Dr. Ruth Angeletti Hogue, Adjunct Professor (visiting from Albert Einstein School of Medicine)
Kelly Burningham, Undergraduate
Recommended