View
221
Download
0
Tags:
Embed Size (px)
Citation preview
11
Prediction of Prediction of functional/structural sites in functional/structural sites in a protein using conservation a protein using conservation
and hyper-variation and hyper-variation (ConSeq, ConSurf, Selecton)(ConSeq, ConSurf, Selecton)
22
Empirical findings:Empirical findings:variation among genesvariation among genes
““ImportantImportant”” proteins evolveproteins evolve
slowerslowerthan “unimportantunimportant” onesones.
33
Histone H4 proteinHistone H4 protein
44
Empirical findings:Empirical findings:variation among genesvariation among genes
Functional Functional regionsregions evolveevolve
slowerslowerthanthan nonfunctional nonfunctional regions.regions.
55
Conservation = functional/structural Conservation = functional/structural importanceimportance
66
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHLBos MALWTRLRPLLALLALWPPPPARAFVNQHL **** : * *.*: *:..* :. *:****
Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQBos CGSHLVEALYLVCGERGFFYTPKARREVEG **************:******** :*::*
Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIVBos PQVG---ALELAGGPGAGGLEGPPQKRGIV .**. ** * * *****
Xenopus EQCCHSTCSLFQLENYCNBos EQCCASVCSLYQLENYCN **** *.***:*******
Alignment preproinsulin
77
88
99
Conserved sites: Important for the function or structureImportant for the function or structure Not allowed to mutateNot allowed to mutate “Slow evolving” sites Low rate of evolution
Variable sites: Less important (usually) Change more easily “Fast evolving” sites High rate of evolution
Conservation based inferenceConservation based inference
1010
Detecting conservation: Detecting conservation: Evolutionary rates
d T
dr
2
• Rate (~speed) = distance / time• Distance = number of substitutions per site • Time = 2*#years (doubled because the sequences evolved independently
1111
Mean Rate of Nucleotide Substitution in Mammalian Genomes
Evolution is a very Evolution is a very slowslow process at the molecular process at the molecular level level (“Nothing (“Nothing happens…”)happens…”)
~10-9
Substitutions/site/year
1212
Rate computationRate computation11223344556677
HumanHumanDDMMAAAAHHAAMM
ChimpChimpDDEEAAAAGGGGCC
CowCowDDQQAAAAWWAAPP
FishFishDDLLAAAACCAALL
S. cerevisiaeS. cerevisiaeDDDDGGAAFFAAAA
S. pombeS. pombeDDDDGGAALLGGEE
1313
http://http://conseqconseq.tau.ac.il.tau.ac.ilSite-specific rate computation methodSite-specific rate computation method
1414
Using the ConSeq serverUsing the ConSeq server
1515
ConSeq results:ConSeq results:
1616
Crash course in protein structureCrash course in protein structure
1717
Why protein structure?Why protein structure?
Each protein has a particular 3D structure that determines
its function
Protein structure is better conserved than protein sequence
and more closely related to function
Analyzing a protein structure is
more informative than analyzing its
sequence for function inference
1818
PDB: Protein Data BankPDB: Protein Data Bankhttp://www.rcsb.orghttp://www.rcsb.org
Holds 3D models of biological macromolecules (protein, RNA, DNA, small molecules)
All data are available to the public
X-Ray crystals (84%) NMR models (16%)
Submitted by biologists and biochemists from around the world.
1919
PDB modelPDB model
Defines the 3D coordinates (x,y,z) of each of the atoms Defines the 3D coordinates (x,y,z) of each of the atoms in one in one or moreor more molecules (i.e., complex) molecules (i.e., complex)
There are models of proteins, protein complexes, There are models of proteins, protein complexes, proteins and DNA, protein segments, etc …proteins and DNA, protein segments, etc …
The models also include the positions of ligand The models also include the positions of ligand molecules, solvent molecules, metal ions, etc…molecules, solvent molecules, metal ions, etc…
PDB code: integer + 3 integers/characters (e.g., 1a14) PDB code: integer + 3 integers/characters (e.g., 1a14)
2020
The PDB file – text formatThe PDB file – text format
2121
The PDB file – textThe PDB file – text formatformat
ATOM:
Usually protein or DNA
HETATM:
Usually Ligand, ion, water
chain
Residue identity
Residue number
Atom number
Atom identity
The coordinates for each residue in the structure Temperature
factorX Y Z
2222
Viewing structuresViewing structuresWireframe Spacefill
Backbone
2323
Protein core: structurally constrained - usually conserved
Active site: functionally constrained - usually conserved
Surface loops: tolerant to mutations - usually variable
Hydrophobic core
Surface loops
Conservation in the structureConservation in the structure
Active site
2424
http://http://consurfconsurf.tau.ac.il.tau.ac.ilSame algorithm as ConSeq, but here the resultsSame algorithm as ConSeq, but here the results are projected onto the 3D structure of the proteinare projected onto the 3D structure of the protein
2525
Using the ConSurf serverUsing the ConSurf server
2626
ConSurf example: ConSurf example: potassium channel potassium channel
An integral membrane protein with sequence An integral membrane protein with sequence similarity to all known K+ channels, particularly similarity to all known K+ channels, particularly in the pore region. in the pore region.
PDB code: 1bl8, chain A PDB code: 1bl8, chain A
2727
ConSurf ConSurf resultsresults::
2828
Alignment of homologs found by psi-blast:Alignment of homologs found by psi-blast:
ConSurf example: ConSurf example: potassium channel potassium channel
2929
ConSurf ConSurf resultsresults::
3030
ConSurf example: ConSurf example: potassium channel potassium channel
Neighbor-Joining reconstructed phylogenetic tree:Neighbor-Joining reconstructed phylogenetic tree:
3131
ConSurf ConSurf resultsresults::
3232
Conservation scoresConservation scores::
The scores are standardized: the average score for all The scores are standardized: the average score for all residues is zero, and the standard deviation is one residues is zero, and the standard deviation is one
The lowest score represents the most conserved site The lowest score represents the most conserved site in the protein in the protein negative values: slowly evolving (= low evolutionary rate), negative values: slowly evolving (= low evolutionary rate),
conserved sitesconserved sites The highest score represents the most variable site in The highest score represents the most variable site in
the proteinthe protein positive values: rapidly evolving (= fast evolutionary rate), positive values: rapidly evolving (= fast evolutionary rate),
variable sitesvariable sites
3333
ConSurf results: amino-acid ConSurf results: amino-acid conservation scoresconservation scores
3434
ConSurf result with First Glance in ConSurf result with First Glance in Jmol:Jmol:
3535
ConSeqConSeq//ConSurfConSurf user intervention user intervention(advanced options)(advanced options)
1.1. Method of calculating the amino acid conservation scores: Method of calculating the amino acid conservation scores: BayesianBayesian/Max Likelihood/Max Likelihood
2.2. Enter your own MSA fileEnter your own MSA file3.3. Multiply Align Sequences using: Multiply Align Sequences using: MUSCLEMUSCLE/CLUSTALW/CLUSTALW4.4. Collect the Homologues from: Collect the Homologues from: SWISS-PROTSWISS-PROT/UniProt/UniProt5.5. Max. Number of Homologues (default = 50)Max. Number of Homologues (default = 50)6.6. No. of PSI-BLAST Iterations (default = 1)No. of PSI-BLAST Iterations (default = 1)7.7. PSI-BLAST E-value Cutoff (default = 0.001PSI-BLAST E-value Cutoff (default = 0.001))8.8. Model of substitution for proteins: Model of substitution for proteins:
JTTJTT/Dayhoff/mtREV/cpREV/WAG/Dayhoff/mtREV/cpREV/WAG9.9. Enter your own PDB fileEnter your own PDB file10.10. Enter your own TREE fileEnter your own TREE file
3636
Codon-level selectionCodon-level selection
ConSeq/ConSurf:ConSeq/ConSurf: Compute the evolutionary rate of amino-acid Compute the evolutionary rate of amino-acid
sites → the data are amino acids.sites → the data are amino acids. But, codons encode amino acids…But, codons encode amino acids… 61 codons vs. 20 amino acids !61 codons vs. 20 amino acids ! Aren’t we loosing information ???Aren’t we loosing information ???
3737
Darwin – the theory of Darwin – the theory of natural selectionnatural selection
Adaptive evolutionAdaptive evolution::
Favorable traits will become more Favorable traits will become more frequent in the populationfrequent in the population
3838
M. Kimura – the neutral theory M. Kimura – the neutral theory of molecular evolutionof molecular evolution
Most of the DNA variation betweenMost of the DNA variation betweenspecies is neutral with regards to the species is neutral with regards to the phenotypephenotype
Selection operates to Selection operates to preservepreserve a trait a trait
3939
Synonymous (silent) and non-synonymous (non-silent) substitutions
Silent Non-silent…
4040
Synonymous vs. nonsynonymous substitutions
UUU → UUC (Phe → Phe ): synonymous
UUU → CUU (Phe → Leu): non-synonymous
synonymous substitutions = silent substitutions
non-synonymous substitutions = non-silent or amino-acid altering substitutions
4141
For For mostmost proteins, it is observed that the proteins, it is observed that the rate of rate of synonymoussynonymous substitutions is much substitutions is much
HigherHigherthan the than the non-synonymousnon-synonymous rate rate
This is called purifying selectionpurifying selection (= conservation (= conservation this is what ConSeq/Surf are computingthis is what ConSeq/Surf are computing))
Synonymous vs. non-synonymous substitutions
4242
Synonymous vs. non-synonymous substitutions
Structural proteins
4343
Saturation of synonymous substitutions
Histone H4 between human and wheat: saturation of synonymous substitutions
4444
There are rare cases where the non-synonymous rate is much larger than the synonymous rate.
This is called Positive selectionPositive selection
Synonymous vs. nonsynonymous substitutions
4545
Examples:Examples: Proteins of the immune systemProteins of the immune system Pathogen proteins evading the host immune Pathogen proteins evading the host immune
systemsystem Pathogen proteins that are drug targetsPathogen proteins that are drug targets Proteins that are products of gene duplicationProteins that are products of gene duplication Proteins involved in the reproduction systemProteins involved in the reproduction system
Positive Selection
The hypothesis:The hypothesis:
Promotes the fitness of the organism Promotes the fitness of the organism
4646
Computing synonymous and non-synonymous rates
• Codon-based MSA: translate DNA to amino acids, align, backtrack to the DNA but keep alignment
• Phylogenetic tree: 5 replacements in 10 positions between 5 replacements in 10 positions between human and chimp is a lot, but between human and human and chimp is a lot, but between human and cucumber is nothing cucumber is nothing
• Different replacement probabilities between two amino acids:
LysLysArg Arg ≠ ≠ LysLysCysCys
Positive evolution occurs at only a few sitesPositive evolution occurs at only a few sites! !
4747
Inferring positive selectionInferring positive selection
Divide the rate of non-silent Divide the rate of non-silent substitutions (substitutions (KKaa))
by the rate of silent substitutions (by the rate of silent substitutions (KKss))
s
ak
k
4848
Inferring positive selectionInferring positive selection
Basic assumptions:Basic assumptions:
Selection score Selection score ((Ka/KsKa/Ks) > 1) > 1
↓↓
positive selectionpositive selection
Selection score Selection score ((Ka/KsKa/Ks) < 1) < 1
↓↓
purifying selectionpurifying selection
4949
Not so fastNot so fast!!! !!!
Our computational model assumes Our computational model assumes there is positive selection in the datathere is positive selection in the data
There is a good chance our model There is a good chance our model will find a few positively selected will find a few positively selected sites whatever the case sites whatever the case
Is this really indicative of positive Is this really indicative of positive selection or plain randomness?selection or plain randomness?
So, maybe there’s no positive selection after all So, maybe there’s no positive selection after all
5050
Statistics helps us to compare Statistics helps us to compare between hypothesesbetween hypotheses
HH00: There’s no positive selection: There’s no positive selection
HH11: There is positive selection: There is positive selection
2~)))0(|(
))1(|(ln(2
HMDataL
HMDataL
HH00: compute the probability: compute the probability (likelihood) (likelihood) of the data of the data
using a model that using a model that does does not not account for positive account for positive selectionselection
HH11: compute the probability: compute the probability (likelihood) (likelihood) of the data of the data
using a model that using a model that does account for positive selectiondoes account for positive selection Perform a Perform a likelihood ratio test likelihood ratio test (LRT)(LRT)
5151
http://selecton.tau.ac.il
5252
Using the selecton serverUsing the selecton server
5353
Input = a coding sequence at the codon level
The user must provide the sequences – no psi-blast optionThe user must provide the sequences – no psi-blast option The sequences’ lengths must divide by 3 (ORF) and must The sequences’ lengths must divide by 3 (ORF) and must not not
include any stop-codonsinclude any stop-codons An alignment should be a An alignment should be a codon alignmentcodon alignment RevTransRevTrans
5454
Similar to ConSurf
optional
Nuclear/mitochondria different species
Default run:M8(H1) and the M8a(H0)
5555
Selecton Example: HIV ProteaseSelecton Example: HIV Protease
The Protease is The Protease is an essential an essential
enzymeenzyme for viral for viral
infectivityinfectivity
PDB ID: 1hxwPDB ID: 1hxw
5656
Selecton ResultsSelecton Results::
5757
Selecton ResultsSelecton Results::
5858
Selecton resultsSelecton results::
5959
Selection scores (Selection scores (Ka/KsKa/Ks):):
The scores are normalized The scores are normalized Ka/Ks Ka/Ks > 1: positive selected site > 1: positive selected site Ka/KsKa/Ks <1: purified selected site <1: purified selected site
6060
Coloring schemeColoring scheme::
Used for visualization is based on the Used for visualization is based on the continuous continuous Ka/KsKa/Ks scores. scores.
The color grades (1-7):The color grades (1-7): 1 for positive selected sites (blue)1 for positive selected sites (blue) 7 for purified selected sites (bordeaux)7 for purified selected sites (bordeaux)
Color coding scheme of Selecton
6161