Upload
quamar-conrad
View
43
Download
0
Embed Size (px)
DESCRIPTION
BCB 444/544. Lecture 19 A bit of: Protein Structure - Basics Protein Structure Visualization, Classification & Comparison #19_Oct05. Required Reading ( before lecture). √ Mon Oct 1 - Lecture 17 Protein Motifs & Domain Prediction Chp 7 - pp 85-96 √ Wed Oct 3 - Lecture 18 - PowerPoint PPT Presentation
Citation preview
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 1
BCB 444/544
Lecture 19
A bit of: Protein Structure - Basics
Protein Structure Visualization, Classification & Comparison
#19_Oct05
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 2
√Mon Oct 1 - Lecture 17
Protein Motifs & Domain Prediction• Chp 7 - pp 85-96
√ Wed Oct 3 - Lecture 18Protein Structure: Basics (Note chg in Lecture Schedule
online )
• Chp 12 - pp 173-186
√Thurs Oct 4 & Fri Oct 5 - Lab 6 & Lecture 19
Protein Structure: Basics, Databases, Visualization,
Classification & Comparison
• Chp 13 - pp 187-199
Required Reading (before lecture)
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 3
BCB 544 - Extra Required Reading
Assigned Mon Sept 24
BCB 544 Extra Required Reading Assignment: for 544 Extra HW#1 Task 2
• Pollard KS, …., Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443: 167-172.
• http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html
doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 4
BCB 544 Projects (Optional for BCB 444)
• For a better idea about what's involved in the Team Projects, please look over last year's expectations for projects: http://www.public.iastate.edu/~f2007.com_s.544/project.htm
• Criteria for evaluation of projects (oral presentations) are summarized here: http://www.public.iastate.edu/%7Ef2007.com_s.544/homework/HW7.pdf
Please note: wrong URL (instead of that shown above) was includedin originally posted 544ExtraHW#1; corrected version is posted now
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 5
Assignments & Announcements - #1
Students registered for BCB 444: Two Grading Options
1) Take Final Exam per original Grading Policies2) Instead of taking Final Exam - you may
participate in a Team Research Project
If you choose #2, please do 3 things:• Contact Drena (in person) • Send email to Michael Terribilini ([email protected])• Complete 544 Extra HW#1 - Task 1.1 by noon on Mon
Oct 1
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 6
Assignments & Announcements - #2
BCB 444s (Standard):200 pts Midterm Exams = 100 points each 200 Homework & Laboratory assignments = 200 points100 Final Exam500 pts Total for BCB 444
BCB 444p (Project):
200 pts Midterm Exams = 100 points each 200 Homework & Laboratory assignments = 200 points190 Team Research Project590 pts Total for BCB 444p
BCB 544: 200 pts Midterm Exams = 100 points each 200 Homework & Laboratory assignments100 Final Exam 200 Discussion Questions & Team Research Projects 700 pts Total for BCB 544
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 7
Assignments & Announcements #3
ALL: HomeWork #3 Due: Mon Oct 8 by 5 PM
• HW544: HW544Extra #1 √Due: Task 1.1 - Mon Oct 1 by noon
Due: Task 1.2 & Task 2 - Fri Oct 12 by 5 PM (not Monday)
• 444 "Project-instead-of-Final" students should also submit:• HW544Extra #1
• Due: Task 1.1 - Mon Oct 8 by noon • Due: Task 1.2 - Fri Oct 12 by 5 PM (not Monday) Task 2 NOT required!
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 8
QUESTIONS re: HW#3? Due Mon
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 9
HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction
This is a new slide
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10
An HMM for Occasionally Dishonest Casino
Transition probabilities• Prob(Fair Loaded) = 0.01• Prob(Loaded Fair) = 0.2
But, where do you start? "Begin" state not shown
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 11
Occasionally Dishonest Casino - HW#3
"Begin" state? 50:50 chance of starting with F vs L die
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 12
Calculating Different Paths to an Observed Sequence
€
Pr(x,π (1) ) = a0FeF (6)aFFeF (2)aFFeF (6)
= 0.5 ×16
×0.99 ×16
×0.99 ×16
≈ 0.00227
008.0
5.08.01.08.05.05.0
)6()2()6(),Pr( 0)2(
=×××××=
= LLLLLLLL eaeaeax π
0000417.0
5.001.061
2.05.05.0
)6()2()6(),Pr( 00)3(
≈
×××××=
= LLFLFLFLL aeaeaeax π
FFF=)1(π
LLL=)2(π
LFL=)3(π
6,2,6,, 321 == xxxx transition probability
emission probability
This slide has been changed
Calculations such as those shown below are used to fill a matrix with probability values for every state at every position
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 13
Calculate optimal path? Construct a matrix of probability values for every state at every residue
• Initialization (i = 0)
• Recursion (i = 1, . . . , L): For each state k
• Termination:
( )rkrr
ikk aivxeiv )1(max)()( −=
( )0* )(max),Pr( kk
kaLvx =π
0 for 0)0( ,1)0(0 >== kvv k
To find π*, use trace-back, as in dynamic programming
How: one way = Viterbi Algorithm
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 14
Viterbi for Calculating Most Probable Path*
( )rkrr
ikk aivxeiv )1(max)()( −=
1
π
x
0
0
6 2 6
(1/6)(1/2) = 1/12
0
(1/2)(1/2) = 1/4
(1/6)max{(1/12)0.99,
(1/4)0.2} = 0.01375(1/10)max{(1/12)0.
01, (1/4)0.8}
= 0.02
B
F
L
0 0
(1/6)max{0.013750.99,
0.020.2} = 0.00226875(1/2)max{0.013750.01,
0.020.8} = 0.08
* Path within HMM that matches query sequence with highest probability
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 15
Total Probability
Several different paths can result in observation x
∑=π
π ),Pr()Pr( xx
Probability that our model will emit x is:
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 16
Calculating the Total Probability:
1
π
x
0
0
6 2 6
(1/6)(1/2) = 1/12
0
(1/2)(1/2) = 1/4
(1/6)sum{(1/12)0.99,
(1/4)0.2} = 0.022083(1/10)sum{(1/12)0.
01, (1/4)0.8}
= 0.020083
B
F
L
0 0
(1/6)sum{0.0220830.99,
0.0200830.2} = 0.004313(1/2)sum{0.0220830.01,
0.0200830.8} = 0.008144
Total probability = ∑π
π ),Pr(x = 0 + 0.004313 + 0.008144 = 0.012
This slide has bee changed
Note: This not the same as matrix on previous slide!Here, last column contains sums for each row
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 17
A few more Details re: Profiles & HMMs
• Smoothing or "Regularization" - method used to avoid "over-fitting"
• Common problem in machine learning (data-driven) approaches
• Limited training sample size causes over-representation of observed characters while "ignoring" unobserved characters
• Result? Miss members of family not yet sampled
(too many false negative hits)
• Pseudocounts - adding artificial values for 'extra' amino acid(s) not observed in the training set
• Treated as a 'real' values in calculating probabilities
• Improve predictive power of profiles & HMMs
• Dirichlet mixture - commonly used mathematical model to simulate the aa distribution in a sequence alignment
• To "correct" problems in an observed alignment based on limited number of sequences
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 18
Chp 7 - Protein Motifs & Domain Prediction
SECTION II SEQUENCE ALIGNMENT
Xiong: Chp 7
Protein Motifs and Domain Prediction• √Identification of Motifs & Domains in MSAs• √Motif & Domain Databases Using Regular
Expressions• √Motif & Domain Databases Using Statistical
Models
• Protein Family Databases• Motif Discovery in Unaligned Sequences• √Sequence Logos
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 19
Motifs & Domains
• Motif - short conserved sequence pattern• Associated with distinct function in protein or DNA• Avg = 10 residues (usually 6-20 residues)
• e.g., zinc finger motif - in protein• e.g., TATA box - in DNA
• Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit• Avg = 100 residues (range from 40-700 in proteins)
• e.g., kinase domain or transmembrane domain - in protein
• Domains may (or may not) include motifs
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 20
2 Approaches for Representing "Consensus" Information in Motifs & Domains
• Regular expression - symbolic representation of information from MSA
• e.g., protein phosphorylation site motif: [S,T]- X- [R,K]• Symbols represent specific or unspecified residues, spaces,
etc.• 2 mechanisms for matching:
• Exact• "Fuzzy" (inexact, approximate) - flexible, more
permissive to detect "near matches"
• Statistical model - includes probability information derived from MSA
• e.g., PSSM, Profile, or HMM
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 21
Motif & Domain Databases
Based on regular expressions:• Prosite (Interpro includes Prosite, PRINTS, etc)• Emofit
Limitation: these don't take probability info into account
Based on statistical models:• PRINTS• BLOCKS• ProDom• Pfam • SMART• CDART• Reverse PsiBLAST
• READ your textbook & try some of these at home; there are distinct advantages/disadvantages associated with each
• TAKE HOME LESSON: Always try several methods! (not just one!)
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 22
Protein Family Databases
• In addition to databases of "related" protein sequences, based on shared motifs or domains (Pfam, BLOCKS, CDART), some databases "cluster" sequences into families based on near full-length sequence comparisons
• COGs - Clusters of Orthologous Groups (at NCBI)• Mostly Prokaryotic sequences• KOG = newer Eukaryotic version• COGnitor - softwared to search database
• ProtoNet - also clusters of homologous protein sequences
• Advantages: tree-like hierarchical structure• Provide GO (gene ontology) annotations• Provides InterPro keywords
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 23
Motif Discovery in Unaligned Sequences
Expectation Maximization - generate"random" alignment of all sequences, derive PSSM, iteratively match individual sequences to PSSM to edit & improve it Problems? Can hit a local optimum (premature convergence)
Sensitive to initial alignment• MEME - Multiple EM for Motif Elicitation - modified EM,
avoids local optimum issues; two step procedure
Gibbs Sampling - generate "trial" PSSM from random alignment first, as in EM, but leave one sequence out of initial alignment, then iteratively match PSSM to left-out sequences
• Gibbs Sampler - web-based motif search via Gibbs sampling
• Not mentioned in textbook: • Stochastic context-free grammers• Other "state of the art"pproaches in recent literature, but not
available in web-based servers (yet)
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 24
Chp 12 - Protein Structure Basics
SECTION V STRUCTURAL BIOINFORMATICS
Xiong: Chp 12
Protein Structure Basics
• LAB 6• Introduction to Protein DataBank -
PDB• PyMol• Cn3D?
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 25
Chp 12 - Protein Structure Basics
SECTION V STRUCTURAL BIOINFORMATICS
Xiong: Chp 12
Protein Structure Basics
• Amino Acids• Peptide Bond Formation• Dihedral Angles• Hierarchy• Secondary Structures• Tertiary Structures• Determination of Protein 3-Dimensional
Structure• Protein Structure DataBank (PDB)
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 26
Protein Structure & Function
• Protein structure - primarily determined by sequence
• Protein function - primarily determined by structure
• Globular proteins: compact hydrophobic core & hydrophilic surface
• Membrane proteins: special hydrophobic surfaces
• Folded proteins are only marginally stable
• Some proteins do not assume a stable "fold" until they bind to something = Intrinsically disordered
Predicting protein structure and function can be very hard -- & fun!
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 27
4 Basic Levels of Protein Structure
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 28
Primary & Secondary Structure
• Primary • Linear sequence of amino acids• Description of covalent bonds linking aa’s
• Secondary • Local spatial arrangement of amino acids• Description of short-range non-covalent interactions• Periodic structural patterns: -helix, -sheet
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 29
Tertiary & Quaternary Structure
• Tertiary • Overall 3-D "fold" of a single polypeptide chain• Spatial arrangement of 2’ structural elements;
packing of these into compact "domains"• Description of long-range non-covalent
interactions (plus disulfide bonds)
• Quaternary• In proteins with > 1 polypeptide chain, spatial
arrangement of subunits
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 30
"Additional" Structural Levels
• Super-secondary elements
• Motifs• Domains• Foldons
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 31
Amino Acids
• Each of 20 different amino acids has different "R-Group" or side chain attached to C
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 32
Peptide Bond is Rigid and Planar
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 33
Hydrophobic Amino Acids
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 34
Charged Amino Acids
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 35
Polar Amino Acids
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 36
Certain Side-chain Configurations are Energetically Favored (Rotamers)
Ramachandran plot: "Allowable" psi & phi angles
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 37
Glycine is Smallest Amino AcidR group = H atom
• Glycine residues increase backbone flexibility because they have no R group
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 38
Proline is Cyclic• Proline residues reduce flexibility of polypeptide chain
• Proline cis-trans isomerization is often a rate-limiting step in protein folding
• Recent work suggests it also may also regulate ligand binding in native proteins Andreotti (BBMB)
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 39
Cysteines can Form Disulfide (S-S) Bonds
• Disulfide bonds (covalent) stabilize 3-D structures
• In eukaryotes, disulfide bonds are often found in secreted proteins or extracellular domains
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 40
Globular Proteins Have a Compact Hydrophobic Core
• Packing of hydrophobic side chains into interior is main driving force for folding
• Problem? Polypeptide backbone is highly polar (hydrophilic) due to polar -NH and C=O in each peptide unit (which are charged at neutral pH=7, found in biological systems); these polar groups must be neutralized
• Solution? Form regular secondary structures, • e.g., -helix, -sheet, stabilized by H-bonds
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 41
Exterior Surface of Globular Proteins is Generally Hydrophilic
• Hydrophobic core formed by packed secondary structural elements provides compact, stable core
• "Functional groups" of protein are attached to this framework; exterior has more flexible regions (loops) and polar/charged residues
• Hydrophobic "patches" on protein surface are often involved in protein-protein interactions
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 42
Protein Secondary Structures
• Helices
• Sheets
• Loops
• Coils
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 43
Helix: Stabilized by H-bonds between every ~ 4th residue in Backbone
C = blackO = redN = blueH = white
Look! - Charges on backbone are "neutralized" by hydrogen bonds (H-bonds) - red fuzzy vertical
bonds
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 44
Certain Amino Acids are "Preferred" & Others are Rare in Helices
• Ala, Glu, Leu, Met = good helix formers• Pro, Gly Tyr, Ser = very poor• Amino acid composition & distribution varies,
depending on on location of helix in 3-D structure
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 45
-Sheets - also Stabilized by H-bonds Between Backbone Atoms
Anti-parallel Parallel
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 46
Loops
• Connect helices and sheets• Vary in length and 3-D configurations• Are located on surface of structure• Are more "tolerant" of mutations• Are more flexible and can adopt
multiple conformations• Tend to have charged and polar amino acids• Are frequently components of active sites• Some fall into distinct structural
families (e.g., hairpin loops, reverse turns)
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 47
Coils
• Regions of 2' structure that are not helices, sheets, or recognizable turns
• Intrinsically disordered regions appear to play important functional roles
10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 48
Chp 13 - Protein Structure Basics
SECTION V STRUCTURAL BIOINFORMATICS
Xiong: Chp 13
Protein Structure Visualization, Comparison & Classfication
• Protein Structural Visualization• Protein Structure Comparison• Protein Structure Classification