48
10/5/07 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 1 BCB 444/544 Lecture 19 A bit of: Protein Structure - Basics Protein Structure Visualization, Classification & Comparison #19_Oct05

BCB 444/544

Embed Size (px)

DESCRIPTION

BCB 444/544. Lecture 19 A bit of: Protein Structure - Basics Protein Structure Visualization, Classification & Comparison #19_Oct05. Required Reading ( before lecture). √ Mon Oct 1 - Lecture 17 Protein Motifs & Domain Prediction Chp 7 - pp 85-96 √ Wed Oct 3 - Lecture 18 - PowerPoint PPT Presentation

Citation preview

Page 1: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 1

BCB 444/544

Lecture 19

A bit of: Protein Structure - Basics

Protein Structure Visualization, Classification & Comparison

#19_Oct05

Page 2: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 2

√Mon Oct 1 - Lecture 17

Protein Motifs & Domain Prediction• Chp 7 - pp 85-96

√ Wed Oct 3 - Lecture 18Protein Structure: Basics (Note chg in Lecture Schedule

online )

• Chp 12 - pp 173-186

√Thurs Oct 4 & Fri Oct 5 - Lab 6 & Lecture 19

Protein Structure: Basics, Databases, Visualization,

Classification & Comparison

• Chp 13 - pp 187-199

Required Reading (before lecture)

Page 3: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 3

BCB 544 - Extra Required Reading

Assigned Mon Sept 24

BCB 544 Extra Required Reading Assignment: for 544 Extra HW#1 Task 2

• Pollard KS, …., Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443: 167-172.

• http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html

doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link

Page 4: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 4

BCB 544 Projects (Optional for BCB 444)

• For a better idea about what's involved in the Team Projects, please look over last year's expectations for projects: http://www.public.iastate.edu/~f2007.com_s.544/project.htm

• Criteria for evaluation of projects (oral presentations) are summarized here: http://www.public.iastate.edu/%7Ef2007.com_s.544/homework/HW7.pdf

Please note: wrong URL (instead of that shown above) was includedin originally posted 544ExtraHW#1; corrected version is posted now

Page 5: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 5

Assignments & Announcements - #1

Students registered for BCB 444: Two Grading Options

1) Take Final Exam per original Grading Policies2) Instead of taking Final Exam - you may

participate in a Team Research Project

If you choose #2, please do 3 things:• Contact Drena (in person) • Send email to Michael Terribilini ([email protected])• Complete 544 Extra HW#1 - Task 1.1 by noon on Mon

Oct 1

Page 6: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 6

Assignments & Announcements - #2

BCB 444s (Standard):200 pts Midterm Exams = 100 points each 200 Homework & Laboratory assignments = 200 points100 Final Exam500 pts Total for BCB 444

BCB 444p (Project):

200 pts Midterm Exams = 100 points each 200 Homework & Laboratory assignments = 200 points190 Team Research Project590 pts Total for BCB 444p

BCB 544: 200 pts Midterm Exams = 100 points each 200 Homework & Laboratory assignments100 Final Exam 200 Discussion Questions & Team Research Projects 700 pts Total for BCB 544

Page 7: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 7

Assignments & Announcements #3

ALL: HomeWork #3 Due: Mon Oct 8 by 5 PM

• HW544: HW544Extra #1 √Due: Task 1.1 - Mon Oct 1 by noon

Due: Task 1.2 & Task 2 - Fri Oct 12 by 5 PM (not Monday)

• 444 "Project-instead-of-Final" students should also submit:• HW544Extra #1

• Due: Task 1.1 - Mon Oct 8 by noon • Due: Task 1.2 - Fri Oct 12 by 5 PM (not Monday) Task 2 NOT required!

Page 8: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 8

QUESTIONS re: HW#3? Due Mon

Page 9: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 9

HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction

This is a new slide

Page 10: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10

An HMM for Occasionally Dishonest Casino

Transition probabilities• Prob(Fair Loaded) = 0.01• Prob(Loaded Fair) = 0.2

But, where do you start? "Begin" state not shown

Page 11: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 11

Occasionally Dishonest Casino - HW#3

"Begin" state? 50:50 chance of starting with F vs L die

Page 12: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 12

Calculating Different Paths to an Observed Sequence

Pr(x,π (1) ) = a0FeF (6)aFFeF (2)aFFeF (6)

= 0.5 ×16

×0.99 ×16

×0.99 ×16

≈ 0.00227

008.0

5.08.01.08.05.05.0

)6()2()6(),Pr( 0)2(

=×××××=

= LLLLLLLL eaeaeax π

0000417.0

5.001.061

2.05.05.0

)6()2()6(),Pr( 00)3(

×××××=

= LLFLFLFLL aeaeaeax π

FFF=)1(π

LLL=)2(π

LFL=)3(π

6,2,6,, 321 == xxxx transition probability

emission probability

This slide has been changed

Calculations such as those shown below are used to fill a matrix with probability values for every state at every position

Page 13: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 13

Calculate optimal path? Construct a matrix of probability values for every state at every residue

• Initialization (i = 0)

• Recursion (i = 1, . . . , L): For each state k

• Termination:

( )rkrr

ikk aivxeiv )1(max)()( −=

( )0* )(max),Pr( kk

kaLvx =π

0 for 0)0( ,1)0(0 >== kvv k

To find π*, use trace-back, as in dynamic programming

How: one way = Viterbi Algorithm

Page 14: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 14

Viterbi for Calculating Most Probable Path*

( )rkrr

ikk aivxeiv )1(max)()( −=

1

π

x

0

0

6 2 6

(1/6)(1/2) = 1/12

0

(1/2)(1/2) = 1/4

(1/6)max{(1/12)0.99,

(1/4)0.2} = 0.01375(1/10)max{(1/12)0.

01, (1/4)0.8}

= 0.02

B

F

L

0 0

(1/6)max{0.013750.99,

0.020.2} = 0.00226875(1/2)max{0.013750.01,

0.020.8} = 0.08

* Path within HMM that matches query sequence with highest probability

Page 15: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 15

Total Probability

Several different paths can result in observation x

∑=π

π ),Pr()Pr( xx

Probability that our model will emit x is:

Page 16: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 16

Calculating the Total Probability:

1

π

x

0

0

6 2 6

(1/6)(1/2) = 1/12

0

(1/2)(1/2) = 1/4

(1/6)sum{(1/12)0.99,

(1/4)0.2} = 0.022083(1/10)sum{(1/12)0.

01, (1/4)0.8}

= 0.020083

B

F

L

0 0

(1/6)sum{0.0220830.99,

0.0200830.2} = 0.004313(1/2)sum{0.0220830.01,

0.0200830.8} = 0.008144

Total probability = ∑π

π ),Pr(x = 0 + 0.004313 + 0.008144 = 0.012

This slide has bee changed

Note: This not the same as matrix on previous slide!Here, last column contains sums for each row

Page 17: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 17

A few more Details re: Profiles & HMMs

• Smoothing or "Regularization" - method used to avoid "over-fitting"

• Common problem in machine learning (data-driven) approaches

• Limited training sample size causes over-representation of observed characters while "ignoring" unobserved characters

• Result? Miss members of family not yet sampled

(too many false negative hits)

• Pseudocounts - adding artificial values for 'extra' amino acid(s) not observed in the training set

• Treated as a 'real' values in calculating probabilities

• Improve predictive power of profiles & HMMs

• Dirichlet mixture - commonly used mathematical model to simulate the aa distribution in a sequence alignment

• To "correct" problems in an observed alignment based on limited number of sequences

Page 18: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 18

Chp 7 - Protein Motifs & Domain Prediction

SECTION II SEQUENCE ALIGNMENT

Xiong: Chp 7

Protein Motifs and Domain Prediction• √Identification of Motifs & Domains in MSAs• √Motif & Domain Databases Using Regular

Expressions• √Motif & Domain Databases Using Statistical

Models

• Protein Family Databases• Motif Discovery in Unaligned Sequences• √Sequence Logos

Page 19: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 19

Motifs & Domains

• Motif - short conserved sequence pattern• Associated with distinct function in protein or DNA• Avg = 10 residues (usually 6-20 residues)

• e.g., zinc finger motif - in protein• e.g., TATA box - in DNA

• Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit• Avg = 100 residues (range from 40-700 in proteins)

• e.g., kinase domain or transmembrane domain - in protein

• Domains may (or may not) include motifs

Page 20: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 20

2 Approaches for Representing "Consensus" Information in Motifs & Domains

• Regular expression - symbolic representation of information from MSA

• e.g., protein phosphorylation site motif: [S,T]- X- [R,K]• Symbols represent specific or unspecified residues, spaces,

etc.• 2 mechanisms for matching:

• Exact• "Fuzzy" (inexact, approximate) - flexible, more

permissive to detect "near matches"

• Statistical model - includes probability information derived from MSA

• e.g., PSSM, Profile, or HMM

Page 21: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 21

Motif & Domain Databases

Based on regular expressions:• Prosite (Interpro includes Prosite, PRINTS, etc)• Emofit

Limitation: these don't take probability info into account

Based on statistical models:• PRINTS• BLOCKS• ProDom• Pfam • SMART• CDART• Reverse PsiBLAST

• READ your textbook & try some of these at home; there are distinct advantages/disadvantages associated with each

• TAKE HOME LESSON: Always try several methods! (not just one!)

Page 22: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 22

Protein Family Databases

• In addition to databases of "related" protein sequences, based on shared motifs or domains (Pfam, BLOCKS, CDART), some databases "cluster" sequences into families based on near full-length sequence comparisons

• COGs - Clusters of Orthologous Groups (at NCBI)• Mostly Prokaryotic sequences• KOG = newer Eukaryotic version• COGnitor - softwared to search database

• ProtoNet - also clusters of homologous protein sequences

• Advantages: tree-like hierarchical structure• Provide GO (gene ontology) annotations• Provides InterPro keywords

Page 23: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 23

Motif Discovery in Unaligned Sequences

Expectation Maximization - generate"random" alignment of all sequences, derive PSSM, iteratively match individual sequences to PSSM to edit & improve it Problems? Can hit a local optimum (premature convergence)

Sensitive to initial alignment• MEME - Multiple EM for Motif Elicitation - modified EM,

avoids local optimum issues; two step procedure

Gibbs Sampling - generate "trial" PSSM from random alignment first, as in EM, but leave one sequence out of initial alignment, then iteratively match PSSM to left-out sequences

• Gibbs Sampler - web-based motif search via Gibbs sampling

• Not mentioned in textbook: • Stochastic context-free grammers• Other "state of the art"pproaches in recent literature, but not

available in web-based servers (yet)

Page 24: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 24

Chp 12 - Protein Structure Basics

SECTION V STRUCTURAL BIOINFORMATICS

Xiong: Chp 12

Protein Structure Basics

• LAB 6• Introduction to Protein DataBank -

PDB• PyMol• Cn3D?

Page 25: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 25

Chp 12 - Protein Structure Basics

SECTION V STRUCTURAL BIOINFORMATICS

Xiong: Chp 12

Protein Structure Basics

• Amino Acids• Peptide Bond Formation• Dihedral Angles• Hierarchy• Secondary Structures• Tertiary Structures• Determination of Protein 3-Dimensional

Structure• Protein Structure DataBank (PDB)

Page 26: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 26

Protein Structure & Function

• Protein structure - primarily determined by sequence

• Protein function - primarily determined by structure

• Globular proteins: compact hydrophobic core & hydrophilic surface

• Membrane proteins: special hydrophobic surfaces

• Folded proteins are only marginally stable

• Some proteins do not assume a stable "fold" until they bind to something = Intrinsically disordered

Predicting protein structure and function can be very hard -- & fun!

Page 27: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 27

4 Basic Levels of Protein Structure

Page 28: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 28

Primary & Secondary Structure

• Primary • Linear sequence of amino acids• Description of covalent bonds linking aa’s

• Secondary • Local spatial arrangement of amino acids• Description of short-range non-covalent interactions• Periodic structural patterns: -helix, -sheet

Page 29: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 29

Tertiary & Quaternary Structure

• Tertiary • Overall 3-D "fold" of a single polypeptide chain• Spatial arrangement of 2’ structural elements;

packing of these into compact "domains"• Description of long-range non-covalent

interactions (plus disulfide bonds)

• Quaternary• In proteins with > 1 polypeptide chain, spatial

arrangement of subunits

Page 30: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 30

"Additional" Structural Levels

• Super-secondary elements

• Motifs• Domains• Foldons

Page 31: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 31

Amino Acids

• Each of 20 different amino acids has different "R-Group" or side chain attached to C

Page 32: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 32

Peptide Bond is Rigid and Planar

Page 33: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 33

Hydrophobic Amino Acids

Page 34: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 34

Charged Amino Acids

Page 35: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 35

Polar Amino Acids

Page 36: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 36

Certain Side-chain Configurations are Energetically Favored (Rotamers)

Ramachandran plot: "Allowable" psi & phi angles

Page 37: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 37

Glycine is Smallest Amino AcidR group = H atom

• Glycine residues increase backbone flexibility because they have no R group

Page 38: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 38

Proline is Cyclic• Proline residues reduce flexibility of polypeptide chain

• Proline cis-trans isomerization is often a rate-limiting step in protein folding

• Recent work suggests it also may also regulate ligand binding in native proteins Andreotti (BBMB)

Page 39: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 39

Cysteines can Form Disulfide (S-S) Bonds

• Disulfide bonds (covalent) stabilize 3-D structures

• In eukaryotes, disulfide bonds are often found in secreted proteins or extracellular domains

Page 40: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 40

Globular Proteins Have a Compact Hydrophobic Core

• Packing of hydrophobic side chains into interior is main driving force for folding

• Problem? Polypeptide backbone is highly polar (hydrophilic) due to polar -NH and C=O in each peptide unit (which are charged at neutral pH=7, found in biological systems); these polar groups must be neutralized

• Solution? Form regular secondary structures, • e.g., -helix, -sheet, stabilized by H-bonds

Page 41: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 41

Exterior Surface of Globular Proteins is Generally Hydrophilic

• Hydrophobic core formed by packed secondary structural elements provides compact, stable core

• "Functional groups" of protein are attached to this framework; exterior has more flexible regions (loops) and polar/charged residues

• Hydrophobic "patches" on protein surface are often involved in protein-protein interactions

Page 42: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 42

Protein Secondary Structures

• Helices

• Sheets

• Loops

• Coils

Page 43: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 43

Helix: Stabilized by H-bonds between every ~ 4th residue in Backbone

C = blackO = redN = blueH = white

Look! - Charges on backbone are "neutralized" by hydrogen bonds (H-bonds) - red fuzzy vertical

bonds

Page 44: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 44

Certain Amino Acids are "Preferred" & Others are Rare in Helices

• Ala, Glu, Leu, Met = good helix formers• Pro, Gly Tyr, Ser = very poor• Amino acid composition & distribution varies,

depending on on location of helix in 3-D structure

Page 45: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 45

-Sheets - also Stabilized by H-bonds Between Backbone Atoms

Anti-parallel Parallel

Page 46: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 46

Loops

• Connect helices and sheets• Vary in length and 3-D configurations• Are located on surface of structure• Are more "tolerant" of mutations• Are more flexible and can adopt

multiple conformations• Tend to have charged and polar amino acids• Are frequently components of active sites• Some fall into distinct structural

families (e.g., hairpin loops, reverse turns)

Page 47: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 47

Coils

• Regions of 2' structure that are not helices, sheets, or recognizable turns

• Intrinsically disordered regions appear to play important functional roles

Page 48: BCB 444/544

10/5/07BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 48

Chp 13 - Protein Structure Basics

SECTION V STRUCTURAL BIOINFORMATICS

Xiong: Chp 13

Protein Structure Visualization, Comparison & Classfication

• Protein Structural Visualization• Protein Structure Comparison• Protein Structure Classification