JM - http://folding.chmcc.org 1
Introduction to Bioinformatics: Lecture IIFrom Molecular Processes to String Matching
Jarek MellerJarek Meller
Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org 2
Outline of the lecture
Sequence approximation in computational molecular biology: the premise and the limits
Getting ready for analysis of exact string matching and sequence alignment algorithms: some definitions and interplay with biology
The notion of string/sequence similarity Substitution matrices for sequence alignment
JM - http://folding.chmcc.org 3
Before we start: literature watch
A draft of the Rat genome has been published! RGSPC Nature 428
What are the first conclusions from the comparison with other mammalian genomes?
What approaches and tools have been used to perform this comparative analysis?
H: 2.9 Gb
M: 2.5 Gb
R: 2.75 GbR: unique - 0.7 Gb; common with both H and M – 1.1 Gb
4
Biological Polymers and Central Dogma
Bio-Polymer (alphabet) Process (algorithm)
DNA (A,T,G,C) replication
transcription
mRNA (U,A,C,G) splicing
translation
Proteins (20 a.a.) folding
interactions
Lipids, polysaccharides, membranes, signal transduction, environmental signals etc.
JM - http://folding.chmcc.org 5
Complexity of “DNA computing”
http://www.genecrc.org/site/lc/lc2d.htm
JM - http://folding.chmcc.org 6
Get the relevant sequences to compare them: conservation and differences
Problem Algorithms Programs
Sequencing Fragment assembly problem The Shortest Superstring Problem Phrap (Green, 1994)
Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)
Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods BLAST (Altschul et. al., 1990)
JM - http://folding.chmcc.org 7
Redundancy in biological systems
Query: 1 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 60 M LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE Sbjct: 1 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60 Query: 61 DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFISEAIIHVLHSRH 120 DLKKHG TVLTALG ILKKKGHHEAE+KP AQSHATKHKIP+KYLEFISE II VL S+H Sbjct: 61 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120
Query: 121 PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG 154 PG+FGADAQGAMNKALELFRKD+A+ YKELG+QG Sbjct: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154
Ex. Find the sequence of 1mba in the PDB and “blast” against nr using NCBI
An example: sperm whale vs. human myoglobin:
JM - http://folding.chmcc.org 8
Limits of the sequence approximation
• All the information and various fingerprints of information processing at the molecular level (via interactions etc.), including adjustment to physiologically relevant external signals seem to be included in nucleotide and protein sequences
However, there are limits to this simple approximation: actual understanding of molecular processes requires structure, chemistry, kinetics and thermodynamics
On the other hand, a deeper understanding of the nature of biological objects and processes greatly facilitates sequence-based studies by suggesting critical features, similarity measurements etc.
JM - http://folding.chmcc.org 9
Strings, sequences and string operations
String vs. sequence duality will be important for exact vs. inexact string matching
10
Beyond the letters: how to find better models (e.g. GC content for gene finding)
http://www.imb-jena.de/IMAGE_BPDIR.html
JM - http://folding.chmcc.org 11
Another example: active sites, functional motifs and multiple alignment
JM - http://folding.chmcc.org 12
Distance and similarity measures
JM - http://folding.chmcc.org 13
Edit distance vs. substitution score
JM - http://folding.chmcc.org 14
Substitution matrices for protein sequence alignment: learning and extrapolating from examples
PAM matrices (Dayhoff et. al): extrapolating longer evolutionary times from data for very similar proteins with more than 85% sequence identity (short evolutionary time),
s(a,b | t) = log P(b|a,t)/qa e.g. P(b|a,2)=
c P(b|c,1)P(c|a,1)
BLOSUM matrices (Henikoff & Henikoff): multiple alignments of more distantly related proteins (e.g. BLOSUM50 with 50% sequence identity),
s(a,b) = log pab
/qaq
b where p
ab= F
ab /
cd F
cd
Expected score:
ab q
aq
b s(a,b) = -
ab q
aq
b log q
aq
b / p
ab = -H(q||p)
JM - http://folding.chmcc.org 15
Summary
JM - http://folding.chmcc.org 16
Web resources and materials for the course
http://folding.chmcc.orghttp://folding.chmcc.org/protlab/protlab.htmlhttp://folding.chmcc.org/intro2bioinfo/intro2bioinfo.html
Protein Modeling Lab
Remote access to PML and the Citrix software
All lectures and other materials available electronically from the PML servers
Electronic tests and homework, web submission interfaces
The web site for the Introduction to Bioinformatics course
Updates