Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 1

Introduction to Bioinformatics: Lecture IIFrom Molecular Processes to String Matching

Jarek MellerJarek Meller

Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC


Outline of the lecture

Sequence approximation in computational molecular biology: the premise and the limits

Getting ready for analysis of exact string matching and sequence alignment algorithms: some definitions and interplay with biology

The notion of string/sequence similarity Substitution matrices for sequence alignment


Before we start: literature watch

A draft of the Rat genome has been published! RGSPC Nature 428

What are the first conclusions from the comparison with other mammalian genomes?

What approaches and tools have been used to perform this comparative analysis?

H: 2.9 Gb

M: 2.5 Gb

R: 2.75 GbR: unique - 0.7 Gb; common with both H and M – 1.1 Gb

4

Biological Polymers and Central Dogma

Bio-Polymer (alphabet) Process (algorithm)

DNA (A,T,G,C) replication

transcription

mRNA (U,A,C,G) splicing

translation

Proteins (20 a.a.) folding

interactions

Lipids, polysaccharides, membranes, signal transduction, environmental signals etc.


Complexity of “DNA computing”

http://www.genecrc.org/site/lc/lc2d.htm


Get the relevant sequences to compare them: conservation and differences

Problem Algorithms Programs

Sequencing Fragment assembly problem The Shortest Superstring Problem Phrap (Green, 1994)

Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)

Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods BLAST (Altschul et. al., 1990)


Redundancy in biological systems

Query: 1 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 60 M LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE Sbjct: 1 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60 Query: 61 DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFISEAIIHVLHSRH 120 DLKKHG TVLTALG ILKKKGHHEAE+KP AQSHATKHKIP+KYLEFISE II VL S+H Sbjct: 61 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120

Query: 121 PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG 154 PG+FGADAQGAMNKALELFRKD+A+ YKELG+QG Sbjct: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154

Ex. Find the sequence of 1mba in the PDB and “blast” against nr using NCBI

An example: sperm whale vs. human myoglobin:


Limits of the sequence approximation

• All the information and various fingerprints of information processing at the molecular level (via interactions etc.), including adjustment to physiologically relevant external signals seem to be included in nucleotide and protein sequences

However, there are limits to this simple approximation: actual understanding of molecular processes requires structure, chemistry, kinetics and thermodynamics

On the other hand, a deeper understanding of the nature of biological objects and processes greatly facilitates sequence-based studies by suggesting critical features, similarity measurements etc.


Strings, sequences and string operations

String vs. sequence duality will be important for exact vs. inexact string matching

10

Beyond the letters: how to find better models (e.g. GC content for gene finding)

http://www.imb-jena.de/IMAGE_BPDIR.html


Another example: active sites, functional motifs and multiple alignment


Distance and similarity measures


Edit distance vs. substitution score


Substitution matrices for protein sequence alignment: learning and extrapolating from examples

PAM matrices (Dayhoff et. al): extrapolating longer evolutionary times from data for very similar proteins with more than 85% sequence identity (short evolutionary time),

s(a,b | t) = log P(b|a,t)/qa e.g. P(b|a,2)=

c P(b|c,1)P(c|a,1)

BLOSUM matrices (Henikoff & Henikoff): multiple alignments of more distantly related proteins (e.g. BLOSUM50 with 50% sequence identity),

s(a,b) = log pab

/qaq

b where p

ab= F

ab /

cd F

cd

Expected score:

ab q

aq

b s(a,b) = -

ab q

aq

b log q

aq

b / p

ab = -H(q||p)


Summary


Web resources and materials for the course

http://folding.chmcc.orghttp://folding.chmcc.org/protlab/protlab.htmlhttp://folding.chmcc.org/intro2bioinfo/intro2bioinfo.html

Protein Modeling Lab

Remote access to PML and the Citrix software

All lectures and other materials available electronically from the PML servers

Electronic tests and homework, web submission interfaces

The web site for the Introduction to Bioinformatics course

Updates

Documents

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching