16
JM - http://folding.chmcc.o rg 1 Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching Jarek Jarek Meller Meller Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, & Department of Biomedical Engineering, UC UC

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

  • Upload
    raheem

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 1

Introduction to Bioinformatics: Lecture IIFrom Molecular Processes to String Matching

Jarek MellerJarek Meller

Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC

Page 2: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 2

Outline of the lecture

Sequence approximation in computational molecular biology: the premise and the limits

Getting ready for analysis of exact string matching and sequence alignment algorithms: some definitions and interplay with biology

The notion of string/sequence similarity Substitution matrices for sequence alignment

Page 3: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 3

Before we start: literature watch

A draft of the Rat genome has been published! RGSPC Nature 428

What are the first conclusions from the comparison with other mammalian genomes?

What approaches and tools have been used to perform this comparative analysis?

H: 2.9 Gb

M: 2.5 Gb

R: 2.75 GbR: unique - 0.7 Gb; common with both H and M – 1.1 Gb

Page 4: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

4

Biological Polymers and Central Dogma

Bio-Polymer (alphabet) Process (algorithm)

DNA (A,T,G,C) replication

transcription

mRNA (U,A,C,G) splicing

translation

Proteins (20 a.a.) folding

interactions

Lipids, polysaccharides, membranes, signal transduction, environmental signals etc.

Page 5: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 5

Complexity of “DNA computing”

http://www.genecrc.org/site/lc/lc2d.htm

Page 6: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 6

Get the relevant sequences to compare them: conservation and differences

Problem Algorithms Programs

Sequencing Fragment assembly problem The Shortest Superstring Problem Phrap (Green, 1994)

Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)

Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods BLAST (Altschul et. al., 1990)

Page 7: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 7

Redundancy in biological systems

Query: 1 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 60 M LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE Sbjct: 1 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60 Query: 61 DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFISEAIIHVLHSRH 120 DLKKHG TVLTALG ILKKKGHHEAE+KP AQSHATKHKIP+KYLEFISE II VL S+H Sbjct: 61 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120

Query: 121 PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG 154 PG+FGADAQGAMNKALELFRKD+A+ YKELG+QG Sbjct: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154

Ex. Find the sequence of 1mba in the PDB and “blast” against nr using NCBI

An example: sperm whale vs. human myoglobin:

Page 8: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 8

Limits of the sequence approximation

• All the information and various fingerprints of information processing at the molecular level (via interactions etc.), including adjustment to physiologically relevant external signals seem to be included in nucleotide and protein sequences

However, there are limits to this simple approximation: actual understanding of molecular processes requires structure, chemistry, kinetics and thermodynamics

On the other hand, a deeper understanding of the nature of biological objects and processes greatly facilitates sequence-based studies by suggesting critical features, similarity measurements etc.

Page 9: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 9

Strings, sequences and string operations

String vs. sequence duality will be important for exact vs. inexact string matching

Page 10: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

10

Beyond the letters: how to find better models (e.g. GC content for gene finding)

http://www.imb-jena.de/IMAGE_BPDIR.html

Page 11: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 11

Another example: active sites, functional motifs and multiple alignment

Page 12: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 12

Distance and similarity measures

Page 13: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 13

Edit distance vs. substitution score

Page 14: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 14

Substitution matrices for protein sequence alignment: learning and extrapolating from examples

PAM matrices (Dayhoff et. al): extrapolating longer evolutionary times from data for very similar proteins with more than 85% sequence identity (short evolutionary time),

s(a,b | t) = log P(b|a,t)/qa e.g. P(b|a,2)=

c P(b|c,1)P(c|a,1)

BLOSUM matrices (Henikoff & Henikoff): multiple alignments of more distantly related proteins (e.g. BLOSUM50 with 50% sequence identity),

s(a,b) = log pab

/qaq

b where p

ab= F

ab /

cd F

cd

Expected score:

ab q

aq

b s(a,b) = -

ab q

aq

b log q

aq

b / p

ab = -H(q||p)

Page 15: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 15

Summary

Page 16: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

JM - http://folding.chmcc.org 16

Web resources and materials for the course

http://folding.chmcc.orghttp://folding.chmcc.org/protlab/protlab.htmlhttp://folding.chmcc.org/intro2bioinfo/intro2bioinfo.html

Protein Modeling Lab

Remote access to PML and the Citrix software

All lectures and other materials available electronically from the PML servers

Electronic tests and homework, web submission interfaces

The web site for the Introduction to Bioinformatics course

Updates