31
2/8/07 CAP5510 1 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 [email protected] www.cis.fiu.edu/~giri/teach/BioinfS07.html

CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 1

CAP 5510: Introduction to Bioinformatics

Giri NarasimhanECS 254; Phone: x3748

[email protected]/~giri/teach/BioinfS07.html

Page 2: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 2

Pattern Discovery

Page 3: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 3

What we have discussed so far

Why Pattern discovery?Types of patternsHow to represent and store patterns?Types of pattern discovery

Supervised pattern discoveryMotif Detection

Unsupervised pattern discoveryEvaluation of pattern discovery

Page 4: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 4

Unaligned Pattern DiscoveryUnaligned Pattern Discovery

Rigoutsos & Floratos, Bioinformatics, ’98

TEIRESIAS: The algorithm is similar to that used in GYM for aligned Pattern discovery.

TEIRESIASProtein SequenceDatabase

Seqlet Dictionary

A..GV

L..H…H

Y.C..C…F

V..G..G.G.T.L•••

Page 5: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 5

TEIRESIAS: Key Features

Starts with a set of seed patterns (Enumeration step) Convolution operator applied to all pairs of patterns:

A..GV.S ⊕ V.S.GR = A..GV.S.GROrder of Evaluation carefully chosen so that long patterns get longer firstFinds all maximal patterns.Combinatorial explosion avoided by generating only relevant maximal patterns.

Rigoutsos & Floratos, Bioinformatics, ’98

Page 6: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 6

SPLASH

Structural Pattern Localization Analysis by Sequential Histogram (SPLASH)Not limited to fixed alphabet sizePatterns are modeled by a homology metric and thus allow mismatchesEarly pruning of inconsistent seed patterns, leading to increased efficiency. Easily parallelized with availability of extra resources.

Califano, Bioinformatics, ’00; Califano et al., J Comput Biol, ’00

Page 7: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 7

Precomputed Sequence Patterns

PROSITEBLOCKS and PRINTSeMOTIFSPATPRODOMPfam

Page 8: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 8

Motif Detection ToolsPROSITE (Database of protein families & domains)

Try PDOC00040. Also Try PS00041PRINTS Sample OutputBLOCKS (multiply aligned ungapped segments for highly conserved regions of proteins; automatically created) Sample OutputPfam (Protein families database of alignments & HMMs)

Multiple Alignment, domain architectures, species distribution, links: TryMoSTPROBEProDomDIP

Page 9: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 9

Protein Information Sites

SwissPROT & GenBankInterPRO is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. See sample.

PIR Sample Protein page

Page 10: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 10

Modular Nature of Proteins

Proteins are collections of “modular”domains. For example,

F2 E E

EF2

F2

K K

K Catalytic Domain

Catalytic Domain

PLAT

Coagulation Factor XII

Page 11: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 11

Domain Architecture Tools

CDARTProtein Domain ArchitectureAAH24495; ;It’s domain relatives; Multiple alignment for 2nd domain

SMART

Page 12: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 12

Predicting Specialized Structures

COILS – Predicts coiled coil motifsTMPred – predicts transmembrane regionsSignalP – predicts signal peptidesSEG – predicts nonglobular regions

Page 13: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 13

Patterns in DNA Sequences

Signals in DNA sequence control eventsStart and end of genesStart and end of intronsTranscription factor binding sites (regulatory elements)Ribosome binding sites

Detection of these patterns are useful for Understanding gene structureUnderstanding gene regulation

Page 14: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 14

Motifs in DNA Sequences

Given a collection of DNA sequences of promoter regions, locate the transcription factor binding sites (also called regulatory elements)

Example:

http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Page 15: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 15

Motifs

http://weblogo.berkeley.edu/examples.html

Page 16: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 16

Motifs in DNA Sequences

http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Page 17: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 17

More Motifs in E. Coli DNA Sequences

http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Page 18: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 18

http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Page 19: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 19

Other Motifs in DNA

Sequences: Human Splice

Junctions

http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Page 20: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 20

Motifs in DNA Sequences

Page 21: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 21

Gene

Gene-Specific TFBinding Sites

TATA BoxCAT Box

Basal TFBinding Sites

coding regionupstream region

Transcription Regulation

Page 22: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

Transcription Regulation

2/8/07 CAP5510 22

TF

DNA

TFBS

Gene

Page 23: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

Transcription Regulation

2/8/07 CAP5510 23

DNA

TF

TFBS

Gene Co-regulated genes

Page 24: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 24

Problem: Given the upstream regions of all genes in the genome, find all over-represented sequence signatures.

Motif-prediction: Whole genome

Page 25: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 25

Basic Principle: If a TF co-regulates many genes, then all these genes should have at least 1 binding site for it in their upstream region.

Motif-prediction: Whole genome

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

Binding sites for TF

Page 26: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 26

Motif Detection (TFBMs)

See evaluation by Tompa et al.[bio.cs.washington.edu/assessment]

Gibbs Sampling Methods: AlignACE, GLAM, SeSiMCMC, MotifSamplerWeight Matrix Methods: ANN-Spec, Consensus, EM: Improbizer, MEMECombinatorial & Misc.: MITRA, oligo/dyad, QuickScore, Weeder, YMF

Page 27: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 27

Predicting Motifs in Whole Genome

MEME: EM algorithm [ Bailey et al., 1994 ]

AlignACE: Gibbs Sampling Approach [ Hughes et al., 2000 ]

Consensus: Greedy Algorithm Based [ Hertz et al., 1990 ]

ANN-Spec: Artificial Neural Network and a Gibbs sampling method [ Workman et al., 2000 ]

YMF: Enumerative search [Sinha et al., 2003 ]

Page 28: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 28

Input: upstream sequencesX = {X1, X2, …, Xn},

Motif profile: 4×k matrix θ = (θrp), r ∈ {A,C,G,T} 1 ≤ p ≤ kθrp = Pr(residue r in position p of motif)

Background distribution: θr0 = Pr(residue r in background)

EM Method: Model Parameters

Page 29: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 29

Z = {Zij}, where1, if motif instance starts at

Zij = position i of Xj0, otherwise

Iterate over probabilistic models that could generate X and Z, trying to converge on this solution

EM Method: Hidden Information

Page 30: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

2/8/07 CAP5510 30

M-step: Build a new profile by using every m-window, but weighting each one with value zij.

Initialize: random profile

EM Algorithm

E-step: Using profile, compute a likelihood value zijfor each m-window at position i in input sequence j.

Stop if convergedMEME [Bailey, Elkan 1994]

Goal: Find θ, Z that maximize Pr (X, Z | θ)

Page 31: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm

Gibbs Sampling for Motif Detection

2/8/07 CAP5510 31