62
Previous Lecture: Multiple Alignment

Previous Lecture: Multiple Alignment

Embed Size (px)

DESCRIPTION

Previous Lecture: Multiple Alignment. This Lecture. Introduction to Biostatistics and Bioinformatics Motifs. Learning Objectives. Restriction sites Finding genes in DNA sequences Regulatory sites in DNA Protein signals (transport and processing) - PowerPoint PPT Presentation

Citation preview

Page 1: Previous Lecture:  Multiple Alignment

Previous Lecture: Multiple Alignment

Page 2: Previous Lecture:  Multiple Alignment

Introduction to Biostatistics and Bioinformatics

Motifs

This Lecture

Page 3: Previous Lecture:  Multiple Alignment

Learning Objectives

• Restriction sites• Finding genes in DNA sequences• Regulatory sites in DNA• Protein signals (transport and processing)• Protein functional domains & motif

databases• Regular Expressions• Position Specific Scoring Matrix

& Hidden Markov Models

Page 4: Previous Lecture:  Multiple Alignment

Restriction Sites

• Bacteria make restriction enzymes that cut DNA at specific sequences

(4-8 base patterns)• Very simple to find these patterns - can even

use the “Find” function of your web browser or word processor

• Open any page of text and look for “CAT”– you now have a restriction site search program!

Page 5: Previous Lecture:  Multiple Alignment

NEBcutter2

http://tools.neb.com/NEBcutter2/

Page 6: Previous Lecture:  Multiple Alignment

Finding Genes in Genomic DNA

• Translate (in all 6 reading frames) and look for similarity to known protein sequences

• Look for long Open Reading Frames (ORFs) between start and stop codons

(start=ATG, stop=TAA, TAG, TGA)• Look for known gene markers

• TAATAA box, intron splice sites, etc.

• Statistical methods (codon preference)

Page 7: Previous Lecture:  Multiple Alignment

GCCACATGTAGATAATTGAAACTGGATCCTCATCCCTCGCCTTGTACAAAAATCAACTCCAGATGGATCTAAGATTTAAATCTAACACCTGAAACCATAAAAATTCTAGGAGATAACACTGGCAAAGCTATTCTAGACATTGGCTTAGGCAAAGAGTTCGTGACCAAGAACCCAAAAGCAAATGCAACAAAAACAAAAATAAATAGGTGGGACCTGATTAAACTGAAAAGCCTCTGCACAGCAAAAGAAATAATCAGCAGAGTAAACAGACAACCCACAGAATGAGAGAAAATATTTGCAAACCATGCATCTGATGACAAAGGACTAATATCCAGAATCTACAAGGAACTCAAACAAATCAGCAAGAAAAAAATAACCCCATCAAAAAGTGGGCAAAGGAATGAATAGACAATTCTCAAAATATACAAATGGCCAATAAACATACGAAAAACTGTTCAACATCACTAATTATCAGGGAAATGCAAATTAAAACCACAATGAGATGCCACCTTACTCCTGCAAGAATGGCCATAATAAAAAAAAATCAAAAAAGAATAAATGTTGGTGTGAATGTGGTGAAAAGAGAACACTTTGACACTGCTGGTGGGAATGGAAACTAGTACAACCACTGTGGAAAACAGTACCGAGATTTCTTAAAGAACTACAAGTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAGAGGAAAAGAAGTCATTATTTGAAAAAGACACTTGTACATACATGTTTATAGCAGCACAATTTGCAATTGCAAAGATATGGAACCAGTCTAAATGCCCATCAACCAACAAATGGATAAAGAAAATATGGTATATATACACCATGGAACACTACTCAGCCATAAAAAGGAACAAAATAATGGCAACTCACAGATGGAGTTGGAGACCACTATTCTAAGTGAAATAACTCAGGAATGGAAAACCAAATATTGTATGTTCTCACTTATAAGTGGGAGCTAAGCTATGAGGACAAAAGGCATAAGAATTATACTATGGACTTTGGGGACTCGGGGGAAAGGGTGGGAGGGGGATGAGGGACAAAAGACTACACATTGGGTGCAGTGTACACTGCTGAGGTGATGGGTGCACCAAAATCTCAGAAATTACCACTAAAGAACTTATCCATGTAACTAAAAACCACCTCTACCCAAATAATTTTGAAATAAAAAATAAAAATATTTTAAAAAGAACTCTTTAAAATAAATAATGAAAAGCACCAACAGACTTATGAACAGGCAATAGAAAAAATGAGAAATAGAAAGGAATACAAATAAAAGTACAGAAAAAAAATATGGCAAGTTATTCAACCAAACTGGTAATTTGAAATCCAGATTGAAATAATGCAAAAAAAAGGCAATTTCTGGCACCATGGCAGACCAGGTACCTGGATGATCTGTTGCTGAAAACAACTGAAAATGCTGGTTAAAATATATTAACACATTCTTGAATACAGTCATGGCCAAAGGAAGTCACATGACTAAGCCCACAGTCAAGGAGTGAGAAAGTATTCTCTACCTACCATGAGGCCAGGGCAAGGGTGTGCACTTTTTTTTTTCTTCTGTTCATTGAATACAGTCACTGTGTATTTTACATACTTTCATTTAGTCTTATGACAATCCTATGAAACAAGTACTTTTAAAAAAATTGAGATAACAGTTGCATACCGTGAAATTCATCCATTTAAAGTGAGCAATTCACAGGTGCAGCTAGCTCAGTCAGCAGAGCATAAGACTCTTAAAGTGAACAATTCAGTGCTTTTTAGTATATTCACAGAGTTGTGCAACCATCACCACTATCTAATTGGTCTTAGTCTGTTTGGGCTGCCATAACAAAATACCACAAACTGGATAGCTCATAAACAACAGGCATTTATTGCTCACAGTTCTAGAGGCTGGAAGTGCAAGATTAAGATGCCAGCAGATTCTGTGTCTGCTGAGGGCCTGTTCCTCATAGAAGGTGCCCTCTTGCTGAATTCTCACATGGTGGAAGGGGGAAAACAAGCTTGCATTGCAAAGAGGTGGGCCTCTTTAATCCCAAAGGCCCCACCTCTAAAAGGCCCCACTTCTGAATACCATTACATTGAGAATTAAGTTTCAACATAGGAATTTGGGGGAACACAAATATCCAGACTGTAGCATAATTCCAGAACGGATTCAT

Page 8: Previous Lecture:  Multiple Alignment

Intron/Exon structure

• Gene finding programs work well in bacteria• None of the gene prediction programs do a

very good job of predicting eukaryotic intron/exon boundaries

• The only reasonable gene models are based on alignment of cDNAs to genome sequence

• >50% of all human genes still do not have an accurate coding sequence defined

(transcription start, intron splice sites)

Page 9: Previous Lecture:  Multiple Alignment
Page 10: Previous Lecture:  Multiple Alignment

Gene Finding on the Web

GRAIL: Oak Ridge Natl. Lab, Oak Ridge, TN– http://compbio.ornl.gov/grailexp

ORFfinder: NCBI– http://www.ncbi.nlm.nih.gov/gorf/gorf.html

DNA translation: Univ. of Minnesota Med. School– http://alces.med.umn.edu/webtrans.html

GenLang– http://cbil.humgen.upenn.edu/~sdong/genlang.html

BCM GeneFinder: Baylor College of Medicine, Houston, TX– http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html– http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

Page 11: Previous Lecture:  Multiple Alignment

Truth?

• There may not be a "correct" answer to the gene finding problem

• Some genes have more than one start and stop position on the DNA

• Alternative splicing (a portion of the DNA is sometimes in an exon, sometimes in an intron)

• Pseudogenes - look like genes, but no longer function

• All computational gene predictions need to be experimentally verified (RNA-seq!!)

Page 12: Previous Lecture:  Multiple Alignment

Genomic Sequence

• Once each gene is located on the chromosome, it becomes possible to get upstream genomic sequence

• This is where transcription factor (TF) binding sites are located –promoters and enhancers

• Search for known TF sites, and discover new ones (among co-regulated genes)

Page 13: Previous Lecture:  Multiple Alignment

Phage CRO repressor bound to DNA Andrew Coulson & Roger Sayles with RasMol, Univ. of Edinburgh 1993

Page 14: Previous Lecture:  Multiple Alignment

Sequence Logos

Page 15: Previous Lecture:  Multiple Alignment

Many DNA Regulatory Sequences are Known

–JASPAR: a curated, non-redundant set of transcription factor binding sites from published articles (currently 593 non-redundant matrics).

–UniProbe: binding sites of transcription factors determined by in vitro protein binding microarray(data for 406 DNA binding proteins on all k-mers)

–TransFac• Became a private for profit company (BIOBASE/Quiagen)• Stopped adding new entries to public data in 2005

– The Eukaryotic Promoter Database (EPD)• 1314 entries taken directly from scientific literature

Page 16: Previous Lecture:  Multiple Alignment

JASPAR page for CTCF

Page 17: Previous Lecture:  Multiple Alignment

Position Scoring Matrix

Count matrix:

>>> m.consensus Seq('CACGTG', IUPACUnambiguousDNA())

Biopython Bio.motifs package (similar to BioPerl TFBS)

>>>m.weblogo("mymotif.png")

0 1 2 3 4 5A: 4.00 19.00 0.00 0.00 0.00 0.00C: 16.00 0.00 20.00 0.00 0.00 0.00G: 0.00 1.00 0.00 20.00 0.00 20.00T: 0.00 0.00 0.00 0.00 20.00 0.00

0 1 2 3 4 5A: 0.22 0.69 0.09 0.09 0.09 0.09C: 0.59 0.09 0.72 0.09 0.09 0.09G: 0.09 0.12 0.09 0.72 0.09 0.72T: 0.09 0.09 0.09 0.09 0.72 0.09

Normalized position weight matrix (with pseudocounts) = probability of each base

0 1 2 3 4 5A: -0.19 1.46 -1.42 -1.42 -1.42 -1.42C: 1.25 -1.42 1.52 -1.42 -1.42 -1.42G: -1.42 -1.00 -1.42 1.52 -1.42 1.52T: -1.42 -1.42 -1.42 -1.42 1.52 -1.42

Position Specific Scoring Matrix (log odds ratios of matrix vs background):

Positive scores show that a base is more likely to come from the motif, negative scores are more likely to come from background

Page 18: Previous Lecture:  Multiple Alignment

Motif Search Methods

Exact Match

PSSM Search

>>> from Bio import motifs>>> for position, score in pssm.search(seq, threshold=7.0):... print("Position %d: score = %5.3f" % (position, score))... Position 0: score = 5.622Position -20: score = 4.601Position 10: score = 3.037Position 13: score = 5.738

Threshold of log-odds 7 = 100x more likely to occur in motif than random backgroundNegative positions are on - strand

A highly selective motif should only match once (or zero times) in each sequence tested.

Regular Expression Match

>>> match = seq.count('CACGTG')

>>> match = re.search(r'[CA][AG]CG[TC]G', seq)

Page 19: Previous Lecture:  Multiple Alignment

• Most TF binding sites are determined by just a few base pairs (typically 6-12)

• Sequence is variable (consensus)• This is not enough information for proteins to locate

unique promoters for each gene in a 3 billion base genome

• TF's bind cooperatively and combinatorially– The key is in the location in relation to each other and to the

transcription units of genes + epigenetic factors• Can use phylogenetic conservation to help predict binding sites

DE IFI-6-16 (interferon-induced gene 6-16); G000176.SQ gGGAAAaTGAAACTSF -127ST -89BF T00428 ISGF-3; Quality: 6; Species: human, Homo sapiens.

TF Binding sites lack information

Page 20: Previous Lecture:  Multiple Alignment

Web tools for TFBS

Promoter Scan: NIH Bioinformatics (BIMAS)http://www-bimas.cit.nih.gov/molbio/proscan/

Signal Scan: NIH Bioinformatics (BIMAS) – uses old TransFac database

http://www-bimas.cit.nih.gov/molbio/signal/

TFSEARCH (uses 1998 version of TransFac)http://www.cbrc.jp/research/db/TFSEARCH.html

JASPAR (search motifs in one sequence), ConSitehttp://jaspar.genereg.net/http://consite.genereg.net/

Toucan workbench for regulatory sequence analysis https://gbiomed.kuleuven.be/english/research/50000622/lcb/tools/toucan

TargetFinder: Telethon Inst.of Genetics and Medicine, Milan, Italyhttp://www.targetfinder.org/index.php/findtargets

RSAT: Regulatory Sequence Analysis Toolkithttp://rsat.ulb.ac.be/rsat/

MotifMogul: A web server that enables the analysis of multiple DNA sequences with PWM from JASPAR and TRANSFAC using 3 different algorithms (CLOVER, MotifLocator, MotifScanner)http://xerad.systemsbiology.net/MotifMogulServer/index.html

Page 21: Previous Lecture:  Multiple Alignment

Protein Sequence

Page 22: Previous Lecture:  Multiple Alignment

Protein Sequence Analysis

• Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity)

• Motifs (signal peptide, coiled-coil, trans-membrane, etc.)

• Protein Families

• Secondary Structure (helix vs. beta-sheet)

• 3-D prediction, Threading

Page 23: Previous Lecture:  Multiple Alignment

Chemical Properties of Proteins

• Proteins are linear polymers of 20 amino acids

• Chemical properties of the protein are determined by its amino acids

• Molecular wt., pH, isoelectric point are simple calculations from amino acid composition

• Hydrophobicity is a property of groups of amino acids - best examined as a graph

Page 24: Previous Lecture:  Multiple Alignment

Hydrophobicity Plot

P53_HUMAN (P04637) human cellular tumor antigen p53Kyte-Doolittle hydrophilicty, window=19

Page 26: Previous Lecture:  Multiple Alignment

EMBOSS Protein Analysis Toolkit

• plotorf: simple open reading frame finder• Garnier: predicts 2ndary structure• Charge: plot of protein charge• Octanol: hydrophobicity plot• Pepwindow: hydropathy plot

• pepinfo: plots protein secondary structure and hydrophobicity in parallel panels

• tmap: predict transmembrane regions• Topo: draws a map of transmembrane protein• Pepwheel: shows protein sequence as helical wheel• Pepcoil: predicts coiled-coil domains• Helixturnhelix: predicts helix-turn-helix domains

Page 27: Previous Lecture:  Multiple Alignment

Simple Motifs

Common structural motifs–Membrane spanning –Signal peptide –Coiled coil –Helix-turn-helix

Page 28: Previous Lecture:  Multiple Alignment

Protein Signal Peptides

• Proteins are sorted within the cell using 20-25 amino acid tags at their 5' end (beginning)

• Chopped off once they reach their destination

Page 29: Previous Lecture:  Multiple Alignment

Protein Signal Prediction

• ChloroP - Prediction of chloroplast transit peptides• LipoP - Prediction of lipoproteins and signal peptides in Gram

negative bacteria• MITOPROT - Prediction of mitochondrial targeting sequences• PATS - Prediction of apicoplast targeted sequences• PlasMit - Prediction of mitochondrial transit peptides in Plasmodium

falciparum• Predotar - Prediction of mitochondrial and plastid targeting

sequences• PTS1 - Prediction of peroxisomal targeting signal 1 containing

proteins• SignalP - Prediction of signal peptide cleavage sites ・

Page 30: Previous Lecture:  Multiple Alignment

“Super-secondary” Structure

Common structural motifs– Membrane spanning (EMBOSS: tmap, topo)– Signal peptide (EMBOSS: sigcleave)– Coiled coil (EMBOSS: pepcoil)– Helix-turn-helix (EMBOSS: helixturnhelix)

• Predicted from abundance of specific amino acids in a window and patterns of hydrophobic/hydrophillic

Page 31: Previous Lecture:  Multiple Alignment

Web servers that predict these structures

Predict Protein server: : EMBL Heidelberg– http://www.embl-heidelberg.de/predictprotein/

SOSUI: Tokyo Univ. of Ag. & Tech., Japan– http://www.tuat.ac.jp/~mitaku/adv_sosui/submit.html

TMpred (transmembrane prediction): ISREC (Swiss Institute for Experimental Cancer Research)– http://www.isrec.isb-sib.ch/software/TMPRED_form.html

COILS (coiled coil prediction): ISREC– http://www.isrec.isb-sib.ch/software/COILS_form.html

SignalP (signal peptides): Tech. Univ. of Denmark – http://www.cbs.dtu.dk/services/SignalP/

Page 32: Previous Lecture:  Multiple Alignment

Protein Domains/Motifs

• Proteins are built out of functional units know as domains (or motifs)

• These domains have conserved sequences• Often much more similar than their respective proteins• Exon splicing theory (W. Gilbert)

• Exons correspond to folding domains which in turn serve as functional units

• Unrelated proteins may share a single similar exon (i.e.. ATPase or DNA binding function)

Page 33: Previous Lecture:  Multiple Alignment

Protein Domains (Pattern analysis)

Page 34: Previous Lecture:  Multiple Alignment

Motifs are built from Multiple Alignmennts

Page 35: Previous Lecture:  Multiple Alignment

Protein Motif Databases

• Known protein motifs have been collected in databases

• Best database is PROSITE– The Dictionary of Protein Sites and Patterns– maintained by Amos Bairoch, at the Univ. of Geneva,

Switzerland– contains a comprehensive list of documented protein

domains constructed by expert molecular biologists– Alignments and patterns built by hand!

Page 36: Previous Lecture:  Multiple Alignment

PROSITE is based on Patterns

Each domain is defined by a simple pattern– Patterns can have alternate amino acids in each

position and defined spaces, but no gaps– Pattern searching is by exact matching, so any

new variant will not be found (can allow mismatches, but this weakens the algorithm)

ID CBD_FUNGAL; PATTERN. AC PS00562; DT DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (UPDATE). DE Cellulose-binding domain, fungal type.

PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C

Page 37: Previous Lecture:  Multiple Alignment
Page 38: Previous Lecture:  Multiple Alignment
Page 39: Previous Lecture:  Multiple Alignment

Tools for Pattern searching

fuzznuc: DNA pattern searchfuzzpro: protein pattern searchpreg: regular expression search of a protein sequence

EMBOSS

Page 40: Previous Lecture:  Multiple Alignment

Tools for PROSITE searches

Free Mac program: MacPattern– ftp://ftp.ebi.ac.uk/pub/software/mac/macpattern.hqx

Free PC program (DOS): PATMAT– ftp://ncbi.nlm.nih.gov/repository/blocks/patmat.dos

EMBOSS has the programs: patmatdb,

patmatmotifs

Also in virtually all commercial programs: MacVector, VectorNTI, CLC-Bio, LaserGene, etc.

Page 41: Previous Lecture:  Multiple Alignment

Websites for PROSITE Searches

ScanProsite at ExPASy: Univ. of Geneva– http://expasy.hcuge.ch/sprot/scnpsit1.html

Network Protein Sequence Analysis: Institut de Biologie et Chimie des Protéines, Lyon, France– http://pbil.ibcp.fr/NPSA/npsa_prosite.html

PPSRCH: EBI, Cambridge, UK– http://www2.ebi.ac.uk/ppsearch/

Page 42: Previous Lecture:  Multiple Alignment

Pattern Search Methods

Consensus PSSMPattern HMM

Complexity

exact match

fuzzy match

regular expression(defined mismatches)

Scores for each type of match in each position,

gapped alignment

Position-specific gap scores

Challenges to define statistical significance, sensitivity, & specificty

What are all the true postives, & false negatives in a genome-wide search?

Page 43: Previous Lecture:  Multiple Alignment

Profiles

• Profiles are tables of amino acid frequencies at each position in a motif

• They are built from multiple alignments• PROSITE entries also contain profiles built

from an alignment of proteins that match the pattern

• Profile searching is more sensitive than pattern searching - uses an alignment algorithm, allows gaps

Page 44: Previous Lecture:  Multiple Alignment

Protein PSSM with log ratios

Page 45: Previous Lecture:  Multiple Alignment

Profile Alignment

Gribskov et al. 1987• Position specific scores• Allows addition of extra sequence(s) to an alignment• Allows alignment of alignments• Gaps introduced as whole columns in the separate

alignments• Optimal alignment in time O(a2l2)

a = alphabet size, l = sequence length• Information about the degree of conservation of

sequence positions is included (similar amino acids)

Page 46: Previous Lecture:  Multiple Alignment

Good reasons to use profile alignments

– Adding a new sequence to an existing multiple alignment that you want to keep fixed(align sequence to profile)

– Searching a database for new members of your protein family (pfsearch)

– Searching a database of profiles to find out which one your sequence belongs to (pfscan)

– Combining two multiple sequence alignments(profile to profile)

Page 47: Previous Lecture:  Multiple Alignment

EMBOSS ProfileSearch

• EMBOSS has a set of profile analysis tools.• Start with a multiple alignment

– prophecy: create a profile– profit: scans a database with your profile

– prophet makes pairwise alignments between a single sequence and a profile

Page 48: Previous Lecture:  Multiple Alignment

Websites for Profile searching

• PROSITE ProfileScan: ExPASy, Geneva– http://www.isrec.isb-sib.ch/software/PFSCAN_form.html

• BLOCKS (builds profiles from PROSITE entries and adds all matching sequences in SwissProt): Fred Hutchinson Cancer Research Center, Seattle, Washington, USA– http://www.blocks.fhcrc.org/blocks_search.html

• PRINTS (profiles built from automatic alignments of OWL non-redundant protein databases): http://www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTScan/fps/PathForm.cgi

Page 49: Previous Lecture:  Multiple Alignment

More Protein Motif Databases

• PFAM (1344 protein family HMM profiles built by

hand): Washington Univ., St. Louis– http://pfam.wustl.edu/hmmsearch.shtml

• ProDom (profiles built from PSI-BLAST automatic multiple alignments of the SwissProt database): INRA, Toulouse, France– http://www.toulouse.inra.fr/prodom/doc/blast_form.html

[This is my favorite protein database - nicely colored results]

Page 50: Previous Lecture:  Multiple Alignment

Sample ProDom Output

Page 51: Previous Lecture:  Multiple Alignment

Profile searching using PSI-BLAST

• Position Specific Iterative• Perform search – construct profile – perform

search• Convergence (hopefully…)• Increased sensitivity for distantly related

sequences• Only as good as your first set of hits• Available on-line (NCBI)

Page 52: Previous Lecture:  Multiple Alignment

Probabilistic Models of Sequence Alignment

• Hidden Markov Models– sequence of states and associated symbol probabilities

• Produces a probabilistic model of a sequence alignment

• Align a sequence to a Profile Hidden Markov Model– Algorithms exist to find the most efficient pathway

through the model

Page 53: Previous Lecture:  Multiple Alignment

Markov Chain: A sequence of ‘things’. The probability of the next thing depends only on the current thing. Based on finite state automata.

Hidden Markov Model: A sequence of states which form a Markov Chain. The states are not observable. The observable characters have “emission” probabilities which depend on the current state.

Page 54: Previous Lecture:  Multiple Alignment

Hidden Markov Models

• Hidden Markov Models (HMMs) are a more sophisticated form of profile analysis.

• Rather than build a table of amino acid frequencies at each position, they model the transition from one amino acid to the next, as well as gaps.

• Pfam is built with HMMs. • Free HMM software HMMER

• HMMs can be used for a wide range of bioinformatics problems, not just alignment motifs.

Page 55: Previous Lecture:  Multiple Alignment

Profile HMM

• The sequence at each position is a “hidden state.” The model contains probabilities of transitions between states. The “M” box is a Match, which is further modeled by probabilities for each possible amino acid. There is a specific probability for Insertion “I” and Deletion “D” at each transition.

• Any sequence can be matched to this model, and its best probability calculated. The log-odds score is a measure of probability of a sequence being emitted by an HMM rather than any random (null) model.

Page 56: Previous Lecture:  Multiple Alignment

Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.

Page 57: Previous Lecture:  Multiple Alignment

Discovery of new Motifs

• All of the tools discussed so far rely on a database of existing domains/motifs

• How to discover new motifs– Start with a set of related proteins– Make a multiple alignment– Build a pattern or profile – You will need access to a fairly powerful UNIX

computer to search databases with custom built profiles or HMMs.

Page 58: Previous Lecture:  Multiple Alignment

Patterns in Unaligned Sequences

• Sometimes sequences may share just a small common region

–transcription factors• MEME: San Diego Supercomputing Facility

http://www.sdsc.edu/MEME/meme/website/meme.html

• Gibbs Sampler• Sombrero (Self-organizing maps)

Page 59: Previous Lecture:  Multiple Alignment

MEME Details

• The E-value of a motif is based on its log likelihood ratio, width, sites, the background letter frequencies and the size of the training set. The E-value is an estimate of the expected number of motifs with the given log likelihood ratio (or higher), and with the same width and site count, that one would find in a similarly sized set of random sequences.

• Each motif describes a pattern of a fixed width as no gaps are allowed in MEME motifs• log likelihood ratio is the logarithm of the ratio of the probability of the occurrences of the

motif given the motif model (likelihood given the motif) versus their probability given the background model (likelihood given the null model). (Normally the background model is a 0-order Markov model using the background letter frequencies, but higher order Markov models may be specified via the -bfile option to MEME.)

• The information content of the motif in bits. It is equal to the sum of the uncorrected information content, R(), in the columns of the LOGO. This is equal relative entropy of the motif relative to a uniform background frequency model.

• Relative Entropy The relative entropy of the motif, computed in bits and relative to the background letter frequencies. It is equal to the log-likelihood ratio (llr) divided by the number of contributing sites of the motif times 1/ln(2),

re = llr / (sites * ln(2)).

Page 60: Previous Lecture:  Multiple Alignment

True significance of Motifs?

• All motif sampling methods will find common words in a set of sequences.

• This is essentially a “least common denominator” approach.

• All sets of biological sequences have some words above random frequencies.

• Need to compare to an appropriate background model for motif finding.

• Test found motifs against appropriate positive and negative controls (how to define?)

Page 61: Previous Lecture:  Multiple Alignment

Summary

0 1 2 3 4 5A: -0.19 1.46 -1.42 -1.42 -1.42 -1.42C: 1.25 -1.42 1.52 -1.42 -1.42 -1.42G: -1.42 -1.00 -1.42 1.52 -1.42 1.52T: -1.42 -1.42 -1.42 -1.42 1.52 -1.42

• Restriction sites• Finding genes in DNA sequences• Regulatory sites in DNA• Protein signals (transport and processing)• Protein functional domains & motif

databases• Regular Expressions• Position Specific Scoring Matrix

& Hidden Markov Models

Page 62: Previous Lecture:  Multiple Alignment

Next Lecture: Phylogenetics