Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST –...

Preview:

Citation preview

Introduction to Bioinformatics - Tutorial no. 5

MEME – Discovering motifs in sequences

MAST – Searching for motifs in databanks

TRANSFAC – The Transcription Factor DB

http://weblogo.berkeley.edu

WebLogo - InputAligned

Sequences(e.g. output of

ClulatlW)

RUN !

Genes:

WebLogo - Output

Proteins:

MEME

http://meme.sdsc.edu/ Motif discovery from unaligned sequences

Genomic or protein sequences Identifies profile motifs

Multiple motifs for any input Flexible model of motif presence

Motif can be absent in some sequences Can appear several times in one sequence

MEME InputEmail address Multiple input sequences

How many times in each sequence?

How many motifs?

How many sites?

Range of motif lengths

MEME Output (1)

Motif length

Number of times

Like BLAST

“Position-Specific Probability Matrix”

= Motif Profile

Diversion of motif position

from background

Most popular symbols

MEME Output (2)

Sequence names

Reverse complement (genomic input only)

Position in sequence

Strength of match

Motif within sequence

MEME Output (3)

Overall strength of motif matches

Original sequence lengths

Motif instance

MAST Searches for motifs (one or more) in

sequence databases: Like BLAST but motifs for input Similar to iterations of PSI-BLAST

Profile defines strength of match Multiple motif matches per sequence Combined E value for all motifs

MEME uses MAST to summarize results: Each MEME result is accompanied by the MAST

result for searching the discovered motifs on the given sequences.

MAST InputEmail address

Database (like BLAST)

Motif file (e.g. MEME output)

Consider matched sequence length

E value threshold

MAST Output (1)

Matched accession

Match E value

Length of sequence

Link to GenBank

MAST Output (2)Motif

diagram

MAST Output (3)

Position of each instance

P value of instance

Matched parts of

sequence

Motif ‘consensus’

Motif and orientation

TRANSFACDatabase of eukaryotic DNA transcription regulation: Individual regulatory sites (SITES table)

Genes to which they belong Proteins which bind them

Proteins which bind sites (FACTORS table) Cellular source of protein Nucleotide motif profile for binding Some grouping and classification

Classification of factors (CLASS table) Position-specific matrices for select factors

(MATRIX table) Cell localization (CELL table)

Searching TRANSFAC www.gene-regulation.com Search a single table

By identifier, factor name, gene name By species, author

Browse your way from table to table Search within a sequence

MatInspector, TFScan (EMBOSS package)

TRANSFAC FactorDT Date; authorFA Factor nameGE Encoding geneSF Structural featuresCP Cell specificity (positive)CN Cell specificity (negative)EX Expression patternFF Functional featuresIN Interacting factors MX MatrixBS Binding SITE DR External databases

References: RN Reference no.RX MEDLINE IDRA Reference authorsRT Reference titleRL Reference data

TRANSFAC MatrixAccession

Position Specific Matrix

Statistical basis

Concensus (IUPAC subset

symbols)

TRANSFAC Site (1)

Accession number

DNA or

RNA

Gene

Gene region

Sequence of regulatory element

Position range of factor

binding site

TRANSFAC Site (2)

Binding factor

accession

Factor name

Binding ‘quality’1 functionally confirmed

2 binding of pure protein

3immunologically

characterized extract

4via known binding

sequence

5extract protein binding to

bona fide element

6 unassigned

Organism

Cellular source

Methods of identifying site

External links

TRANSFAC Factor (1)

AC: Accession number

FA: Factor name

SX: Other names

OS: OrganismOC: Taxonomy

HO: Homologs

CL: Classification

SZ: SizeSX: Amino

acid sequence

TRANSFAC Factor (2)

Protein sequence reference

Features and positions

Structural featuresCell specificity

Question

A biologist at your university has found 15 target genes that she thinks are co-regulated. She gives you 15 upstream regions of length 50 base pairs in FASTA format, file DNASample50.txt, and asks you to identify the motif, and - if possible - the potential regulating protein. She tells you the sequences are from Homo sapiens, and by intuition feels the motifs of length 8. She wants you to suggest only the best possible candidate motif.

QuestionAfter you ran all the programs your biologist friend confesses that she is not sure if her intuition about the motif length was correct. Re-run the tool without knowledge of motif length. Do you get the same results?

Determine a potential DNA binding protein using TRANSFAC