Patterns and regular expressions for analysis of ...bio.lundberg.gu.se/courses/ht05/bio1/bjorn_olsson.pdf · 5 Zink finger domain: – DNA binding (usually by 3 or more zink fingers)

1

Patterns and regular expressions for analysis of biological

sequences

Björn OlssonBioinformatics Research Group University College of Skövde

[email protected]

Ajit Narayanan, Professor

Björn Olsson, Senior lecturer

Dan Lundh, Lecturer

Kim Laurio

SimonLundell

ZelminaLubovac

Jonas Gamalielsson

JaneSynnergren

AngelicaLindlöf

PhD students

2

Why patterns?

• Patterns summarize the most important features of a group of sequences.

• Patterns often correspond to functional or structural features. Example: When constructing patterns, PROSITE developers look for motifs corresponding to:– Cysteines forming disulphide bonds– Regions involved in binding to other molecules– Catalytically active regions– etc

• Pattern matching is easy to implement and pattern search is extremely fast

• Patterns are interpretable and simple models

CMLPIFECKQFSECNSYTGRCECIEGFAGDDCCDQNPCLNGGACLPYLINEVTHLYTCTCENGFQGDKCCFQSDCKNDGFCQSPSDEYACTCQPGFEGDDCCFDNNCEYQCQPVGRSEHKCICAEGFAPVPGAPHKCCTIKNQCMNFGTCEPVLQGNAIDFICHCPVGFTDKVCCGILNGCENGRCVRVQEGYTCDCLDGYHLDTAKMTCCKTKEAGFVCKHGCRSTGKAYECTCPSGSTVAEDGITCCRKNACLHNAECRNTWNDYTCKCPNGYKGKKCCPRNVSECSHDCVLTSEGPLCFCPEGSVLERDGKTCCTDFVDVPCSHFCNNFIGGYFCSCPPEYFLHDDMKNCCLDNNGGCSHVCNDLKIGYECLCPDGFQLVAQRRCCEEDNGGCSHLCLLSPREPFYSCACPTGVQLQDNGKTCCADGNGGCSHLCFFTPRATKCGCPIGLELLSDMKTCCARDNGGCSHICIAKGDGTPRCSCPVHLVLLQNLLTCCKLRKGNCSSTVCGQDLQSHLCMCAEGYALSRDRKYCCPKANCSHFCIDRRDVGHQCFCAPGYILSENQKDCCDSHPCLHGGTCEDDGREFTCRCPAGKGGAVCCVLEPNCIHGTCNKPWTCICNEGWGGLYCCNPLPCNEDGFMTCKDGQATFTCICKSGWQGEKCCNAEFQNFCIHGECKYIEHLEAVTCKCQQEYFGERCCPLSHDGYCLHDGVCMYIEALDKYACNCVVGYIGERCCPKQYKHYCIKGRCRFVVAEQTPSCVCDEGYIGARCCNDDYKNYCLNNGTCFTVALNNVSLNPFCACHINYVGSRC

3

CMLPIF..ECKQF..SECNSYT..........GRCECIEG.....FAGDDCCDQN....PCLNG..GACLPYLINE...VTHLYTCTCENG.....FQGDKCCFQS....DCKND..GFCQSPSD........EYACTCQPG.....FEGDDCCFDN....NCEY....QCQPVGR.......SEHKCICAEGFAPVPGAPHKCCTIKN...QCMNF..GTCEPVLQG....NAIDFICHCPVG.....FTDKVCCGILN...GCEN...GRCVRVQEG........YTCDCLDG.YHLDTAKMTCCKTKEAGFVCKH....GCRSTGK........AYECTCPSG.STVAEDGITCCRKN....ACLHN..AECRNTW........NDYTCKCPNG.....YKGKKCCPRNVS..ECS....HDCVLTSE........GPLCFCPEG.SVLERDGKTCCTDFVDV.PCSH....FCNNFIG........GYFCSCPPE.YFLHDDMKNCCLDNN..GGCSH....VCNDLKI........GYECLCPDG..FQLVAQRRCCEEDNG..GCSH....LCLLSPR......EPFYSCACPTG.VQLQDNGKTCCADGNG..GCSH....LCFFTPR........ATKCGCPIG.LELLSDMKTCCARDNG..GCSH....ICIAKGD.......GTPRCSCPVH.LVLLQNLLTCCKLRKG..NCSS...TVCGQDLQ........SHLCMCAEG.YALSRDRKYCCPKA....NCSH....FCIDRRD.......VGHQCFCAPG.YILSENQKDCCDSH....PCLHG..GTCEDDG........REFTCRCPAG.....KGGAVCCVLEP...NCIH...GTCNKP...........WTCICNEG.....WGGLYCCNPL....PCNEDGFMTCKDGQ........ATFTCICKSG.....WQGEKCCNAEFQN.FCIH...GECKYIEH......LEAVTCKCQQE.....YFGERCCPLSHDG.YCLHD..GVCMYIEALD......KYACNCVVG.....YIGERCCPKQYKH.YCIK...GRCRFVVA......EQTPSCVCDEG.....YIGARCCNDDYKN.YCLNN..GTCFTVALNN...VSLNPFCACHIN.....YVGSRC

CMLPIF..ECKQF..SECNSYT..........GRCECIEG.....FAGDDCCDQN....PCLNG..GACLPYLINE...VTHLYTCTCENG.....FQGDKCCFQS....DCKND..GFCQSPSD........EYACTCQPG.....FEGDDCCFDN....NCEY....QCQPVGR.......SEHKCICAEGFAPVPGAPHKCCTIKN...QCMNF..GTCEPVLQG....NAIDFICHCPVG.....FTDKVCCGILN...GCEN...GRCVRVQEG........YTCDCLDG.YHLDTAKMTCCKTKEAGFVCKH....GCRSTGK........AYECTCPSG.STVAEDGITCCRKN....ACLHN..AECRNTW........NDYTCKCPNG.....YKGKKCCPRNVS..ECS....HDCVLTSE........GPLCFCPEG.SVLERDGKTCCTDFVDV.PCSH....FCNNFIG........GYFCSCPPE.YFLHDDMKNCCLDNN..GGCSH....VCNDLKI........GYECLCPDG..FQLVAQRRCCEEDNG..GCSH....LCLLSPR......EPFYSCACPTG.VQLQDNGKTCCADGNG..GCSH....LCFFTPR........ATKCGCPIG.LELLSDMKTCCARDNG..GCSH....ICIAKGD.......GTPRCSCPVH.LVLLQNLLTCCKLRKG..NCSS...TVCGQDLQ........SHLCMCAEG.YALSRDRKYCCPKA....NCSH....FCIDRRD.......VGHQCFCAPG.YILSENQKDCCDSH....PCLHG..GTCEDDG........REFTCRCPAG.....KGGAVCCVLEP...NCIH...GTCNKP...........WTCICNEG.....WGGLYCCNPL....PCNEDGFMTCKDGQ........ATFTCICKSG.....WQGEKCCNAEFQN.FCIH...GECKYIEH......LEAVTCKCQQE.....YFGERCCPLSHDG.YCLHD..GVCMYIEALD......KYACNCVVG.....YIGERCCPKQYKH.YCIK...GRCRFVVA......EQTPSCVCDEG.....YIGARCCNDDYKN.YCLNN..GTCFTVALNN...VSLNPFCACHIN.....YVGSRC

4

1

2

34

56

EGF-like domain:- First identified in epidermial growth factor, but present in many other proteins- Function unclear, but found in extracellular part of membrane proteins- Includes six cysteins involved in disulfide bonds

+----------------+ +-----------------+| | | |

C-x(0,48)-C-x(3,12)-C-x(1,70)-C-x(1,6)-C-x2-G-x(0,21)-G-x2-C| |+----------------+1

2

3

4 5 6

CMLPIF..ECKQF..SECNSYT..........GRCECIEG.....FAGDDCCDQN....PCLNG..GACLPYLINE...VTHLYTCTCENG.....FQGDKCCFQS....DCKND..GFCQSPSD........EYACTCQPG.....FEGDDCCFDN....NCEY....QCQPVGR.......SEHKCICAEGFAPVPGAPHKCCTIKN...QCMNF..GTCEPVLQG....NAIDFICHCPVG.....FTDKVCCGILN...GCEN...GRCVRVQEG........YTCDCLDG.YHLDTAKMTCCKTKEAGFVCKH....GCRSTGK........AYECTCPSG.STVAEDGITCCRKN....ACLHN..AECRNTW........NDYTCKCPNG.....YKGKKCCPRNVS..ECS....HDCVLTSE........GPLCFCPEG.SVLERDGKTCCTDFVDV.PCSH....FCNNFIG........GYFCSCPPE.YFLHDDMKNCCLDNN..GGCSH....VCNDLKI........GYECLCPDG..FQLVAQRRCCEEDNG..GCSH....LCLLSPR......EPFYSCACPTG.VQLQDNGKTCCADGNG..GCSH....LCFFTPR........ATKCGCPIG.LELLSDMKTCCARDNG..GCSH....ICIAKGD.......GTPRCSCPVH.LVLLQNLLTCCKLRKG..NCSS...TVCGQDLQ........SHLCMCAEG.YALSRDRKYCCPKA....NCSH....FCIDRRD.......VGHQCFCAPG.YILSENQKDCCDSH....PCLHG..GTCEDDG........REFTCRCPAG.....KGGAVCCVLEP...NCIH...GTCNKP...........WTCICNEG.....WGGLYCCNPL....PCNEDGFMTCKDGQ........ATFTCICKSG.....WQGEKCCNAEFQN.FCIH...GECKYIEH......LEAVTCKCQQE.....YFGERCCPLSHDG.YCLHD..GVCMYIEALD......KYACNCVVG.....YIGERC

C-x(0,48)-C-x(3,12)-C-x(1,70)-C-x(1,6)-C-x2-G-x(0,21)-G-x2-C

5

Zink finger domain:– DNA binding (usually by 3 or more zink fingers)– The C2H2 zink finger has a loop of 12 amino acids, anchored by two cysteines

and two histidines co-ordinating a zink atom.

C H

Z

C H

DNA binding

Aromatic or aliphatic

LIVMFYWC

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

Thiolases. Enzymes of two types:– Thiolase I involved in degradative pathways such as fatty acid beta-oxidation – Thiolase II involved in biosynthetic pathways such as poly beta-hydroxybutyrate

synthesis or steroid biogenesis

[LIVM]-[NST]-x(2)-C-[SAGLI]-[ST]-[SAG]-[LIVMFYNS]-x-[STAG]-[LIVM]-x(6)-[LIVM]

TSNIAREAALLAGVPDKIPAHTVTLACISSNVAMTTGMGMLATGNANAIIAGGVELLTSNVAREAALGAGFSDKTPAHTVTMACISSNQAMTTAVGLIASGQCDVVVAGGVELMGQNPARQAAIKAGLPAMVPAMTINKVCGSGLKAVMLAANAIMAGDAEIVVAGGQENMGQNPARQTTLMAGLPHTVPAMTINKVCGSGLKAVHLAMQAVACGDAEIVIAGGQESMGQNPARQAHIKVGLPRESAAWVINQVCGSGLRTVALAAQQVLLGDARIVVAGGQESMGQNPARQAAMKAGVPQEATAWGMNQLCGSGLRAVALGMQQIATGDASIIVAGGMESMGQNPARQSAIKGGLPNSVSAITINDVCGSGLKALHLATQAIQCGEADIVIAGGQENMGQNPARQALLKSGLAETVCGFTVNKVCGSGLKSVALAAQAIQAGQAQSIVAGGMENMGQNPARQAALKAGIEKEIPSLTINKVCGSGLKSVALGAQSIISGDADIVVVGGMENMGQNPARQASFKAGLPVEIPAMTINKVCGSGLRTVSLAAQIIKAGDADVIIAGGMENMGQMPARQAAVAAGIGWDVPALTINKMCLSGLDAIALADQLIRAGEFDVVVAGGQESMGQIPSRLAARLAGMPWSVPSETLNKVCASGLRAVTLCDQMIRAQDADILVAGGMESMGQAPARQVALKAGLPDSIIASTINKVCASGMKAVIIGAQNIICGTSDIVVVGGAESM

6

Basic syntax for PROSITE patterns

[AS] A and S allowed.D D allowed.x Any symbol.x4 Four arbitrary symbols.{PG} Any symbol except P and G.[FY]2 Two positions where F and Y allowed.x(3,7) Minimum 3 and maximum 7 residues.

D-[IVL]-x(4)-{PG}-C-x(4,10)-[DE]-R-[FY]2-Q

Additional syntax (rarely used)

• Some patterns are anchored to the beginning or end of the sequence:

[KRHQSA]-[DENQ]-E-L>

must match at the end of the sequence.

VTLACISSNVAMTTGMGMLATGNANAIIAGQNEL

7

Wildcards

G-x2-[LIV]-x(4,7)-R-x-[GA]

• x represents an arbitrary symbol.• xN represents a sub-sequence of N arbitrary

symbols.• x(i,j) represents a sub-sequence of

arbitrary symbols with minimum length i and maximum length j.

Wildcards

G-x2-[LIV]-x(4,7)-R-x(5,6)-[GA]

• The wildcard x(i,j) has flexibility j-i+1.

• Total flexibility of a pattern can be defined asthe product of the flexibilities of its wildcards, i.e. in this example pattern (7-4+1) ×(6-5+1) = 4×2 = 8.

• High flexibility increases search time, since the number of potential matches increasesexponentially.

8

A-x(4,7)-C-x(5,6)-D

ATLACCCCCVAMTTGMDMLATGNANAIIAGQNELA....C.....DA....C......DA.....C.....DA.....C......DA......C.....DA......C......DA.......C.....DA.......C......D

• Flexibility N means N potential matches per position.• Pattern construction methods typically seacrh low flexibility patterns.

Zinc-containing alcohol dehydrogenasesG-H-E-x(2)-G-x(5)-[GA]-x(2)-[IVSAC]

Bacterial quinoprotein dehydrogenases [DEN]-[WV]-x(3)-G-[RKNM]-x(6)-[FYW]-[SV]-x(4)-[LIVM]-N-x(2)-N-V-x(2)-L- [RKT]

Phosphoketolase E-G-G-E-L-G-Y

Ferrochelatase [LIVMF]-[LIVMFC]-x-[ST]-x-H-[GS]-[LIVM]-P-x(4,5)-[DENQKRLHAFSTI]-x-[GN]- [DPC]-x(1,4)-[YA]

• Note that general elements also contribute to increased search time by increasing the number of matches that will be evaluated.

9

PROSITEhttp://www.expasy.org/prosite/

• First and most ”famous” pattern database.• Release 19.10 (12-Sep-2005) contains 1374

documentation entries that describe 1328 patterns, 4 rules and 550 profiles/matrices.

• ”.. with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs”.

• But:– Some domains are not informative of family– Some patterns (10-15) for post-translational

modifications– PROSITE patterns often have low sensitivity

PROSITE’s structure

Documentation of family (PDOCxxxxx)

Pattern files (PSxxxxx)

Note: - Some PS entries contain profiles or rules- Many PDOC entries describe domains, rather than families

10

General information about the entry Entry name FLAVODOXIN Accession number PS00201 Entry type PATTERN

Date APR-1990 (CREATED); DEC -2004 (DATA UPDATE); AUG-2005 (INFO UPDATE).

PROSITE documentation

PDOC00178

Name and characterization of the entry Description Flavodoxin signature. Pattern [LIV]-[LIVFY]-[FY]-x-[ST]-{V}-x-[AGC] -x-T-{P}-x(2)- A-x(2)-[LIV].

Numerical results • UniProtKB/Swiss-Prot release number: 48.0, total number of se quence entries in that release:

194317. • Total number of hits in UniProtKB/Swiss -Prot: 55 hits in 55 different sequences • Number of hits on proteins that are known to belong to the set under consideration: 51 hits in

51 different sequences • Number of hits on proteins that could potentially belong to the set under consideration: 0 hits in

0 different sequences • Number of false hits (on unrelated proteins): 4 hits in 4 different sequences • Number of known missed hits: 9 • Number of partial sequences which belong to the set under consideration, but which are not hit

by the pattern or profile because they are partial (fragment) sequences: 2 • Precision (true hits / (true hits + false positives)): 92.73 % • Recall (true hits / (true hits + false negatives)): 85.00 %

Comments

• Taxonomic range: Archaebacteria, Eukaryotes, Prokaryotes (Bacteria) • Maximum known number of repetitions of the pattern in a single protein: 1 • VERSION: 1

Cross- references

UniProtKB/Swiss -Prot

True positive hits: DFA2_SYNEL (Q8DJ55), DFA3_SYNY3 (P74373), DFA4_ANASP (Q8YNW7), DFA4_SYNY3 (P72721), DFA5_ANASP (Q8Z0C1), DFA6_ANASP (Q8YQE2), FLAV_ANASO (P0A3E0), FLAV_ANASP (P0A3D9), FLAV_AQUAE (O67866), FLAV_AZOCH (P23001), FLAV_AZOVI (P00324), FLAV_BUCAI (P57385), FLAV_BUCAP (Q8K9N5), FLAV_BUCBP (Q89AK0), FLAV_CHOCR (P14070), FLAV_CLOBE (P00322), FLAV_CLOSA (P18855), FLAV_DESDE (P26492), FLAV_DESGI (Q01095), FLAV_DESSA (P18086), FLAV_DESVH (P00323), FLAV_DESVM (P71165), FLAV_ECO57 (P61951), FLAV_ECOL6 (P61950), FLAV_ECOLI (P61949), FLAV_ENTAG (P28579), FLAV_HAEIN (P44562), FLAV_KLEPN (O07026), FLAV_MEGEL (P00321), FLAV_NOSSM (P35707), FLAV_RHOCA (P52967), FLAV_SALTY (Q8ZQX1), FLAV_SHIFL (Q83S80), FLAV_SYNP2 (P31158), FLAV_SYNP7 (P10340), FLAV_SYNY3 (P27319), FLAV_TREPA (O83895), FLAV_TRIER (O52659), FLAW_BACSU (O34589), FLAW_DESDE (P80312), FLAW_DESGI (Q01096), FLAW_ECOLI (P41050), FLAW_ENTAG (P71169), FLAW_KLEOX (P56268), FLAW_KLEPN (P04668),

Access to PROSITE

Quick Search

in PROSITE by AC, ID or documentation text

Prefix and append wildcard '*' to words.

• Browse PROSITE documentation entries

• Search by author • Search by citation • Search by description • Search by full text search • SRS - Sequence Retrieval System • Download by FTP

Tools for PROSITE Scan PROSITE patterns, profiles and rules with a Swiss-Prot/TrEMBL AC, ID or paste your own sequence in the box below (for more options, use the ScanProsite form):

Quick Scan Clear

Exclude patterns with a high probability of

occurrence

• ScanProsite - Scan a sequence against PROSITE or a pattern against Swiss-Prot or PDB and visualize matches on structures

with graphical view and feature detection

• MotifScan - Scan a sequence against the profile entries in PROSITE and Pfam

• InterProScan - Scan a sequence against all the motif databases in InterPro

• ps_scan - Perl program to scan PROSITE locally

• pftools - Standalone programs to create and scan PROSITE profiles

• PRATT - Interactively generates conserved patterns from a series of unaligned proteins

• Other pattern and profile search tools

11

Patterns with high probability of occurence

ID PKC_PHOSPHO_SITE; PATTERN. AC PS00005; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE Protein kinase C phosphorylation site. PA [ST]-x-[RK]. CC /TAXO-RANGE=??E?V; CC /SITE=1,phosphorylation; CC /SKIP-FLAG=TRUE; CC /VERSION=1; DO PDOC00005;

Classification accuracy

• Also called ”discriminatory power” or ”diagnostic power”.

• Ideally PROSITE should be a database of diagnostic patterns:

”.. with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs”.

12

• Positive:Sequence matching a given pattern.

• Negative:Sequence not matching a given pattern.

• True: A correct result.

• False:An incorrect result.

• True positive:A matching family sequence.

• True negative:A non-matching non-family sequence.

• False positive: A matching non-family sequence.

• False negative:A non-matching family sequence.

All sequences

Positive

Matchingsequences

Negative

Familysequences

TN

TP

FN

FP

13

TN

TPFN

FP

We want to minimize FN + FP.

A diagnostic pattern matches all family sequences and no other sequence, i.e. FN = FP = 0.

Classification accuracy

Precision (specificity): TP(TP+FP)

Recall (sensitivity): TP(TP+FN)

• PROSITE’s PS files give the following measures for each pattern:

• Ex: The flavodoxin family (PS00201)True positives: 51False positives: 4False negatives: 9

51(51+4)

51(51+9)

≈ 0.93

≈ 0.85

14

Classification accuracy varies...

Family TP FP FN Spec Sens

14_3_3 110 0 0 1.00 1.00POU domain 84 0 0 1.00 1.00Germin 42 0 6 1.00 0.87Claudin 62 6 0 0.91 1.00Cecropin 67 13 13 0.81 0.81Copper blue 80 34 8 0.70 0.91Lipocalin 90 118 45 0.43 0.67

• Average values ~0.95 for both Spec and Sens.• But: this is classification accuracy on the training set!• On test set:

Spec ~0.7 Sens ~0.3

PROSITE matches

0

10

20

30

40

50

60

70

0 1 2 3 4 5

Nr of matching patterns

% o

f seq

uen

ces

Unknown Oat specific16%

Transport and secretion

6%Transcription

6%

Signal transduction and

perception7%

Protein synthesis6%

Protein catabolism4%

Metabolism16%

Cell growth, Development, Cell cycle, Maitenance

5%

Cell rescue, Defence, Stress,

Aging,Death 5%

Energy6%

Others2%

Unknown 21%

Unknown Oat specific16%

Transport and secretion

6%Transcription

6%

Signal transduction and

perception7%

Protein synthesis6%

Protein catabolism4%

Metabolism16%

Cell growth, Development, Cell cycle, Maitenance

5%

Cell rescue, Defence, Stress,

Aging,Death 5%

Energy6%

Others2%

Unknown 21%

Oat cold acclimation project• 10,000 ESTs• 2,500 genes• 63% characterised• 23% by PROSITE

38% have PROSITE match,but not all matches are useful for characterization

15

PROSITE’s semi-automated method

1. Study literature on the protein family.

2. Study one or more alignments of sequences belonging to the family and look for motifs corresponding to, for example:– Catalytically active region– Cysteines forming disulphide bonds– Regions involved in binding to other molecules (including other proteins)

3. Identify a conserved sequence of 4-5 residues within the chosen motif.

4. Search SWISS-PROT for sequences matching this initial pattern.

5. Extend the pattern until its classification accuracy is ”adequate”.

16

Automated pattern construction

• Patterns can be built bottom-up or top-down:– BU: Enumerate possible patterns, and rank them according to the

number of occurrences in the sequences.– TD: Compare (align) the sequences and find similarities from

which patterns can be built.• Advantages of BU:

– Guarantees finding the best pattern for a given pattern size– Not affected by multiple alignment quality

• Advantages of TD:– Faster– Conserved positions in multiple alignment are likely to be

biologically meaningful

PRATThttp://www.ii.uib.no/~inge/Pratt.html

• Uses a heuristic bottom-up approach.

tcgacggattgacgcagctacgaa

aa 1ac 3ag 0at 0ca 1cc 0cg 3ct 1....

Sequences Enumeration of patterns

Length 2 Length 3aca 0acc 0acg 3act 0cga 1cgc 1cgg 1cgt 0

aacg 0cacg 0gacg 2tacg 1

Length 4agacg 0cgacg 1ggacg 0tgacg 1

Length 5

Search time can be reducedby ordering of candidatepatterns and pruning of thesearch space.

17

• A candidate pattern can be generalized so that it matches more sequences.

• How to choose between alternative generalizations?• PRATT uses pattern information content as a guide,

and favorizes more informative patterns.


g-a-c-g (2)

[gt]-a-c-g (3)


g-a-c-g (2)

g-x(0,2)-a-c-g (3)

Score of pattern P is defined as:

∑ ∑=

−

=−⋅−=

p

i

p

kkki ijcAIPI

1

1

1)()()(

where: p = number of elementsc = a constant (0.5)ik = lower length for wildcard kjk = upper length for wildcard k

The second term measures information content of wildcards. For the pattern C-x(3,5)-[DE]-x(2)-A we get:

( )( ) 0.1)22(355.0)(1

1=−+−⋅=−⋅∑

−

=

p

kkk ijc

18

∑ ∑∈ ∈

⋅+⋅−=

Aa Aa A

a

A

aaai

i iipp

pp

ppAI log))log(()(

∑ ∑=

−

=−⋅−=

p

i

p

kkki ijcAIPI

1

1

1)()()(

The first term measures information content of symbol elements, which is defined as:

where: pa = probability of amino acid a∑∈

=i

iAa

aA pp

The pattern C-x(3,5)-[DE]-x(2)-A contains symbols CDEA: 034.0)02.0log02.0( −=⋅=Cp088.0)08.0log08.0( −=⋅=Ap

19

PRATT parameters

Max product of flexibilitiesFP

Max flexibility of wildcards (wildcard x(3,5) has 2)FLMax nr of flexible wildcardsFN

Max wildcard length (wildcard x(3,5) has length 5)PXMax pattern symbols (elements)PNMax pattern length (C-x(3,5)-[DE] has length 7)PL

Pattern position (anywhere, complete seq, start)PP

Minimum percentage of matching seqsC%

PRATT parameters

Specify that the first BN lines of the file contain symbols for initial pattern search

BN

Scoring method (information content, and various other)

S

Use file of symbols for initial pattern search and during refinement(in pattern C-x(3,5)-[DE] symbols are C and DE)

BI

20

• Creates “fuzzy” patterns.• Method of making patterns more general, so

that remote homologues can be found.• Amino acids grouped, e.g.:

– Aromatic: FYW – Acidic: HKR– Hydrophobic: ILVM– etc.

eMOTIF makerhttp://motif.stanford.edu/emotif/emotif-maker.html

ADLGAVFALCDRYFQSDVGPRSCFCERFYQADLGRTQNRCDRYYQADIGQPHSLCERYFQ

Fuzzy Regular Expressions

[ASGPT]-D-[IVLM]-G-x5-C-[DENQ]-R-[FYW]2-Q

[AS]-D-[IVL]-G-x5-C-[DE]-R-[FY]2-Q

Create pattern

Generalize

21

eMOTIF maker

• Makes fuzzy pattern from multiple alignment using overlapping groups:

1. AG 7. KR2. ST 8. KRH3. PAGST 9. VLI4. QN 10. VLIM5. ED 11. FYW6. QNED 12. PAGSTQNEDHKR

13. CVLIMFYW

• For each alignment column, chooses smallest group containing all symbols from the column.

http://motif.stanford.edu/emotif/emotif-maker.html

PAGSTQNEDHKR

PAGST

AG ST

QNED

QN ED

KRH

KR

CVLIMFYW

VLIM

VLI

FYW

1 2 3 4 5 6 7 8 91011

L A S P T H E D L S LL A S P T N E E L S MI A S P I E V Q V S LI A G P I V Q Q I S L

[VLI]-A-[PAGST]-P-x3-[QNED]-[VLI]-S-[VLIM]

22

Pros and cons of patterns

+ Fast search.+ Simple and interpretable models.- Limited power of expression.- Discrete results (”yes/no”).- Lack of reliable methodology for automated

design of patterns.- Only use information from a single motif.

Patterns do not use probabilities

• Note that the pattern matches:– Consensus sequence: ACACATC – and an exceptional sequence, such as: TGCAGG

• Probabilistic approaches (e.g. profiles and hidden Markov models) would assign these matches different probabilities.

ACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATC

[AT]-[CG]-[AC]-x(0,3)-[TG]-[CG]

23

What patterns can’t do

• A pattern is a regular expression, i.e. an expression in a regular grammar.

• Regular grammars cannot describe palindrome languages or copy languages, i.e. cannot handle long-range interactions.

a a b b a a

a a b a a b

Palindrome language

Copy language

The Chomsky hierarchy of transformational grammars

regular

context-free

context-sensitive

unrestricted

24

Palindrome languages

• Do palindromes occur in biological sequences?• Consider RNA secondary structure (stem loops,

”hairpins”), or anti-parallel beta strands, and the pairwise dependencies between positions that these result in (in the sequence).

And E.T. saw waste DNA

A Toyota

Dammit, I'm mad!

Flee to me, remote elf

Harpo sees Oprah

A AG AG•CA•UC•G

C AG AU•AC•GG•C

C AG AUxCCxUGxG

1 CAGGAAACUG2 GCUGCAAAGC3 GCUGCAACUG

1 2 3

7-253132

Identical positions

Sequences 1 and 2 are the most dissimilar, but they can fold to the same structure, because they share the same pattern of A-U and G-C pairs (purine-pyrimidine).

The structure imposes a set of nested pairwise constraints, just like apalindrome language.

25

A AG AG•CA•UC•G

C AG AU•AC•GG•C

C AG AUxCCxUGxG

1 CAGGAAACUG2 GCUGCAAAGC3 GCUGCAACUG

1 2 3

How to formulate a pattern, when allowed symbols at position 10 depends on which symbol we see at position 1?

A context-free grammar is required:

S → aW1u | cW1g | gW1c | uW1aW1 → aW2u | cW2g | gW2c | uW2aW2 → aW3u | cW3g | gW3c | uW3aW3 → gaaa | gcaa

Regular expressions

• Patterns are represented by regular expressions.• Regular expressions (with varying syntax) are

used in:– operating system tools (awk, and grep in UNIX)– programming languages (e.g. PERL)– text editors– search engines

• In all cases to search text for occurrence of a pattern.

26

Regular expressions in UNIX/Linux

ID ASN_GLYCOSYLATION; PATTERN.AC PS00001;DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).DE N-glycosylation site.PA N-{P}-[ST]-{P}.CC /TAXO-RANGE=??E?V;CC /SITE=1,carbohydrate;CC /SKIP-FLAG=TRUE;DO PDOC00001;

> grep ’CC’ file1> CC /TAXO-RANGE=??E?V;> CC /SITE=1,carbohydrate;> CC /SKIP-FLAG=TRUE;>


> grep ’PA’ file1> ID ASN_GLYCOSYLATION; PATTERN.> PA N-{P}-[ST]-{P}.>

> grep ’^PA’ file1> PA N-{P}-[ST]-{P}.>

> grep ’^PA’ prosite.dat > prosite_patterns.dat>

27


> grep ’^D[EOT]’ file1> DT APR-1990 (CREATED); APR-1990 (DATA UPDATE);> DE N-glycosylation site.> DO PDOC00001;>

> grep ’^D.’ file1> DT APR-1990 (CREATED); APR-1990 (DATA UPDATE);> DE N-glycosylation site.> DO PDOC00001;>


> grep ’;$’ file1> AC PS00001;> CC /TAXO-RANGE=??E?V;> CC /SITE=1,carbohydrate;> CC /SKIP-FLAG=TRUE;> DO PDOC00001;>

28


> grep ’\.$’ file1> ID ASN_GLYCOSYLATION; PATTERN.> DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).> DE N-glycosylation site.> PA N-{P}-[ST]-{P}.>

• The grep program uses:

Match a range of instances\{ \}Match one of the enclosed characters[ ]Turn off special meaning of following character\Match the end of the line$Match the beginning of the line^Match zero or more of the preceding character*Match any character (except newline).ActionSymbol

29

\{n\} Matches exactly n occurrences.\{n,\} Matches at least n occurrences.\{n,m\} Matches at least n and at most m occurrences.

SKKIGLFYGTAKIGLFFGSNKIGIFFSTSTMKISILYSSKMKIVYWSGTG

Which sequences contain two aliphatic hydrophobes?> grep ’[LIV].*[LIV]’ file2> SKKIGLFYGT> AKIGLFFGSN> KIGIFFSTST> MKISILYSSK> MKIVYWSGTG

Which sequences contain two aliphatic hydrophobes in succession?> grep ’[LIV]\{2\}’ file2> MKISILYSSK> MKIVYWSGTG

\{n\} Matches exactly n occurrences.\{n,\} Matches at least n occurrences.\{n,m\} Matches at least n and at most m occurrences.

SKKIGLFYGTAKIGLFFGSNAKIGLFFGSNKIGIFFSTSTMKISILYSSKMKIVYWSGTG

Two tiny residues that are exactly one residue apart?> grep ’[ACGS].[ACGS]’ file2> KIGIFFSTST> MKIVYWSGTG

Two tiny residues that are one or two residues apart?> grep ’[AGCS].\{1,2\}[AGCS]’ file2> AKIGLFFGSN> KIGIFFSTST> MKIVYWSGTG

Two tiny residues that are at least five residues apart?> grep ’[AGCS].\{5,\}[AGCS]’ file2> SKKIGLFYGT > AKIGLFFGSN> KIGIFFSTST

30

PROSITE patterns in grep

• Simply using grep, we could find all lines in a sequence file that match a PROSITE pattern:

G-x2-[LIV]-R-x-[GA]

> grep ’G.{\2\}[LIV]R.[GA]’ seq_file

G-x2-[LIV]-x(4,7)-R-x(5,6)-[GA]

> grep ’G..[LIV].{\4,7\}R.{\5,6\}[GA]’ seq_file

Regular expressions in PERL

• The programming language PERL uses a regular expression syntax that is similar to grep’s.

$pattern = ”G.{\2\}[LIV]R.[GA]”;while ($name=<SPROT>) {

$seq = <SPROT>;if ($seq =~ /$pattern/) { $nr_matching_seqs++;printf STDOUT "$name";printf STDOUT "$seq";}

}

31

Exercise idea

1) Write a program that finds patterns in a multiple alignment by looking for columns:

• containing no gap-characters• containing a set of amino acids in which all pairs have positive

scores in the PAM250 substitution matrix• being at most 2 columns apart

A valid pattern must model at least four columns.2) Test your program on alignments downloaded from Pfam

and check your patterns through the PROSITE scanProsite interface.

3) Compare the patterns your program finds with those in PROSITE for the same family.

Documents

Patterns and regular expressions for analysis of ...bio.lundberg.gu.se/courses/ht05/bio1/bjorn_olsson.pdf · 5 Zink finger domain: – DNA binding (usually by 3 or more zink fingers)