1
Patterns and regular expressions for analysis of biological
sequences
Björn OlssonBioinformatics Research Group University College of Skövde
Ajit Narayanan, Professor
Björn Olsson, Senior lecturer
Dan Lundh, Lecturer
Kim Laurio
SimonLundell
ZelminaLubovac
Jonas Gamalielsson
JaneSynnergren
AngelicaLindlöf
PhD students
2
Why patterns?
• Patterns summarize the most important features of a group of sequences.
• Patterns often correspond to functional or structural features. Example: When constructing patterns, PROSITE developers look for motifs corresponding to:– Cysteines forming disulphide bonds– Regions involved in binding to other molecules– Catalytically active regions– etc
• Pattern matching is easy to implement and pattern search is extremely fast
• Patterns are interpretable and simple models
CMLPIFECKQFSECNSYTGRCECIEGFAGDDCCDQNPCLNGGACLPYLINEVTHLYTCTCENGFQGDKCCFQSDCKNDGFCQSPSDEYACTCQPGFEGDDCCFDNNCEYQCQPVGRSEHKCICAEGFAPVPGAPHKCCTIKNQCMNFGTCEPVLQGNAIDFICHCPVGFTDKVCCGILNGCENGRCVRVQEGYTCDCLDGYHLDTAKMTCCKTKEAGFVCKHGCRSTGKAYECTCPSGSTVAEDGITCCRKNACLHNAECRNTWNDYTCKCPNGYKGKKCCPRNVSECSHDCVLTSEGPLCFCPEGSVLERDGKTCCTDFVDVPCSHFCNNFIGGYFCSCPPEYFLHDDMKNCCLDNNGGCSHVCNDLKIGYECLCPDGFQLVAQRRCCEEDNGGCSHLCLLSPREPFYSCACPTGVQLQDNGKTCCADGNGGCSHLCFFTPRATKCGCPIGLELLSDMKTCCARDNGGCSHICIAKGDGTPRCSCPVHLVLLQNLLTCCKLRKGNCSSTVCGQDLQSHLCMCAEGYALSRDRKYCCPKANCSHFCIDRRDVGHQCFCAPGYILSENQKDCCDSHPCLHGGTCEDDGREFTCRCPAGKGGAVCCVLEPNCIHGTCNKPWTCICNEGWGGLYCCNPLPCNEDGFMTCKDGQATFTCICKSGWQGEKCCNAEFQNFCIHGECKYIEHLEAVTCKCQQEYFGERCCPLSHDGYCLHDGVCMYIEALDKYACNCVVGYIGERCCPKQYKHYCIKGRCRFVVAEQTPSCVCDEGYIGARCCNDDYKNYCLNNGTCFTVALNNVSLNPFCACHINYVGSRC
3
CMLPIF..ECKQF..SECNSYT..........GRCECIEG.....FAGDDCCDQN....PCLNG..GACLPYLINE...VTHLYTCTCENG.....FQGDKCCFQS....DCKND..GFCQSPSD........EYACTCQPG.....FEGDDCCFDN....NCEY....QCQPVGR.......SEHKCICAEGFAPVPGAPHKCCTIKN...QCMNF..GTCEPVLQG....NAIDFICHCPVG.....FTDKVCCGILN...GCEN...GRCVRVQEG........YTCDCLDG.YHLDTAKMTCCKTKEAGFVCKH....GCRSTGK........AYECTCPSG.STVAEDGITCCRKN....ACLHN..AECRNTW........NDYTCKCPNG.....YKGKKCCPRNVS..ECS....HDCVLTSE........GPLCFCPEG.SVLERDGKTCCTDFVDV.PCSH....FCNNFIG........GYFCSCPPE.YFLHDDMKNCCLDNN..GGCSH....VCNDLKI........GYECLCPDG..FQLVAQRRCCEEDNG..GCSH....LCLLSPR......EPFYSCACPTG.VQLQDNGKTCCADGNG..GCSH....LCFFTPR........ATKCGCPIG.LELLSDMKTCCARDNG..GCSH....ICIAKGD.......GTPRCSCPVH.LVLLQNLLTCCKLRKG..NCSS...TVCGQDLQ........SHLCMCAEG.YALSRDRKYCCPKA....NCSH....FCIDRRD.......VGHQCFCAPG.YILSENQKDCCDSH....PCLHG..GTCEDDG........REFTCRCPAG.....KGGAVCCVLEP...NCIH...GTCNKP...........WTCICNEG.....WGGLYCCNPL....PCNEDGFMTCKDGQ........ATFTCICKSG.....WQGEKCCNAEFQN.FCIH...GECKYIEH......LEAVTCKCQQE.....YFGERCCPLSHDG.YCLHD..GVCMYIEALD......KYACNCVVG.....YIGERCCPKQYKH.YCIK...GRCRFVVA......EQTPSCVCDEG.....YIGARCCNDDYKN.YCLNN..GTCFTVALNN...VSLNPFCACHIN.....YVGSRC
CMLPIF..ECKQF..SECNSYT..........GRCECIEG.....FAGDDCCDQN....PCLNG..GACLPYLINE...VTHLYTCTCENG.....FQGDKCCFQS....DCKND..GFCQSPSD........EYACTCQPG.....FEGDDCCFDN....NCEY....QCQPVGR.......SEHKCICAEGFAPVPGAPHKCCTIKN...QCMNF..GTCEPVLQG....NAIDFICHCPVG.....FTDKVCCGILN...GCEN...GRCVRVQEG........YTCDCLDG.YHLDTAKMTCCKTKEAGFVCKH....GCRSTGK........AYECTCPSG.STVAEDGITCCRKN....ACLHN..AECRNTW........NDYTCKCPNG.....YKGKKCCPRNVS..ECS....HDCVLTSE........GPLCFCPEG.SVLERDGKTCCTDFVDV.PCSH....FCNNFIG........GYFCSCPPE.YFLHDDMKNCCLDNN..GGCSH....VCNDLKI........GYECLCPDG..FQLVAQRRCCEEDNG..GCSH....LCLLSPR......EPFYSCACPTG.VQLQDNGKTCCADGNG..GCSH....LCFFTPR........ATKCGCPIG.LELLSDMKTCCARDNG..GCSH....ICIAKGD.......GTPRCSCPVH.LVLLQNLLTCCKLRKG..NCSS...TVCGQDLQ........SHLCMCAEG.YALSRDRKYCCPKA....NCSH....FCIDRRD.......VGHQCFCAPG.YILSENQKDCCDSH....PCLHG..GTCEDDG........REFTCRCPAG.....KGGAVCCVLEP...NCIH...GTCNKP...........WTCICNEG.....WGGLYCCNPL....PCNEDGFMTCKDGQ........ATFTCICKSG.....WQGEKCCNAEFQN.FCIH...GECKYIEH......LEAVTCKCQQE.....YFGERCCPLSHDG.YCLHD..GVCMYIEALD......KYACNCVVG.....YIGERCCPKQYKH.YCIK...GRCRFVVA......EQTPSCVCDEG.....YIGARCCNDDYKN.YCLNN..GTCFTVALNN...VSLNPFCACHIN.....YVGSRC
4
1
2
34
56
EGF-like domain:- First identified in epidermial growth factor, but present in many other proteins- Function unclear, but found in extracellular part of membrane proteins- Includes six cysteins involved in disulfide bonds
+----------------+ +-----------------+| | | |
C-x(0,48)-C-x(3,12)-C-x(1,70)-C-x(1,6)-C-x2-G-x(0,21)-G-x2-C| |+----------------+1
2
3
4 5 6
CMLPIF..ECKQF..SECNSYT..........GRCECIEG.....FAGDDCCDQN....PCLNG..GACLPYLINE...VTHLYTCTCENG.....FQGDKCCFQS....DCKND..GFCQSPSD........EYACTCQPG.....FEGDDCCFDN....NCEY....QCQPVGR.......SEHKCICAEGFAPVPGAPHKCCTIKN...QCMNF..GTCEPVLQG....NAIDFICHCPVG.....FTDKVCCGILN...GCEN...GRCVRVQEG........YTCDCLDG.YHLDTAKMTCCKTKEAGFVCKH....GCRSTGK........AYECTCPSG.STVAEDGITCCRKN....ACLHN..AECRNTW........NDYTCKCPNG.....YKGKKCCPRNVS..ECS....HDCVLTSE........GPLCFCPEG.SVLERDGKTCCTDFVDV.PCSH....FCNNFIG........GYFCSCPPE.YFLHDDMKNCCLDNN..GGCSH....VCNDLKI........GYECLCPDG..FQLVAQRRCCEEDNG..GCSH....LCLLSPR......EPFYSCACPTG.VQLQDNGKTCCADGNG..GCSH....LCFFTPR........ATKCGCPIG.LELLSDMKTCCARDNG..GCSH....ICIAKGD.......GTPRCSCPVH.LVLLQNLLTCCKLRKG..NCSS...TVCGQDLQ........SHLCMCAEG.YALSRDRKYCCPKA....NCSH....FCIDRRD.......VGHQCFCAPG.YILSENQKDCCDSH....PCLHG..GTCEDDG........REFTCRCPAG.....KGGAVCCVLEP...NCIH...GTCNKP...........WTCICNEG.....WGGLYCCNPL....PCNEDGFMTCKDGQ........ATFTCICKSG.....WQGEKCCNAEFQN.FCIH...GECKYIEH......LEAVTCKCQQE.....YFGERCCPLSHDG.YCLHD..GVCMYIEALD......KYACNCVVG.....YIGERC
C-x(0,48)-C-x(3,12)-C-x(1,70)-C-x(1,6)-C-x2-G-x(0,21)-G-x2-C
5
Zink finger domain:– DNA binding (usually by 3 or more zink fingers)– The C2H2 zink finger has a loop of 12 amino acids, anchored by two cysteines
and two histidines co-ordinating a zink atom.
C H
Z
C H
DNA binding
Aromatic or aliphatic
LIVMFYWC
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
Thiolases. Enzymes of two types:– Thiolase I involved in degradative pathways such as fatty acid beta-oxidation – Thiolase II involved in biosynthetic pathways such as poly beta-hydroxybutyrate
synthesis or steroid biogenesis
[LIVM]-[NST]-x(2)-C-[SAGLI]-[ST]-[SAG]-[LIVMFYNS]-x-[STAG]-[LIVM]-x(6)-[LIVM]
TSNIAREAALLAGVPDKIPAHTVTLACISSNVAMTTGMGMLATGNANAIIAGGVELLTSNVAREAALGAGFSDKTPAHTVTMACISSNQAMTTAVGLIASGQCDVVVAGGVELMGQNPARQAAIKAGLPAMVPAMTINKVCGSGLKAVMLAANAIMAGDAEIVVAGGQENMGQNPARQTTLMAGLPHTVPAMTINKVCGSGLKAVHLAMQAVACGDAEIVIAGGQESMGQNPARQAHIKVGLPRESAAWVINQVCGSGLRTVALAAQQVLLGDARIVVAGGQESMGQNPARQAAMKAGVPQEATAWGMNQLCGSGLRAVALGMQQIATGDASIIVAGGMESMGQNPARQSAIKGGLPNSVSAITINDVCGSGLKALHLATQAIQCGEADIVIAGGQENMGQNPARQALLKSGLAETVCGFTVNKVCGSGLKSVALAAQAIQAGQAQSIVAGGMENMGQNPARQAALKAGIEKEIPSLTINKVCGSGLKSVALGAQSIISGDADIVVVGGMENMGQNPARQASFKAGLPVEIPAMTINKVCGSGLRTVSLAAQIIKAGDADVIIAGGMENMGQMPARQAAVAAGIGWDVPALTINKMCLSGLDAIALADQLIRAGEFDVVVAGGQESMGQIPSRLAARLAGMPWSVPSETLNKVCASGLRAVTLCDQMIRAQDADILVAGGMESMGQAPARQVALKAGLPDSIIASTINKVCASGMKAVIIGAQNIICGTSDIVVVGGAESM
6
Basic syntax for PROSITE patterns
[AS] A and S allowed.D D allowed.x Any symbol.x4 Four arbitrary symbols.{PG} Any symbol except P and G.[FY]2 Two positions where F and Y allowed.x(3,7) Minimum 3 and maximum 7 residues.
D-[IVL]-x(4)-{PG}-C-x(4,10)-[DE]-R-[FY]2-Q
Additional syntax (rarely used)
• Some patterns are anchored to the beginning or end of the sequence:
[KRHQSA]-[DENQ]-E-L>
must match at the end of the sequence.
VTLACISSNVAMTTGMGMLATGNANAIIAGQNEL
7
Wildcards
G-x2-[LIV]-x(4,7)-R-x-[GA]
• x represents an arbitrary symbol.• xN represents a sub-sequence of N arbitrary
symbols.• x(i,j) represents a sub-sequence of
arbitrary symbols with minimum length i and maximum length j.
Wildcards
G-x2-[LIV]-x(4,7)-R-x(5,6)-[GA]
• The wildcard x(i,j) has flexibility j-i+1.
• Total flexibility of a pattern can be defined asthe product of the flexibilities of its wildcards, i.e. in this example pattern (7-4+1) ×(6-5+1) = 4×2 = 8.
• High flexibility increases search time, since the number of potential matches increasesexponentially.
8
A-x(4,7)-C-x(5,6)-D
ATLACCCCCVAMTTGMDMLATGNANAIIAGQNELA....C.....DA....C......DA.....C.....DA.....C......DA......C.....DA......C......DA.......C.....DA.......C......D
• Flexibility N means N potential matches per position.• Pattern construction methods typically seacrh low flexibility patterns.
Zinc-containing alcohol dehydrogenasesG-H-E-x(2)-G-x(5)-[GA]-x(2)-[IVSAC]
Bacterial quinoprotein dehydrogenases [DEN]-[WV]-x(3)-G-[RKNM]-x(6)-[FYW]-[SV]-x(4)-[LIVM]-N-x(2)-N-V-x(2)-L- [RKT]
Phosphoketolase E-G-G-E-L-G-Y
Ferrochelatase [LIVMF]-[LIVMFC]-x-[ST]-x-H-[GS]-[LIVM]-P-x(4,5)-[DENQKRLHAFSTI]-x-[GN]- [DPC]-x(1,4)-[YA]
• Note that general elements also contribute to increased search time by increasing the number of matches that will be evaluated.
9
PROSITEhttp://www.expasy.org/prosite/
• First and most ”famous” pattern database.• Release 19.10 (12-Sep-2005) contains 1374
documentation entries that describe 1328 patterns, 4 rules and 550 profiles/matrices.
• ”.. with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs”.
• But:– Some domains are not informative of family– Some patterns (10-15) for post-translational
modifications– PROSITE patterns often have low sensitivity
PROSITE’s structure
Documentation of family (PDOCxxxxx)
Pattern files (PSxxxxx)
Note: - Some PS entries contain profiles or rules- Many PDOC entries describe domains, rather than families
10
General information about the entry Entry name FLAVODOXIN Accession number PS00201 Entry type PATTERN
Date APR-1990 (CREATED); DEC -2004 (DATA UPDATE); AUG-2005 (INFO UPDATE).
PROSITE documentation
PDOC00178
Name and characterization of the entry Description Flavodoxin signature. Pattern [LIV]-[LIVFY]-[FY]-x-[ST]-{V}-x-[AGC] -x-T-{P}-x(2)- A-x(2)-[LIV].
Numerical results • UniProtKB/Swiss-Prot release number: 48.0, total number of se quence entries in that release:
194317. • Total number of hits in UniProtKB/Swiss -Prot: 55 hits in 55 different sequences • Number of hits on proteins that are known to belong to the set under consideration: 51 hits in
51 different sequences • Number of hits on proteins that could potentially belong to the set under consideration: 0 hits in
0 different sequences • Number of false hits (on unrelated proteins): 4 hits in 4 different sequences • Number of known missed hits: 9 • Number of partial sequences which belong to the set under consideration, but which are not hit
by the pattern or profile because they are partial (fragment) sequences: 2 • Precision (true hits / (true hits + false positives)): 92.73 % • Recall (true hits / (true hits + false negatives)): 85.00 %
Comments
• Taxonomic range: Archaebacteria, Eukaryotes, Prokaryotes (Bacteria) • Maximum known number of repetitions of the pattern in a single protein: 1 • VERSION: 1
Cross- references
UniProtKB/Swiss -Prot
True positive hits: DFA2_SYNEL (Q8DJ55), DFA3_SYNY3 (P74373), DFA4_ANASP (Q8YNW7), DFA4_SYNY3 (P72721), DFA5_ANASP (Q8Z0C1), DFA6_ANASP (Q8YQE2), FLAV_ANASO (P0A3E0), FLAV_ANASP (P0A3D9), FLAV_AQUAE (O67866), FLAV_AZOCH (P23001), FLAV_AZOVI (P00324), FLAV_BUCAI (P57385), FLAV_BUCAP (Q8K9N5), FLAV_BUCBP (Q89AK0), FLAV_CHOCR (P14070), FLAV_CLOBE (P00322), FLAV_CLOSA (P18855), FLAV_DESDE (P26492), FLAV_DESGI (Q01095), FLAV_DESSA (P18086), FLAV_DESVH (P00323), FLAV_DESVM (P71165), FLAV_ECO57 (P61951), FLAV_ECOL6 (P61950), FLAV_ECOLI (P61949), FLAV_ENTAG (P28579), FLAV_HAEIN (P44562), FLAV_KLEPN (O07026), FLAV_MEGEL (P00321), FLAV_NOSSM (P35707), FLAV_RHOCA (P52967), FLAV_SALTY (Q8ZQX1), FLAV_SHIFL (Q83S80), FLAV_SYNP2 (P31158), FLAV_SYNP7 (P10340), FLAV_SYNY3 (P27319), FLAV_TREPA (O83895), FLAV_TRIER (O52659), FLAW_BACSU (O34589), FLAW_DESDE (P80312), FLAW_DESGI (Q01096), FLAW_ECOLI (P41050), FLAW_ENTAG (P71169), FLAW_KLEOX (P56268), FLAW_KLEPN (P04668),
Access to PROSITE
Quick Search
in PROSITE by AC, ID or documentation text
Prefix and append wildcard '*' to words.
• Browse PROSITE documentation entries
• Search by author • Search by citation • Search by description • Search by full text search • SRS - Sequence Retrieval System • Download by FTP
Tools for PROSITE Scan PROSITE patterns, profiles and rules with a Swiss-Prot/TrEMBL AC, ID or paste your own sequence in the box below (for more options, use the ScanProsite form):
Quick Scan Clear
Exclude patterns with a high probability of
occurrence
• ScanProsite - Scan a sequence against PROSITE or a pattern against Swiss-Prot or PDB and visualize matches on structures
with graphical view and feature detection
• MotifScan - Scan a sequence against the profile entries in PROSITE and Pfam
• InterProScan - Scan a sequence against all the motif databases in InterPro
• ps_scan - Perl program to scan PROSITE locally
• pftools - Standalone programs to create and scan PROSITE profiles
• PRATT - Interactively generates conserved patterns from a series of unaligned proteins
• Other pattern and profile search tools
11
Patterns with high probability of occurence
ID PKC_PHOSPHO_SITE; PATTERN. AC PS00005; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE Protein kinase C phosphorylation site. PA [ST]-x-[RK]. CC /TAXO-RANGE=??E?V; CC /SITE=1,phosphorylation; CC /SKIP-FLAG=TRUE; CC /VERSION=1; DO PDOC00005;
Classification accuracy
• Also called ”discriminatory power” or ”diagnostic power”.
• Ideally PROSITE should be a database of diagnostic patterns:
”.. with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs”.
12
• Positive:Sequence matching a given pattern.
• Negative:Sequence not matching a given pattern.
• True: A correct result.
• False:An incorrect result.
• True positive:A matching family sequence.
• True negative:A non-matching non-family sequence.
• False positive: A matching non-family sequence.
• False negative:A non-matching family sequence.
All sequences
Positive
Matchingsequences
Negative
Familysequences
TN
TP
FN
FP
13
TN
TPFN
FP
We want to minimize FN + FP.
A diagnostic pattern matches all family sequences and no other sequence, i.e. FN = FP = 0.
Classification accuracy
Precision (specificity): TP(TP+FP)
Recall (sensitivity): TP(TP+FN)
• PROSITE’s PS files give the following measures for each pattern:
• Ex: The flavodoxin family (PS00201)True positives: 51False positives: 4False negatives: 9
51(51+4)
51(51+9)
≈ 0.93
≈ 0.85
14
Classification accuracy varies...
Family TP FP FN Spec Sens
14_3_3 110 0 0 1.00 1.00POU domain 84 0 0 1.00 1.00Germin 42 0 6 1.00 0.87Claudin 62 6 0 0.91 1.00Cecropin 67 13 13 0.81 0.81Copper blue 80 34 8 0.70 0.91Lipocalin 90 118 45 0.43 0.67
• Average values ~0.95 for both Spec and Sens.• But: this is classification accuracy on the training set!• On test set:
Spec ~0.7 Sens ~0.3
PROSITE matches
0
10
20
30
40
50
60
70
0 1 2 3 4 5
Nr of matching patterns
% o
f seq
uen
ces
Unknown Oat specific16%
Transport and secretion
6%Transcription
6%
Signal transduction and
perception7%
Protein synthesis6%
Protein catabolism4%
Metabolism16%
Cell growth, Development, Cell cycle, Maitenance
5%
Cell rescue, Defence, Stress,
Aging,Death 5%
Energy6%
Others2%
Unknown 21%
Unknown Oat specific16%
Transport and secretion
6%Transcription
6%
Signal transduction and
perception7%
Protein synthesis6%
Protein catabolism4%
Metabolism16%
Cell growth, Development, Cell cycle, Maitenance
5%
Cell rescue, Defence, Stress,
Aging,Death 5%
Energy6%
Others2%
Unknown 21%
Oat cold acclimation project• 10,000 ESTs• 2,500 genes• 63% characterised• 23% by PROSITE
38% have PROSITE match,but not all matches are useful for characterization
15
PROSITE’s semi-automated method
1. Study literature on the protein family.
2. Study one or more alignments of sequences belonging to the family and look for motifs corresponding to, for example:– Catalytically active region– Cysteines forming disulphide bonds– Regions involved in binding to other molecules (including other proteins)
3. Identify a conserved sequence of 4-5 residues within the chosen motif.
4. Search SWISS-PROT for sequences matching this initial pattern.
5. Extend the pattern until its classification accuracy is ”adequate”.
16
Automated pattern construction
• Patterns can be built bottom-up or top-down:– BU: Enumerate possible patterns, and rank them according to the
number of occurrences in the sequences.– TD: Compare (align) the sequences and find similarities from
which patterns can be built.• Advantages of BU:
– Guarantees finding the best pattern for a given pattern size– Not affected by multiple alignment quality
• Advantages of TD:– Faster– Conserved positions in multiple alignment are likely to be
biologically meaningful
PRATThttp://www.ii.uib.no/~inge/Pratt.html
• Uses a heuristic bottom-up approach.
tcgacggattgacgcagctacgaa
aa 1ac 3ag 0at 0ca 1cc 0cg 3ct 1....
Sequences Enumeration of patterns
Length 2 Length 3aca 0acc 0acg 3act 0cga 1cgc 1cgg 1cgt 0
aacg 0cacg 0gacg 2tacg 1
Length 4agacg 0cgacg 1ggacg 0tgacg 1
Length 5
Search time can be reducedby ordering of candidatepatterns and pruning of thesearch space.
17
• A candidate pattern can be generalized so that it matches more sequences.
• How to choose between alternative generalizations?• PRATT uses pattern information content as a guide,
and favorizes more informative patterns.
tcgacggattgacgcagctacgaa
g-a-c-g (2)
[gt]-a-c-g (3)
tcgacggattgacgcagctacgaa
g-a-c-g (2)
g-x(0,2)-a-c-g (3)
Score of pattern P is defined as:
∑ ∑=
−
=−⋅−=
p
i
p
kkki ijcAIPI
1
1
1)()()(
where: p = number of elementsc = a constant (0.5)ik = lower length for wildcard kjk = upper length for wildcard k
The second term measures information content of wildcards. For the pattern C-x(3,5)-[DE]-x(2)-A we get:
( )( ) 0.1)22(355.0)(1
1=−+−⋅=−⋅∑
−
=
p
kkk ijc
18
∑ ∑∈ ∈
⋅+⋅−=
Aa Aa A
a
A
aaai
i iipp
pp
ppAI log))log(()(
∑ ∑=
−
=−⋅−=
p
i
p
kkki ijcAIPI
1
1
1)()()(
The first term measures information content of symbol elements, which is defined as:
where: pa = probability of amino acid a∑∈
=i
iAa
aA pp
The pattern C-x(3,5)-[DE]-x(2)-A contains symbols CDEA: 034.0)02.0log02.0( −=⋅=Cp088.0)08.0log08.0( −=⋅=Ap
19
PRATT parameters
Max product of flexibilitiesFP
Max flexibility of wildcards (wildcard x(3,5) has 2)FLMax nr of flexible wildcardsFN
Max wildcard length (wildcard x(3,5) has length 5)PXMax pattern symbols (elements)PNMax pattern length (C-x(3,5)-[DE] has length 7)PL
Pattern position (anywhere, complete seq, start)PP
Minimum percentage of matching seqsC%
PRATT parameters
Specify that the first BN lines of the file contain symbols for initial pattern search
BN
Scoring method (information content, and various other)
S
Use file of symbols for initial pattern search and during refinement(in pattern C-x(3,5)-[DE] symbols are C and DE)
BI
20
• Creates “fuzzy” patterns.• Method of making patterns more general, so
that remote homologues can be found.• Amino acids grouped, e.g.:
– Aromatic: FYW – Acidic: HKR– Hydrophobic: ILVM– etc.
eMOTIF makerhttp://motif.stanford.edu/emotif/emotif-maker.html
ADLGAVFALCDRYFQSDVGPRSCFCERFYQADLGRTQNRCDRYYQADIGQPHSLCERYFQ
Fuzzy Regular Expressions
[ASGPT]-D-[IVLM]-G-x5-C-[DENQ]-R-[FYW]2-Q
[AS]-D-[IVL]-G-x5-C-[DE]-R-[FY]2-Q
Create pattern
Generalize
21
eMOTIF maker
• Makes fuzzy pattern from multiple alignment using overlapping groups:
1. AG 7. KR2. ST 8. KRH3. PAGST 9. VLI4. QN 10. VLIM5. ED 11. FYW6. QNED 12. PAGSTQNEDHKR
13. CVLIMFYW
• For each alignment column, chooses smallest group containing all symbols from the column.
http://motif.stanford.edu/emotif/emotif-maker.html
PAGSTQNEDHKR
PAGST
AG ST
QNED
QN ED
KRH
KR
CVLIMFYW
VLIM
VLI
FYW
1 2 3 4 5 6 7 8 91011
L A S P T H E D L S LL A S P T N E E L S MI A S P I E V Q V S LI A G P I V Q Q I S L
[VLI]-A-[PAGST]-P-x3-[QNED]-[VLI]-S-[VLIM]
22
Pros and cons of patterns
+ Fast search.+ Simple and interpretable models.- Limited power of expression.- Discrete results (”yes/no”).- Lack of reliable methodology for automated
design of patterns.- Only use information from a single motif.
Patterns do not use probabilities
• Note that the pattern matches:– Consensus sequence: ACACATC – and an exceptional sequence, such as: TGCAGG
• Probabilistic approaches (e.g. profiles and hidden Markov models) would assign these matches different probabilities.
ACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATC
[AT]-[CG]-[AC]-x(0,3)-[TG]-[CG]
23
What patterns can’t do
• A pattern is a regular expression, i.e. an expression in a regular grammar.
• Regular grammars cannot describe palindrome languages or copy languages, i.e. cannot handle long-range interactions.
a a b b a a
a a b a a b
Palindrome language
Copy language
The Chomsky hierarchy of transformational grammars
regular
context-free
context-sensitive
unrestricted
24
Palindrome languages
• Do palindromes occur in biological sequences?• Consider RNA secondary structure (stem loops,
”hairpins”), or anti-parallel beta strands, and the pairwise dependencies between positions that these result in (in the sequence).
And E.T. saw waste DNA
A Toyota
Dammit, I'm mad!
Flee to me, remote elf
Harpo sees Oprah
A AG AG•CA•UC•G
C AG AU•AC•GG•C
C AG AUxCCxUGxG
1 CAGGAAACUG2 GCUGCAAAGC3 GCUGCAACUG
1 2 3
7-253132
Identical positions
Sequences 1 and 2 are the most dissimilar, but they can fold to the same structure, because they share the same pattern of A-U and G-C pairs (purine-pyrimidine).
The structure imposes a set of nested pairwise constraints, just like apalindrome language.
25
A AG AG•CA•UC•G
C AG AU•AC•GG•C
C AG AUxCCxUGxG
1 CAGGAAACUG2 GCUGCAAAGC3 GCUGCAACUG
1 2 3
How to formulate a pattern, when allowed symbols at position 10 depends on which symbol we see at position 1?
A context-free grammar is required:
S → aW1u | cW1g | gW1c | uW1aW1 → aW2u | cW2g | gW2c | uW2aW2 → aW3u | cW3g | gW3c | uW3aW3 → gaaa | gcaa
Regular expressions
• Patterns are represented by regular expressions.• Regular expressions (with varying syntax) are
used in:– operating system tools (awk, and grep in UNIX)– programming languages (e.g. PERL)– text editors– search engines
• In all cases to search text for occurrence of a pattern.
26
Regular expressions in UNIX/Linux
ID ASN_GLYCOSYLATION; PATTERN.AC PS00001;DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).DE N-glycosylation site.PA N-{P}-[ST]-{P}.CC /TAXO-RANGE=??E?V;CC /SITE=1,carbohydrate;CC /SKIP-FLAG=TRUE;DO PDOC00001;
> grep ’CC’ file1> CC /TAXO-RANGE=??E?V;> CC /SITE=1,carbohydrate;> CC /SKIP-FLAG=TRUE;>
ID ASN_GLYCOSYLATION; PATTERN.AC PS00001;DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).DE N-glycosylation site.PA N-{P}-[ST]-{P}.CC /TAXO-RANGE=??E?V;CC /SITE=1,carbohydrate;CC /SKIP-FLAG=TRUE;DO PDOC00001;
> grep ’PA’ file1> ID ASN_GLYCOSYLATION; PATTERN.> PA N-{P}-[ST]-{P}.>
> grep ’^PA’ file1> PA N-{P}-[ST]-{P}.>
> grep ’^PA’ prosite.dat > prosite_patterns.dat>
27
ID ASN_GLYCOSYLATION; PATTERN.AC PS00001;DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).DE N-glycosylation site.PA N-{P}-[ST]-{P}.CC /TAXO-RANGE=??E?V;CC /SITE=1,carbohydrate;CC /SKIP-FLAG=TRUE;DO PDOC00001;
> grep ’^D[EOT]’ file1> DT APR-1990 (CREATED); APR-1990 (DATA UPDATE);> DE N-glycosylation site.> DO PDOC00001;>
> grep ’^D.’ file1> DT APR-1990 (CREATED); APR-1990 (DATA UPDATE);> DE N-glycosylation site.> DO PDOC00001;>
ID ASN_GLYCOSYLATION; PATTERN.AC PS00001;DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).DE N-glycosylation site.PA N-{P}-[ST]-{P}.CC /TAXO-RANGE=??E?V;CC /SITE=1,carbohydrate;CC /SKIP-FLAG=TRUE;DO PDOC00001;
> grep ’;$’ file1> AC PS00001;> CC /TAXO-RANGE=??E?V;> CC /SITE=1,carbohydrate;> CC /SKIP-FLAG=TRUE;> DO PDOC00001;>
28
ID ASN_GLYCOSYLATION; PATTERN.AC PS00001;DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).DE N-glycosylation site.PA N-{P}-[ST]-{P}.CC /TAXO-RANGE=??E?V;CC /SITE=1,carbohydrate;CC /SKIP-FLAG=TRUE;DO PDOC00001;
> grep ’\.$’ file1> ID ASN_GLYCOSYLATION; PATTERN.> DT APR-1990 (CREATED); APR-1990 (DATA UPDATE).> DE N-glycosylation site.> PA N-{P}-[ST]-{P}.>
• The grep program uses:
Match a range of instances\{ \}Match one of the enclosed characters[ ]Turn off special meaning of following character\Match the end of the line$Match the beginning of the line^Match zero or more of the preceding character*Match any character (except newline).ActionSymbol
29
\{n\} Matches exactly n occurrences.\{n,\} Matches at least n occurrences.\{n,m\} Matches at least n and at most m occurrences.
SKKIGLFYGTAKIGLFFGSNKIGIFFSTSTMKISILYSSKMKIVYWSGTG
Which sequences contain two aliphatic hydrophobes?> grep ’[LIV].*[LIV]’ file2> SKKIGLFYGT> AKIGLFFGSN> KIGIFFSTST> MKISILYSSK> MKIVYWSGTG
Which sequences contain two aliphatic hydrophobes in succession?> grep ’[LIV]\{2\}’ file2> MKISILYSSK> MKIVYWSGTG
\{n\} Matches exactly n occurrences.\{n,\} Matches at least n occurrences.\{n,m\} Matches at least n and at most m occurrences.
SKKIGLFYGTAKIGLFFGSNAKIGLFFGSNKIGIFFSTSTMKISILYSSKMKIVYWSGTG
Two tiny residues that are exactly one residue apart?> grep ’[ACGS].[ACGS]’ file2> KIGIFFSTST> MKIVYWSGTG
Two tiny residues that are one or two residues apart?> grep ’[AGCS].\{1,2\}[AGCS]’ file2> AKIGLFFGSN> KIGIFFSTST> MKIVYWSGTG
Two tiny residues that are at least five residues apart?> grep ’[AGCS].\{5,\}[AGCS]’ file2> SKKIGLFYGT > AKIGLFFGSN> KIGIFFSTST
30
PROSITE patterns in grep
• Simply using grep, we could find all lines in a sequence file that match a PROSITE pattern:
G-x2-[LIV]-R-x-[GA]
> grep ’G.{\2\}[LIV]R.[GA]’ seq_file
G-x2-[LIV]-x(4,7)-R-x(5,6)-[GA]
> grep ’G..[LIV].{\4,7\}R.{\5,6\}[GA]’ seq_file
Regular expressions in PERL
• The programming language PERL uses a regular expression syntax that is similar to grep’s.
$pattern = ”G.{\2\}[LIV]R.[GA]”;while ($name=<SPROT>) {
$seq = <SPROT>;if ($seq =~ /$pattern/) { $nr_matching_seqs++;printf STDOUT "$name";printf STDOUT "$seq";}
}
31
Exercise idea
1) Write a program that finds patterns in a multiple alignment by looking for columns:
• containing no gap-characters• containing a set of amino acids in which all pairs have positive
scores in the PAM250 substitution matrix• being at most 2 columns apart
A valid pattern must model at least four columns.2) Test your program on alignments downloaded from Pfam
and check your patterns through the PROSITE scanProsite interface.
3) Compare the patterns your program finds with those in PROSITE for the same family.