Upload
workhorse-computing
View
874
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A short description of Perly grammar processors leading up to Regexp::Grammars. Develops two R::G modules, one for single-line logfile entries, another for larger FASTA format entries in the NCBI "nr.gz" file. The second example shows how to derive one grammar from another by overriding tags in the base grammar.
Citation preview
Perly ParsersPerl-byaccParseYapp
ParseRecDescentRegexGrammar
Steven LembarkWorkhorse Computing
lembarkwrkhorscom
Grammars are the guts of compilers
Compilers convert text from one form to anotherndash C compilers convert C source to CPU-specific assembly
ndash Databases compile SQL into RDBMS ops
Grammars define structure precedence valid inputsndash Realistic ones are often recursive or context-sensitive
ndash The complexity in defining grammars led to a variety of tools for defining them
ndash The standard format for a long time has been ldquoBNFrdquo which is the input to YACC
They are wasted on flat textndash If ldquosplit trdquo does the job skip grammars entirely
The first Yet Another YACC
Yet Another Compiler Compiler ndash YACC takes in a standard-format grammar structure
ndash It processes tokens and their values organizing the results according to the grammar into a structure
Between the source and YACC is a tokenizerndash This parses the inputs into individual tokens defined by the grammar
ndash It doesnt know about structure only breaking the text stream up into tokens
Parsing is a pain in the lex
The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns
ndash Grammars are defined in terms of structure
Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers
difficult
ndash Context-sensitive grammars with multiple sub-grammars are painful
The perly way
Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen
ndash Then add captures and if-blocks or excute (code) blocks inside of each regex
The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it
ndash Hubris maybe but Truly Lazy it aint
ndash Was the whole reason for developing standard grammars amp their handlers in the first place
Early Perl Grammar Modules
These take in a YACC grammar and spit out compiler code Intentionally looked like YACC
ndash Able to re-cycle existing YACC grammar files
ndash Benefit from using Perl as a built-in lexer
ndash Perl-byacc amp ParseYapp
Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you
still have to plug in post-processing code to deal with the results
right =left - +left left NEGright ^
input empty
| input line push($_[1]$_[2]) $_[1]
line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok
exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]
Example ParseYapp grammar
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Grammars are the guts of compilers
Compilers convert text from one form to anotherndash C compilers convert C source to CPU-specific assembly
ndash Databases compile SQL into RDBMS ops
Grammars define structure precedence valid inputsndash Realistic ones are often recursive or context-sensitive
ndash The complexity in defining grammars led to a variety of tools for defining them
ndash The standard format for a long time has been ldquoBNFrdquo which is the input to YACC
They are wasted on flat textndash If ldquosplit trdquo does the job skip grammars entirely
The first Yet Another YACC
Yet Another Compiler Compiler ndash YACC takes in a standard-format grammar structure
ndash It processes tokens and their values organizing the results according to the grammar into a structure
Between the source and YACC is a tokenizerndash This parses the inputs into individual tokens defined by the grammar
ndash It doesnt know about structure only breaking the text stream up into tokens
Parsing is a pain in the lex
The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns
ndash Grammars are defined in terms of structure
Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers
difficult
ndash Context-sensitive grammars with multiple sub-grammars are painful
The perly way
Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen
ndash Then add captures and if-blocks or excute (code) blocks inside of each regex
The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it
ndash Hubris maybe but Truly Lazy it aint
ndash Was the whole reason for developing standard grammars amp their handlers in the first place
Early Perl Grammar Modules
These take in a YACC grammar and spit out compiler code Intentionally looked like YACC
ndash Able to re-cycle existing YACC grammar files
ndash Benefit from using Perl as a built-in lexer
ndash Perl-byacc amp ParseYapp
Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you
still have to plug in post-processing code to deal with the results
right =left - +left left NEGright ^
input empty
| input line push($_[1]$_[2]) $_[1]
line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok
exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]
Example ParseYapp grammar
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
The first Yet Another YACC
Yet Another Compiler Compiler ndash YACC takes in a standard-format grammar structure
ndash It processes tokens and their values organizing the results according to the grammar into a structure
Between the source and YACC is a tokenizerndash This parses the inputs into individual tokens defined by the grammar
ndash It doesnt know about structure only breaking the text stream up into tokens
Parsing is a pain in the lex
The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns
ndash Grammars are defined in terms of structure
Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers
difficult
ndash Context-sensitive grammars with multiple sub-grammars are painful
The perly way
Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen
ndash Then add captures and if-blocks or excute (code) blocks inside of each regex
The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it
ndash Hubris maybe but Truly Lazy it aint
ndash Was the whole reason for developing standard grammars amp their handlers in the first place
Early Perl Grammar Modules
These take in a YACC grammar and spit out compiler code Intentionally looked like YACC
ndash Able to re-cycle existing YACC grammar files
ndash Benefit from using Perl as a built-in lexer
ndash Perl-byacc amp ParseYapp
Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you
still have to plug in post-processing code to deal with the results
right =left - +left left NEGright ^
input empty
| input line push($_[1]$_[2]) $_[1]
line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok
exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]
Example ParseYapp grammar
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Parsing is a pain in the lex
The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns
ndash Grammars are defined in terms of structure
Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers
difficult
ndash Context-sensitive grammars with multiple sub-grammars are painful
The perly way
Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen
ndash Then add captures and if-blocks or excute (code) blocks inside of each regex
The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it
ndash Hubris maybe but Truly Lazy it aint
ndash Was the whole reason for developing standard grammars amp their handlers in the first place
Early Perl Grammar Modules
These take in a YACC grammar and spit out compiler code Intentionally looked like YACC
ndash Able to re-cycle existing YACC grammar files
ndash Benefit from using Perl as a built-in lexer
ndash Perl-byacc amp ParseYapp
Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you
still have to plug in post-processing code to deal with the results
right =left - +left left NEGright ^
input empty
| input line push($_[1]$_[2]) $_[1]
line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok
exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]
Example ParseYapp grammar
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
The perly way
Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen
ndash Then add captures and if-blocks or excute (code) blocks inside of each regex
The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it
ndash Hubris maybe but Truly Lazy it aint
ndash Was the whole reason for developing standard grammars amp their handlers in the first place
Early Perl Grammar Modules
These take in a YACC grammar and spit out compiler code Intentionally looked like YACC
ndash Able to re-cycle existing YACC grammar files
ndash Benefit from using Perl as a built-in lexer
ndash Perl-byacc amp ParseYapp
Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you
still have to plug in post-processing code to deal with the results
right =left - +left left NEGright ^
input empty
| input line push($_[1]$_[2]) $_[1]
line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok
exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]
Example ParseYapp grammar
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Early Perl Grammar Modules
These take in a YACC grammar and spit out compiler code Intentionally looked like YACC
ndash Able to re-cycle existing YACC grammar files
ndash Benefit from using Perl as a built-in lexer
ndash Perl-byacc amp ParseYapp
Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you
still have to plug in post-processing code to deal with the results
right =left - +left left NEGright ^
input empty
| input line push($_[1]$_[2]) $_[1]
line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok
exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]
Example ParseYapp grammar
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
right =left - +left left NEGright ^
input empty
| input line push($_[1]$_[2]) $_[1]
line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok
exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]
Example ParseYapp grammar
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
The Swiss Army Chainsaw
ParseRecDescent extended the original BNF syntax combining the tokens amp handlers
Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers
ndash Rules with Perl blocks allows the code to do anything
ndash Results can be acquired from a hash an array or $1
ndash Left right associative tags simplify messy situations
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Example PRD
This is part of an infix formula compiler I wrote
It compiles equations to a sequence of closures
add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]
add ltleftop mult add_op multgt compile_binop $item[1]
mult ltleftop factor mult_op factorgt compile_binop $item[1]
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Just enough rope to shoot yourself
The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long
ndash Unless you really know what all of it does you may not be able to figure out the pieces
ndash Lots of really good docs that most people never read
Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers
ndash Good amp Bad thing it really is a compiler
ndash Bad rap for not doing well what it wasnt supposed to do at all
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
RIP PRD
Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6
ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster
The result is Perl6 Grammarsndash Declarative syntax extends matching with rules
ndash Built into Perl6 as a structure not an add-on
ndash Much faster
ndash Not available in Perl5
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
RegexGrammars
Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax
ndash Much better performance than PRD
Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself
ndash Simplifies mixing code and matching
ndash Single place to get the final results
ndash Cleaner syntax with automatic whitespace handling
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Extending regexen
ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)
You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues
ndash multi-line mode (m) simplifies line anchors for line-oriented parsing
ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic
ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
What you get
The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program
Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward
Comfortable for incremental development or refactoringndash Largely declarative syntax helps
ndash OOP provides inheritance with overrides for rules
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
my $compiler= do use RegexpGrammars
qr ltdatagt
ltrule data gt lt[text]gt+ ltrule text gt +
xm
Example Creating a compiler
Context can be a do-block subroutine or branch logic
ldquodatardquo is the entry rule
All this does is read lines into an array with automatic ws handling
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Results
The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results
ndash Empty keys () hold input text (for errors or debugging)
ndash Easy to handle with DataDumper
The hash has at least one key for the entry rule one empty key for input data if context is being saved
For example feeding two lines of a Gentoo emerge log through the line grammar gives
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]
Parsing a few lines of logfile
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Getting rid of context
The empty-keyed values are useful for development or explicit error messages
They also get in the way and can cost a lot of memory on large inputs
You can turn them on and off with ltcontextgt and ltnocontextgt in the rules
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
qr
ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +
xm
warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk
You usually want [] with +
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]
qr
ltnocontextgt turn off globally
ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)
xm
An array[ref] of text
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Breaking up lines
Each log entry is prefixed with an entry id Parsing the ref_id off the front adds
ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +
line =gt[
ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212
hellip
]
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Removing cruft ldquowsrdquo
Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the
spaces Whitespace is defined by ltws hellip gt
ltrule linegt ltws[s]+gt ltref_idgt lttextgt
ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
The prefix means something
Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag
ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256
ldquoentryrdquo now contains optional prefix
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Aliases can also assign tag results
Aliases assign a key to rule results
The match from ldquotextrdquo is aliased to a named type of log entry
ltrule entrygt
ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133
Generic ldquotextrdquo replaced with a type
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Parsing without capturing
At this point we dont really need the prefix strings since the entries are labeled
A leading tells RG to parse but not store the results in
ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133
ldquoentryrdquo now has typed keys
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
The ldquoentryrdquo nesting gets in the way
The named subrule is not hard to get rid of just move its syntax up one level
ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137
Result array of ldquolinerdquo with ref_id amp type
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Funny names for things
Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text
You can store an optional token followed by text
ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Entrys now have ldquotextrdquo and ldquotyperdquo
entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
prefix alternations look ugly
Using a count works
[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable
Given the way these are used use a block
[gt=] 3
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
qr ltnocontextgt
ltdatagt ltrule data gt lt[entry]gt+
ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt
lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm
This is the skeleton parser
Doesnt take muchndash Declarative syntax
ndash No Perl code at all
Easy to modify by extending the definition of ldquotextrdquo for specific types of messages
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Finishing the parser
Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types
ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+
lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive
content
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Inheriting amp Extending Grammars
ltgrammar namegt and ltextends namegt allow a building-block approach
Code can assemble the contents of for a qr without having to eval or deal with messy quote strings
This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries
ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
The Non-Redundant File
NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear
It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated
by ctrl-A chars
gtHeading 1
[amino-acid sequence characters]
gtHeading 2
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Example A short nrgz FASTA entry
Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single
description
ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace
ndash Species counts in some header run into the thousands
gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
First step Parse FASTA
qr ltgrammar ParseFastagt ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+
ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt
lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm
Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself
ndash Accessible anywhere via RexepGrammars
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
The output needs help however
The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string
Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex
ndash ldquoalmostrdquo because the code cannot include regexen
seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Munging results $MATCH
The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse
In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex
ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
One more step Remove the arrayref
Now the body is a single string
No need for an arrayref to contain one string Since the body has one entry assign offset zero
body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]
ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Result a generic FASTA parser
fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
The head and body are easily accessible Next parse the nr-specific header
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Deriving a grammar
Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case
References the grammar and extracts a list of fasta entries
ltextends ParseFastagt
lt[fasta]gt+
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Splitting the head into identifiers
Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species
Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on
ndash Using ldquo+[cAn] walks off the header onto the sequence
ndash This is a common problem with separators amp tokenizers
ndash This can be handled with special tokens in the grammar but RG provides a cleaner way
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
First pass Literal ldquotailrdquo item
This works but is uglyndash Have two rules for the main list and tail
ndash Alias the tail to get them all in one place
ltrule headgt lt[ident]gt+ lt[ident=final]gt (
remove the matched anchors
trcAnd for $MATCH ident )
lttoken ident gt + cAlttoken final gt + n
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Breaking up the header
The last header item is aliased to ldquoidentrdquo Breaks up all of the entries
head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Dealing with separators ltsepgt
Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces
ndash g-c-a-g-t-t-a-c-a characters by dashes
ndash usrlocalbin basenames by dir markers
ndash usrusrlocalbin dirs separated by colons
that RG has special syntax for dealing with them Combining the item with and a seprator
ltrule listgt lt[item]gt+ ltseparatorgt one-or-more
ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Cleaner nrgz header rule Separator syntax cleans things up
ndash No more tail rule with an alias
ndash No code block required to strip the separators and trailing newline
ndash Non-greedy match ldquo+rdquo avoids capturing separators
qr ltnocontextgt
ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Nested ldquoidentrdquo tag is extraneous
Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier
contents
qr ltnocontextgt ltextends ParseFastagt
lt[fasta]gt+
ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )
lttoken ident gt + xm
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Result
fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]
The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries
Not bad for a dozen lines of grammar with a few lines of code
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
One more level of structure idents
Species have ltsource gt | ltidentifiergt pairs followed by a description
Add a separator clause ldquo (s|s)rdquo
ndash This can be parsed into a hash something like
gi|66816243|ref|XP_6421311|hypothetical
Becomes
gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Munging the separated input
ltfastagt ( my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz )
ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Result head with sources ldquodescrdquo
fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Balancing RG with calling code
The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in
the heads
ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor
ndash Making it optional with ltstartgt fixes the problemlocal $ = gt
while( my $chunk = readline )
chomplength $chunk or do --$ next
$chunk =~ $nr_gz
process single fasta record in
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Fasta base grammar 3 lines of codeqr
ltgrammar ParseFastagt
ltnocontextgt
ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(
$MATCH body = $MATCH body [0])
ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(
$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd
)
lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )
xm
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Extension to Fasta 6 lines of codeqr
ltnocontextgtltextends ParseFastagtltfastagt(
my $identz = delete $MATCH fasta head ident
for( $identz )
my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc
$MATCH fasta head = $identz)
ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +
xm
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Result Use grammars
Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation
ndash Code only needed for cleanups or re-arranging structs
Code can simplify your grammarndash Too much code makes them hard to maintain
ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code
Either way the result is going to be more maintainable than hardwiring the grammar into code
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Aside KwikFix for Perl v518
v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front
ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo
ndash One for the Perl code the other for $MATCH and friends
The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller
Look up $^H in perlvars to see how it works
require re re-gtimport( eval )require strict strict-gtunimport( vars )
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
Use RegexpGrammars
Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars
RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however
Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510
More info on RegexpGrammars
The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]
The demo directory has a number of working ndash if un-annotated ndash examples
ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl
510