Upload
kato-ryoichi
View
307
Download
0
Embed Size (px)
Citation preview
record-oriented grep
mlr-grep
ryo1kato@github @gmail @twitter @facebook
motivation
Want to "grep" multi-line entries in a file
✦ multi-line log files, or *.ini, etc. ✦ semi-structured text like an ifconfig output
2
for example...$ cat data.txt[one]twothree[foo]barbaz[hoge]piyohuga
3
} want to extract entire record lines that contains a pattern, where a record
Typical way
✦ grep -A 12 -B 34 -C 56 ✦ pcregrep --multiline ✦ awk -v RS='\n\n' "/$re/" ✦ perl -e …
4
But✦ pcregrep : You often need a very long regex.
✦ Note that it's NOT about finding multiline pattern (a pattern containing '\n'), but extract multiline record containing a pattern.
✦ AWK : Possible with using RS (need gawk) ✦Actually it's difficult to do it right using pcregrep or awk.
✦ perl, python : well, if you go that far ...5
But, do you want to write a one-liner / X script for these?
✦ zgrep ✦ grep -c (--count) ✦ grep -i (--ignore-case) ✦ grep -v (--invert-match) ✦ grep --color
6
So I wrote it for you!✦mlr-grep
✦Multi-Line Record Grep
✦AWK, Haskell, Python ✦ named amlgrep, hmlgrep, and pmlgrep ✦ They have almost identical features.
7
$ amlgrep 'ba' …[foo]barbaz
8
e.g.
} A whole record containing the pattern
✦ amlgrep - AWK implementation ✦ Needs gawk. ✦ Fastest ✦ --rs regex is slightly broken in RHEL5. ✦ Auto extract *.gz, *.bz2, and *.xz files ✦ --color, --count, --invert-match ✦ AND, OR of multiple keywords.
✦ hmlgrep - Haskell implementation ✦ Has almost same feature set as AWK ver. ✦ Sometimes 1.5~2x slower, with files with short lines and many matches.
✦ pymlgrep - Python implementation ✦ Slowest (4x of AWK version) ✦ Doesn't support multiple keywords
9
Multiple Keywords
10
$ amlgrep [--or] h t [FILE][one]twothree[hoge]piyohuga
≒ egrep 'h|t',
but fewer key types. 11
$ amlgrep --and h t [FILE][one]twothree
≒ egrep 'h.*t|t.*h' but fewer key types
12
--timestamp
multi-line log files with each entry begins
with timestamps13
$ cat datetime.log2014-01-23 12:34:56 log 1 foo bar2014-01-24 12:34:57 log 2 one two2014-01-25 12:34:58 log 3 hoge piyo
14
$ amlgrep -t 'one' … 2014-01-24 12:34:57 log 2 one two
15
$ amlgrep -t --dump foo
gawk -W re-interval -F \n -v RS='\n(((Mon|Tue|Wed|Thu|Fri|Sat),?[ \t]+)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Dec),?[ \t]*[0-9]{1,2},?[ \t][0-2][0-9]:[0-5][0-9](:[0-5][0-9])?(,?[ \t]20[0-9][0-9])?|20[0-9][0-9]-(0[0-9]|11|12)-(0[1-9]|[12][0-9]|3[01]))' '-v' 'ORS=' 'oldRT $0 ~ /foo/ {i++;if(substr(oldRT,1,1)=="\n"){h=substr(oldRT,2)}else{h=oldRT};;gsub(/foo/,"&",h);print h;gsub(/foo/, "&");print;if(RT != "")printf "\n"} {oldRT=RT} END{if (i>0){exit 0}else{exit 1}}'
16
Change the record separator✦ --rs '^$'
✦ Empty lines ✦ --rs '^----'
✦ Four or more dash ✦ --rs '^[[:alnum]]'
✦ Alphanumeric character on the first column. (For ifconfig like output)
✦ --rs '^\['
✦ A line begins with '[' (For *.ini files) ✦ --timestamp
≒ -rs '^(((Mon|Tue|Wed|Thu|Fri|Sat),?[\t]+)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Dec),?[ \t]*[0-9]{1,2},?[ \t][0-2][0-9]:[0-5][0-9](:[0-5][0-9])?(,?[ \t]20[0-9][0-9])?|20[0-9][0-9]-(0[0-9]|11|12)-(0[1-9]|[12][0-9]|3[01]))'
17
http://github.com/
ryo1kato/mlr-grep
18