View
232
Download
4
Category
Preview:
Citation preview
Advanced Text Processing
222
Lecture Overview
Character manipulation commands cut, paste, tr
Line manipulation commands sort, uniq, diff
Regular expressions and grep
Text replacement using sed
333
Cutting Lines – cut
The cut command extracts sections from each line of the input file
Command line options for cut: -c – output only these characters -f – output only these fields -d – use this character as the field delimiter
cut options [files]
444
Cutting Lines – cut
With cut, at least one of the selection options (-c or -f) must be specified
The value given with -c or -f can be: A number – specifies a single character position A range – specifies a sequence of positions A comma separated list – specifies multiple
positions or ranges
555
cut – Examples
Given a file called 'my_phones.txt':ADAMS, Andrew 7583BARRETT, Bruce 6466BAYES, Ryan 6585BECK, Bill 6346BENNETT, Peter 7456GRAHAM, Linda 6141HARMER, Peter 7484MAKORTOFF, Peter 7328MEASDAY, David 6494NAKAMURA, Satoshi 6453REEVE, Shirley 7391ROSNER, David 6830
666
cut – Examples
head -3 my_phones.txt | cut -c3-16
AMS, Andrew 75RRETT, Bruce 6YES, Ryan 6585
head -3 my_phones.txt | cut -d" " -f2
AndrewBruceRyan
head -3 my_phones.txt | cut -c1-3,10,12,15-18
ADAde7583BARBu 646BAYa 85
777
Merging Files – paste
The paste command merges multiple files by concatenating corresponding lines
Command line options for paste: -d – provide a list of separator characters -s – paste one file at a time instead of in parallel
(each file becomes a single line)
paste [options] [files]
888
paste – Examples
Assume that we are given 3 input files:
AndrewBruceRyanBillPeterLindaPeterPeterDavidSatoshi
first.txtADAMSBARRETTBAYESBECKBENNETTGRAHAMHARMERMAKORTOFFMEASDAYNAKAMURA
last.txt7583646665856346745661417484732864946453
num.txt
999
paste – Examples
paste first.txt last.txt num.txt | head -3
Andrew ADAMS 7583Bruce BARRETT 6466Ryan BAYES 6585
paste -d" :" first.txt last.txt num.txt | head -3
Andrew ADAMS:7583Bruce BARRETT:6466Ryan BAYES:6585
paste -s last.txt first.txt num.txt | cut -f1-5,10
ADAMS BARRETT BAYES BECK BENNETT NAKAMURAAndrew Bruce Ryan Bill Peter Satoshi7583 6466 6585 6346 7456 6453
101010
Translating Characters – tr
The tr command is used to translate between one character set and another
Input is read from standard input and written to standard output (no files)
With no options, tr accepts two character sets with equal lengths, and replaces each character with the corresponding one
tr [options] set1 [set2]
111111
Deleting or Squeezing Characters – tr
Sets contain literal characters, or character ranges, such as: 'a-z' or 'DEFa-z'
With command line options, tr can also be used to delete or squeeze characters
Command line options for tr: -d – delete characters in set1 -s – replace sequence of characters with one
121212
Defining Sets for tr
tr has some interpreted sequences to simplify the definition of sets: [:alpha:] – all letters [:digit:] – all digits [:alnum:] – all letters and digits [:space:] – all whitespace [:punct:] – all punctuation characters [CHAR*REPEAT] – REPEAT copies of CHAR [CHAR*] – copies of CHAR until set1 length
131313
tr – Examples
Change lower case to capital, and replace the digits 6, 7, 8 with the letters x, y, z
head -3 padded_phones.txt
ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585
head -3 padded_phones.txt | tr 'a-z678' 'A-Zxyz'
ADAMS ANDREW y5z3BARRETT BRUCE x4xxBAYES RYAN x5z5
141414
tr – Examples
Squeeze sequences of spaces into one:
Delete spaces, and digits 7 and 8:head -3 padded_phones.txt | tr -d " 78"
ADAMSAndrew53BARRETTBruce6466BAYESRyan655
head -3 padded_phones.txt | tr -s " "
ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585
151515
Reading from Standard Input
Many UNIX commands accept one or more input files listed in the command line(tr is one of the few that don't)
If no input file is given, these commands will read from the standard input
Alternately, if the file list contains a '-', the standard input will be inserted in its place
161616
Standard Input – Example
cat last.txt | tr "A-Z" "a-z" | \ paste –d"_" first.txt - number.txt | head -10
Andrew_adams_7583Imelda_aguilar_6518Daniel_albers_7540Pierre_amaudruz_7567Friedhelm_ames_7581Willy_andersson_6238Andrei_andreyev_6491Jonathan_aoki_6820Donald_arseneau_6295Danny_ashery_6188
171717
Lecture Overview
Character manipulation commands cut, paste, tr
Line manipulation commands sort, uniq, diff
Regular expressions and grep
Text replacement using sed
181818
Sorting Files – sort
The sort command reorders the lines ina file (or files), and sends the result to the standard output
Command line options for sort: -f – ignore case (fold lowercase to uppercase) -r – sort in reverse order -n – sort in numeric order
sort [options] [files]
191919
Sorting Files – sort
With no options given, the input is sorted based on the ASCII code order
The sort command has many more options for selecting which fields to sort by, and for changing the way input is treated
As always, you should read the man pages for the full details
202020
sort – Example: Using Ignore-Case
AndrewbillBrucepeterRyan
AndrewBruceRyanbillpeter
BruceRyanpeterAndrewbill
sort -f
sort
212121
sort – Example: Sorting Numbers
1838665751256875
1256875183857566
3818125687566575
sort -n
sort
222222
Removing Duplicate Lines – uniq
The uniq command removes adjacent duplicate lines from its input file If input is sorted, removes all duplicate lines
Command line options for uniq: -i – ignore case -c – prefix lines by the number of occurrences -d – only print duplicate lines -u – only print unique lines
232323
uniq – Example
1 Andrew1 Bill2 David3 Peter1 Ryan
AndrewBillDavidPeterRyan
AndrewBillDavidDavidPeterPeterPeterRyan
uniq -c
uniq
242424
uniq – Example
AndrewBillRyan
DavidPeter
AndrewBillDavidDavidPeterPeterPeterRyan
uniq -u
uniq -d
252525
Example – File Processing Using Pipes
Task – go over the book "War and Peace" and count the appearances of each word Step 1: remove all punctuation marks
Step 2: put each word in a separate line
Step 3: sort words
cat war_and_peace.txt | tr -d '[:punct:]'
cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n"
cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort
262626
Example – File Processing Using Pipes
Step 4: count appearances of each word
Step 5: sort result by number of appearances
Step 6: write output to file
cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort | uniq -c | sort -nr
cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort | uniq -c
cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort | uniq -c | sort -nr > words.txt
272727
Comparing Text Files – diff
The diff command takes two input files, and compares them
The output contains only the different lines, with their line numbers
Command line options for diff: -i – ignore case -b – ignore changes in amount of white space -B – ignore insertion or deletion of blank lines
282828
diff – Examples
2,3c2,3< BARRETT Bruce 6466< BAYES Ryan 6585---> BARRETT Bruce 3333> BAYES Ryan 65855c5< BENNETT Peter 7456---> Bennett peter 7456
diff
ADAMS Andrew 7583BARRETT Bruce 3333BAYES Ryan 6585BECK Bill 6346Bennett peter 7456
ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585BECK Bill 6346BENNETT Peter 7456
292929
diff – Examples
2c2< BARRETT Bruce 6466---> BARRETT Bruce 33335c5< BENNETT Peter 7456---> Bennett peter 7456
diff -b
ADAMS Andrew 7583BARRETT Bruce 3333BAYES Ryan 6585BECK Bill 6346Bennett peter 7456
ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585BECK Bill 6346BENNETT Peter 7456
2c2< BARRETT Bruce 6466---> BARRETT Bruce 3333diff -bi
303030
Maintaining Output Consistency
During program development, assume that we have reached the correct output
We want to verify that it does not change Create reference output file:
After changing the program, compare output:
prog > prog.out
prog | diff – prog.out
313131
Lecture Overview
Character manipulation commands cut, paste, tr
Line manipulation commands sort, uniq, diff
Regular expressions and grep
Text replacement using sed
323232
Searching For Matching Patterns – grep
The grep command searches files for patterns, and prints matching lines
The mandatory regexp argument defines a regular expression
A regular expression is a formula for matching strings that follow some pattern
grep [options] regexp [files]
333333
Searching For Matching Patterns – grep
The simplest regular expression is just a sequence of characters
This regular expression matches only a single string – itself
The following command prints all lines from any of files that contain word:
grep word files
343434
Searching For Matching Patterns – grep
The power of grep lies in using more sophisticated regular expressions
Command line options for grep: -v – print all lines that don't match -c – print only a count of matched lines -n – print line numbers -h – don't print file names (for multiple files) -l – print file name but not matching line
353535
Regular Expressions
Regular expressions are a powerful tool for searching and selecting text
Their origin is in the UNIX grep command (and further back in automata theory)
They have since been copied into many other tools and languages such as awk, sed, perl and Java
363636
Regular Expressions vs.Filename Expansion
Note that regular expressions are different from filename expansion
Filename expansion uses some regular expression concepts and symbols, but: Filename expansion is done by the shell Regular expressions are passed as arguments to
specific commands or utilities
373737
Matching a Single Character
A period (.) matches any single character
For example:
Regular Expression
Matches Doesn't Match
b.g bagdebugbigger
bragbgbad
U..X UNIX unix
. a, b, c An empty line
383838
Matching a Character Class
Square brackets ([]) match any single character within the brackets
If the first character following the left bracket is a '^', the expression matches any character not in the brackets
A '-' can be used to indicate a range,such as: [a-z]
393939
Matching a Character Class
Regular Expression
Matches Doesn't Match
[Bb]ill Billbillgot billed
Dillillkill
t[aeiou].k talkstackstink
tracktake
number [^0-5] number xxxnumber 8:
number 59
404040
Matching a Character Class
The same predefined character classes used for tr can also be used here
For portability reasons, [:alpha:] is always preferable to [A-Za-z]
Note: the brackets are part of the symbolic names, and must be included in addition to the enclosing brackets, i. e. [[:alpha:]]
414141
Matching Repetitions
An asterisk (*) represents zero or more matches of the regular expression it follows
Regular Expression
Matches Doesn't Match
ab*c acabcaaabbbc
abacacb
t.*ing thingstringthinking
king
424242
Matching Special Characters
Sometimes we want to literally matcha character that has a special meaning, such as '*' or '['
There are two ways to do that: Precede the character with a '\' Use square brackets – any character inside is
taken literally
434343
Matching Special Characters
Regular Expression
Matches Doesn't Match
a\.c a.c abc
\.\.\.* the end...more.....
abcstop.
[*.] * start *Sys.print
Hello worldabc
C:\\bin C:\bin C:\\bin
444444
Matching the Beginning orthe End of a Line
A regular expression that begins with a caret (^) can match a string only at the beginning of a line
Similarly, a regular expression that ends with a dollar sign ($) can match a string only at the end of a line
454545
Matching the Beginning orthe End of a Line
Regular Expression
Matches Doesn't Match
^T This lineThat bug
STARTMy Tag
^num.*[0-9]$ num5num99number 1
my num1the number 6num 6a
^t.*k$ talktracktk
stacktake
464646
Using Regular Expressions with grep – Examples
cat bugs.txt
big boybad bugbagbigger bagbetterboogie nights
grep 'b.g' bugs.txt
big boybad bugbagbigger bag
grep 'b.g.' bugs.txt
big boybigger bag
grep 'b.*g.' bugs.txt
big boybigger bagboogie nights
474747
Using Regular Expressions with grep – Examples
cat f.txt
ADAMS,Andrew7583BARRETT,Bruce6466BAYES,Ryan6585
grep '[[:alpha:]],' f.txt
grep '^[C-Z][[:lower:]]*$' f.txtRyan
ADAMS,BARRETT,BAYES,
64666585
grep '^[^[:alpha:]0-3]*$' f.txt
484848
Pipes and Regular Expressions – Example
Task: create a file containing the names of all source files in the current directory, sorted by the number of lines in each file Step 1: count lines in each file
Step 2: leave only '.c' and '.h' files
Step 3: sort in reverse order (largest first)
wc -l *
wc -l * | grep '\.[ch]$'
wc -l * | grep '\.[ch]$' | sort -nr
494949
Pipes and Regular Expressions – Example
Step 4: squeeze leading spaces (into one)
Step 5: remove number field
Step 6: write output to file
wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3 > sorted_source_files.txt
wc -l * | grep '\.[ch]$' | sort -nr | tr -s " "
wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3
505050
Which grep to Use?
In addition to grep itself, there are two more variants of it: egrep and fgrep Use grep for most standard text finding tasks Use egrep for complex tasks, where basic regular
expressions are just not enough, and you need to use extended regular expressions
Use fgrep when only fixed strings are searched, and speed is of the essence
515151
Extended Regular Expressions – egrep
Extended regular expressions support all basic regular expression syntax, plus some additional special characters: + – similar to '*', but at least one appearance ? – similar to '*', but zero or one appearances () – grouping a|b – the OR operator – matches either regular
expression a or regular expression b
525252
Extended Regular Expressions – egrep
Regular Expression
Matches Doesn't Match
num6+ num666 num654
num566 number
num6?5 num65num555
num6num665
Barret|Bennet BarretBennet
B(arr|enn)et BarretBennet
535353
Lecture Overview
Character manipulation commands cut, paste, tr
Line manipulation commands sort, uniq, diff
Regular expressions and grep
Text replacement using sed
545454
Stream Editor – sed
sed is a script editor for text streams, which supports basic regular expressions
It performs transformations on an input stream, based on simple instructions
sed has many commands, but the most commonly used is the substitute command:
sed 's/pattern/replacement/[g]' [file]
555555
Stream Editor – sed
pattern is any basic regular expression replacement is a string that will replace one
or more matches of pattern The optional g flag defines whether the
operation is global – without it only the first match in every line is replaced
The special character '&' can be used inside replacement to refer to the matched text
565656
Using Regular Expressions with grep – Examples
cat bugs.txt
big boybad bugbagbigger bagbetter
sed 's/b.g/XXX/' bugs.txt
XXX boybad XXXXXXXXXger bagbetter
sed 's/b.g/XXX/g' bugs.txt
XXX boybad XXXXXXXXXger XXXbetter
575757
sed – Examples
head -2 my_phones.txt
head -2 my_phones.txt | sed 's/ [[:upper:]]/<&>/g'
ADAMS,< A>ndrew 7583BARRETT,< B>ruce 6466
ADAMS, Andrew 7583BARRETT, Bruce 6466
ADAMS, Andrew ###BARRETT, Bruce ###
head -2 my_phones.txt | sed 's/[[:digit:]]*$/###/g'
585858
Matching and Reusing Portions ofa Pattern in sed
It is also possible to use portions of the matching pattern
Within the pattern, portions should be enclosed between '\(' and '\)'
In replacement , the special sequences: '\1', '\2', etc. can be used to refer to the matched portions
595959
Matching and Reusing Portions ofa Pattern in sed – Examples
Remove the first name from each line:
Replace first name with initial:head -2 my_phones.txt |sed 's/ \([[:upper:]]\)[[:lower:]]* / \1. /'
ADAMS, A. 7583BARRETT, B. 6466
ADAMS, 7583BARRETT, 6466
head -2 my_phones.txt |sed 's/ [[:upper:]][[:lower:]]* / /'
606060
Matching and Reusing Portions ofa Pattern in sed – Examples
Switch between first and last names:
Switch names and parenthesize number:head -2 my_phones.txt |sed 's/\(.*\), \(.*\) \(.*\)/\2 \1: (03-555\3)/'
Andrew ADAMS: (03-5557583)Bruce BARRETT: (03-5556466)
Andrew ADAMS 7583Bruce BARRETT 6466
head -2 my_phones.txt |sed 's/\(.*\), \(.*\) /\2 \1 /'
Recommended