CIS 191: Linux and Unix Class 4 October 7th, 2015

CIS 191: Linux and Unix

Class 4October 7th, 2015

Next week

• Lecture on Makefiles• Xiuruo OOO

Running at

• In Ubuntu, you’ll probably need to install at– sudo apt-get install at– It should just work after this…

• In OSX, at relies on the atrun daemon to manage its jobs– See man atrun

“The atrun utility runs commands queued by at(1). It is invoked periodically by launchd(8) as specified in the com.apple.atrun.plist property list. By default the property list contains the Disabled key set to true, so atrun is never invoked. Execute the following command as root to enable atrun: launchctl load -w /System/Library/LaunchDaemons/com.apple.atrun.plist”

Outline

Language Theory Overview

Grep Regular Expressions

Examples of Grep Regular Expressions

Sed

Languages

• A set of strings of symbols• These symbols form an “alphabet”• The language is “decided” by some process which

decides if a string is in the language or not

Regular Languages

• A regular language is a set that can be decided by viewing a single character at time, using a fixed amount of memory!– Specifically, regular languages are languages that can be decided

by a DFA (deterministic finite automaton); you’ll learn more about this in CIS 262 if you haven’t taken it already.

• It doesn’t matter how long the string is!

Regular Expressions

• A regular expression exactly describes a regular language– That is, every regular language can be described by some

regular expressions– And a regular expression describes a regular language

Regular Expressions Illustrated

• Suppose A and B are regular languages.

Regular Extensions

• A few extensions to classical regular expressions that stay within regular langauges– If A is an RE, then A+ matches one or more copies of A– If A is an RE, then A? matches one or no copies of A

Core regex in one page

• ABC– Sequence of A B and C, exactly one copy of each

• A | B– A or B

• *– >= 0 copies

• +– >= 1 copies

• ?– 0 or 1 copies

Truly Regular Expressions

• abc matches only the string “abc”• (ab)* matches the empty string “”, “ab”, “abab”, …• (a|b)+ matches any string containing some number of

‘a’s and ‘b’s• (a*b)+ matches any string that has any number of ‘a’s

followed by a single ‘b’, at least once– In other words, any string of ‘a’s and ‘b’s which ends in a ‘b’.

• a(b|c)*a matches any string which starts and ends with an ‘a’ and has only ‘b’s and ‘c’s in between.

More Regular Expression Extensions

• There are a number of extensions that allow for more concise representation– . (dot) matches any single character (any character at all)– [cde] matches any single character (here: c, d, and e) listed

between the square brackets– [h-l] matches any character in the range of characters from h-l

• To match any character not in the list, place a caret (^) first inside the brackets.– [^0-9] matches anything that is not a digit.

– If A is a RE, then A{n,m} matches anywhere between m and n copies of A, inclusive.

– A{n} matches exactly n copies of A.

• On this slide, .,[, ], {, and }, are metacharacters.

Metacharacters

• A certain number of predefined shortcuts (character classes) are provided.– [[:space:]], or ‘\s’, matches any whitespace character.– [[:alnum:]], or ‘\w’, matches any “word” character

• By which we mean letters and numbers, though some implementations include underscores (_)

– [[:digit:]], ‘\d’, matches any digit (0-9)– ^ matches “beginning-of-line”– $ matches “end-of-line”– \< and \> matches word boundaries

Metacharacters

• \\ matches backslash (\)– Since \ is normally used to specify other metacharacters

• \* matches an asterisk– Since * usually matches anything…

• \. matches a dot• Metacharacters need to be preceeded by a backslash in

order to match the literal character

“Regular” Expressions: a Misnomer

• Just about any name but “regular” would have been better!– Many extensions describe non-regular languages– The syntax and behavior is different for just about every system

involving regular expressions!– What needs escaping changes based on implementation

• In fact, Vim has four different settings for this.– See “:help magic”

– The way we describe or apply regular expressions and gather the matches differs across settings.

New Skill

xkcd.com/208

Our focus: grep and sed

• As we’ve discussed, grep applies a regular expression to each line in input file or files

• sed is a stream editor– More on this soon…

Outline




Sed

Motivating Examples

• We’re usually searching for a particular kind of text– An integer, maybe with a minus sign in front– A decimal number (for example 2.718)– A first name followed by a last name

• Or maybe a last, first– An email addres– Sentences beginning with the word “The”, ending with

punctuation.– A phone number– Prime numbers

• This really does exist, but it relies on backreferences and is rather inefficient…

Integers and Decimals

• Integers start with an optional -, followed by one or more digits. The perfect regular expression is therefore…


• Integers start with an optional -, followed by one or more digits. The perfect regular expression is therefore…– -?[[:digit:]]+– -?\d+



• How about decimals? First, we need a characterization.– There is an optional minus sign, then an optional string of digits,

followed by a ., then a string of digits.



• How about decimals? First, we need a characterization.– There is an optional minus sign, then an optional string of digits,

followed by a ., then a string of digits.– -?[[:digit:]]*\.[[:digit:]]+– -?\d*\.\d+

Names

• Let’s begin with a characterization.

Names

• Let’s begin with a characterization of First Name Last Name format.– A capital letter, followed by any number of letters, then a space,

then another capital followed by any number of letters

• Now, let’s come up with the regular expression

Names



• Now, let’s come up with the regular expression– [A-Z]\w*\s[A-Z]\w*

Names




• Do you see any potential issues with this approach?

Names




• Do you see any potential issues with this approach?– What about hyphenated names? Multiple names? Middle

initials? Middle names written out?

Aside: Solve the Problem You Want to

• Many regular expressions will match the target– But some are easier to construct (and to understand) than

others.

• If you know a little more about the text you will be handling, you can sometimes make shortcuts– This will become more apparent when we get to replacing

(rather than just matching) text.

• Modifying the problem is a major theme throughout computer science, and in this course as well!

Aside #2: Evil Regular Expressions!!!

• There are two main kinds of RE engines.– NFA (Nondeterministic Finite Automaton) engines step through

the regex and may backtrack on the input text– DFA (Deterministic Finite Automaton) engines always move

forward in the string character by character– Nonbacktracking NFA engines do exist…– See http://swtch.com/~rsc/regexp/regexp1.html for more

details on the differences.

• The runtime can increase drastically for the following– Repetitions of overlapping alternations– Repetitions within repetitions– Repetitions containing both wildcards and normal characters

http://swtch.com/~rsc/regexp/regexp1.html

Aside #2: Some evil examples

• Can you figure out why these might be “evil”?– (x*)*– (x.)*– (x|xx)*– (x|x?)*– The prime number checker we mentioned earlier



• Think about how they behave on the string– xxxxxxxxxxxxxxxxy



• Think about how they behave on the string– xxxxxxxxxxxxxxxxy

• Matching is exponential because ‘x’ matches with both the sub-expression x* and the expression (x*); every time it sees an ‘x’ input, potential matching paths doubles!

ReDos

• Regular expression denial of service • Use evil regex to attack a service that accepts arbitrary

regex• https://en.wikipedia.org/wiki/ReDoS

https://en.wikipedia.org/wiki/ReDoS




Outline




Sed

grep with extended regex

• Generally, we want to use extended regular expressions (as we discussed earlier)– So when you call grep, call it with the –E flag

ps -aux

• All processes• You can look up a particular process using grep…

ps aux

$ ps –aux | grep yes | less

ps aux with word boundry

$ ps -aux | grep –w yes | less

C identifiers

• Suppose we want to find all uses of the function strfry in the directory chef

• We can use Bash expansions and grep together!

$ grep –E strfry *.cchef.c: strfry(p_str);chef.c: cond ? strfry(uuname) : uunamerecipes.c: is_strfry_ingredient(p_src)

C Identifiers

• But grep included results that we didn’t want, such as is_strfry_ingredient

• What can we do?

C Identifiers

• But grep included results that we didn’t want, such as is_strfry_ingredient

• What can we do?– Include word boundaries!

$ grep –E \<strfry\> *.cchef.c: strfry(p_str);chef.c: cond ? strfry(uuname) : uuname

Grepping for Hardware…

• Another common scenario: attempting to find a particular piece of hardware

• The lspci command will spit out a list of available PCI (Peripheral Component Interconnect) devices

$ lspci | grep –i NetworkEthernet controller: Intel 82566MM GigabitNetwork controller: Intel PRO/Wireless

Grepping for Hardware

• Which kernel modules are related?

$ lsmod | grep –i iwliwl4965 202721 0iwl_legacy 146875 1iwl4965mac80211 267163 2iwl4965,iwl_legacycfg80211 170485 3iwl4965,iwl_legacy,

mac80211

Display only the matching text

• Generally, when grep finds a match, it will display the entire line

• Most of the time this is what you want!• But when you are trying to extract a match from the text

– Like when you are looking for an address or a phone number…

• You may want to only display the match.• You can do this with the –o option

– grep –oE ‘regular expression’ file_list– displays just the matches on separate lines

Greedy Matching

• Let’s right a regular expression to match all instances of html tags of the form , , <title>…

Greedy Matching

• Let’s right a regular expression to match all instances of html tags of the form , , <title>…– <.*>

Greedy Matching


• What if we run this on– Hi! I’m an example!

Greedy Matching


• What if we run this on– Hi! I’m an example!

• We’ll get the following match:– Hi! I’m an example!

What went wrong?

• Grep matches expressions greedily.• This means that it will try and match as much as it can (if

there is more to match in a line, it will do so – even if it has already found a match!)

• While there are some syntaxes (such as Perl) which allow for lazy matching, Grep’s extended regex syntax does not allow this!

• You can use perl syntax with grep –P, but we are not allowing that for assignments in this class.

A right answer (without greed)

• Hi! I’m an example!• What if we try the following expression:

– <[^>]*>



– <[^>]*>

• We’ll match every character that is not the close brace, followed by a close brace.

• Hallelujah! Success! We get– – 

• Just as we expected.



– <[^>]*>

• We’ll match every character that is not the close brace, followed by a close brace.

• Hallelujah! Success! We get– – 

• Just as we expected.

Outline

Scheduled Jobs




Sed

Sed Introduction

• The man page for sed describes it as “a stream editor for filtering and transforming text.”

• You should always run sed with the –r option, which allows for extended regular expressions– Noticing a pattern here?

• You also always want to give sed its regular expressions in single quotes, which tells Bash not to expand dollar signs, asterisks, question marks, and so on

Sed Syntax

• sed regular expressions take the syntax– s/regex/replacement/flags

• The g flag tells sed not to stop after the first replacement– Think “globally”

• Patterns can be captured in parentheses, and used in the replacement with backreferences– Sort of like storing matched information in variables…– Tell sed to store this information using extra parentheses in your

expression. Refer to them later with \1 for first group, \2 for second group…

Regular Expression Parenthesis Groups

• From out in first, then from left to right.• Recall the Name example from earlier

– [A-Z]\w*\s[A-Z]\w*

• If we rewrite the expression as– (([A-Z]\w*)\s([A-Z]\w*))

• Group “1” matches the full name• Group “2” matches the first name• Group “3” matches the last name

Sed Examples

$ echo “hello” | sed –r ‘s/lo/p/help$ echo “Here is a sentence” | sed –r ‘s/is/was/’Here was a sentence$ echo “This is a sentence” | sed –r ‘s/is/is not’This is not a sentence$ echo “This is a sentence” | sed –r ‘s/is/XXX’ThXXX is a sentence$ echo “This is a sentence” | sed –r ‘s/is/is not/g’This not is not a sentence$ echo “This is a sentence” | sed –r ‘s/\<is\>/is not/g’This is not a sentence

Another Sed example

• Consider translating a list of phone numbers from• (xxx)-xxx-xxxx to • xxx-xxx-xxxx• We need to replce the parenthesized part of the

numbers with its contents…• sed –r ‘s/$([0-9]{3})$/\1/’

– Extra parentheses tell sed to store the matched number– \1 grabs the matched text as a backreferences

Another Sed example


numbers with its contents…• sed –r ‘s/$([0-9]{3})$/\1/’


• But there’s a simpler solution…

Another Sed example


numbers with its contents…• sed –r ‘s/$([0-9]{3})$/\1/’ numbers


• But there’s a simpler solution… Remove the parentheses!– sed –r ‘s/[]//’ numbers

Another Example

• Consider changing a list of names from (Last, First) to (First, Last)

• As usual, we need to characterize the input first

Another Example


• As usual, we need to characterize the input first– A capital letter, followed by any number of letters, then a

comma and a space; finally, one more capital letter and any number of other letters.

• And the sed expression?

Another Example


• As usual, we need to characterize the input first– A capital letter, followed by any number of letters, then a

comma and a space; finally, one more capital letter and any number of other letters.

• And the sed expression?– sed –r ‘s/([A-Z]\w*),\s([A-Z]\w*)/\2, \1/g’

Documents

CIS 191: Linux and Unix Class 4 October 7th, 2015