Introduction To Regex in Lasso 8.5

Embed Size (px)

DESCRIPTION

Presentation at LDC09: Introduction To Regex in Lasso 8.5

Citation preview

  • 1. Beginner Track: Introduction to Regular Expressions (aka regex) Bil Corry lasso.pro

2. What is regex? Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. ( Wikipedia: http://en.wikipedia.org/wiki/Regex) In plain English: Regex is a text-searching language. 3. How regex works Three components are needed:

    • A regex engine that uses a regular expression (search string) to search against the text and return results (we're using Lasso).
  • 4. Some text to search against

5. A regular expression that defines what to search for (e.g. d to find a digit) 6. #1 Regex Engine

  • Lasso provides regex processing via:
  • [string_findregexp] we'll be covering just this

7. [string_replaceregexp] 8. [regexp] 9. [compare_regexp] 10. [compare_notregexp] 11. [match_regexp] 12. [match_notregexp] 13. #2 Some Text To Search Against

  • Text should be of type [string] if you use type [bytes], you may get odd results.

14. There may be performance and memory challenges using regex against a sizably large [string] 15. #3 Regular Expressions: The regex language

    • Literals
  • 16. Dot

17. White Space 18. Character Classes 19. Shorthand Character Classes

    • Positional Matching
  • 20. Alternation

21. Quantifiers 22. Grouping 23. Literals All characters search for their literal selves except for the following: [^$.|?*+() they require being escaped when searched for as a literal. Example: [string_findregexp('LDC is fun!',-find='fun')] LP8:array: (fun) L9:array(fun) 24. Literals (cont) By default, regex is case-sensitive.Use the (?i) switch to make it case-insensitive. Examples: [string_findregexp('ABC abc',-find='abc')] LP8:array: (abc) L9:array(abc) [string_findregexp('ABC abc',-find='(?i)abc')] LP8:array: (ABC), (abc) L9:array(ABC, abc) 25. Escaping Characters In regular expressions, depending on the context, various characters have special meaning.In order to specify the literal character, you must escape it with a backslash ().And because the backslash has special meaning in Lasso, it means you must double the backslashes in Lasso (). 26. Escaping Characters (cont) Example: [string_findregexp('[date] returns the date', -find='date')] LP8:array: ([date]) L9:array([date]) [string_findregexp('[date] returns the date', -find='[date]')] LP8: array:(d),(a),(t),(e),(e),(t),(t),(e),(d),(a),(t),(e) L9: array(d, a, t, e, e, t, t, e, d, a, t, e) 27. Dot A dot (aka period symbol .) will match any single character except line returns.Use the switch (?s) to turn on matching line returns too. Example: [string_findregexp('LDC is fun! Turn on a fan.', -find='f.n')] LP8:array: (fun), (fan) L9:array(fun, fan) 28. Dot (cont) [string_findregexp('1 2 3',-find='.')] LP8: array: (1), (2), (3) L9:array(1, 2, 3) [string_findregexp('1 2 3',-find='(?s).')] LP8: array: (1), ( ), (2), ( ), (3) L9:array(1, , 2, , 3) 29. White Space To find white space, use the Lasso equivalents: Return = Newline = Tab = Example: [string_findregexp('1 2 3',-find=' ')] LP8:array: ( ), ( ) L9:array( , ) 30. Character Classes Used to match against a set of characters contained within square brackets [ ].Order of characters within the class does not matter (i.e. [abc] == [cba]).Reserved characters are ^-]. Example: [string_findregexp('New Years Eve is 2009-12-31', -find='[123ae]')] LP8:array: (e), (e), (a), (e), (2), (1), (2), (3), (1) L9: array(e, e, a, e, 2, 1, 2, 3, 1) 31. Character Classes (cont) Hyphen denotes a range (e.g. [0-9] means 0,1,2,..,9 and [a-z] means a,b,c,...,z). Example: [string_findregexp('abcdef',-find='[b-d]')] LP8:array: (b), (c), (d) L9:array(b, c, d) 32. Character Classes (cont) A caret after the opening square bracket denotes characters to omit instead of find. Example: [string_findregexp('abcdef',-find='[^b-d]')] LP8:array: (a), (e), (f) L9:array(a, e, f) 33. Shorthand Character Classes =[0-9] =[^0-9] [a-zA-Z0-9_] [^a-zA-Z0-9_] [] [^] Example: [string_findregexp('1a2b3c',-find='')] LP8:array: (1), (2), (3) L9:array(1, 2, 3) [string_findregexp('1a2b3c',-find='')] LP8:array: (a), (b), (c) L9:array(a, b, c) 34. Shorthand Character Classes (cont) Example: [string_findregexp('1a2b3c',-find='')] LP8:array: (1), (a), (2), (b), (3), (c) L9:array(1, a, 2, b, 3, c) [string_findregexp('1 2 3',-find='')] LP8: array: ( ), ( ) L9:array( , ) 35. Positional Matching ^ matches beginning of text, $ matches end of text, and (?m) switch makes ^ and $ match beginning and ending of each line. Example: [string_findregexp('1 2 3',-find='^')] LP8: array: (1) L9:array(1)[string_findregexp('1 2 3',-find='(?m)^')] LP8: array: (1), (2), (3) L9:array(1, 2, 3) 36. Positional Matching (cont) matches a word boundary (the position between a word character and a non-word character or start/end of line). Example: [string_findregexp('cape and ape',-find='ape')] LP8: array: (ape) L9:array(ape) [string_findregexp('cape and ape',-find='ape')] LP8: array: (ape), (ape) L9:array(ape, ape) 37. Alternation Vertical bar (|) is an OR operand for regex. Example: [string_findregexp('cat and rat',-find='cat|rat')] LP8: array: (cat), (rat) L9:array(cat, rat) 38. Quantifiers Specifies the number to find: * = 0 or more + = 1 or more ? = 0 or 1 {n} = n times {n,m} = min n, max m times {n, }= min n, no max Example: [string_findregexp('123aaabbb', -find='0*1+2?3{1}a{1,2}ab{2,}')] LP8: array: (123aaabbb) L9:array(123aaabbb) 39. Grouping Round brackets ( ) group the regex together, allowing quantifiers to be used on the group or to perform AND/OR with regex.They also create backreferences, which we won't cover in this session, but know that Lasso returns the group match in addition to the overall match. Example: [string_findregexp('cat and rat',-find='(c|r)at')] LP8: array: (cat), (c), (rat), (r) L9:array(cat, c, rat, r) 40. Grouping (cont) There is an option for non-capturing groups: (?: regex here...) Example: [string_findregexp('cat and rat',-find='(?:c|r)at')] LP8: array: (cat), (rat) L9:array(cat, rat) 41. Tips for Regular Expressions

  • Be sure it's of type [string] type [bytes] may give odd results

42. When using regular expressions obtained from outside sources, you'll need to double-up the backslashes () for Lasso (e.g. d+ becomes +). 43. User-input used as part of a regular expression must be encoded (http://tagswap.net/lp_regexp_encode) 44. Putting it all together

  • When building a complex regex, try breaking the regex into smaller pieces and confirm each piece matches correctly

45. Often, there are several ways to match.If one approach doesn't work, try another. 46. Great reference and tutorial site: www.regular-expressions.info 47. Examples Extract names from comma-delimited list: [string_findregexp('Abe Smith, Bob Jones, Cindy Hart, Darla King',-find='+++')] LP8:array: (Abe Smith), (Bob Jones), (Cindy Hart), (Darla King) L9:array(Abe Smith, Bob Jones, Cindy Hart, Darla King) 48. Examples (cont) Extract phone numbers into a packed format: [string_findregexp('(213) 555-1212',-find='') ->join('')] [string_findregexp('213-555-1212',-find='') ->join('')] [string_findregexp('213 555 1212',-find='') ->join('')] LP8: 2135551212 2135551212 2135551212 L9:2135551212 2135551212 2135551212 49. Examples (cont) Extract data from HTML: [string_findregexp('',-find='name="secret" value="[^"]+')] LP8: array: (name="secret" value="123) L9:array(name="secret" value="123) [string_findregexp('',-find='name="secret" value="([^"]+)')] LP8: array: (name="secret" value="123), (123) L9:array(name="secret" value="123, 123) 50. Examples (cont) Extract data from HTML: [string_findregexp('',-find='(?:name="secret" value=")[^"]+')]LP8: array: (name="secret" value="123) L9:array(name="secret" value="123) [string_findregexp('',-find='(?