Upload
others
View
49
Download
3
Embed Size (px)
Citation preview
Searching and
Regular Expressions in ELAN
Johanna Lorenz, Bielefeld University, 13.11.2015
Overview
Searching options in ELAN (multiple) files
(multiple) layers
query languages
display search
save/load queries and export results
Regular Expressions/RegEx
introduction
types of characters
character classes
special characters
13/11/2015 2 Searching and Regular Expressions in ELAN
Searching options in ELAN:
Overview
ELAN provides the possibility to search
in one file or multiple files
in tiers or types (or speakers)
in one tier/type or multiple tiers/types
with literal strings, regular expressions or variables
13/11/2015 3 Searching and Regular Expressions in ELAN
Searching options in ELAN:
(multiple) files
ELAN provides the possibility to search
in one file
Find (and Replace)
multiple files
Find and Replace in Multiple files
Search Multiple eaf
Structured Search Multiple eaf
13/11/2015 4 Searching and Regular Expressions in ELAN
Searching options in ELAN:
(multiple) files
13/11/2015 5
number of results
main menu
Find (and replace) – one file
tier selection
single results
Replace function
query language
Double-Klick on a result to
jump to the ELAN-file
search string
Searching and Regular Expressions in ELAN
Searching options in ELAN:
(multiple) files
13/11/2015 6
query language
search domain
Find (and replace) – multiple files
search string
domain creation
domain selection: single files or folders; you can store defined search domains and name them
replace string
tiers to be searched
Searching and Regular Expressions in ELAN
Searching options in ELAN:
(multiple) files
13/11/2015 7
query language
Search multiple eaf
no selection of tiers
search domain
search string
Searching and Regular Expressions in ELAN
Searching options in ELAN:
(multiple) layers
13/11/2015 8
search domain
Structured search multiple eaf
search string
1. Substring search
• no definition of layer • no regular expressions
2. Single layer search
• selection of 1 layer (tier, type, speaker)
• regular expressions possible
search domain
search string
query language
layer to be searched
Searching and Regular Expressions in ELAN
Searching options in ELAN:
(multiple) layers
13/11/2015 Searching and Regular Expressions in ELAN 9
search domain
Structured search multiple eaf
searching modes: case sensitivity and
query language
3. Multiple layer search
• multiple layers and columns • regular expressions possible
search strings (white cells)
search constraints (green cells)
layers to be searched
add/remove columns and/or layers
Searching options in ELAN:
(multiple) layers
Search constraints
in a row
set number of annotations or milliseconds between the annotations containing the results
13/11/2015 Searching and Regular Expressions in ELAN 10
Structured search multiple eaf
in a column
set constraints regarding the time interval between annotations on different tiers
Searching options in ELAN:
query languages
Search modes
case sensitivity case insensitive: no difference between upper and lower case
letters e.g. ‘hello’ matches ‘HellO’
case sensitive: difference between upper and lower case letters e.g. ‘hello’ doesn´t match ‘HellO’
query language substring match: results contain search string
e.g. ‘road’ matches ‘road, abroad, roads…’ in glosses, sentences…
exact match: result exactly matches search string e.g. ‘road’ only matches ‘road’ in a gloss, not in an annotation with the
word ‘road’ in a sentence
regular expression: (see below) variable match: variables search for annotations with the same
strings
13/11/2015 Searching and Regular Expressions in ELAN 11
Searching options in ELAN:
display search
By right-clicking on the search hits, you can choose different visualization:
13/11/2015 Searching and Regular Expressions in ELAN 12
alignment view
hits in an aligned time-based view
concordance view
list of all hits
Searching options in ELAN:
display search
By right-clicking on the search hits, you can choose different visualization:
13/11/2015 Searching and Regular Expressions in ELAN 13
frequency view (by frequency)
count/percentage of hits
numerical order
frequency view (by annotation)
count/percentage of hits
alphabetical order
Searching options in ELAN:
display search
If you want to view the hits in the timeline viewer of the corresponded file, just double click on the hit. ELAN will open the file and highlight the search result.
13/11/2015 Searching and Regular Expressions in ELAN 14
Searching options in ELAN:
save/load queries and export results
Queries (not hits) can be saved (.xml) and loaded in ELAN.
By right-clicking in the concordance view or alignment view, you can export hits and hit statistics in a .csv-format.
By right-clicking in the frequency view (by annotation or frequency), you can export frequency info in a .csv-format.
When you want to open the exported search results with a spreadsheet program, you have to define that you want to import data from a text file, that the file type contains data that are delimited and that the delimiters/separators are tab stops.
13/11/2015 Searching and Regular Expressions in ELAN 15
RegEx:
introduction
Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences of substrings
(e.g. a word) in strings (e.g. a sentence)
What are they? … are special strings of characters describing search patterns What do they do? … match pieces/sequences of texts (strings) with the defined
format that corresponds to the search pattern described by the regular expression
to sum up, RegEx are a way of describing patterns in texts
13/11/2015 Searching and Regular Expressions in ELAN 16
RegEx:
types of characters
Regular Expressions can be formulated with different types of characters:
literal characters normal text characters, e.g. letters and digits
metacharacters/special characters
have a special meaning with regard to string matching, e.g. ‘.’ matches any character
some literal characters have a special meaning when they are marked by a preceding backslash
e.g. \b defines the beginning of a word
the literal value of metacharacters can be received by escaping them with a preceding backslash
e.g. ‘\.’ matches a dot
13/11/2015 Searching and Regular Expressions in ELAN 17
RegEx:
character classes
characters can be grouped by putting them between squared brackets
[…] > squared brackets define a set of characters e.g. [aeiou] matches any vowel
most metacharacters loose their meaning, e.g. [?.] matches . or ?
[^…] > the caret defines a set of negated characters e.g. [^aeiou] matches anything but vowels
there are predefined range sets of characters
a-z, A-Z, 0-9
[…-…] > a hyphen defines a range of characters
e.g. [a-e] matches a, b, c, d, e
the hyphen gets a special meaning within squared brackets
13/11/2015 Searching and Regular Expressions in ELAN 18
RegEx:
character classes
character sets can be joined
set union > matches all of the one after another written members of the operand classes
e.g. [a-z0-9] matches any letter or digit
set intersection > matches every character that is in both of its operand classes
e.g. [0-7&&[5-9]] matches 5, 6, 7
set subtraction > matches every character that is in one operand class, but not in the other
e.g. [a-z&&[^aeiou]] matches all consonants
13/11/2015 Searching and Regular Expressions in ELAN 19
RegEx:
character classes
short-hand classes
matches one of several classes
does not match one of several characters
13/11/2015 Searching and Regular Expressions in ELAN 20
. any character (including white space)
\d digit character [0-9]
\w word character [a-zA-Z0-9_]
\s whitespace character
\D anything but a digit [^0-9]
\W anything but a word character [^a-zA-Z0-9_]
\S anything but a whitespace character
RegEx:
special characters
logical operators
examples
ban matches bananas
(b|n)an matches bananas, bananas
(an|b)an matches bananas, bananas
(b|na)(s|a|n) matches bananas, bananas, bananas
13/11/2015 Searching and Regular Expressions in ELAN 21
RegEx Operator Meaning
ab sequence (‘and‘) a followed by b
a|b alternatives (or) either a or b
(ab) grouping a group with a followed by b
RegEx:
special characters
repetitions/quantifiers
greedy: first matches as much as possible reluctant: first matches as little as possible possessive: like greedy quantifier, but doesn´t backtrack
We won´t go in detail here, for everyone who is interested in this I recommend Friedl (2006).
13/11/2015 Searching and Regular Expressions in ELAN 22
Greedy Reluctant Possessive Meaning
X? X?? X?+ X, once or not at all ({0,1}, optional)
X* X*? X*+ X, zero or more times ({0,}) X+ X+? X++ X, one or more times ({1,}) X{n} X{n}? X{n}+ X, exactly n times X{n,} X{n,}? X{n,}+ X, at least n times
X{n,m} X{n,m}? X{n,m}+ X, at least n but not more than m times
RegEx:
special characters
repetitions/quantifiers
13/11/2015 Searching and Regular Expressions in ELAN 23
RegEx Meaning Example Matches
X? X, once or not at all (grand)?child child, grandchild, grandgrandchild,
grandgrandgrandchild
X* X, zero or more times
(grand)*child child, grandchild, grandgrandchild,
grandgrandgrandchild
X+ X, one or more times
(grand)+child *child, grandchild, grandgrandchild,
grandgrandgrandchild
X{n} X, exactly n times (grand){2}child *child, *grandchild, grandgrandchild,
grandgrandgrandchild
X{n,} X, at least n times (grand){2,}child *child, *grandchild, grandgrandchild,
grandgrandgrandchild
X{n,m} X, at least n but not more than m times
(grand){1,2}child *child, grandchild, grandgrandchild,
grandgrandgrandchild
RegEx:
special characters
repetitions/quantifiers
with a backslach and a subsequent number after a group you can use backreference to match the same string that was previously matched
(ed)\1 matches needed
(\w{2})\1 matches hehe, needed, remember, 1818
this is not the same as using curly brackets where you don´t find backreference
(ed){2} matches needed
but (\w{2}){2} matches any sequence of four word characters
13/11/2015 Searching and Regular Expressions in ELAN 24
RegEx:
special characters
anchors
13/11/2015 Searching and Regular Expressions in ELAN 25
domain boundary RegEx Example
line beginning ^… ^watch > watch this watch
end …$ watch$ > watch this watch
annotation beginning \A… \Awatch > watch this watch
end …\Z watch\Z > watch this watch
word beginning \b… \bson > son, song, *lesson, *persons
end …\b son\b > son, *song, lesson, *persons
non-word beginning \B… \Bson > *son, *song, lesson, persons
end …\B son\B > *son, song, *lesson, persons
RegEx:
resources
Useful sites: http://en.wikipedia.org/wiki/Regular_expression
http://www.regular-expressions.info/
http://etext.virginia.edu/services/helpsheets/unix/regex.html
Online tutorial: http://www.zvon.org/comp/r/tut-Regexp.html
Literature Friedl, Jeffrey E. F. 2006. Mastering Regular expressions.
Beijing, Cambridge etc.: O'Reilly.
13/11/2015 Searching and Regular Expressions in ELAN 26
the end
13/11/2015 Introduction to ELAN 27
Compendium RegEx character
types
literal characters
metacharacters/special characters
some literal characters have a special meaning when they are
marked by a preceding backslash
e.g. \b defines the beginning of a word
the literal value of metacharacters can be received by escaping
them with a preceding backslash
e.g. \. at hes a dot
character
groups
[…] > s ua ed a kets defi e a set of characters
e.g. [aeiou] matches any vowel
most metacharacters loose their meaning, e.g. [?.]
matches . or ?
[^…] > the a et defi es a set of negated characters
e.g. [^aeiou] matches anything other than a vowel
there are predefined range sets of characters
a-z, A-Z, 0-9
[…-…] > a hyphen defines a range of characters
e.g. [a-e] matches a, b, c, d, e
the hyphen gets a special meaning in squared brackets
connec-
tions of
groups
set union > matches all of the one after another written
members of the operand classes
e.g. [a-z0-9] matches any letter or digit
set intersection > matches every character that is in both of its
operand classes
e.g. [0-7&&[5-9]] matches 5, 6, 7
set subtraction > matches every character that is in one
operand class, but not in the other
e.g. [a-z&&[^aeiou]] matches all consonants
short-hand
classes RegEx class
. any character (including white space)
\d digit character [0-9]
\w word character [a-zA-Z0-9_]
\s whitespace character
\D anything but a digit [^0-9]
\W anything but a word character [^a-zA-Z0-9_]
\S anything but a whitespace character
logical
operators RegEx Operator Meaning
ab sequence a d a followed by b
a|b alternatives (or) either a or b
(ab) grouping a group with a followed by b
repetition/
quantifiers RegEx Meaning Example Matches
X? X, once or not
at all (grand)?child
child, grandchild, grandgrandchild,
grandgrandgrandchild
X* X, zero or
more times (grand)*child
child, grandchild, grandgrandchild,
grandgrandgrandchild
X+ X, one or more
times (grand)+child
*child, grandchild, grandgrandchild,
grandgrandgrandchild
X{n} X, exactly n
times (grand){2}child
*child, *grandchild, grandgrandchild,
grandgrandgrandchild
X{n,} X, at least n
times (grand){2,}child
*child, *grandchild, grandgrandchild,
grandgrandgrandchild
X{n,m}
X, at least n
but not more
than m times
(grand){1,2}child *child, grandchild, grandgrandchild,
grandgrandgrandchild
with a backslach and a subsequent number after a group you can
use backreference to match the same string that was previously
matched
(ed)\1 matches needed
(\w{2})\1 matches hehe, needed, remember, 1818
this is not the same as using curly brackets where you don´t find
backreference
(ed){2} matches needed
but (\w{2}){2} matches any sequence of four word characters
anchors
domain boundary RegEx Example
annotation beginning \A… \Awatch > watch this watch
end …\Z watch\Z > watch this watch
word beginning \ … \bson > son, song, *lesson, *persons
end …\b son\b > son, *song, lesson, *persons
non-word beginning \B… \Bson > *son, *song, lesson, persons
end …\B son\B > *son, song, *lesson, persons