Upload
jaiden-dewhurst
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Course
• Webpage for lecture slides and Panopto recordings:– http://dingo.sbs.arizona.edu/~sandiway/ling581-15/
• Meeting information
Course Objectives
• Follow-on course to LING/C SC/PSYC 438/538 Computational Linguistics:– continue with selected material from the 538 textbook (J&M):
• 25 chapters, a lot of material not covered in 438/538
• And gain more extensive experience– with new stuff not in textbook– dealing with natural language software packages– Installation, input data formatting– operation – project exercises– useful “real-world” computational experience– abilities gained will be of value to employers
Computational Facilities• Use your own laptop/desktop
– can also make use of the computers in this lab (Shantz 338) • but you don’t have installation rights on these computers• Plus the alarm goes off after hours and campus police will arrive…
• PlatformsWindows is maybe possible but you really should run some variant of Unix… (for your task #1 for this week)– Linux (separate bootable partition or via virtualization software)
• de facto standard for advanced/research software • https://www.virtualbox.org/ (free!)
– Cygwin on Windows• http://www.cygwin.com/• Linux-like environment for Windows making it possible to port software running on POSIX systems (such as
Linux, BSD, and Unix systems) to Windows.
– OSX• Not quite Linux, some porting issues, especially with C programs, can use Virtual Box (Linux under OSX)
Grading
• Completion of all homework tasks will result in a satisfactory grade (A)
• Tasks should be completed before the next class. – email me your work ([email protected]). – also be prepared to come up and present your work (if
called upon).
Homework Task 1: Install Tregex
Computer language: java
• http://nlp.stanford.edu/software/tregex.shtml
• (538: Perl regex on strings)• 581: regex for trees …
Homework Task 1: Install Tregex
• We’ll use the program tregex from Stanford University to explore the Penn Treebank– current version:
Penn Treebank
• Availability– Source:
• Linguistic Data Consortium (LDC)• U. of Arizona is a (fee-paying) member of this
consortium• Resources are made available to the community
through the main library• URL
– http://sabio.library.arizona.edu/search/X
Penn Treebank (V3)
• Call Record
Have it on a usb drive here that I willpass aroundTREEBANK_3.zip (65.2MB)
tregex
• Tregex is a Tgrep2-style utility for matching patterns in trees.
written in Java
run-tregex-gui.command shell script
-mx flag, the 300m default memory size may need to be increased depending on the platform
15
Minimum Edit Distance
• general string comparison• edit operations are insertion, deletion and substitution• not just limited to distance defined by a single operation away• we can ask how different is string a from b by the minimum edit distance
16
Minimum Edit Distance• applications
– could be used for multi-typo correction– used in Machine Translation Evaluation (MTEval)– example
• Source: 生産工程改善について
• Translations:• (Standard) For improvement of the production process• (MT-A) About a production process betterment• (MT-B) About the production process improvement• method
– compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.
17
Minimum Edit Distance
• cost models– Levenshtein
• insertion, deletion and substitution all have unit cost
– Levenshtein (alternate)• insertion, deletion have unit cost• substitution is twice as expensive• substitution = one insert followed by one
delete
– Typewriter• insertion, deletion and substitution all
have unit cost• modified by key proximity
Minimum Edit Distance
• Dynamic Programming– divide-and-conquer
• to solve a problem we divide it into sub-problems
– sub-problems may be repeated• don’t want to re-solve a sub-problem the 2nd time around
– idea: put solutions to sub-problems in a table• and just look up the solution 2nd time around, thereby saving time• memoization
we’ll use a spreadsheet…
Minimum Edit Distance
• Consider a simple case: xy yx⇄
• Minimum # of operations: • insert and delete• cost = 2
• Minimum # of operations: • swap• cost = ?
Minimum Edit Distance Computation
• Or in Microsoft Excel, file: eds.xls (on course webpage)
$ in a cell referencemeans don’t change when copiedfrom cell to celle.g. in C$11 stays the samein $A3A stays the same
Minimum Edit Distance
• Task: transform string s1..si into string t1..tj
– each sn and tn are letters– string s is of length i, t is of length j
• Example: – s = leader, t = adapter– i = 6, j = 7– Let’s say you’re allowed just three operations: (1)
delete a letter, (2) insert a letter, or (3) substitute a letter for another letter
– What is one possible way to generate t from s?
Minimum Edit Distance
• Example: – s = leader, t = adapter– What is one possible way to generate t from s?– leader
– ↕︎ ↕︎ – adapter– cost is 2 deletes and 3 inserts, total 5 operations– Question: is this the minimum possible?
leader◄leade◄lead◄lea◄le◄l◄◄a◄ad◄ada◄adap◄adapt◄adapte◄adapter◄
Simplest methodcost: 13 operations
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
cell (6,7)cost of
transforming leader into
adapter
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (3,0)cost of
transforming lea into (empty)
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (0,4)cost of
transforming (empty) into
adap
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e k
6 r
cell (5,6)cost of
transforming leade into
adapte
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e k
6 r
cell (5,6)cost of
transforming leade into
adapte➡︎
�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e k
6 r k
cell (5,6)cost of
transforming leade into
adapte
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e k
3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e k
3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
cell (2,4)cost of
transforming le into adap
➡︎�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e k k+1
3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
cell (2,4)cost of
transforming le into adap
➡︎�
l e
a d a p
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e3 a4 d5 e6 r
cell (1,4)cost of
transforming l into adap
➡︎�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e k+1
3 a4 d5 e6 r
cell (1,4)cost of
transforming l into adap
➡︎�
l e
a d a p
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e3 a4 d5 e6 r
cell (1,3)cost of
transforming l into ada
➡︎
�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e k+2
3 a4 d5 e6 r
cell (1,3)cost of
transforming l into ada
➡︎
�
assuming the cost of
swapping e for p is 2
l e
a d a p
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k1,3 k1,4
2 e k2,3 ?
3 a4 d5 e6 r
➡︎
�➡︎�
➡︎� cell (2,4)minimum of
the three costs to get here in one
step
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (3,0)cost of
transforming lea into (empty)
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0
1 l 1
2 e 2
3 a4 d5 e6 r
➡︎�
cost of le =cost of l , plus the cost of deleting the e
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
➡︎
�
➡︎�
➡︎�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1 2
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
➡︎
�
➡︎�
➡︎�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4 5 6
5 e 5 6
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4 5 6
5 e 5 6
6 r 6
➡︎
�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4 5 6
5 e 5 6 5
6 r 6
➡︎
�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2 6 7
3 a 3 5
4 d 4
5 e 5
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2 6 7
3 a 3 5
4 d 4
5 e 5
6 r 6
➡︎�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2 6 7
3 a 3 5 6
4 d 4
5 e 5
6 r 6
➡︎�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5 6 5
6 r 6 7
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5 6 5
6 r 6 7
➡︎�
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5 6 5
6 r 6 7 6
➡︎�