Upload
casey-bares
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Extracting Names Using Layout Clues
in Genealogical Books
Aaron Stewart
David W. Embley
March 20, 2010
Finding Names
Stanford Named Entity Recognizer
Apache UIMA Framework
CRF MEMM
Natural Language Processing
?
Ancestry.com Data
• Word text
• Word bounding boxes
• Genres:– Genealogical Books– City Directories– Yearbooks– Newspapers
Margin Finder – Future Work
• ABBYY FineReader handles –– Paragraphs– Newspaper columns
• But has trouble with –– Hanging indents– Outline indentation (possibly)
Pattern Finding
1. Apply baseline name extractor (OntoES)
2. Apply margin finder and insert markers
3. Find left and right context for each name
4. Apply common contexts to extract more names
Pattern Finding
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 2
LEVEL 2
2. Apply margin finder and insert markers
Pattern Finding
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 2
LEVEL 2
3. Find left and right context for each name
Pattern Finding
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 2
LEVEL 2
4. Apply common context patterns to extract more names
Pattern Finding – Sample Results
Baseline Results• Precision: 40%• Recall: 31.25%• F1: 35.09%
Results of Most Salient Pattern• Precision: 51.52%• Recall: 53.12%• F1: 52.31% Not all results are this good!