22
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010

Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010

Embed Size (px)

Citation preview

Extracting Names Using Layout Clues

in Genealogical Books

Aaron Stewart

David W. Embley

March 20, 2010

Problem

Process

Finding Names

• Name recognition in genealogical texts

• Focus: Lists, Directories

Finding Names

It’s easy for us to spot names… But how does a computer do it?

Which side was easier?

Finding Names

Stanford Named Entity Recognizer

Apache UIMA Framework

CRF MEMM

Natural Language Processing

?

BYU OntoES Ontology Extraction System

• Dictionary

• Regular Expressions

Part 1: Preprocessing

Ancestry.com Data

• Word text

• Word bounding boxes

• Genres:– Genealogical Books– City Directories– Yearbooks– Newspapers

Page Separator

Line Segment Identifier

RANSAC Margin Finder

Margin Finder – Future Work

LeftCenter

Right

Key

Margin Finder – Future Work

• ABBYY FineReader handles –– Paragraphs– Newspaper columns

• But has trouble with –– Hanging indents– Outline indentation (possibly)

Part 2: Pattern Finding

Pattern Finding

1. Apply baseline name extractor (OntoES)

2. Apply margin finder and insert markers

3. Find left and right context for each name

4. Apply common contexts to extract more names

Pattern Finding

1. Apply baseline name extractor (OntoES)

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

2. Apply margin finder and insert markers

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

3. Find left and right context for each name

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

4. Apply common context patterns to extract more names

Pattern Finding – Sample Results

Baseline Results• Precision: 40%• Recall: 31.25%• F1: 35.09%

Results of Most Salient Pattern• Precision: 51.52%• Recall: 53.12%• F1: 52.31% Not all results are this good!

Challenges

• Evaluation– More aligned data– Annotation tool

• Other books– Centered and right-aligned text– Knowing when to apply patterns