48
A Training and Classification System in Support of Automated Metadata Extraction PhD Proposal Paul K Flynn 14 May 2009

A Training and Classification System in Support of Automated Metadata Extraction

  • Upload
    caspar

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

A Training and Classification System in Support of Automated Metadata Extraction. PhD Proposal Paul K Flynn 14 May 2009. Motivation. bootstrap. Overview . Background Metadata Extraction Overview System Description Nonform classification Proposal Classification Background - PowerPoint PPT Presentation

Citation preview

Page 1: A Training and Classification System in Support of Automated Metadata Extraction

A Training and Classification System in Support of Automated Metadata Extraction

PhD ProposalPaul K Flynn

14 May 2009

Page 2: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn

Motivation

• bootstrap

2

Page 3: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 3

Overview Background

Metadata Extraction OverviewSystem DescriptionNonform classification

ProposalClassification

BackgroundPotential MethodsComplexitiesEarly experimentsAvenues of Investigation

Training SystemDesign Overview

Testing and EvaluationSchedule

Page 4: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 4

Overview Background

Metadata Extraction OverviewSystem DescriptionNonform classification

• Proposal• Classification

• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation

• Training System• Design Overview

• Testing and Evaluation• Schedule

Page 5: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 5

Metadata Extraction SystemLarge, heterogeneous, evolving collections consisting of documents with diverse layout and structure

Defense Technical Information Center (DTIC)U.S. Government Printing Office (GPO) – Environmental Protection Agency (EPA)

Two types of targetsForms – documents containing a standardized form filled out with metadata of interestNon-forms – all others

Developing a system which can be transitioned into a production system

Page 6: A Training and Classification System in Support of Automated Metadata Extraction

6PhD Proposal – Paul K Flynn

Form Document

Page 7: A Training and Classification System in Support of Automated Metadata Extraction

7PhD Proposal – Paul K Flynn

Nonform Document

Page 8: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn

Design concepts

HeterogeneityA new document is classified, assigning it to a group of documents of similar layout – reducing the problem to multiple homogeneous collectionsAssociated with each class of document layouts is a template, a scripted description of how to associate blocks of text in the layout with metadata fields.

EvolutionNew classes of documents accommodated by writing a new templatetemplates are comparatively simple, no lengthy retraining required potentially rapid response to changes in collection

RobustnessUse of Validation techniques to detect extraction problems and selection of templates

8

Page 9: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 9

Architecture & ImplementationInput

Documents

Input Processing &

OCR

Form Processing

Final Metadata

Output

PDF

XML model of document

Unresolved Documents

Extracted Metadata

CleanedMetadata

sf298_1 sf298_2 ...

Form Templates

au eagle ...

Nonform TemplatesPost

Processing

Nonform Processing

Extracted Metadata

Validation

trusted outputs

Untrusted Metadata Outputs

Human Review & Correction

correctedmetadata

Page 10: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn

Validation Scripts

10

Final Validation Classification

Page 11: A Training and Classification System in Support of Automated Metadata Extraction

11PhD Proposal – Paul K Flynn

Post-Hoc Classification

Extract Metadata

Final Nonform Output

CleanXML

Selected Metadata

au eagle ...

Nonform Templates

Unresolved Document

Select Best Metadata

CandidateMetadata

Sets

Validation Spec.

validation rules

• Apply all templates to document– results in multiple candidate sets of metadata

• Score each candidate using the validator– Select the best-scoring set

Page 12: A Training and Classification System in Support of Automated Metadata Extraction

12PhD Proposal – Paul K Flynn

Post hoc classification shortcomings

Correct

Selected

Page 13: A Training and Classification System in Support of Automated Metadata Extraction

13PhD Proposal – Paul K Flynn

Classification (a priori)

Classify (select best template)

Final Nonform Output

CleanXML

Extracted Metadata

au eagle ...

Nonform Templates

Unresolved Document

Extract Metadata

selectedtemplate

• Replace Post hoc classification alone with a new classification module

• Continue to use Validator to provide semantic verification of extracts

Page 14: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 14

Focus: Non-Form Processing

• Classification – compare document against known document layouts– Select template written for closest matching

layout• Apply non-form extraction engine to

document and template• Send to validator for scoring

Page 15: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn

Extraction System

15

Page 16: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 16

Overview • Background

• Metadata Extraction Overview• System Description• Nonform classification

Proposal• Classification

• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation

• Training System• Design Overview

• Testing and Evaluation• Schedule

Page 17: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 17

ProposalInvestigate Classification methodologies and implementationCreate Training System for managing and creating templatesSpecific questions we will attempt to answer:

Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?

Page 18: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 18

Overview • Background

• Metadata Extraction Overview• System Description• Nonform classification

• ProposalClassification

BackgroundPotential MethodsComplexitiesEarly experimentsAvenues of Investigation

• Training System• Design Overview

• Testing and Evaluation• Schedule

Page 19: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 19

Previous ResearchExtensive coverage in literature

Model and methodology match purposeNo universal classifierMany experiments use limited number of classes

Use visual similarity, logical structures, or textual features

Machine learningDecision treesDistance measures

Multiple classifiers

Page 20: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 20

Issues and ComplexitiesSelf-imposed constraints

Must be simple to maintainAutomated process for deploying from Training System

Avoid adding dependency on additional 3rd party packages

Page 21: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 21

Unpredictable OCR Results

Page 22: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 22

Unpredictable Page Segmentation

Page 23: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 23

Second page relevancy

Page 1

Page 2

Page 24: A Training and Classification System in Support of Automated Metadata Extraction

24PhD Proposal – Paul K Flynn

Manual Classification

• Documents may appear visually similar at thumbnail scale

• Closer inspection reveals semantic differences

Page 25: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 25

Incorrect Manual Classification

Position of Date Field different – detectable by post hoc classification

Page 26: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 26

Initial experimentsMethods tested

Block distances – Tries to match blocks and measure distancesMxN Overlap – divide page into bins and count matchesCommon Vocab – find common words in pages of training class

• Vocab1 – looks at only 1st page• Vocab5 – looks at 1st five pages

MXY tree – variant encodes structure as string, uses edit-distance to measure similarityMXY + MxN – sums two methods

Used best 4/5 votes to declare match

Page 27: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 27

Sample ResultsSummary

MXY Tree

Page 28: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 28

Sample ResultsSummary

MXY Tree MXY Tree -MxN

MxNBlock Distance

Page 29: A Training and Classification System in Support of Automated Metadata Extraction

29PhD Proposal – Paul K Flynn

Experimental Results

• Precision = #correct / #total in class• Recall = #correct / #answers

Method Precision Recall

Block Distance 43% 92%

MxN 50% 92%

Vocab 1 43% 62%

Vocab 5 61% 64%

MXY tree 30% 94%

MXY + MxN 36% 96%

Page 30: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn

Experimental Results  Layout Distance MxN Matching Vocab Match 5 Page Vocab Match 1 Page MXY Tree

MXY Tree Plus MxN

CLASS Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision RecallABSTRACT1-2COL 0% 0% 0% 0% 100% 100% 100% 100% 0% 0% 0% 0%ATOM 0% 0% 0% 0% 100% 100% 0% 0% 0% 0% 0% 0%AU 99% 100% 97% 100% 91% 100% 97% 100% 98% 100% 99% 100%BOTTOM-BLOCK 13% 100% 0% 0% 50% 80% 0% 0% 0% 0% 0% 0%CPRC 0% 0% 17% 100% 83% 45% 0% 0% 0% 0% 0% 0%EAGLE-IMAGE 100% 91% 100% 100% 78% 93% 0% 0% 50% 94% 94% 100%EAGLE-TEXT 100% 100% 69% 100% 54% 100% 100% 100% 100% 93% 100% 93%ERDC 69% 95% 92% 100% 65% 81% 54% 93% 19% 100% 46% 100%HORIZ 80% 100% 80% 100% 100% 15% 0% 0% 20% 100% 80% 100%LOGI 15% 100% 7% 100% 26% 100% 96% 70% 11% 60% 11% 75%RAND-ARC 0% 0% 89% 89% 56% 83% 0% 0% 0% 0% 11% 100%RAND-ARROYO 50% 86% 50% 100% 75% 82% 0% 0% 0% 0% 0% 0%RAND-ARROYO2 14% 100% 68% 79% 75% 78% 71% 80% 0% 0% 0% 0%RAND-BRIEF1 33% 100% 67% 100% 100% 100% 67% 100% 33% 100% 33% 100%RAND-BRIEF2 60% 100% 90% 100% 80% 100% 45% 100% 50% 91% 60% 86%RAND-LEFT 0% 0% 0% 0% 83% 100% 33% 8% 0% 0% 0% 0%RAND-NOTE 57% 100% 79% 100% 86% 100% 100% 100% 0% 0% 0% 0%RANDTECH 50% 73% 13% 67% 81% 57% 0% 0% 0% 0% 0% 0%RESEARCH 0% 0% 0% 0% 89% 47% 67% 67% 0% 0% 0% 0%SIGNATUR 0% 0% 0% 0% 100% 91% 100% 100% 0% 0% 0% 0%TOPLOG-2COL 0% 0% 44% 100% 56% 100% 56% 100% 0% 0% 0% 0%WARCOLLEGE 0% 0% 0% 0% 100% 71% 100% 38% 0% 0% 0% 0%

30

• Precision = #correct / #total in class• Recall = #correct / #answers

Page 31: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 31

Avenues of InvestigationImplement and test variety of methodHandling multiple page classificationMultiple Classifiers

Methods of combiningWeighting best for each class

Deriving signature for specifying combination rules

Clustering methods for bootstrapping

Page 32: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 32

Overview • Background

• Metadata Extraction Overview• System Description• Nonform classification

• Proposal• Classification

• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation

Training SystemDesign Overview

• Testing and Evaluation• Schedule

Page 33: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 33

Training SystemManual classification

InaccurateTime consumingNot dynamic

Need automated verification of extractionsDevelopers open original documents multiple timesCorrectness open to interpretationRegression testing only compares against previous extraction attempts

Need to allow multiple template writers to interactNeed to measure effects of individual templates against the whole

Page 34: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 34

Prod

uctio

n Sy

stem

Training Docs

ClusteredDocs

Candidates

TemplateMaker

TrainingEvaluator

BootStrapClassifier

BaselineData

Manager

Trained Pool

BaselineData Pool

Templates andClassification

Signatures

Metadata Training System

Page 35: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 35

Persistence LayerManage complete set of Training documents

Track Baseline data input by users• Allows for independent confirmation• Change tracking

Track subset of documents with out templates developedTrack trained documents

Allow for multiple access

Page 36: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 36

Baseline Data ManagerGUI for establishing BaselineHighlight and copy

Work on OCR to account for errorsProvide auditing and tracking of changes

Page 37: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 37

Bootstrap ClassifierDynamic GUI to allow:

Differing classifiersClustering method

User can flag documents to ignoreUser can designate single doc as matcherProvides output to Template Maker

Page 38: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 38

Template Maker

Template being developed

Results for document

Page 39: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 39

Template MakerFinal form depends on Engine replacementSample documents come from Bootstrap ClassifierAlso prepares classification spec for export to production system

Page 40: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 40

Training EvaluatorVerifies operation of template and classification specIdentifies other documents in the pool which are correctly extracted

Moved to trained poolRemoved from further bootstrapping

Supports robust regression testing

Page 41: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 41

Overview • Background

• Metadata Extraction Overview• System Description• Nonform classification

• Proposal• Classification

• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation

• Training System• Design Overview

Testing and Evaluation• Schedule

Page 42: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 42

Testing and EvaluationProposed Tests

Evaluate effectiveness of pre-classification module Evaluate effectiveness of adding similarity score to validationEvaluate the effectiveness of the bootstrap classificationEnd to End evaluation

 Other tests as needed

Page 43: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 43

Evaluate effectiveness of pre-classification module

Use simple Baseline classifier to test

Select candidate templatesAdjust post hoc score

Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?

Page 44: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 44

Evaluate effectiveness of adding similarity score to validation

Use Baseline classifier Determine a baseline cluster of 5 documents to serve as the “signature” targets for measuring similarity. Apply score to final validation

Remove templates to measure effectsAssessment: Evaluate the percent of documents which are correctly flagged as resolved.

Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?

Page 45: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 45

Evaluate the effectiveness of the bootstrap classification

Measure amount of time to completely run thru training collection

Isolate template development time by providing appropriate template

 Assessment: Compare the time needed to classify the documents to the manual method.

Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?

Page 46: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 46

End to End evaluation

 

Create a mini-collection by downloading 100 documents from DTIC. Assign two separate teams of trained template writers to create templates to correctly extract metadata from a minimum of 80 documents.

One team will perform the task using manual classification, a version of the Template Maker with the training system enhancements disabled and a production system (with no templates) for extraction. The other team will use the complete training system.

Assessment: The teams will use logs to record work time. We will evaluate logs to assess time usage and conduct interviews to compile observations and impressions of the system.

Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?

Page 47: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 47

Overview • Background

• Metadata Extraction Overview• System Description• Nonform classification

• Proposal• Classification

• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation

• Training System• Design Overview

• Testing and EvaluationSchedule

Page 48: A Training and Classification System in Support of Automated Metadata Extraction

PhD Proposal – Paul K Flynn 48

Schedule