A Training and Classification System in Support of Automated Metadata Extraction

A Training and Classification System in Support of Automated Metadata Extraction

PhD ProposalPaul K Flynn

14 May 2009

PhD Proposal – Paul K Flynn

Motivation

• bootstrap

2

PhD Proposal – Paul K Flynn 3

Overview Background

Metadata Extraction OverviewSystem DescriptionNonform classification

ProposalClassification

BackgroundPotential MethodsComplexitiesEarly experimentsAvenues of Investigation

Training SystemDesign Overview

Testing and EvaluationSchedule


Overview Background

Metadata Extraction OverviewSystem DescriptionNonform classification

• Proposal• Classification

• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation

• Training System• Design Overview

• Testing and Evaluation• Schedule


Metadata Extraction SystemLarge, heterogeneous, evolving collections consisting of documents with diverse layout and structure

Defense Technical Information Center (DTIC)U.S. Government Printing Office (GPO) – Environmental Protection Agency (EPA)

Two types of targetsForms – documents containing a standardized form filled out with metadata of interestNon-forms – all others

Developing a system which can be transitioned into a production system

6PhD Proposal – Paul K Flynn

Form Document


Nonform Document


Design concepts

HeterogeneityA new document is classified, assigning it to a group of documents of similar layout – reducing the problem to multiple homogeneous collectionsAssociated with each class of document layouts is a template, a scripted description of how to associate blocks of text in the layout with metadata fields.

EvolutionNew classes of documents accommodated by writing a new templatetemplates are comparatively simple, no lengthy retraining required potentially rapid response to changes in collection

RobustnessUse of Validation techniques to detect extraction problems and selection of templates

8


Architecture & ImplementationInput

Documents

Input Processing &

OCR

Form Processing

Final Metadata

Output

PDF

XML model of document

Unresolved Documents

Extracted Metadata

CleanedMetadata

sf298_1 sf298_2 ...

Form Templates

au eagle ...

Nonform TemplatesPost

Processing

Nonform Processing

Extracted Metadata

Validation

trusted outputs

Untrusted Metadata Outputs

Human Review & Correction

correctedmetadata


Validation Scripts

10

Final Validation Classification


Post-Hoc Classification

Extract Metadata

Final Nonform Output

CleanXML

Selected Metadata

au eagle ...

Nonform Templates

Unresolved Document

Select Best Metadata

CandidateMetadata

Sets

Validation Spec.

validation rules

• Apply all templates to document– results in multiple candidate sets of metadata

• Score each candidate using the validator– Select the best-scoring set


Post hoc classification shortcomings

Correct

Selected


Classification (a priori)

Classify (select best template)

Final Nonform Output

CleanXML

Extracted Metadata

au eagle ...

Nonform Templates

Unresolved Document

Extract Metadata

selectedtemplate

• Replace Post hoc classification alone with a new classification module

• Continue to use Validator to provide semantic verification of extracts


Focus: Non-Form Processing

• Classification – compare document against known document layouts– Select template written for closest matching

layout• Apply non-form extraction engine to

document and template• Send to validator for scoring


Extraction System

15


Overview • Background

• Metadata Extraction Overview• System Description• Nonform classification

Proposal• Classification





ProposalInvestigate Classification methodologies and implementationCreate Training System for managing and creating templatesSpecific questions we will attempt to answer:

Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?




• ProposalClassification

BackgroundPotential MethodsComplexitiesEarly experimentsAvenues of Investigation




Previous ResearchExtensive coverage in literature

Model and methodology match purposeNo universal classifierMany experiments use limited number of classes

Use visual similarity, logical structures, or textual features

Machine learningDecision treesDistance measures

Multiple classifiers


Issues and ComplexitiesSelf-imposed constraints

Must be simple to maintainAutomated process for deploying from Training System

Avoid adding dependency on additional 3rd party packages


Unpredictable OCR Results


Unpredictable Page Segmentation


Second page relevancy

Page 1

Page 2


Manual Classification

• Documents may appear visually similar at thumbnail scale

• Closer inspection reveals semantic differences


Incorrect Manual Classification

Position of Date Field different – detectable by post hoc classification


Initial experimentsMethods tested

Block distances – Tries to match blocks and measure distancesMxN Overlap – divide page into bins and count matchesCommon Vocab – find common words in pages of training class

• Vocab1 – looks at only 1st page• Vocab5 – looks at 1st five pages

MXY tree – variant encodes structure as string, uses edit-distance to measure similarityMXY + MxN – sums two methods

Used best 4/5 votes to declare match


Sample ResultsSummary

MXY Tree


Sample ResultsSummary

MXY Tree MXY Tree -MxN

MxNBlock Distance


Experimental Results

• Precision = #correct / #total in class• Recall = #correct / #answers

Method Precision Recall

Block Distance 43% 92%

MxN 50% 92%

Vocab 1 43% 62%

Vocab 5 61% 64%

MXY tree 30% 94%

MXY + MxN 36% 96%


Experimental Results Layout Distance MxN Matching Vocab Match 5 Page Vocab Match 1 Page MXY Tree

MXY Tree Plus MxN

CLASS Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision RecallABSTRACT1-2COL 0% 0% 0% 0% 100% 100% 100% 100% 0% 0% 0% 0%ATOM 0% 0% 0% 0% 100% 100% 0% 0% 0% 0% 0% 0%AU 99% 100% 97% 100% 91% 100% 97% 100% 98% 100% 99% 100%BOTTOM-BLOCK 13% 100% 0% 0% 50% 80% 0% 0% 0% 0% 0% 0%CPRC 0% 0% 17% 100% 83% 45% 0% 0% 0% 0% 0% 0%EAGLE-IMAGE 100% 91% 100% 100% 78% 93% 0% 0% 50% 94% 94% 100%EAGLE-TEXT 100% 100% 69% 100% 54% 100% 100% 100% 100% 93% 100% 93%ERDC 69% 95% 92% 100% 65% 81% 54% 93% 19% 100% 46% 100%HORIZ 80% 100% 80% 100% 100% 15% 0% 0% 20% 100% 80% 100%LOGI 15% 100% 7% 100% 26% 100% 96% 70% 11% 60% 11% 75%RAND-ARC 0% 0% 89% 89% 56% 83% 0% 0% 0% 0% 11% 100%RAND-ARROYO 50% 86% 50% 100% 75% 82% 0% 0% 0% 0% 0% 0%RAND-ARROYO2 14% 100% 68% 79% 75% 78% 71% 80% 0% 0% 0% 0%RAND-BRIEF1 33% 100% 67% 100% 100% 100% 67% 100% 33% 100% 33% 100%RAND-BRIEF2 60% 100% 90% 100% 80% 100% 45% 100% 50% 91% 60% 86%RAND-LEFT 0% 0% 0% 0% 83% 100% 33% 8% 0% 0% 0% 0%RAND-NOTE 57% 100% 79% 100% 86% 100% 100% 100% 0% 0% 0% 0%RANDTECH 50% 73% 13% 67% 81% 57% 0% 0% 0% 0% 0% 0%RESEARCH 0% 0% 0% 0% 89% 47% 67% 67% 0% 0% 0% 0%SIGNATUR 0% 0% 0% 0% 100% 91% 100% 100% 0% 0% 0% 0%TOPLOG-2COL 0% 0% 44% 100% 56% 100% 56% 100% 0% 0% 0% 0%WARCOLLEGE 0% 0% 0% 0% 100% 71% 100% 38% 0% 0% 0% 0%

30

• Precision = #correct / #total in class• Recall = #correct / #answers


Avenues of InvestigationImplement and test variety of methodHandling multiple page classificationMultiple Classifiers

Methods of combiningWeighting best for each class

Deriving signature for specifying combination rules

Clustering methods for bootstrapping






Training SystemDesign Overview



Training SystemManual classification

InaccurateTime consumingNot dynamic

Need automated verification of extractionsDevelopers open original documents multiple timesCorrectness open to interpretationRegression testing only compares against previous extraction attempts

Need to allow multiple template writers to interactNeed to measure effects of individual templates against the whole


Prod

uctio

n Sy

stem

Training Docs

ClusteredDocs

Candidates

TemplateMaker

TrainingEvaluator

BootStrapClassifier

BaselineData

Manager

Trained Pool

BaselineData Pool

Templates andClassification

Signatures

Metadata Training System


Persistence LayerManage complete set of Training documents

Track Baseline data input by users• Allows for independent confirmation• Change tracking

Track subset of documents with out templates developedTrack trained documents

Allow for multiple access


Baseline Data ManagerGUI for establishing BaselineHighlight and copy

Work on OCR to account for errorsProvide auditing and tracking of changes


Bootstrap ClassifierDynamic GUI to allow:

Differing classifiersClustering method

User can flag documents to ignoreUser can designate single doc as matcherProvides output to Template Maker


Template Maker

Template being developed

Results for document


Template MakerFinal form depends on Engine replacementSample documents come from Bootstrap ClassifierAlso prepares classification spec for export to production system


Training EvaluatorVerifies operation of template and classification specIdentifies other documents in the pool which are correctly extracted

Moved to trained poolRemoved from further bootstrapping

Supports robust regression testing







Testing and Evaluation• Schedule


Testing and EvaluationProposed Tests

Evaluate effectiveness of pre-classification module Evaluate effectiveness of adding similarity score to validationEvaluate the effectiveness of the bootstrap classificationEnd to End evaluation

Other tests as needed


Evaluate effectiveness of pre-classification module

Use simple Baseline classifier to test

Select candidate templatesAdjust post hoc score

Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?


Evaluate effectiveness of adding similarity score to validation

Use Baseline classifier Determine a baseline cluster of 5 documents to serve as the “signature” targets for measuring similarity. Apply score to final validation

Remove templates to measure effectsAssessment: Evaluate the percent of documents which are correctly flagged as resolved.

Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?


Evaluate the effectiveness of the bootstrap classification

Measure amount of time to completely run thru training collection

Isolate template development time by providing appropriate template

Assessment: Compare the time needed to classify the documents to the manual method.

Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?


End to End evaluation

Create a mini-collection by downloading 100 documents from DTIC. Assign two separate teams of trained template writers to create templates to correctly extract metadata from a minimum of 80 documents.

One team will perform the task using manual classification, a version of the Template Maker with the training system enhancements disabled and a production system (with no templates) for extraction. The other team will use the complete training system.

Assessment: The teams will use logs to record work time. We will evaluate logs to assess time usage and conduct interviews to compile observations and impressions of the system.

Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?







• Testing and EvaluationSchedule


Schedule