28
A Framework for Extraction Plans and Heuristics in an Ontology-Based Data- Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based in part on research funded by the National Science Foundation.

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

  • Upload
    deliz

  • View
    31

  • Download
    1

Embed Size (px)

DESCRIPTION

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System. Alan Wessman Brigham Young University MS Thesis Defense. Based in part on research funded by the National Science Foundation. Presentation Overview. Background of legacy Ontos - PowerPoint PPT Presentation

Citation preview

Page 1: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

Alan WessmanBrigham Young UniversityMS Thesis Defense

Based in part on research funded by the National Science Foundation.

Page 2: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

2

Presentation Overview

Background of legacy Ontos Assumptions, challenges, concerns Framework as solution Explain framework Explain reference implementation Evaluation of system Future work and conclusion

Page 3: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

3

Data Extraction Goals of data extraction

Find relevant data in unstructured or semi-structured documents

Map extracted data to a formal structure Approaches

Wrappers (ROADRUNNER, TSIMMIS) NLP and machine learning (RAPIER, WHISK) Ontologies (Ontos)

Page 4: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

4

Ontos

Developed by Data Extraction Group (DEG) at BYU

Based on OSM ontologies and data frames Focuses on multiple-record extraction Good precision/recall Resilient to document changes

Page 5: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

5

How Ontos Works

Page 6: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

6

Ontos Assumptions

OSML ontologies Single- or multiple-record text documents Each document/record relevant to domain Heuristics produce accurate mappings Output to relational database

Page 7: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

7

Some Current Challenges

Challenge Example

New/evolving ontology features Enhanced data frames

Variety of documents PDF, plaintext, XML

Content filtering Extract from certain HTML attributes (ALT, SRC, HREF)

Locating values On-the-fly lexicon

Optimizing mappings Better heuristics; HMM-based mapping

Page 8: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

8

Architectural Concerns

Variety of technologies Different OSM representations Highly coupled code Difficult to install elsewhere Difficult to upgrade or extend

Page 9: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

9

Thesis Statement

A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research.

We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.

Page 10: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

10

Frameworks Abstract architecture Decouple independent

functions Define interfaces Use abstract classes,

interfaces, declarative configuration files

Allow quick adjustment of system settings without re-coding

Make a system customizable

Image from http://www.mcoe.org

Page 11: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

11

Creating an Extraction Framework

Analyze systems Generalize

functionality Define interfaces Create supporting

code Document framework

DataExtractionEngine

public void doExtraction()

ExtractionPlan

DocumentRetriever

DocumentStructureRecognizer

DocumentStructureParser

ContentFilter

ValueRecognizer

ValueMapper

OntologyWriter

Dynamicallyloaded

components

Config parameters

execute()

ExtractionAlgorithm

uses

Page 12: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

12

Managing the Process

DataExtractionEngine Main class Initialize, perform extraction, finalize

ExtractionPlan Defines order of steps in the extraction process Can be imperative, declarative, or dynamic (like

SQL execution plan)

Page 13: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

13

Handling Documents DocumentRetriever

Responsible for locating relevant documents

Search engine, local filesystem, CMS

DocumentStructureRecognizer Decides which

DocumentStructureParser to use

DocumentStructureParser Breaks document into

individual records or sub-documents

Record separator, table analyzer

ContentFilter Normalizes document text Strips out unwanted markup,

stopwords, etc.

DocumentRetriever

public Iterator retrieveDocuments()URI

DocumentStructureRecognizer

public DocumentStructureParser getDocumentParser(Document doc)

Single

Multi

Tabular

Hierarchical

...

DocumentStructureParser

public Document parse(Document doc)

ContentFilter

public Document filterDocument(Document doc)

<p>Price:<br>$452.00</p>

Price:$452.00

ValueRecognizer

public void findValues(Ontology ont, Document doc)

Price:$452.00

Ontology

Price: $452.00

Keyword Value

DocumentRetriever

public Iterator retrieveDocuments()URI

DocumentStructureRecognizer

public DocumentStructureParser getDocumentParser(Document doc)

Single

Multi

Tabular

Hierarchical

...

DocumentStructureParser

public Document parse(Document doc)

ContentFilter

public Document filterDocument(Document doc)

<p>Price:<br>$452.00</p>

Price:$452.00

ValueRecognizer

public void findValues(Ontology ont, Document doc)

Price:$452.00

Ontology

Price: $452.00

Keyword Value

DocumentRetriever

public Iterator retrieveDocuments()URI

DocumentStructureRecognizer

public DocumentStructureParser getDocumentParser(Document doc)

Single

Multi

Tabular

Hierarchical

...

DocumentStructureParser

public Document parse(Document doc)

ContentFilter

public Document filterDocument(Document doc)

<p>Price:<br>$452.00</p>

Price:$452.00

ValueRecognizer

public void findValues(Ontology ont, Document doc)

Price:$452.00

Ontology

Price: $452.00

Keyword Value

DocumentRetriever

public Iterator retrieveDocuments()URI

DocumentStructureRecognizer

public DocumentStructureParser getDocumentParser(Document doc)

Single

Multi

Tabular

Hierarchical

...

DocumentStructureParser

public Document parse(Document doc)

ContentFilter

public Document filterDocument(Document doc)

<p>Price:<br>$452.00</p>

Price:$452.00

ValueRecognizer

public void findValues(Ontology ont, Document doc)

Price:$452.00

Ontology

Price: $452.00

Keyword Value

Page 14: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

14

Extracting Values ValueRecognizer

Uses matching rules defined in ontology Produces set of candidate matches (like data

record table) ValueMapper

Accepts or rejects candidate matches Assigns accepted matches to elements of the

ontology (e.g., object sets) OntologyWriter

Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)

Page 15: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

15

Implementing the Framework

Applicati onOntol ogyDocument

Retriever

Object sets, relationship sets,and constraints

Value and keyword matching rules

SourceDescriptor

Document

StructureRecognizer

StructureParser

ContentFilter

Document

DocumentDocumentDocument

ValueRecognizer

ValueMapper

Candidate matches

Extracted objects and relationships

Ontol ogyWriter

StructureOutput

DataOutput

URI LocalDocumentRetriever

DOMDocument

(no DocumentStructureRecognizer)

FanoutRecordSeparator

TextDocument

HTMLFilter

DataFrameMatcher

OSMX ontology

HeuristicBasedMapper

ObjectRelationshipWriter

(no structural output) HTML representation

Page 16: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

16

OSMX Legacy Ontos: OSML OntologyEditor:

OSM.dtd New standard is OSMX

XML Schema (better constraints; validation)

JAXB generates corresponding Java classes

Common language for DEG tools

Allows data to be stored inline with model

Page 17: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

17

Managing the Process OntosEngine

Main class for Ontos system

Takes parameters from command line or configuration file

OntosExtractionPlan Sequentially

retrieves, parses, filters, and extracts from individual documents

Imperative (hard-coded) algorithm

Applicati onOntol ogyDocument

Retriever

Object sets, relationship sets,and constraints

Value and keyword matching rules

SourceDescriptor

Document

StructureRecognizer

StructureParser

ContentFilter

Document

DocumentDocumentDocument

ValueRecognizer

ValueMapper

Candidate matches

Extracted objects and relationships

Ontol ogyWriter

StructureOutput

DataOutput

URI LocalDocumentRetriever

DOMDocument

(no DocumentStructureRecognizer)

FanoutRecordSeparator

TextDocument

HTMLFilter

DataFrameMatcher

OSMX ontology

HeuristicBasedMapper

ObjectRelationshipWriter

(no structural output) HTML representation

Page 18: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

18

Handling Documents

LocalDocumentRetriever Retrieves documents from local filesystem Filename filter excludes irrelevant files

FanoutRecordSeparator Implements DocumentStructureParser Locates record boundaries and creates sub-

documents HTMLFilter

Removes all HTML markup from documents

Page 19: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

19

Recognizing Values: DataFrameMatcher

Uses data frame enhancements: Keyword affinity (left and right) Require context for left, right, or both Value phrase-specific keywords Link matches back to specific patterns

Other improvements: Consistent regular expression handling Unlimited recursive macro definition

Page 20: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

20

Mapping Values: HeuristicBasedMapper

New algorithm Fully recursive wrt ontology structure ContextualHeuristic generates objects Connection-based heuristics (singleton, nested-

group, etc.) generate relationships See paper for additional details

Page 21: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

21

Output

Human-readable HTML format Easier to count correct, partial, incorrect

mappingsDeceasedPerson osmx3113

•has DeceasedName Sandoval, Ernesto J.•has DeathDate October 7, 2004•has BirthDate November 9, 1923•has Age 63•DeceasedPerson has Relationship to RelativeName

•RelativeName Agullar Sandoval•Relationship daughter

•DeceasedPerson has Relationship to RelativeName •RelativeName Lalo Sandoval•Relationship brother

Page 22: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

22

Using the Framework and Reference Implementation

Adding new features Create new implementation classes Extend (subclass) existing implementations

Switching feature set Change class name in config file Override class on command line

Page 23: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

23

Evaluating the Framework

Age FuneralDate Viewing Relationship/

RelativeName

Recall Precision Recall Precision Recall Precision Recall Precision

New Ontos

60% 50% 68% 76% 80% 63% 74% 43%

Legacy Ontos

57% 38% 63% 75% 93% 18% 73% 41%

Four of eighteen object sets shown above.

Data from Salt Lake Tribune and Arizona Daily Star

Input:

Obituaries ontology

25 obituaries from two newspapers

Page 24: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

24

Statistics about the System

Files Lines of code*

Framework 38 2868

OntologyEditor 141 22,249

OSMX (XML Schema) 1 1918

OSMX (Java)** 60 6912

Ontos 29 6295

* Includes comments and whitespace.

** JAXB-generated classes add 197 files and 62,888 lines of code.

Page 25: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

25

Future Work Algorithm improvements

On-the-fly lexicons Machine learning techniques Confidence values Canonicalization Expected participation cardinality Negative-indicator keywords

Integration Online search engines Semantic Web annotator and query engine Web interface to extraction engine

Page 26: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

26

Contributions

Design and construction of a data-extraction framework

Reference implementation Ontos upgrade Pattern for future use of framework

OSMX Standardized storage format http://www.deg.byu.edu/xml/osmx.xsd

Page 27: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

27

Contributions

Uniform codebase and language OntologyEditor migration

New graphics classes Extended data frame support

Modular heuristic-based mapper Concept of extraction plans Flexible research platform

Page 28: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

28

Conclusion

Framework gives us the flexibility we need for further data-extraction research

Framework is capable of supporting Ontos functionality

OSMX and reference implementation provide solid base for future research applications