39
SEMI-STRUCTURE DATA EXTRACTION Rajendra Akerkar (with David Camacho, Maria D. R-Moreno, David F Barrero) David F. Barrero) Bonn, June 2007

Semi structure data extraction

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Semi structure data extraction

SEMI-STRUCTUREDATA EXTRACTION

Rajendra Akerkar(with David Camacho, Maria D. R-Moreno, David F Barrero)David F. Barrero)

Bonn, June 2007

Page 2: Semi structure data extraction

INDEX

I d i Introduction

Semantic Generators

The WebMantic architecture

A practical example

Some experimental issues

Conclusions

Page 3: Semi structure data extraction

INTRODUCTION

Page 4: Semi structure data extraction

INTRODUCTION

Web informationWeb information Unstructured Non-semantic Designed for humans not for crawlers Designed for humans not for crawlers

Problems Representation (HTML vs XML) Extract, filter and reuse data Share information Volatility Fault tolerance

Page 5: Semi structure data extraction

IINTRODUCTION Information Extraction techniques

Machine learning Pattern recognition Wrappers technologies Wrappers technologies Tools for automatic and semi-automatic

Web data extraction

This work presents

A l b d th d f d t id tifi ti A rule-based method for data identification An approach to Web data extraction A particular implementation of the previous

method

Page 6: Semi structure data extraction

SEMANTIC GENERATORS

Page 7: Semi structure data extraction

SEMANTIC GENERATORS

Def: A Semantic Generator (Sg) is a non- Def: A Semantic Generator (Sg) is a nonempty set of rules (HTML2XML) that can be used to translate HTML documents into XML documentsdocuments

A Semantic Generator (Sg), is built by several A Semantic Generator (Sg), is built by several rules which transform a set of non-semanticHTML tags into a set of semantic XML tags

HTML2XML rule format

HTML2XMLi =< header > IS < body > #num

Page 8: Semi structure data extraction

SEMANTIC GENERATORSSEMANTIC GENERATORS

HTML2XML: <table.tr.td> IS <my-xml-tag>

Tags: <table> <tr> <td> <A href…> etc…will be removed….only data will be extracted

#num: provides the number of cells to be processed

<my-xml-tag> Madrid <my-xml-tag>

Page 9: Semi structure data extraction

SEMANTIC GENERATORSSEMANTIC GENERATORS

Semantic generator

Page 10: Semi structure data extraction

THE WEBMANTIC ARCHITECTURE

Page 11: Semi structure data extraction

WEBMANTIC ARCHITECTURE

WebMantic allows:

Automatically generates Sg

G li HTML XML l Generalize HTML2XML rules

Guiding the extraction process Guiding the extraction process

Automatically generates WrappersAutomatically generates Wrappers

Page 12: Semi structure data extraction

WEBMANTIC ARCHITECTURE

Page 13: Semi structure data extraction

WEBMANTIC ARCHITECTURE

Tidy HTML parser (http://tidy.sourceforge.net). It y p ( p y f g )translates HTML documents into well-formed HTML documents

The HTML Tidy program (HTML parser and y p g ( ppretty printer) has been integrated as the first preprocessing module in WebMantic.

Tree generator module. Once the HTML page is preprocessed by Tidy parser, a tree representation p p y y p , pof the structures stored in the page is built

In this representation any table or list tags generate a node, and the leafs of the tree are: cells g , f ffor tables (th,td,tr) or items for lists (li,lo)

Page 14: Semi structure data extraction

WEBMANTIC ARCHITECTURE

Page 15: Semi structure data extraction

WEBMANTIC ARCHITECTURE HTML2XML: Rule generator module The tree HTML2XML: Rule generator module. The tree

representation obtained is used by this module to generate a set of rules (Sg) that represent the information to be translated

HTML2XML rulesHTML2XML rules

Page 16: Semi structure data extraction

WEBMANTIC ARCHITECTURE

Page 17: Semi structure data extraction

WEBMANTIC ARCHITECTUREWEBMANTIC ARCHITECTURE

Subsumption module. Previous module generates a rule for each structure to be translated. However, some of those rules can be generalized if the XML-tag represents the same concept. (i.e. the XML tag represents the same concept. (i.e. the rules in previous example that represent the concepts of <data-record> and <country>)

Page 18: Semi structure data extraction

WEBMANTIC ARCHITECTURE

Page 19: Semi structure data extraction

W M AWEBMANTIC ARCHITECTURE XML Parser module. This module receives both,

th S ti G t bt i d i i the Semantic Generator obtained in previous module, and the (well formed) HTML document

XM

Lar

ser

Semantic GeneratorYahoo! Weather

X Pa

Page 20: Semi structure data extraction

A PRACTICAL EXAMPLE

Page 21: Semi structure data extraction

WEBMANTIC GUI

WebMantic’s GUI

Page 22: Semi structure data extraction

WEBMANTIC GUI

www.citypopulation.de

Page 23: Semi structure data extraction

WEBMANTIC GUI

www.citypopulation.de

Page 24: Semi structure data extraction

WEBMANTIC GUI

First tables & list are rejected

Page 25: Semi structure data extraction

WEBMANTIC GUI

First data-table is rejected

Page 26: Semi structure data extraction

WEBMANTIC GUI

data-table target

Page 27: Semi structure data extraction

WEBMANTIC GUI

XML i ( i i )XML tags generation (user interaction)

Page 28: Semi structure data extraction

WEBMANTIC GUI

XML tags & HTML2XML rules

Page 29: Semi structure data extraction

WEBMANTIC HTML PROCESSING

T d f HTML dTree generated from HTML document

Relation between the HTML tree and the XML-tags provided by the user

Page 30: Semi structure data extraction

WEBMANTIC HTML PROCESSING

HTML2XML rules

Semantic Generator: HTML2XML subsumed rules

Page 31: Semi structure data extraction

EXPERIMENTAL RESULTS

Page 32: Semi structure data extraction

EXPERIMENTAL RESULTS Experimental tests (Web sites used):

Population (www.citypopulation.de)

Page 33: Semi structure data extraction

EXPERIMENTAL RESULTS Experimental tests (Web sites used):

Yahoo Weather (weather.yahoo.com)

Page 34: Semi structure data extraction

EXPERIMENTAL RESULTS Experimental tests (Web sites used):

Iberia arilines (www.iberia.com)

Page 35: Semi structure data extraction

EXPERIMENTAL RESULTS Several parameters have been evaluated:

1. Number of pages tested from each Web site

2 Number of accessible structures2. Number of accessible structures

3. Maximum nested structure

4 Average number of HTML2XML rules for each Semantic 4. Average number of HTML2XML rules for each Semantic Generator (Sg), once the subsumption process has finished

5. Average time (seconds) to generate the Sg (Time Sg)

6. Average time (seconds) to translate from HTML to XMLfor the set of training pages (transformation time)

Page 36: Semi structure data extraction

EXPERIMENTAL RESULTS

Page 37: Semi structure data extraction

CONCLUSIONS

Page 38: Semi structure data extraction

CONCLUSIONS AND FUTURE WORK

Conclusions:Conclusions:

We define a technique which is able to provide a f q psemantic representation (using XML-tags) to semi-structured (tables and lists) Web pages through a set of rules (encapsulated in a Semantic Generator)

Rules are created and automatically generalized These rules can be used to preprocess Web pages with a

similar structure, and convert them into XML d i h i documents with semantic tags

These can be integrated into information agents

Page 39: Semi structure data extraction

CONCLUSIONS AND FUTURE WORK

In the near future:

Oth W b t h l i DOM Other Web technologies as DOM

Ontologies

Machine learning algorithms to automatically learns new web (similar) pages( ) p g

Statistical knowledge extraction