Semi structure data extraction

SEMI-STRUCTUREDATA EXTRACTION

Rajendra Akerkar(with David Camacho, Maria D. R-Moreno, David F Barrero)David F. Barrero)

Bonn, June 2007

INDEX

I d i Introduction

Semantic Generators

The WebMantic architecture

A practical example

Some experimental issues

Conclusions

INTRODUCTION

INTRODUCTION

Web informationWeb information Unstructured Non-semantic Designed for humans not for crawlers Designed for humans not for crawlers

Problems Representation (HTML vs XML) Extract, filter and reuse data Share information Volatility Fault tolerance

IINTRODUCTION Information Extraction techniques

Machine learning Pattern recognition Wrappers technologies Wrappers technologies Tools for automatic and semi-automatic

Web data extraction

This work presents

A l b d th d f d t id tifi ti A rule-based method for data identification An approach to Web data extraction A particular implementation of the previous

method

SEMANTIC GENERATORS

SEMANTIC GENERATORS

Def: A Semantic Generator (Sg) is a non- Def: A Semantic Generator (Sg) is a nonempty set of rules (HTML2XML) that can be used to translate HTML documents into XML documentsdocuments

A Semantic Generator (Sg), is built by several A Semantic Generator (Sg), is built by several rules which transform a set of non-semanticHTML tags into a set of semantic XML tags

HTML2XML rule format

HTML2XMLi =< header > IS < body > #num

SEMANTIC GENERATORSSEMANTIC GENERATORS

HTML2XML: <table.tr.td> IS <my-xml-tag>

Tags: <table> <tr> <td> <A href…> etc…will be removed….only data will be extracted

#num: provides the number of cells to be processed

<my-xml-tag> Madrid <my-xml-tag>

SEMANTIC GENERATORSSEMANTIC GENERATORS

Semantic generator

THE WEBMANTIC ARCHITECTURE

WEBMANTIC ARCHITECTURE

WebMantic allows:

Automatically generates Sg

G li HTML XML l Generalize HTML2XML rules

Guiding the extraction process Guiding the extraction process

Automatically generates WrappersAutomatically generates Wrappers



Tidy HTML parser (http://tidy.sourceforge.net). It y p ( p y f g )translates HTML documents into well-formed HTML documents

The HTML Tidy program (HTML parser and y p g ( ppretty printer) has been integrated as the first preprocessing module in WebMantic.

Tree generator module. Once the HTML page is preprocessed by Tidy parser, a tree representation p p y y p , pof the structures stored in the page is built

In this representation any table or list tags generate a node, and the leafs of the tree are: cells g , f ffor tables (th,td,tr) or items for lists (li,lo)


WEBMANTIC ARCHITECTURE HTML2XML: Rule generator module The tree HTML2XML: Rule generator module. The tree

representation obtained is used by this module to generate a set of rules (Sg) that represent the information to be translated

HTML2XML rulesHTML2XML rules


WEBMANTIC ARCHITECTUREWEBMANTIC ARCHITECTURE

Subsumption module. Previous module generates a rule for each structure to be translated. However, some of those rules can be generalized if the XML-tag represents the same concept. (i.e. the XML tag represents the same concept. (i.e. the rules in previous example that represent the concepts of <data-record> and <country>)


W M AWEBMANTIC ARCHITECTURE XML Parser module. This module receives both,

th S ti G t bt i d i i the Semantic Generator obtained in previous module, and the (well formed) HTML document

XM

Lar

ser

Semantic GeneratorYahoo! Weather

X Pa

A PRACTICAL EXAMPLE

WEBMANTIC GUI

WebMantic’s GUI

WEBMANTIC GUI

www.citypopulation.de

WEBMANTIC GUI

www.citypopulation.de

WEBMANTIC GUI

First tables & list are rejected

WEBMANTIC GUI

First data-table is rejected

WEBMANTIC GUI

data-table target

WEBMANTIC GUI

XML i ( i i )XML tags generation (user interaction)

WEBMANTIC GUI

XML tags & HTML2XML rules

WEBMANTIC HTML PROCESSING

T d f HTML dTree generated from HTML document

Relation between the HTML tree and the XML-tags provided by the user

WEBMANTIC HTML PROCESSING

HTML2XML rules

Semantic Generator: HTML2XML subsumed rules

EXPERIMENTAL RESULTS

EXPERIMENTAL RESULTS Experimental tests (Web sites used):

Population (www.citypopulation.de)


Yahoo Weather (weather.yahoo.com)


Iberia arilines (www.iberia.com)

EXPERIMENTAL RESULTS Several parameters have been evaluated:

1. Number of pages tested from each Web site

2 Number of accessible structures2. Number of accessible structures

3. Maximum nested structure

4 Average number of HTML2XML rules for each Semantic 4. Average number of HTML2XML rules for each Semantic Generator (Sg), once the subsumption process has finished

5. Average time (seconds) to generate the Sg (Time Sg)

6. Average time (seconds) to translate from HTML to XMLfor the set of training pages (transformation time)

EXPERIMENTAL RESULTS

CONCLUSIONS

CONCLUSIONS AND FUTURE WORK

Conclusions:Conclusions:

We define a technique which is able to provide a f q psemantic representation (using XML-tags) to semi-structured (tables and lists) Web pages through a set of rules (encapsulated in a Semantic Generator)

Rules are created and automatically generalized These rules can be used to preprocess Web pages with a

similar structure, and convert them into XML d i h i documents with semantic tags

These can be integrated into information agents

CONCLUSIONS AND FUTURE WORK

In the near future:

Oth W b t h l i DOM Other Web technologies as DOM

Ontologies

Machine learning algorithms to automatically learns new web (similar) pages( ) p g

Statistical knowledge extraction

Technology

Semi structure data extraction