Download pdf - Syntactical Integration of Product Information From Semi-Structured Sources

8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

1/92

Department of Computer Science, Institute for Systems Architecture, Chair of Computer Networks

Diplomarbeit

SYNTACTICAL INTEGRATION OFPRODUCT INFORMATION FROM

SEMI-STRUCTURED SOURCES

Ludwig HhneMat.-Nr.: 2959267

Supervised by:

Dipl.-Medieninf. Maximilian Walther

Prof. Dr. rer. nat. habil. Dr. h. c. Alexander SchillSubmitted on July 16, 2009


2/92

II


3/92

ABSTRACT

This thesis presents a novel product information retrieval and extraction system. The

goal is to provide a solution which automatically locates the manufacturers page of a

given product and extracts relevant product attributes. The document retrieval subsys-

tem exploits multiple web search services and uses various heuristics to improve the

ranking. The unsupervised extraction of product attributes is based on syntactic fea-

tures of the product pages. XPath queries are used to cluster and select genuine product

attributes from web documents. Three different extraction rule induction algorithms are

presented. One variant uses multiple training documents, another incorporates already

extracted data, and a supervised solution falls back on user-supplied examples. A webcrawler was developed which automatically retrieves pages sharing common underlying

page-templates.

The implementation extends an experimental federated search engine developed at

the TU Dresden. The extracted product attributes are meant to spice up already available

data with first-hand information gathered from the respective manufacturer sites. The

system was evaluated according to a gold standard. Considering the low expenses in

terms of user guidance effort and execution time, the system exhibits good precisionand recall metrics.

III


4/92

IV


5/92

CONFIRMATION

I confirm that I independently prepared the thesis and that I used only the references

and auxiliary means indicated in the thesis.

Dresden, July 16, 2009

V


6/92

VI


7/92

CONTENTS

1 Introduction 1

2 State of the Art 3

2.1 Document Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Document Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Retrieval Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Supervised Information Extraction . . . . . . . . . . . . . . . . . . . 16

2.2.4 Semi-Supervised Information Extraction . . . . . . . . . . . . . . . 16

2.2.5 Unsupervised Information Extraction . . . . . . . . . . . . . . . . . 16

2.2.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Legal Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Fedseeko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Producer Information Integration . . . . . . . . . . . . . . . . . . . 23

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

VII


8/92

3 Requirements 25

3.1 Information Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Product Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Functional Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Behavioral Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Validation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Design 31

4.1 Data Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Retrieving Product Pages . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 Information Extraction from Product Pages . . . . . . . . . . . . . 34

4.2 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Fedseeko Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Implementation 43

5.1 Product Page Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.1 Locating the Producer Site . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.2 Locating the Product Page . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.3 Crawling Related Product Pages . . . . . . . . . . . . . . . . . . . . 47

5.1.4 Locator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Information Extraction Prototype . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.1 Data Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.2 Phrase Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.3 Phrase Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.4 XPath Query Generalization . . . . . . . . . . . . . . . . . . . . . . 50


5.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Information Extraction Implementation . . . . . . . . . . . . . . . . . . . . 51


5.3.2 Attribute Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3.3 Selecting a Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

VIII Contents


9/92

5.3.4 Architecture of the Web IE Subsystem . . . . . . . . . . . . . . . . . 57

5.4 Fedseeko Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Evaluation 61

6.1 Feature Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2 Effectiveness and Performance Evaluation . . . . . . . . . . . . . . . . . . 62

6.2.1 Test Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2.2 Product Page Retrieval Effectiveness . . . . . . . . . . . . . . . . . . 63

6.2.3 Related Page Crawling Effectiveness . . . . . . . . . . . . . . . . . . 64

6.2.4 Information Extraction Effectiveness . . . . . . . . . . . . . . . . . . 66

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Conclusion 73

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A Glossary 75

Contents IX


10/92

X Contents


11/92

LIST OF FIGURES

2.1 Interplay of document retrieval, information extraction and integration in

web data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Template-driven web page creation from database records . . . . . . . . . 9

2.3 Different wrapper induction strategies [CKGS06] . . . . . . . . . . . . . . 11

2.4 General tree mapping example [ZL05] . . . . . . . . . . . . . . . . . . . . . 13

2.5 Iterative partial tree alignment example [ZL05] . . . . . . . . . . . . . . . . 14

2.6 Wrapper induction example for RoadRunner [CMM01] . . . . . . . . . . . 18

2.7 Input pages in ExAlg [AGM03] . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.8 Generalized nodes and data regions in DEPTA [ZL05] . . . . . . . . . . . 21

2.9 Fedseeko architecture [WSS09] . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Overview of information flow . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Product page example with the extraction targets being highlighted . . . 27

4.1 Information flow during extraction . . . . . . . . . . . . . . . . . . . . . . . 314.2 Selecting a product page from a set of candidates using multiple techniques 33

4.3 Navigating to a related product page (Nikon D90 to Nikon D3X) . . . . . 34

4.4 Examples of specification data embedded in different containers . . . . . 36

4.5 Clustering text nodes from multiple documents . . . . . . . . . . . . . . . 38

4.6 Source code of the two pages from figure 4.5 . . . . . . . . . . . . . . . . . 38

4.7 Architecture overview of the complete system . . . . . . . . . . . . . . . . 40

5.1 Ranking a set of candidate documents using multiple techniques . . . . . 44

XI


12/92

5.2 Architecture of the DR subsystem . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Supervised retrieval and extraction . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 Architecture of the IE subsystem . . . . . . . . . . . . . . . . . . . . . . . . 58

5.5 Fedseeko product administration view . . . . . . . . . . . . . . . . . . . . 59

6.1 Word cloud visualizing the most common terms in key phrases . . . . . . 62

6.2 Effectiveness of locating the right producer sites and product pages . . . 63

6.3 Product page retrieval runtime performance distribution . . . . . . . . . . 65

6.4 Number of successful operations of each isolated component . . . . . . . 67

6.5 Correctness and completeness of extraction results . . . . . . . . . . . . . 68

6.6 Example of a nested template page . . . . . . . . . . . . . . . . . . . . . . . 68

6.7 Example of specification page for multiple products . . . . . . . . . . . . . 69

6.8 Information extraction runtime performance . . . . . . . . . . . . . . . . . 70

XII List of Figures


13/92

1 INTRODUCTION

The World Wide Web is a place where millions if not billions of products are marketed,

searched, sold, bought and reviewed. Potential customers have a multitude of different

sources at their disposal to facilitate a purchase decision. There are various product

review sites, web shops provide product descriptions, blogs gain popularity as informa-

tion resources and there is the information published by the products manufacturer. An

important factor is the reliability of the individual information sources. When it comes

to buying an expensive product, a customer probably prefers to resort to the most reli-

able source of information. However, it is getting increasingly difficult to find first-hand

product information via a simple web search. To reach potential customers, manufactur-

ers have to compete with many other information providers in order to receive attentionand a good search engine rank.

Nowadays, web search engines are the single point of contact interfacing to the exu-

berant information in the World Wide Web. However, todays web search engines pre-

dominantly only inform about the whereabouts of data and can still not answer complex

queries. It is very difficult to do better as long as the web content is not semantically

interwoven.

Not only Tim Berners-Lee believes the Semantic Web to be the future of the Internet

[BLHL01]. Instead of phrasing keyword queries and wading through search results to

find relevant information, the vision is letting the Semantic Web answer actual ques-tions. In the context of product information retrieval one might want to ask questions

like: How much power does the latest Siemens refrigerator consume compared to its

predecessor and the new flagship product of Pengiun Electrics? As old as this vision

is, it still has a long way to go. Web developers are required to semantically describe

their data in languages that may seem too complex and lavish to pick up easily. Espe-

cially the lack of obvious short term benefits may impede the adoption of Semantic Web

technologies. It is not helpful either that a semantic query system needs a somewhat

complete knowledge base in the target domain to be valuable for a potential user. But

what if semantic data could be condensed out of existing web pages?

1


14/92

One idea is to bridge the gap between the "syntactic" Web and the Semantic Web

by automatically transferring information from traditional web pages into a semantic

context with the help of information extraction techniques. Acknowledged, information

extraction systems will not immediately provide the anticipated power of the Semantic

Web without further efforts. But these systems might help to facilitate the migrationprocess in some well-defined domains one of which might be product information

extraction.

With an automatic product information extraction and integration system at hand, it

would be possible to find similar products based on all kinds of feature-related criteria.

It would also relieve the customer of retrieving the producer information for the inter-

esting products manually. Furthermore, such a system would be manufacturer- and

vendor-independent.

This work presents a novel approach towards automatic Web information extraction

and is striving for becoming an enabling technology for product information integra-tion. A prototype implementation was developed and integrated into a federated search

engine, demonstrating the practical viability for product information integration and its

immanent challenges: locating product pages, automatically collecting training data for

pattern mining and identifying and extracting valuable product data.

The product page location component resorts to multiple web search services and

incorporates various heuristics to optimize the retrieval precision. The extraction

exploits structural characteristics of template-generated web pages. Extraction rules

are stored as XPath queries in the system. A low complexity clustering algorithm is

utilized to derive these extraction rules. Three algorithms are proposed, corresponding

to different degrees of automation.

Chapter two provides a theoretical background and the state of the art in Web infor-

mation extraction and related fields of research are discussed. In chapter three the

requirements of the novel product information extraction system are analyzed. The

subsequent chapters deal with the design and implementation of the software system.

Chapter six dissects the advantages and drawbacks of the presented solution and eval-

uates the system according to a gold standard. Finally, a summary and an outlook are

given.

2 Chapter 1 Introduction


15/92

2 STATE OF THE ART

Integrating information from the World Wide Web into a local database relies on three

major components, as depicted in figure 2.1. In this chapter, important concepts of

document retrieval and information extraction are outlined and an overview of the state

of the art in each field is given. This work strongly focuses on information extraction

and thus presents a selection of existing information extraction systems. Information

integration is covered briefly for the sake of completeness. The chapter closes with the

presentation of Fedseeko, the system into which the new information extraction system

shall be integrated.

Figure 2.1: Interplay of document retrieval, information extraction and integration in

web data extraction

3


16/92

2.1 DOCUMENT RETRIEVAL

"Knowledge is of two kinds. We know a subject ourselves, or we know where we can

find information upon it." Samuel Johnson

Information retrieval (IR) is often only loosely defined. Moreover, in the context of

most retrieval systems information retrieval actually refers to document retrieval. In effect,

information retrieval shall be synonymous to document retrieval (DR) in this thesis. But

for being less ambiguous, the latter term is preferred. Lancaster gives the following

definition of IR that also draws a dividing line separating related fields of research like

fact retrieval or question answering [Lan68]:

Definition 1 (Information Retrieval)

An information retrieval system does not inform (i.e. change the knowledge of) the user on the

subject of his inquiry. It merely informs on the existence (or non-existence) and whereaboutsof documents relating to his request.

Document retrieval aims to find relevant information from a large corpus of docu-

ments. Given a user query, traditional DR systems identify and rank documents in a

corporate or library network or on a single host (e.g. desktop search). In the context of

the Internet, DR is an important foundation of web search technologies with web pages

building the document corpus. Due to the vast amount of web content with trillions of

web pages, web search systems have different requirements than traditional DR systems.

User queries normally are lists of words. Based on a query, the DR system finds

relevant documents by matching the query tokens with the documents contents. In

the simplest case, each word occurring in the query must also occur in the document.

Phrase queries are also a very common instrument in IR. In addition, the query may

contain Boolean operators or means to express that two tokens must occur near each

other. However, complex query constructs are rarely used in practice as those make the

DR task more difficult for the users.

In the following, DR document models, effectiveness metrics and web crawlers are

discussed.

2.1.1 Document Model

The document model specifies how the documents and queries are represented and

governs how the relevance of a document in respect to a query is computed.

A document can be modeled in many different ways. It is common to most mod-

els that documents and queries are treated as a "bag of words or terms" in which term

sequence and position are ignored [Liu06]. An important characteristic of document

models is whether and how term-interdependencies are modeled. In the simplest case,

each word is treated independently. According to Kuropka, the various approaches can

be divided into set-theoretic models (e.g. Boolean model), algebraic models (e.g. vector-

4 Chapter 2 State of the Art


17/92

space model) and probabilistic models [Kur04]. The different models will be briefly

presented in the following.

In the Boolean model each term is only checked for its presence or absence in a

document. A query in a Boolean retrieval system can be given as a logical equation

combining terms with logic operators, e.g. "James Joyce" AND Trieste. A document

is relevant in respect to the query if the contained set of terms make the query logically

true. Boolean models have the disadvantage that no ranking can be derived from the

simple definition of the problem. Neither the term frequency is examined nor does the

model permit inexact matches.

In the vector-space model a document is represented by an n-dimensional vector,

in which each dimension represents a distinct term of the vocabulary from the whole

document corpus. The weight of the term is computed from its occurrence characteristic

in the document. The query is also modeled as such a vector. Now the relevance of the

document in respect to the query can be computed as the cosine of the angle betweenthe two vectors, defined as the cosine similarity (see equation (2.1)).

cos =d q

d q(2.1)

An example for a probabilistic approach are language models which were first pro-

posed for document retrieval by Ponte and Croft [PC98]. In a statistical language model

a probability distribution of the n-grams is computed for each document in the corpus.

The idea is to derive the ranking of a document di in respect to a query q from the a pos-

teriori probability P(di|q). This is essentially the likelihood of the query being generated

by the respective language model.

The ranking derived from the degree of relevance is governed by the internal doc-

ument model. It may reflect poorly the actual relevance of documents as perceived by

the user. Thus, effectiveness metrics are required to evaluate the performance of a DR

system.

2.1.2 Retrieval Effectiveness

Numerous metrics have been proposed to measure the performance of DR systems. The

most commonly used metrics are precision and recall. Assuming a document is either

relevant or irrelevant in respect to a query, precision is the fraction of relevant documents

in the set of retrieved documents. In contrast, recall is the ratio of the number of relevant

documents retrieved to the total number of relevant documents (including those that

were not retrieved). Both metrics are related and are most often examined in context

of each other. For example, it is trivial to achieve 100% recall by just returning all

documents for every query. However, the precision metric would immediately reveal

the deficiency of such an approach. Another commonly used metric is the F-score (or

F-measure) which is defined as the weighted harmonic mean of precision and recall.

Web search engines typically present search results in buckets of around ten docu-ments. Users, however, do not consider search results beyond the first few result pages.

2.1 Document Retrieval 5


18/92

In effect, a relevant but very low1 ranked document is essentially useless from the users

perspective. Therefore, the ranking is also considered in the performance evaluation by

only examining the first i search results.

Let D be the whole document corpus set. A query is submitted to a given DR sys-

tem. Dretrieved D is an ordered set of all retrieved documents while Diretrieved are the

i top ranked documents returned by the system. Drelevant D is the set of all relevant

documents. The effectiveness metrics can be computed according the equations (2.2).

precision(i) =|Drelevant D

iretrieved|

|Diretrieved|

recall(i) =|Drelevant D

iretrieved|

|Drelevant |

F-score(i) = 2 precision(i) recall(i)precision(i) + recall(i)

(2.2)

In order to identify all relevant documents, the DR system first needs to be aware of

the existence and whereabouts of the individual web pages. The gathering of web page

is performed by a web crawler which is presented in the following section.

2.1.3 Web Crawler

Web IR systems have to gather web pages to build a document index. This non-trivial

task is performed by a web crawler, also known as spider or robot. Web crawlers recur-sively follow links in web pages to build a document index. As the Internet is constantly

evolving, web sites need to be visited regularly to account for new or changed content.

Definition 2 (Spider [FOL09])

A program that automatically explores the World-Wide Web by retrieving a document and

recursively retrieving some or all the documents that are referenced in it.

The best known web crawlers are universal crawlers which operate on behalf of web

search engines collecting the data for the document index. According to Brin and Page,

the crawler is the most fragile component of a search engine because it has to interactwith millions of remote servers all beyond the control of the system [BP98]. Thus, a

crawler has to be very robust and handle a multitude of corner cases even if that might

affect only a single page.

Crawlers may impose a huge stress on the resources of the respective hosts if the

request rate is not limited, leading to denial of service attacks in the worst case. Fur-

thermore, crawlers should identify themselves and comply with the robots exclusion stan-

dard2.

1Depending on the users web browsing behavior and motivation he or she might wade through one

hundred search results but more likely not more than ten search results will be considered by the user.

2The robot exclusion standard or protocol is a de facto standard described at http://www.robotstxt.org/wc/robots.html.

http://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.html


19/92

In addition to universal crawlers, there are focused and topical crawlers exploring

the Web based on user preferences. These try to find only pages relating to a category

of interest or being similar to a set of seed pages. Focused crawlers differ from universal

crawlers in their strategy of picking URLs that shall be visited.

2.1.4 Summary

Important concepts of IR have been briefly presented. Information retrieval in terms

of document retrieval is just a first step in obtaining and processing knowledge in an

information system. The next step of information processing is the extraction of more

fine-grained information from the retrieved documents.

2.2 INFORMATION EXTRACTION

"Get your facts first, and then you can distort them as much as you please." Mark

Twain

Roughly speaking, information extraction (IE) aims to condense knowledge about a

specific domain of interest. Attributes of the domains entities or facts are distilled from

one or more input documents. The goal of IE is enabling the information system to

reason based on the extracted data. For example, an IE system that collects facts on the

worlds countries may extract attributes such as population, capital or natality [UTF08].

Definition 3 (Information Extraction [SAI01])

Rather than indicating which documents need to be read by a user, [Information Extraction]

extracts pieces of information that are salient to the users needs.

IE produces structured data from unstructured and semi-structured documents.

Semi-structured data typically refers to tables and lists, which are characteristic for web

pages. Whether a document is perceived as structured or unstructured depends on the

research domain. Databases are typically regarded as structured data while free text

is commonly classified as unstructured. The classification, however, cannot be solely

based on the data format. It is quite possible to dump a whole unstructured document

into a single database record, or strictly format a text file as a sequence of key-value

tuples. Similarly, a HTML body may contain an unstructured stream of free text or a

fine-grained table. Nevertheless, in the IE community, HTML is commonly classified as

semi-structured data, while XML documents with available meta-data are considered

being structured [CKGS06]. The dividing line between semi-structured and structured

data is drawn between documents containing some kind of syntactic structuring ele-

ments (e.g. HTML tags) and semantic tags of the data.

While IE for unstructured documents like free text has been thoroughly investigated

during the last decades, as indicated by the success of the Message Understanding

Conferences [Gri97], IE for semi-structured documents has received a growing inter-est from researchers during the last years. For the respective tasks, different techniques

2.2 Information Extraction 7


20/92

are required. Traditional IE needs to extract knowledge from human language texts and

typically uses lexicons and grammars to achieve this goal. Web IE takes advantage of

the fact that web pages are often automatically generated from (structured) database

records. Because web pages are created by static templates, machine learning and pat-

tern recognition techniques can be applied to analyze the syntactic structure of the doc-uments. Web scraping or screen scraping are used synonymously for Web IE. The Jargon

File3 gives the following definition of screen scraping stressing the unintended usage of

the medium.

Definition 4 (Screen scraping)

The act of capturing data from a system or program by snooping the contents of some display

that is not actually intended for data transport or inspection by programs. [...] it often refers

to parsing the HTML in generated web pages with programs designed to mine out particular

patterns of content.

Chang et al. give an overview of contemporary Web information extraction sys-

tems and categorize those based on task difficulty, extraction technique and degree of

automation [CKGS06].

2.2.1 Data Model

In the following, a generic IE data model is described informally. The chosen model is

derived from the data model known from relational databases and is also referenced by

other IE researchers [AGM03, Liu06]. According to this model, the data is structured as

nested relations made up of basic types arranged in tuples and sets. A basic type B is anatomic entity, typically a string in the context of web pages. The tuple type T1, T2,..., Tn

is an ordered collection of other types Ti. Tuples map to data records in a database

context. Set types {T} are constructed by multiple elements of the same type T, like a

list of equally typed tuples.

Let S be the schema of a book description. The data record (tuple) describing a book

might comprise the title, a set of authors, the publisher of the book and the number of

pages. Then the schema can be described as S = Btitle, {Bname}authors ,Bpublisher ,Bpages.

An instance ofS is the value x = "Ulysses",{"James Joyce"}, "Penguin", 1040.

A template-based semi-structured page is created from one or more data recordsstored in a database and a template as illustrated in figure 2.2 on the next page. A

template maps instances of a certain schema to a web page. More formally, an encoded

web page P is created from a data record x and a template T via a template mapping

function . Thus, the page creation process can be modeled as P = (T, x).

The IE task is to extract x from P with Tbeing unknown. If1T is the extraction func-

tion associated with template T, x = 1T (P) is performed by the extractor. The schema

of the extracted data rarely matches the model of the original data schema. Either the

IE system is not able to extract all data fields or only a subset of the data fields are

3The Jargon File is "a comprehensive compendium of hacker slang illuminating many aspects of hackish

tradition, folklore, and humor." [Jar03]



21/92

Database

Template

Web

Page

Figure 2.2: Template-driven web page creation from database records

required. Therefore, x is generally an incomplete approximation of the original data

record x. For example, the schema for the extracted data in the running example mightbe S = Btitle,Bauthors ,Bpublisher with the nested data records for the authors collapsed

into a single data field and the page count being omitted. In practice, many IE systems

use simpler data models for the extraction targets than the one described. Particularly

nesting of set and tuple types is not supported by the majority of the available IE sys-

tems.

An example template for the running example is given in listing 2.1 using a pseudo

template language.

1 2 Books

3

4 < l i > T i t l e : < i >

5

6 < l i >Author :

7

8 < l i > P u b l i s h er :

9 < l i > pages

10

11

Listing 2.1: Template example

So far, the web page has been assumed to be a static document. However, techniques

such as Ajax allow to perform operations asynchronously, for example the deferred

loading of additional content using XMLHttpRequest [Gar05]. This poses new chal-

lenges for DR and IE systems if relevant information becomes only available after per-

forming a certain action, like clicking a link or button on the page. A potential solution

to remedy this problem is to drive a full-fledged web browser with a JavaScript inter-

preter and using a plug-in like Watir4 to store static snapshots of the dynamic page.

4

Watir is an open-source library for automating web browsers: http://wtr.rubyforge.org/index.html

http://wtr.rubyforge.org/index.htmlhttp://wtr.rubyforge.org/index.html


22/92

As has already been identified in this section, the goal of the Web IE system is to

extract data embedded in web pages created from a template. This task is performed by

a wrapper program which may be hand-crafted or automatically generated. Wrapper

generation techniques are discussed in the following section.

2.2.2 Wrapper Induction

According to a very general definition, a wrapper provides an interface to an entity and

allows it to be treated as if being something else. In the Web IE context, a wrapper allows

to regard a web page as a database record. Consequently, the wrapper is responsible for

extracting one or more data records from web pages.

Early IE systems were programmed manually.5 A set of web documents are examined

and common patterns have to be identified by a human operator. Recurrent patterns

enable the programmer to write a wrapper for extracting the target data, either manuallyor aided by pattern specification languages. The hand-crafted wrapper should then be

able to extract data from documents sharing the same template.

1

2 Books

3

4 < l i > T i t l e : < i >Ulysses

5 < l i >Author : J am es J o y c e

6 < l i >Pub lis her : Penguin

7 < l i >1040 pages

8

9

Listing 2.2: Sample web page

Listing 2.2 shows a simple web page generated from the aforementioned template.

Assuming the extraction task is to extract the books title, the programmer might write a

program that skips to the tag and extracts the text that follows until the closing

tag. Alternatively, regular expressions or XPath queries could be used. The different

variants to represent extraction rules are discussed on page 15.

Manually programmed wrappers are prone to failures when templates change,require knowledge of the employed technologies and are very labor-intensive. In con-

trast to manually specifying extraction rules, wrapper induction systems derive these

from a set of training documents with various degrees of automation.

Regardless how the wrapper was generated, Web IE systems have to deal with the

problems of wrapper verification and wrapper repair. A wrapper relies on the extraction

targets to be encoded in a certain way. However, web pages are subject to change

and information providers may choose to replace their templates at any time. This

causes hardship for wrapper maintenance. The detection of whether the wrapper is

5Special purpose IE tasks often are still conducted manually, e.g. extracting the links from a web searchresult page.



23/92

Figure 2.3: Different wrapper induction strategies [CKGS06]

suited to extract data from a presented page is called the wrapper verification problem.6Adapting the wrapper to a changed template is called the wrapper repair problem. A way

to approach both problems is to learn and verify characteristic patterns of the target

data. In case of failure, the patterns can be used in attempting to adapt the wrapper to

the new template. However, both tasks are very difficult to solve and are still an active

research area [Liu06].

The goal of wrapper induction is to derive the encoding template from a collection

of encoded instances of the same type. Repeated patterns in HTML documents can be

detected with string or tree matching and alignment techniques. These will be discussed

in the next sections.

String Matching

String matching helps revealing to what extent two character strings resemble each

other. The Levenshtein distance is a commonly used algorithm to compute the similarity

of two strings [Lev65]. It is defined as the minimum number of operations to transform

one string into the other. These operations are inserting, deleting or replacing a single

character in the string. The edit distance can be computed using dynamic programming.

Let s1 and s2 be the input strings and n and m the respective character counts. The

table D of dimension (n + 1) (m + 1) is initialized with Di,0 = i and D0,j = j. The

remaining cells are computed using equation (2.3).

(i,j)i [1..n],j [1..m] : Di,j = min

Di1,j1 same character

Di1,j1 + 1 replace

Di,j1 + 1 insert

Di1,j + 1 delete

(2.3)

6In fact, wrapper verification is also needed if the IE system may be confronted with ineligible pages,

i.e. pages that are created from different templates.



24/92

The final edit distance is retrieved from the bottom right corner cell Dn,m. An align-

ment path can be traced back through the matrix illustrating the operations. The time-

complexity of the algorithm is O(nm). Table 2.1 shows an example matrix of the com-

parison of the character strings sheep and shepard yielding a Levenshtein distance of 4.

For similarity computations, the edit distance can be normalized by dividing it throughthe length of the longer string max(n, m).

Table 2.1: Edit distance matrix of the strings "shepard" and "sheep"

s1 s h e p a r d

s2 0 1 2 3 4 5 6 7

s 1 0 1 2 3 4 5 6

h 2 1 0 1 2 3 4 5

e 3 2 1 0 1 2 3 4

e 4 3 2 1 1 2 3 4

p 5 4 3 2 1 2 3 4

Tree Matching

String matching across non-trivial Web documents is a complex and expensive oper-

ation considering the average document length in terms of characters. There are no

pre-determined boundaries and the content and length of the data may differ across

multiple documents or records. The semi-structured nature of Web documents led to

the application of tree matching to conduct IE tasks. Tree matching compares the struc-

ture of two trees and computes a cost of pairing the vertices. In the context of Web IE,

the DOM-tree or parts thereof are commonly compared by using the element tags as the

vertices labels.

Tree matching computes a minimum-cost mapping for two ordered labeled trees.

According to the general definition, each node appears no more than once and the

order and hierarchical relations among nodes are preserved. Figure 2.4 on the facing

page illustrates such a mapping. Tai presented the first polynomial algorithm for com-

puting the edit distance based on dynamic programming [Tai79]. The algorithm has acomplexity ofO(n1n2h1h2) in time and space, with n1 and n2 being the number of nodes

and h1 and h2 the heights of the respective trees.

Cost functions are assigned to the editing operations transforming one tree into

another, i.e. relabeling, deleting and inserting nodes. Relabeling is of special interest

as it lends itself to identifying recurrent patterns in similar structured documents. More

elaborate cost functions for the relabel operation may exploit syntactic (e.g. string edit

distance) or semantic (e.g. feature vector) similarities. Zigoris et al. propose using sup-

port vector machines to learn the parameters of the cost function for semantic matching.

The preliminary results, however, indicated no performance-gain in comparison to sim-

pler cost functions [ZEZ06].



25/92

Figure 2.4: General tree mapping example [ZL05]

A more restrictive variant of tree matching was defined by Selkow in 1977 [Sel77].

According to Selkows definition, insertion and deletion is limited to the leaf nodes and

node replacement is not supported. In effect, the aim of tree matching is to find the

maximum matching where every node-pair has the same parent nodes. This definition

has been found to better fit to web documents because structural (i.e. level-crossing)

changes are not generally applicable to DOM-trees [CAM01]. Simple tree matching (STM)

is an algorithm solving this problem in quadratic time [Yan91]. It is again based on

dynamic programming and shown in listing 2.3.

1 STM(A , B )

2 i f Aroot = Broot then3 r e tu r n 0

4 e l s e

5 m Achildren6 n Bchildren7 Mi,0 0 i [0..m]

8 M0,j 0 j [0..n]

9 f o r i = 1 to m do

10 f o r j = 1 t o n do

11 Mi,j max (Mi,j1, Mi1,j, Mi1,j1 + STM(Ai, Bj))

12 r e t u r n Mm,n + 1

Listing 2.3: Simple tree matching algorithm

Multiple Alignment

In order to identify patterns in case more than two strings or trees are involved, mul-

tiple sequence alignment (MSA) techniques can be applied. Multiple alignment has its

foundation in molecular biology where it is used to identify similarities of sequences

(e.g. proteins). Given a set of similar sequences, MSA tries to find an optimal align-

ment by inserting gaps into the sequences. Carrillo and Lipman presented an algorithm

based on multidimensional dynamic programming that yields optimal results but has

an exponential time complexity [CL88]. Hence, various heuristic methods have been

proposed amongst which the center star method has found its way into IE systems.



26/92

In this method, a center sequence c is selected from a set of sequences X minimizing

the pair-wise distance to the other sequences.

c = arg minxcX

xiX

d(xi, xc) (2.4)

Afterwards, the alignments with the remaining sequences are computed and gaps are

inserted into the center string where necessary. The time complexity of the center star

method is O(n2k2) for n sequences of length k. While being of polynomial complexity,

the character sequence lengths of HTML pages still incurs excessive runtime behavior

in IE systems.

Partial Tree Alignment

Partial tree alignment was specifically crafted to solve the multiple alignment problem

in an IE context [ZL05]. It aligns multiple trees by progressively growing a seed tree.

The latter is initialized to be the tree with the maximum number of nodes. This way

it likely aligns well with the other trees. The remaining trees are matched by linking

matching nodes and trying to insert nodes into the seed tree for which no match was

found.

Nodes are only inserted if a position can be uniquely determined. That is, if the

neighboring siblings in the source tree are matched with consecutive siblings in the

seed tree. Figure 2.5 illustrates growing such a seed tree Ts from three input trees.

Figure 2.5: Iterative partial tree alignment example [ZL05]



27/92

Extraction Rules

Once the extraction targets are identified, rules to mine the relevant information need

to be formalized and stored for future use. There are various possibilities ranging from

first-order logic rules over regular expressions to XPath and CSS selectors. Logic rulesare primarily used in free-text IE where common tokens and characteristic delimiters

facilitating the other approaches are rarely available.

Regular expressions have been widely adopted for data mining from semi-structured

documents. In the example in listing 2.2 on page 10 the title of the book can be mined

with the regular expression (\w+). In practice, however, regular expressions

are not very well suited to match data in HTML documents. To correctly match all

possible incantations of a specific HTML tag with a regular expression is a daunting

task, especially due to the statefulness of the HTML syntax. For example, the given

expression will not work if the tag contains any attributes and will unintentionally

match with occurrences of the tag in comments or strings. Therefore, the interest has

recently shifted to query languages like XPath or CSS selectors which are much more

suitable to extract information from an HTML or XML document.

Especially the usage of the XPath language in Web information extraction has gained

importance with a growing number of libraries supporting this query mechanism. In a

nutshell, XPath queries provide means to address node-sets or individual nodes in the

DOM tree of an XML (or HTML) document. For instance, //li/i/text() addresses the

title phrase of the book in the running example while querying for //ul/li[1] returns

the node containing the whole book-title attribute. XPath queries are far more power-

ful than the examples given above. This complexity, however, has caused hardship forproviding full support of the XPath standard in implementations and an uncertainty

concerning the complexity of XPath queries in general. Gottlob et al. have shown that

large fragments of XPath are of LOGCFL7 complexity and thus can be massively par-

allelized [GKP03]. A more elaborate treatise on XPath can be found in Essential XML

Quick Reference [SG01].

OKeefe and Trotman present a number of query languages aside to XPath and argue

that most available solutions are overly complicated [OT03]. On the one hand, the lack

of comprehensive support of the XPath 1.0 standard in many query libraries backs this

assumption. On the other hand, in Web IE the expressive power to select the relevant

parts of the available information with the utmost precision is a more favorable goal

than a simpler yet inferior solution. CSS selectors, for example, share similar concepts

with XPath queries but are not quite as powerful.

After foundational approaches and techniques have been covered, supervised, semi-

supervised and unsupervised IE system concepts are presented along with a few exem-

plary case studies.

7Logarithmically Reducible to Context-Free Languages



28/92

2.2.3 Supervised Information Extraction

Manually observing recurrent patterns in web pages is a rather cumbersome and error-

prone process which can be alleviated by automatically learning extraction rules from

labeled training documents. This approach is referred to as supervised IE. As depictedin 2.3 on page 11, the user has to label relevant data with the help of a graphical user

interface (GUI). In the example of the book page, the user may mark "Ulysses" as the

title of the book and does that for a set of other pages. The IE system then tries to derive

rules from these examples and, depending on the IE system, may suggest additional

informative pages to be labeled by the user.

For example, Rapier is a supervised extraction system that uses a relational learning

algorithm [CM97]. It initializes the system with specific rules to extract the labeled

data and successively replaces those with more general rules. Syntactic and semantic

information is incorporated using a part-of-speech (POS) tagger. Extraction rules consist

of pre-filler, filler and post-filler patterns for each data field. These describe the context

and syntax of the extraction target. The respective patterns for extracting the publisher

name in the running example could be "", "", "Publisher:" as pre-filler

tokens and "", "" as post-filler tokens. Depending on the training data, the

filler pattern might specify that the publisher name consists of at most two words which

were labeled as nouns by the POS tagger.

Other examples of supervised IE systems are SRV [Fre98], WIEN [KWD97], Soft-

Mealy [HD98], STALKER [MMK99] and DEByE [LRNdS02].

2.2.4 Semi-Supervised Information Extraction

Labeling training data in advance is a labor-intensive process limiting the scope of the IE

system. Instead of requiring labeled data, semi-supervised IE systems extract potentially

interesting data and let the user decide what shall be extracted. In other words, the user

provides feedback to the IE system which is incorporated into the wrapper generation

process.

In the running example, a semi-supervised system might recover title, author and

the publisher as extractable data fields from a set of unlabeled book pages. The user

then selects which fields shall be extracted and how to integrate the information, e.g. bylabeling the titles as such in the extraction target tuple.

An example for a semi-supervised system is IEPAD [CL01]. Apart from extraction

target selection, semi-supervised IE systems are very similar to unsupervised IE systems.

2.2.5 Unsupervised Information Extraction

Automatic or unsupervised IE systems extract data from unlabeled training documents.

The core concept behind all unsupervised IE systems is to identify repetitive patterns in

the input data and extracting data items embodied in the recurrent pattern.



29/92

Unsupervised IE systems can be subdivided into page-level extraction systems and

record-level extraction systems. The former extract data from a page-wide template,

while the latter assume multiple data records of the same type are available rendered

by a common template into one page. In case multiple records exist in a single web

page, it might be possible to derive extraction rules from a single web page, assumingthe individual data records can be told apart. The record-level extraction task can be

described as trying to extract various items from a list page (e.g. a product list from a

web shop). In contrast, page-level extraction tasks require multiple pages (e.g. product

detail pages) to discover patterns and learn extraction rules.

Evidently, record-level extraction systems can only operate on documents containing

multiple data records and require means to identify the data regions describing the

individual data records. The latter problem can be tackled with string or tree alignment

techniques. Examples for such systems are DEPTA [ZL05] and NET [LZ05].

Page-level extraction systems can treat the whole input page as a data region fromwhich the data record shall be extracted. However, multiple pages8 for wrapper induc-

tion need to be fetched in advance. Thus, the problem of collecting training data is

shifted into the DR domain and is rarely addressed by IE researchers. Examples for

page-level extraction systems are RoadRunner [CMM01] and ExAlg [AGM03].

2.2.6 Case Studies

In the following, a selection of well-known IE systems are presented which try to solve

similar problems. One semi-supervised and three unsupervised IE systems are pre-

sented illustrating various techniques and the associated constraints to solve different

IE tasks.

RoadRunner

RoadRunner is one of the early unsupervised Web IE systems, presented in 2001 by

Crescenzi, Mecca and Merialdo [CMM01]. It compares multiple pages and generates

union-free9 regular expressions based on the identified similarities and differences.

RoadRunner initializes the wrapper with a random page of the input set and matches

the remaining pages using an algorithm called ACME matching. The wrapper is gener-alized for every encountered mismatch. Text string mismatches are interpreted as data

fields, tag mismatches are treated as indicators of optional items and iterators. In the

RoadRunner data model, individual data items must be separated by HTML tags but

tags must not occur as part of the data field. Figure 2.6 on the following page shows an

example of a wrapper generated from two input pages.

8At least two training pages are required for page-level wrapper induction. Depending on the IE sys-

tem and the template, however, ten or even more training pages may be necessary to successfully derive

extraction rules.9A union-free regular expression does not contain disjunctions (e.g. (A|B)).



30/92

Figure 2.6: Wrapper induction example for RoadRunner [CMM01]

The runtime complexity is exponential in the input string length. Therefore, heuris-

tics were introduced to limit the exploration space.

ExAlg

Arasu and Garcia-Molina propose an IE system automatically deducing the template

from a set of template-generated pages [AGM03]. ExAlg has a hierarchically structured

data model and supports optional elements and disjunctions. A web page is modeled

as a list of tokens in which a token might either be a HTML tag or a word from a text

node. ExAlg builds equivalence classes of the tokens found in the input documents.

Based on these sets of tokens, the underlying template is deduced.

Figure 2.7 on the next page shows four example pages where each template-token is

labeled with an index. Tokens with the same occurrence vector across all input docu-

ments build an equivalence class. The idea is that tokens emitted from the same template

constructor will likely occur with the same frequency. Furthermore, ExAlg can detect

tokens with multiple roles, e.g. the token Name in Book Name and Reviewer Name has

a different semantic in either occurrence. It differentiates between roles based on the

occurrence-path10 and the spans of valid equivalence classes. For instance, an equiv-

10The occurrence-path, as defined by Arasu and Garcia-Molina, has a close resemblance to an XPathquery.



31/92

alence class in the given example is {, Reviewer, Name, Rating, Text, }

with the occurrence vector 1,2,1,0.

ExAlg defines large and frequent equivalence classes (LFEQs) as classes containing

many tokens which occur in a large fraction of the input documents. The LFEQs are

hierarchically structured and the order of the tokens is preserved. The nesting is gov-

erned by the span formed by all tokens in the respective equivalence class. LFEQs are

passed to the analysis stage in which the template is deduced.

Figure 2.7: Input pages in ExAlg [AGM03]

Starting from the root LFEQ the tokens occurring exactly once in all input doc-

uments, ExAlg searches for non-empty positions between consecutive tokens and gen-

erates type constructors for these locations. Nested LFEQs are recursively visited and

the types are constructed according to the data model. The generated template can then

be used to extract data from input pages. For the given example, the original schema

BBook, {BReviewer,BScore,BText} can be recovered by analyzing the four input pages.

ExAlg has a sophisticated data model compared to other automatic IE systems.Moreover, ExAlg operates on the token-level not on the tag-level as many other

unsupervised extraction systems do and thus has the chance of extracting attributes

embedded in text nodes without any markup. The effectiveness of the extraction tends

to improve with the number of input pages. However, experiments indicate that ExAlg

works well for collections of under ten input documents given that the occurrence of

attributes to be extracted exceed the chosen threshold.



32/92

IEPAD

IEPAD, a semi-supervised IE system, was presented by Chang and Liu in 2001 [ CL01]. It

is capable of extracting homogeneous data records from a set of unlabeled pages. IEPAD

generates wrappers by discovering repetitive patterns using multiple string alignment.

The input document is converted to a binary representation of the data. HTML tags

and text elements are mapped to a set of fixed length binary tokens. A PAT tree, which

is a binary suffix tree, is created from the binary representation. The PAT tree, in turn, is

used to find repetitive patterns by recording occurrence count and reference points for

each recurring pattern. To tolerate inexact matches, the center star algorithm is applied

to obtain generalized extraction patterns.

The candidate patterns and the occurrence metrics are presented to the user. Upon

selection of a pattern, a regular expression is created from the binary representation.

Thus, the wrapper can also operate on web pages without transforming those into thebinary representation.

DEPTA

DEPTA stands for Data Extraction based on Partial Tree Alignment and is an unsupervised

IE system [ZL05]. DEPTA extracts data records from list pages with an algorithm called

MDR, taking advantage of the tree structure of the HTML page. MDR was first pre-

sented by Liu et al. in 2003 [LGZ03].

The design of MDR is based on two observations about data records. The first obser-vation states that similar objects are likely located in a contigous region and formatted

with almost identical or at least similar HTML tags. The second observation is that

similar data records are built by sub-trees of a common parent node.

The algorithm first builds the DOM-tree for the web page and stores the bounding

box for each element.11 Adjacent nodes that share the same parent are then compared by

computing the string edit distance of the tag strings. If the estimated similarity exceeds

a predefined threshold, the group of nodes is identified as a data region. To account for

data records that are spread over multiple sibling nodes, the concept of generalized nodes

was introduced. Generalized nodes encompass one or more sibling nodes . Figure 2.8

on the facing page shows an abstracted tag tree where nodes 5, 6 and 8, 9, 10 build twodata regions as the respective nodes in each region are similar. The combined node-

pairs (14, 15), (16, 17) and (18, 19) are also similar to each other and each pair builds a

generalized node. Data records are derived from generalized nodes. However, there are

cases when such a node does not represent a single data record. DEPTA handles some

special cases to deal with these discontinuities in data records.

Finally, data fields are extracted from the alleged data records. After all tag-trees

belonging to the data record are assembled in a new tree, partial tree alignment is

performed to induce the structure of the data. The idea is to match the fields from all

data records to build a generalized representation of the data record.

11The visual information for each tag is supplied by a web browser.



33/92

Figure 2.8: Generalized nodes and data regions in DEPTA [ZL05]

MDR can handle non-contiguous data records and is capable of extracting data

records that span multiple sibling nodes. The assumption is made that HTML tags

are generated by the template and text nodes belong to the data to be extracted. Visual

cues are consulted to distinguish individual data records. However, the extraction is lim-

ited to flat data records. Support for nested data records (e.g. two data records sharing

data items from a common parent data record) was added in a successor system called

NET [LZ05]. In the latter system, a post-order traversal of the tag tree is performed to

identify data records at different levels. NET uses simple tree matching to compute the

tree similarity and aligns the trees whose similarity is above a chosen threshold.

2.2.7 Summary

This section introduced Web IE concepts and techniques and presented a few inter-

esting automatic IE systems from the literature. An information system consisting of

a document retrieval and information extraction component is able to identify rele-

vant Web pages and extract salient data from the respective pages. However, to embed

the obtained information into an existing knowledge base information integration tech-

niques are required.

2.3 INFORMATION INTEGRATION

"It is a very sad thing that nowadays there is so little useless information." Oscar

Wilde

After retrieving and extracting information from heterogeneous sources, the obtained

data needs to be related to existing data. The inherent challenges of information integra-

tion (II) originate in the structural and semantic heterogeneity of the various information

sources. Data can be laid out and stored in different ways depending on the chosen data

model leading to structural heterogeneity. Semantic heterogeneity is concerned with the

content and meaning of the data.

2.3 Information Integration 21


34/92

Wache et al. state the problem of information integration and semantic interoperabil-

ity as follows [WVV+01]:

"In order to achieve semantic interoperability in a heterogeneous information sys-

tem, the meaning of the information that is interchanged has to be understood acrossthe systems. Semantic conflicts occur whenever two contexts do not use the same

interpretation of the information."

According to Pollock and Hodgson semantic conflicts can be classified as naming

conflicts, scaling and unit conflicts, confounding conflicts or domain conflicts. Naming

conflicts occur in the presence of synonyms and homonyms, i.e. multiple names exist

for the same entity. Different units and currencies lead to scaling conflicts. Metrics

may either be explicitly encoded in the data or implicitly assumed. Confounding con-

flicts arise when a same-named entity is defined differently by the various information

providers. Finally, domain conflicts occur when data is modeled with distinct domain-

specific intentions resulting in overlapping or disjoint concepts [PH04].

Information integration can be approached with ontology-mapping techniques.

Ontologies are well suited to model hidden and implicit knowledge for different

domains. Wache et al. give a concise overview of ontology-based information inte-

gration techniques [WVV+01].

2.4 LEGAL CONSIDERATIONS

Retrieving, extracting and integrating information published by a third party may have

legal implications. The terms of service of the respective sites apply which may prohibit

web scraping of their content. Although a few precedents exist, this is a grey area of

law and was differently ruled depending on the jurisdiction and the case. Adhering to

the terms of use of a web site only being visited by the IR/IE-system is not realizable

unless the terms could be retrieved and understood by the crawler.

Legal advise should be sought before employing web scraping in a public or com-

mercial software systems.

2.5 FEDSEEKO

Fedseeko is a federated search engine with the goal to facilitate obtaining product infor-

mation from the Internet [WSS09]. It uses adapters to access diverse product informa-

tion providers such as online shopping malls, producer sites and third party information

portals like forums or blogs. The information sources are accessed via web services if

such a possibility exists. For instance, the Amazon Product Advertising API12 provides

extensive vendor information through a web service. In case no such interface exists,

the information may be extracted using web scraping techniques. Figure 2.9 depicts the

12http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/

http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/


35/92

architecture of Fedseeko and its internal and external interfaces. The reference imple-

mentation is based on Ruby on Rails.

Figure 2.9: Fedseeko architecture [WSS09]

2.5.1 Producer Information Integration

In the following section, some important aspects of the original producer information

extraction implementation will be outlined. As a first step, the manufacturer URL for a

given product is retrieved by a web search query. The first hit of a web search restricted

to the com domain is considered to be the producer site and will be the basis of down-

stream product page searches. The product page is located via a phrase search on the

suspected producer site.

Fedseeko uses XPath queries to address the individual nodes associated with a prod-

uct attribute. The mining of XPath queries requires guidance. An example key/value

pair needs to be supplied, which is used to locate the proper product URL. Starting

from the suspected product page, the linked pages are walked and page contents are

matched via a similarity check with the key/value phrases. The search stops once a

page with the requested resemblance is found. Once a matching product page is found,

a Scrubyt13 extractor computes the XPath queries for the key, value and base queryrespectively. The identified XPath queries are associated with the producer, insinuating

a single producer-wide template.

Fedseeko uses mapping ontologies to relate producer information to available infor-

mation of similar products by other manufacturers.

The shortcomings of the existing producer information solution are first and fore-

most the required amount of user supervision. Supplying samples for each attribute

and producer is a labor-intensive process, especially considering the large variety of

13scRUBYt! is a Ruby library designed to facilitate web scraping tasks. http://scrubyt.org

2.5 Fedseeko 23
http://scrubyt.org/http://scrubyt.org/


36/92

producers and the number of attributes associated with some products14. Furthermore,

the limitation of one template per producer is an oversimplifying presumption. Large

producers with a manifold product range may use slightly different templates for dif-

ferent product categories. A new approach towards producer information retrieval and

extraction aiming to overcome the deficiencies of the existing implementation will bepresented in this thesis.

2.6 SUMMARY

An overview was given covering the research areas information retrieval, information

extraction and information integration. The brief treatise of IR focused on effectiveness

metrics while an in-depth introduction to Web IE was provided. Important IE tech-

niques have been presented and exemplary IE systems have been examined. Some of

the methods and techniques will be reused and referenced in the subsequent chapters.

II was swiftly covered for the sake of completeness but is otherwise outside the scope of

this thesis.15

Finally, the federated search engine Fedseeko has been introduced and its producer

information integration component was evaluated. During the course of this thesis, a

replacement of this component will be developed.

14For instance, in the domain of digital cameras more than one hundred attributes may be listed per

product.15A related work is conducted contemporaneously which revamps the ontology mapping in Fedseeko.



37/92

3 REQUIREMENTS

The goal of the revised information extraction component is to minimize the effort as

well as the cost of obtaining and providing first-hand product information. Upon a

query for a certain product, the system shall extract all available product attributes from

the manufacturers web site without requiring guidance or supervision. In contrast

to the existing IE system, web sites based on not yet encountered templates shall be

analyzed automatically and extraction rules be inferred and stored for future requests.

A change of a known template requiring different extraction rules should be detected

and acted upon.

In this chapter, the information flow of the retrieval and extraction system is ana-lyzed. A functional and behavioral description is given. Finally, the validation criteria

for the software system will be briefly covered.

3.1 INFORMATION DESCRIPTION

In a nutshell, the information extraction system shall locate product pages in the Inter-

net and extract product attributes without any mandatory user interaction. As depicted

in figure 3.1 on the following page, the only input to the software system is a prod-

uct descriptor. This product descriptor or identifier may be manually entered or mayoriginate from vendor databases or other sources listing products.

The input is a tuple comprising a manufacturer name and a product identifier. The latter

can be decomposed into a list of tokens, where the tokens describe a specific product.

Based on this information, the manufacturers product page is to be retrieved. An

example input is Apple Inc. and MacBook Pro.

The output is an ordered set of attribute tuples extracted from the product page asso-

ciated with a product. Each attribute tuple consists of a key and a value character string,

e.g. "Weight", "42 kg".

25


38/92

Figure 3.1: Overview of information flow

The extracted attributes may be saved in a database, be passed to a downstream

processor or can be presented directly to the user. There is a product detail view inFedseeko presenting the producer information alongside other related data like product

reviews. Furthermore, the extracted data is passed to an information integration system

performing ontology mapping. The latter task is carried out by a separate system which

will not be discussed herein.

The source of the attributes to be extracted are product detail pages residing at the

respective manufacturer sites. Empiric observations regarding these pages will be pre-

sented in the next section.

3.1.1 Product Pages

The IE engine shall be able to extract product attributes from a vast amount of hetero-

geneous manufacturer pages. The following empirical observations describe character-

istics of typical product pages.

1. A product page with sufficient information often describes only a single product

but may contain data for different product variants.

2. A manufacturer may use more than one template for different product categories

or families.

3. There might be very few pages available with a common template.

4. Multiple description pages with different templates might exist for the same prod-

uct, e.g. a summary and a specification page.

These characteristics do not apply to all product domains. Throughout this work

the focus is laid upon those kinds of products for which a human operator could

easily tell product features apart by looking at the product page. Figure 3.2 on the

next page shows a product page of a Nikon digital camera for which attributes like

"Total Pixels", "12.9 million" shall be extracted.

26 Chapter 3 Requirements


39/92

Figure 3.2: Product page example with the extraction targets being highlighted

3.2 FUNCTIONAL DESCRIPTION

The complete product information retrieval system can be decomposed into two major

components. One component is responsible for the identification of the manufacturer

site as well as the proper product page. The other components task is to extract product

attributes from the aforementioned product page.

The document retrieval component locates and fetches the product page from themanufacturers web site. If multiple pages exist for a single product, the page with the

most syntactically structured content should be picked. For example, a specifications

page is better suited for Web IE than a free text summary page.

The information extraction component extracts attribute tuples from a product page

of a specific template. Its job is to filter irrelevant data and identify the useful bits

of information in a given document. Either new rules are derived for identifying the

extraction targets or already stored ones are used to extract data out of a page created

from a previously encounered template. Extraction from a page based upon a known

template is an on-line operation1. Therefore, it should deliver results within the time-

1It shall be performed while the user of the system waits for a respond to his request.

3.2 Functional Description 27


40/92

frame given for the overall Fedseeko query to complete. In other words, if a query for

a Fedseeko product detail page should respond within fifteen seconds, the extractions

execution time should not exceed this bound in the average case.

As it might not be possible to select the proper wrapper object to extract data from a

given document, a wrapper shall be able to detect ineligible input pages. In effect, the

wrapper verification problem must be solved inside the wrapper object.

The wrapper induction component creates extraction rules for one or more pages sharing

a specific template. Wrapper induction only needs to be executed if a new template is

discovered or a known template has changed. Thus, the operation may be performed

off-line on a best effort basis.

3.3 BEHAVIORAL DESCRIPTION

Most of the systems operations are invisible to the user. Upon requesting detailed

information for a given product, the system will retrieve the product page and extract

all product attributes from that page. No user input is required.

However, the system may not be able to retrieve the proper product page, may fail

to extract any information or select bogus data. For these cases, the user may intervene

after the retrieval and extraction steps have been executed. The user shall be given

means to correct the estimated product page URL. Furthermore, extracted data may be

discarded whereabout the extraction can be restarted. Should the automatic extraction

fail to deliver meaningful data, the user may provide hints to facilitate the extraction

process.

3.4 VALIDATION CRITERIA

The software system is evaluated according to a gold standard2. A control group of

one hundred products from twenty different domains is used to validate the proper

operation of the system as well as to measure the effectiveness of the retrieval and

extraction components. In order to spot the cause of extraction failures, the subsystems

are examined individually.

The automatic extraction of attributes shall work reliably in the majority of the test

cases. With additional information, it ought to be possible to successfully extract the

proper data from four out of five documents.

For each test product, the proper product URL is gathered manually and a reference

attribute is recorded. This manually gathered data is matched with the automatically

computed data during evaluation.

The document retrieval subsystem either succeeds to locate a product page suitable

for information extraction, or fails to do so. Therefore, the precision metric follows

2Wikipedia defines a gold standard test as a "diagnostic test or benchmark that is regarded as definitive"[Wik09]. Test results are interpreted in a way that no false-positive or false-negative results are included.



41/92

the probabilistic interpretation and states the probability that the returned document is

relevant.

3.5 SUMMARY

This chapter stated the goal of the software system and requirements were analyzed

from various perspectives. Based on the given problem analysis, a software system will

be developed. Its design, implementation and evaluation will be presented throughout

the subsequent chapters.

3.5 Summary 29


42/92



43/92

4 DESIGN

The system design is outlined in this chapter. A description of each component required

to solve the problem is provided as a processing narrative and in context of the archi-

tectural design.

4.1 DATA DESIGN

The input and output data is depicted in figure 4.1. The key components have been

identified as the product page locator responsible for DR, and the components revolving

around the wrapper logic, responsible for Web IE. Both components and their design

constraints will be exhibited in this section.

Product IDWrapper

Database

Wrapper

Induction

Wrapper Attributes

Product Page

Locator

Product

Page

Manufacturer Web Site

Figure 4.1: Information flow during extraction

31


44/92

4.1.1 Retrieving Product Pages

The DR component must supply the downstream IE processor with a genuine product

page. In contrast to the more common DR systems in which a large set of documents

is returned, selecting the proper product page is a binary choice. Either the right prod-uct page is identified or the IE component wont be able to extract relevant data. In

effect, the goal of the document retrieval subsystem is to optimize the precision for

the top-ranked candidate (i.e. according to the terminology introduced in section 2.1.2,

precision(1) shall be maximized).

In a full-fledged product page retrieval system, all manufacturer sites would have

to be indexed in advance in order to allow the retrieval of subordinate product pages.

However, this work puts the focus onto the information extraction task and only limited

resources are available. Hence, it was chosen not to build a dedicated document index

for product page retrieval from the World Wide Web. In contrast, the results of existing

web search services are used and combined to pick the product page. The results of

multiple web search engines such as Google Search, MSN Search and Yahoo! Search

shall be aggregated to obtain a maximum coverage of the World Wide Web and benefit

from well-established ranking algorithms used in the respective services.

Product page retrieval is laid out as a two step process. In a first step, the producer

page is located and, in a second step, the product page is searched at the producer

site. In this manner, first hand product information is not intermixed with third party

information like web shop offers or product reviews. In case of failing to locate the

proper producer site in the first step, the DR component should fall back to another

candidate. This is done if the product was not featured on the site.

Product Page Ranking

During product page retrieval on the producer site, the DR subsystem tries to pick the

proper page from the top-ranked set of candidates of multiple web search engines. Not

just using the single top-ranked candidate improves the chance that a relevant document

is among the set of retrieved documents. The ranking of the individual search engines

is combined using Borda ranking, known from social choice theory. In Borda ranking,

named after Jean-Charles de Borda who proposed it as an election method in 1770,

every voter announces an ordered list of preferred candidates. If there are n candidates,

the top-ranked candidate of each voter receives n points and each lower ranked candi-

date receives a decremented score. Borda ranking and other search result combination

methods are discussed in web Data Mining by Bing Liu [Liu06].

Table 4.1 shows the search results of an artificial query. As indicated in the example,

a combined ranking may not suffice to select the proper document from a set of can-

didates. Therefore, additional metrics are incorporated to refine the original ranking.

Figure 4.2 on the next page gives an overview of the approaches used to process the

candidate list. Some techniques try to identify a page that contains specification infor-

mation and other methods scan for references to the searched product. The scores of

32 Chapter 4 Design


45/92

Table 4.1: Top four search results of two web search engines

Document Relevant? Rank A Rank B Borda Rank

/news/november/the_new_shiny_product no 1 - 4 + 0 = 4

/products/detail.html?category=6&id=17 yes 2 4 3 + 1 = 4/products/index.html?category=6 no 3 1 2 + 4 = 6

/forum/show.html?post=42 no 4 3 1 + 2 = 3

/reviews/produ