8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
1/92
Department of Computer Science, Institute for Systems Architecture, Chair of Computer Networks
Diplomarbeit
SYNTACTICAL INTEGRATION OFPRODUCT INFORMATION FROM
SEMI-STRUCTURED SOURCES
Ludwig HhneMat.-Nr.: 2959267
Supervised by:
Dipl.-Medieninf. Maximilian Walther
Prof. Dr. rer. nat. habil. Dr. h. c. Alexander SchillSubmitted on July 16, 2009
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
2/92
II
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
3/92
ABSTRACT
This thesis presents a novel product information retrieval and extraction system. The
goal is to provide a solution which automatically locates the manufacturers page of a
given product and extracts relevant product attributes. The document retrieval subsys-
tem exploits multiple web search services and uses various heuristics to improve the
ranking. The unsupervised extraction of product attributes is based on syntactic fea-
tures of the product pages. XPath queries are used to cluster and select genuine product
attributes from web documents. Three different extraction rule induction algorithms are
presented. One variant uses multiple training documents, another incorporates already
extracted data, and a supervised solution falls back on user-supplied examples. A webcrawler was developed which automatically retrieves pages sharing common underlying
page-templates.
The implementation extends an experimental federated search engine developed at
the TU Dresden. The extracted product attributes are meant to spice up already available
data with first-hand information gathered from the respective manufacturer sites. The
system was evaluated according to a gold standard. Considering the low expenses in
terms of user guidance effort and execution time, the system exhibits good precisionand recall metrics.
III
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
4/92
IV
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
5/92
CONFIRMATION
I confirm that I independently prepared the thesis and that I used only the references
and auxiliary means indicated in the thesis.
Dresden, July 16, 2009
V
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
6/92
VI
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
7/92
CONTENTS
1 Introduction 1
2 State of the Art 3
2.1 Document Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Document Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Retrieval Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Supervised Information Extraction . . . . . . . . . . . . . . . . . . . 16
2.2.4 Semi-Supervised Information Extraction . . . . . . . . . . . . . . . 16
2.2.5 Unsupervised Information Extraction . . . . . . . . . . . . . . . . . 16
2.2.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Legal Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Fedseeko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Producer Information Integration . . . . . . . . . . . . . . . . . . . 23
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
VII
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
8/92
3 Requirements 25
3.1 Information Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Product Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Functional Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Behavioral Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Validation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Design 31
4.1 Data Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Retrieving Product Pages . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Information Extraction from Product Pages . . . . . . . . . . . . . 34
4.2 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Fedseeko Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Implementation 43
5.1 Product Page Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Locating the Producer Site . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Locating the Product Page . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.3 Crawling Related Product Pages . . . . . . . . . . . . . . . . . . . . 47
5.1.4 Locator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Information Extraction Prototype . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Data Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Phrase Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Phrase Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.4 XPath Query Generalization . . . . . . . . . . . . . . . . . . . . . . 50
5.2.5 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Information Extraction Implementation . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.2 Attribute Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.3 Selecting a Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
VIII Contents
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
9/92
5.3.4 Architecture of the Web IE Subsystem . . . . . . . . . . . . . . . . . 57
5.4 Fedseeko Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 Evaluation 61
6.1 Feature Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Effectiveness and Performance Evaluation . . . . . . . . . . . . . . . . . . 62
6.2.1 Test Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.2 Product Page Retrieval Effectiveness . . . . . . . . . . . . . . . . . . 63
6.2.3 Related Page Crawling Effectiveness . . . . . . . . . . . . . . . . . . 64
6.2.4 Information Extraction Effectiveness . . . . . . . . . . . . . . . . . . 66
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7 Conclusion 73
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A Glossary 75
Contents IX
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
10/92
X Contents
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
11/92
LIST OF FIGURES
2.1 Interplay of document retrieval, information extraction and integration in
web data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Template-driven web page creation from database records . . . . . . . . . 9
2.3 Different wrapper induction strategies [CKGS06] . . . . . . . . . . . . . . 11
2.4 General tree mapping example [ZL05] . . . . . . . . . . . . . . . . . . . . . 13
2.5 Iterative partial tree alignment example [ZL05] . . . . . . . . . . . . . . . . 14
2.6 Wrapper induction example for RoadRunner [CMM01] . . . . . . . . . . . 18
2.7 Input pages in ExAlg [AGM03] . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Generalized nodes and data regions in DEPTA [ZL05] . . . . . . . . . . . 21
2.9 Fedseeko architecture [WSS09] . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Overview of information flow . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Product page example with the extraction targets being highlighted . . . 27
4.1 Information flow during extraction . . . . . . . . . . . . . . . . . . . . . . . 314.2 Selecting a product page from a set of candidates using multiple techniques 33
4.3 Navigating to a related product page (Nikon D90 to Nikon D3X) . . . . . 34
4.4 Examples of specification data embedded in different containers . . . . . 36
4.5 Clustering text nodes from multiple documents . . . . . . . . . . . . . . . 38
4.6 Source code of the two pages from figure 4.5 . . . . . . . . . . . . . . . . . 38
4.7 Architecture overview of the complete system . . . . . . . . . . . . . . . . 40
5.1 Ranking a set of candidate documents using multiple techniques . . . . . 44
XI
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
12/92
5.2 Architecture of the DR subsystem . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Supervised retrieval and extraction . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Architecture of the IE subsystem . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Fedseeko product administration view . . . . . . . . . . . . . . . . . . . . 59
6.1 Word cloud visualizing the most common terms in key phrases . . . . . . 62
6.2 Effectiveness of locating the right producer sites and product pages . . . 63
6.3 Product page retrieval runtime performance distribution . . . . . . . . . . 65
6.4 Number of successful operations of each isolated component . . . . . . . 67
6.5 Correctness and completeness of extraction results . . . . . . . . . . . . . 68
6.6 Example of a nested template page . . . . . . . . . . . . . . . . . . . . . . . 68
6.7 Example of specification page for multiple products . . . . . . . . . . . . . 69
6.8 Information extraction runtime performance . . . . . . . . . . . . . . . . . 70
XII List of Figures
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
13/92
1 INTRODUCTION
The World Wide Web is a place where millions if not billions of products are marketed,
searched, sold, bought and reviewed. Potential customers have a multitude of different
sources at their disposal to facilitate a purchase decision. There are various product
review sites, web shops provide product descriptions, blogs gain popularity as informa-
tion resources and there is the information published by the products manufacturer. An
important factor is the reliability of the individual information sources. When it comes
to buying an expensive product, a customer probably prefers to resort to the most reli-
able source of information. However, it is getting increasingly difficult to find first-hand
product information via a simple web search. To reach potential customers, manufactur-
ers have to compete with many other information providers in order to receive attentionand a good search engine rank.
Nowadays, web search engines are the single point of contact interfacing to the exu-
berant information in the World Wide Web. However, todays web search engines pre-
dominantly only inform about the whereabouts of data and can still not answer complex
queries. It is very difficult to do better as long as the web content is not semantically
interwoven.
Not only Tim Berners-Lee believes the Semantic Web to be the future of the Internet
[BLHL01]. Instead of phrasing keyword queries and wading through search results to
find relevant information, the vision is letting the Semantic Web answer actual ques-tions. In the context of product information retrieval one might want to ask questions
like: How much power does the latest Siemens refrigerator consume compared to its
predecessor and the new flagship product of Pengiun Electrics? As old as this vision
is, it still has a long way to go. Web developers are required to semantically describe
their data in languages that may seem too complex and lavish to pick up easily. Espe-
cially the lack of obvious short term benefits may impede the adoption of Semantic Web
technologies. It is not helpful either that a semantic query system needs a somewhat
complete knowledge base in the target domain to be valuable for a potential user. But
what if semantic data could be condensed out of existing web pages?
1
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
14/92
One idea is to bridge the gap between the "syntactic" Web and the Semantic Web
by automatically transferring information from traditional web pages into a semantic
context with the help of information extraction techniques. Acknowledged, information
extraction systems will not immediately provide the anticipated power of the Semantic
Web without further efforts. But these systems might help to facilitate the migrationprocess in some well-defined domains one of which might be product information
extraction.
With an automatic product information extraction and integration system at hand, it
would be possible to find similar products based on all kinds of feature-related criteria.
It would also relieve the customer of retrieving the producer information for the inter-
esting products manually. Furthermore, such a system would be manufacturer- and
vendor-independent.
This work presents a novel approach towards automatic Web information extraction
and is striving for becoming an enabling technology for product information integra-tion. A prototype implementation was developed and integrated into a federated search
engine, demonstrating the practical viability for product information integration and its
immanent challenges: locating product pages, automatically collecting training data for
pattern mining and identifying and extracting valuable product data.
The product page location component resorts to multiple web search services and
incorporates various heuristics to optimize the retrieval precision. The extraction
exploits structural characteristics of template-generated web pages. Extraction rules
are stored as XPath queries in the system. A low complexity clustering algorithm is
utilized to derive these extraction rules. Three algorithms are proposed, corresponding
to different degrees of automation.
Chapter two provides a theoretical background and the state of the art in Web infor-
mation extraction and related fields of research are discussed. In chapter three the
requirements of the novel product information extraction system are analyzed. The
subsequent chapters deal with the design and implementation of the software system.
Chapter six dissects the advantages and drawbacks of the presented solution and eval-
uates the system according to a gold standard. Finally, a summary and an outlook are
given.
2 Chapter 1 Introduction
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
15/92
2 STATE OF THE ART
Integrating information from the World Wide Web into a local database relies on three
major components, as depicted in figure 2.1. In this chapter, important concepts of
document retrieval and information extraction are outlined and an overview of the state
of the art in each field is given. This work strongly focuses on information extraction
and thus presents a selection of existing information extraction systems. Information
integration is covered briefly for the sake of completeness. The chapter closes with the
presentation of Fedseeko, the system into which the new information extraction system
shall be integrated.
Figure 2.1: Interplay of document retrieval, information extraction and integration in
web data extraction
3
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
16/92
2.1 DOCUMENT RETRIEVAL
"Knowledge is of two kinds. We know a subject ourselves, or we know where we can
find information upon it." Samuel Johnson
Information retrieval (IR) is often only loosely defined. Moreover, in the context of
most retrieval systems information retrieval actually refers to document retrieval. In effect,
information retrieval shall be synonymous to document retrieval (DR) in this thesis. But
for being less ambiguous, the latter term is preferred. Lancaster gives the following
definition of IR that also draws a dividing line separating related fields of research like
fact retrieval or question answering [Lan68]:
Definition 1 (Information Retrieval)
An information retrieval system does not inform (i.e. change the knowledge of) the user on the
subject of his inquiry. It merely informs on the existence (or non-existence) and whereaboutsof documents relating to his request.
Document retrieval aims to find relevant information from a large corpus of docu-
ments. Given a user query, traditional DR systems identify and rank documents in a
corporate or library network or on a single host (e.g. desktop search). In the context of
the Internet, DR is an important foundation of web search technologies with web pages
building the document corpus. Due to the vast amount of web content with trillions of
web pages, web search systems have different requirements than traditional DR systems.
User queries normally are lists of words. Based on a query, the DR system finds
relevant documents by matching the query tokens with the documents contents. In
the simplest case, each word occurring in the query must also occur in the document.
Phrase queries are also a very common instrument in IR. In addition, the query may
contain Boolean operators or means to express that two tokens must occur near each
other. However, complex query constructs are rarely used in practice as those make the
DR task more difficult for the users.
In the following, DR document models, effectiveness metrics and web crawlers are
discussed.
2.1.1 Document Model
The document model specifies how the documents and queries are represented and
governs how the relevance of a document in respect to a query is computed.
A document can be modeled in many different ways. It is common to most mod-
els that documents and queries are treated as a "bag of words or terms" in which term
sequence and position are ignored [Liu06]. An important characteristic of document
models is whether and how term-interdependencies are modeled. In the simplest case,
each word is treated independently. According to Kuropka, the various approaches can
be divided into set-theoretic models (e.g. Boolean model), algebraic models (e.g. vector-
4 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
17/92
space model) and probabilistic models [Kur04]. The different models will be briefly
presented in the following.
In the Boolean model each term is only checked for its presence or absence in a
document. A query in a Boolean retrieval system can be given as a logical equation
combining terms with logic operators, e.g. "James Joyce" AND Trieste. A document
is relevant in respect to the query if the contained set of terms make the query logically
true. Boolean models have the disadvantage that no ranking can be derived from the
simple definition of the problem. Neither the term frequency is examined nor does the
model permit inexact matches.
In the vector-space model a document is represented by an n-dimensional vector,
in which each dimension represents a distinct term of the vocabulary from the whole
document corpus. The weight of the term is computed from its occurrence characteristic
in the document. The query is also modeled as such a vector. Now the relevance of the
document in respect to the query can be computed as the cosine of the angle betweenthe two vectors, defined as the cosine similarity (see equation (2.1)).
cos =d q
d q(2.1)
An example for a probabilistic approach are language models which were first pro-
posed for document retrieval by Ponte and Croft [PC98]. In a statistical language model
a probability distribution of the n-grams is computed for each document in the corpus.
The idea is to derive the ranking of a document di in respect to a query q from the a pos-
teriori probability P(di|q). This is essentially the likelihood of the query being generated
by the respective language model.
The ranking derived from the degree of relevance is governed by the internal doc-
ument model. It may reflect poorly the actual relevance of documents as perceived by
the user. Thus, effectiveness metrics are required to evaluate the performance of a DR
system.
2.1.2 Retrieval Effectiveness
Numerous metrics have been proposed to measure the performance of DR systems. The
most commonly used metrics are precision and recall. Assuming a document is either
relevant or irrelevant in respect to a query, precision is the fraction of relevant documents
in the set of retrieved documents. In contrast, recall is the ratio of the number of relevant
documents retrieved to the total number of relevant documents (including those that
were not retrieved). Both metrics are related and are most often examined in context
of each other. For example, it is trivial to achieve 100% recall by just returning all
documents for every query. However, the precision metric would immediately reveal
the deficiency of such an approach. Another commonly used metric is the F-score (or
F-measure) which is defined as the weighted harmonic mean of precision and recall.
Web search engines typically present search results in buckets of around ten docu-ments. Users, however, do not consider search results beyond the first few result pages.
2.1 Document Retrieval 5
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
18/92
In effect, a relevant but very low1 ranked document is essentially useless from the users
perspective. Therefore, the ranking is also considered in the performance evaluation by
only examining the first i search results.
Let D be the whole document corpus set. A query is submitted to a given DR sys-
tem. Dretrieved D is an ordered set of all retrieved documents while Diretrieved are the
i top ranked documents returned by the system. Drelevant D is the set of all relevant
documents. The effectiveness metrics can be computed according the equations (2.2).
precision(i) =|Drelevant D
iretrieved|
|Diretrieved|
recall(i) =|Drelevant D
iretrieved|
|Drelevant |
F-score(i) = 2 precision(i) recall(i)precision(i) + recall(i)
(2.2)
In order to identify all relevant documents, the DR system first needs to be aware of
the existence and whereabouts of the individual web pages. The gathering of web page
is performed by a web crawler which is presented in the following section.
2.1.3 Web Crawler
Web IR systems have to gather web pages to build a document index. This non-trivial
task is performed by a web crawler, also known as spider or robot. Web crawlers recur-sively follow links in web pages to build a document index. As the Internet is constantly
evolving, web sites need to be visited regularly to account for new or changed content.
Definition 2 (Spider [FOL09])
A program that automatically explores the World-Wide Web by retrieving a document and
recursively retrieving some or all the documents that are referenced in it.
The best known web crawlers are universal crawlers which operate on behalf of web
search engines collecting the data for the document index. According to Brin and Page,
the crawler is the most fragile component of a search engine because it has to interactwith millions of remote servers all beyond the control of the system [BP98]. Thus, a
crawler has to be very robust and handle a multitude of corner cases even if that might
affect only a single page.
Crawlers may impose a huge stress on the resources of the respective hosts if the
request rate is not limited, leading to denial of service attacks in the worst case. Fur-
thermore, crawlers should identify themselves and comply with the robots exclusion stan-
dard2.
1Depending on the users web browsing behavior and motivation he or she might wade through one
hundred search results but more likely not more than ten search results will be considered by the user.
2The robot exclusion standard or protocol is a de facto standard described at http://www.robotstxt.org/wc/robots.html.
6 Chapter 2 State of the Art
http://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.html8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
19/92
In addition to universal crawlers, there are focused and topical crawlers exploring
the Web based on user preferences. These try to find only pages relating to a category
of interest or being similar to a set of seed pages. Focused crawlers differ from universal
crawlers in their strategy of picking URLs that shall be visited.
2.1.4 Summary
Important concepts of IR have been briefly presented. Information retrieval in terms
of document retrieval is just a first step in obtaining and processing knowledge in an
information system. The next step of information processing is the extraction of more
fine-grained information from the retrieved documents.
2.2 INFORMATION EXTRACTION
"Get your facts first, and then you can distort them as much as you please." Mark
Twain
Roughly speaking, information extraction (IE) aims to condense knowledge about a
specific domain of interest. Attributes of the domains entities or facts are distilled from
one or more input documents. The goal of IE is enabling the information system to
reason based on the extracted data. For example, an IE system that collects facts on the
worlds countries may extract attributes such as population, capital or natality [UTF08].
Definition 3 (Information Extraction [SAI01])
Rather than indicating which documents need to be read by a user, [Information Extraction]
extracts pieces of information that are salient to the users needs.
IE produces structured data from unstructured and semi-structured documents.
Semi-structured data typically refers to tables and lists, which are characteristic for web
pages. Whether a document is perceived as structured or unstructured depends on the
research domain. Databases are typically regarded as structured data while free text
is commonly classified as unstructured. The classification, however, cannot be solely
based on the data format. It is quite possible to dump a whole unstructured document
into a single database record, or strictly format a text file as a sequence of key-value
tuples. Similarly, a HTML body may contain an unstructured stream of free text or a
fine-grained table. Nevertheless, in the IE community, HTML is commonly classified as
semi-structured data, while XML documents with available meta-data are considered
being structured [CKGS06]. The dividing line between semi-structured and structured
data is drawn between documents containing some kind of syntactic structuring ele-
ments (e.g. HTML tags) and semantic tags of the data.
While IE for unstructured documents like free text has been thoroughly investigated
during the last decades, as indicated by the success of the Message Understanding
Conferences [Gri97], IE for semi-structured documents has received a growing inter-est from researchers during the last years. For the respective tasks, different techniques
2.2 Information Extraction 7
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
20/92
are required. Traditional IE needs to extract knowledge from human language texts and
typically uses lexicons and grammars to achieve this goal. Web IE takes advantage of
the fact that web pages are often automatically generated from (structured) database
records. Because web pages are created by static templates, machine learning and pat-
tern recognition techniques can be applied to analyze the syntactic structure of the doc-uments. Web scraping or screen scraping are used synonymously for Web IE. The Jargon
File3 gives the following definition of screen scraping stressing the unintended usage of
the medium.
Definition 4 (Screen scraping)
The act of capturing data from a system or program by snooping the contents of some display
that is not actually intended for data transport or inspection by programs. [...] it often refers
to parsing the HTML in generated web pages with programs designed to mine out particular
patterns of content.
Chang et al. give an overview of contemporary Web information extraction sys-
tems and categorize those based on task difficulty, extraction technique and degree of
automation [CKGS06].
2.2.1 Data Model
In the following, a generic IE data model is described informally. The chosen model is
derived from the data model known from relational databases and is also referenced by
other IE researchers [AGM03, Liu06]. According to this model, the data is structured as
nested relations made up of basic types arranged in tuples and sets. A basic type B is anatomic entity, typically a string in the context of web pages. The tuple type T1, T2,..., Tn
is an ordered collection of other types Ti. Tuples map to data records in a database
context. Set types {T} are constructed by multiple elements of the same type T, like a
list of equally typed tuples.
Let S be the schema of a book description. The data record (tuple) describing a book
might comprise the title, a set of authors, the publisher of the book and the number of
pages. Then the schema can be described as S = Btitle, {Bname}authors ,Bpublisher ,Bpages.
An instance ofS is the value x = "Ulysses",{"James Joyce"}, "Penguin", 1040.
A template-based semi-structured page is created from one or more data recordsstored in a database and a template as illustrated in figure 2.2 on the next page. A
template maps instances of a certain schema to a web page. More formally, an encoded
web page P is created from a data record x and a template T via a template mapping
function . Thus, the page creation process can be modeled as P = (T, x).
The IE task is to extract x from P with Tbeing unknown. If1T is the extraction func-
tion associated with template T, x = 1T (P) is performed by the extractor. The schema
of the extracted data rarely matches the model of the original data schema. Either the
IE system is not able to extract all data fields or only a subset of the data fields are
3The Jargon File is "a comprehensive compendium of hacker slang illuminating many aspects of hackish
tradition, folklore, and humor." [Jar03]
8 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
21/92
Database
Template
Web
Page
Figure 2.2: Template-driven web page creation from database records
required. Therefore, x is generally an incomplete approximation of the original data
record x. For example, the schema for the extracted data in the running example mightbe S = Btitle,Bauthors ,Bpublisher with the nested data records for the authors collapsed
into a single data field and the page count being omitted. In practice, many IE systems
use simpler data models for the extraction targets than the one described. Particularly
nesting of set and tuple types is not supported by the majority of the available IE sys-
tems.
An example template for the running example is given in listing 2.1 using a pseudo
template language.
1 2 Books
3
4 < l i > T i t l e : < i >
5
6 < l i >Author :
7
8 < l i > P u b l i s h er :
9 < l i > pages
10
11
Listing 2.1: Template example
So far, the web page has been assumed to be a static document. However, techniques
such as Ajax allow to perform operations asynchronously, for example the deferred
loading of additional content using XMLHttpRequest [Gar05]. This poses new chal-
lenges for DR and IE systems if relevant information becomes only available after per-
forming a certain action, like clicking a link or button on the page. A potential solution
to remedy this problem is to drive a full-fledged web browser with a JavaScript inter-
preter and using a plug-in like Watir4 to store static snapshots of the dynamic page.
4
Watir is an open-source library for automating web browsers: http://wtr.rubyforge.org/index.html
2.2 Information Extraction 9
http://wtr.rubyforge.org/index.htmlhttp://wtr.rubyforge.org/index.html8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
22/92
As has already been identified in this section, the goal of the Web IE system is to
extract data embedded in web pages created from a template. This task is performed by
a wrapper program which may be hand-crafted or automatically generated. Wrapper
generation techniques are discussed in the following section.
2.2.2 Wrapper Induction
According to a very general definition, a wrapper provides an interface to an entity and
allows it to be treated as if being something else. In the Web IE context, a wrapper allows
to regard a web page as a database record. Consequently, the wrapper is responsible for
extracting one or more data records from web pages.
Early IE systems were programmed manually.5 A set of web documents are examined
and common patterns have to be identified by a human operator. Recurrent patterns
enable the programmer to write a wrapper for extracting the target data, either manuallyor aided by pattern specification languages. The hand-crafted wrapper should then be
able to extract data from documents sharing the same template.
1
2 Books
3
4 < l i > T i t l e : < i >Ulysses
5 < l i >Author : J am es J o y c e
6 < l i >Pub lis her : Penguin
7 < l i >1040 pages
8
9
Listing 2.2: Sample web page
Listing 2.2 shows a simple web page generated from the aforementioned template.
Assuming the extraction task is to extract the books title, the programmer might write a
program that skips to the tag and extracts the text that follows until the closing
tag. Alternatively, regular expressions or XPath queries could be used. The different
variants to represent extraction rules are discussed on page 15.
Manually programmed wrappers are prone to failures when templates change,require knowledge of the employed technologies and are very labor-intensive. In con-
trast to manually specifying extraction rules, wrapper induction systems derive these
from a set of training documents with various degrees of automation.
Regardless how the wrapper was generated, Web IE systems have to deal with the
problems of wrapper verification and wrapper repair. A wrapper relies on the extraction
targets to be encoded in a certain way. However, web pages are subject to change
and information providers may choose to replace their templates at any time. This
causes hardship for wrapper maintenance. The detection of whether the wrapper is
5Special purpose IE tasks often are still conducted manually, e.g. extracting the links from a web searchresult page.
10 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
23/92
Figure 2.3: Different wrapper induction strategies [CKGS06]
suited to extract data from a presented page is called the wrapper verification problem.6Adapting the wrapper to a changed template is called the wrapper repair problem. A way
to approach both problems is to learn and verify characteristic patterns of the target
data. In case of failure, the patterns can be used in attempting to adapt the wrapper to
the new template. However, both tasks are very difficult to solve and are still an active
research area [Liu06].
The goal of wrapper induction is to derive the encoding template from a collection
of encoded instances of the same type. Repeated patterns in HTML documents can be
detected with string or tree matching and alignment techniques. These will be discussed
in the next sections.
String Matching
String matching helps revealing to what extent two character strings resemble each
other. The Levenshtein distance is a commonly used algorithm to compute the similarity
of two strings [Lev65]. It is defined as the minimum number of operations to transform
one string into the other. These operations are inserting, deleting or replacing a single
character in the string. The edit distance can be computed using dynamic programming.
Let s1 and s2 be the input strings and n and m the respective character counts. The
table D of dimension (n + 1) (m + 1) is initialized with Di,0 = i and D0,j = j. The
remaining cells are computed using equation (2.3).
(i,j)i [1..n],j [1..m] : Di,j = min
Di1,j1 same character
Di1,j1 + 1 replace
Di,j1 + 1 insert
Di1,j + 1 delete
(2.3)
6In fact, wrapper verification is also needed if the IE system may be confronted with ineligible pages,
i.e. pages that are created from different templates.
2.2 Information Extraction 11
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
24/92
The final edit distance is retrieved from the bottom right corner cell Dn,m. An align-
ment path can be traced back through the matrix illustrating the operations. The time-
complexity of the algorithm is O(nm). Table 2.1 shows an example matrix of the com-
parison of the character strings sheep and shepard yielding a Levenshtein distance of 4.
For similarity computations, the edit distance can be normalized by dividing it throughthe length of the longer string max(n, m).
Table 2.1: Edit distance matrix of the strings "shepard" and "sheep"
s1 s h e p a r d
s2 0 1 2 3 4 5 6 7
s 1 0 1 2 3 4 5 6
h 2 1 0 1 2 3 4 5
e 3 2 1 0 1 2 3 4
e 4 3 2 1 1 2 3 4
p 5 4 3 2 1 2 3 4
Tree Matching
String matching across non-trivial Web documents is a complex and expensive oper-
ation considering the average document length in terms of characters. There are no
pre-determined boundaries and the content and length of the data may differ across
multiple documents or records. The semi-structured nature of Web documents led to
the application of tree matching to conduct IE tasks. Tree matching compares the struc-
ture of two trees and computes a cost of pairing the vertices. In the context of Web IE,
the DOM-tree or parts thereof are commonly compared by using the element tags as the
vertices labels.
Tree matching computes a minimum-cost mapping for two ordered labeled trees.
According to the general definition, each node appears no more than once and the
order and hierarchical relations among nodes are preserved. Figure 2.4 on the facing
page illustrates such a mapping. Tai presented the first polynomial algorithm for com-
puting the edit distance based on dynamic programming [Tai79]. The algorithm has acomplexity ofO(n1n2h1h2) in time and space, with n1 and n2 being the number of nodes
and h1 and h2 the heights of the respective trees.
Cost functions are assigned to the editing operations transforming one tree into
another, i.e. relabeling, deleting and inserting nodes. Relabeling is of special interest
as it lends itself to identifying recurrent patterns in similar structured documents. More
elaborate cost functions for the relabel operation may exploit syntactic (e.g. string edit
distance) or semantic (e.g. feature vector) similarities. Zigoris et al. propose using sup-
port vector machines to learn the parameters of the cost function for semantic matching.
The preliminary results, however, indicated no performance-gain in comparison to sim-
pler cost functions [ZEZ06].
12 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
25/92
Figure 2.4: General tree mapping example [ZL05]
A more restrictive variant of tree matching was defined by Selkow in 1977 [Sel77].
According to Selkows definition, insertion and deletion is limited to the leaf nodes and
node replacement is not supported. In effect, the aim of tree matching is to find the
maximum matching where every node-pair has the same parent nodes. This definition
has been found to better fit to web documents because structural (i.e. level-crossing)
changes are not generally applicable to DOM-trees [CAM01]. Simple tree matching (STM)
is an algorithm solving this problem in quadratic time [Yan91]. It is again based on
dynamic programming and shown in listing 2.3.
1 STM(A , B )
2 i f Aroot = Broot then3 r e tu r n 0
4 e l s e
5 m Achildren6 n Bchildren7 Mi,0 0 i [0..m]
8 M0,j 0 j [0..n]
9 f o r i = 1 to m do
10 f o r j = 1 t o n do
11 Mi,j max (Mi,j1, Mi1,j, Mi1,j1 + STM(Ai, Bj))
12 r e t u r n Mm,n + 1
Listing 2.3: Simple tree matching algorithm
Multiple Alignment
In order to identify patterns in case more than two strings or trees are involved, mul-
tiple sequence alignment (MSA) techniques can be applied. Multiple alignment has its
foundation in molecular biology where it is used to identify similarities of sequences
(e.g. proteins). Given a set of similar sequences, MSA tries to find an optimal align-
ment by inserting gaps into the sequences. Carrillo and Lipman presented an algorithm
based on multidimensional dynamic programming that yields optimal results but has
an exponential time complexity [CL88]. Hence, various heuristic methods have been
proposed amongst which the center star method has found its way into IE systems.
2.2 Information Extraction 13
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
26/92
In this method, a center sequence c is selected from a set of sequences X minimizing
the pair-wise distance to the other sequences.
c = arg minxcX
xiX
d(xi, xc) (2.4)
Afterwards, the alignments with the remaining sequences are computed and gaps are
inserted into the center string where necessary. The time complexity of the center star
method is O(n2k2) for n sequences of length k. While being of polynomial complexity,
the character sequence lengths of HTML pages still incurs excessive runtime behavior
in IE systems.
Partial Tree Alignment
Partial tree alignment was specifically crafted to solve the multiple alignment problem
in an IE context [ZL05]. It aligns multiple trees by progressively growing a seed tree.
The latter is initialized to be the tree with the maximum number of nodes. This way
it likely aligns well with the other trees. The remaining trees are matched by linking
matching nodes and trying to insert nodes into the seed tree for which no match was
found.
Nodes are only inserted if a position can be uniquely determined. That is, if the
neighboring siblings in the source tree are matched with consecutive siblings in the
seed tree. Figure 2.5 illustrates growing such a seed tree Ts from three input trees.
Figure 2.5: Iterative partial tree alignment example [ZL05]
14 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
27/92
Extraction Rules
Once the extraction targets are identified, rules to mine the relevant information need
to be formalized and stored for future use. There are various possibilities ranging from
first-order logic rules over regular expressions to XPath and CSS selectors. Logic rulesare primarily used in free-text IE where common tokens and characteristic delimiters
facilitating the other approaches are rarely available.
Regular expressions have been widely adopted for data mining from semi-structured
documents. In the example in listing 2.2 on page 10 the title of the book can be mined
with the regular expression (\w+). In practice, however, regular expressions
are not very well suited to match data in HTML documents. To correctly match all
possible incantations of a specific HTML tag with a regular expression is a daunting
task, especially due to the statefulness of the HTML syntax. For example, the given
expression will not work if the tag contains any attributes and will unintentionally
match with occurrences of the tag in comments or strings. Therefore, the interest has
recently shifted to query languages like XPath or CSS selectors which are much more
suitable to extract information from an HTML or XML document.
Especially the usage of the XPath language in Web information extraction has gained
importance with a growing number of libraries supporting this query mechanism. In a
nutshell, XPath queries provide means to address node-sets or individual nodes in the
DOM tree of an XML (or HTML) document. For instance, //li/i/text() addresses the
title phrase of the book in the running example while querying for //ul/li[1] returns
the node containing the whole book-title attribute. XPath queries are far more power-
ful than the examples given above. This complexity, however, has caused hardship forproviding full support of the XPath standard in implementations and an uncertainty
concerning the complexity of XPath queries in general. Gottlob et al. have shown that
large fragments of XPath are of LOGCFL7 complexity and thus can be massively par-
allelized [GKP03]. A more elaborate treatise on XPath can be found in Essential XML
Quick Reference [SG01].
OKeefe and Trotman present a number of query languages aside to XPath and argue
that most available solutions are overly complicated [OT03]. On the one hand, the lack
of comprehensive support of the XPath 1.0 standard in many query libraries backs this
assumption. On the other hand, in Web IE the expressive power to select the relevant
parts of the available information with the utmost precision is a more favorable goal
than a simpler yet inferior solution. CSS selectors, for example, share similar concepts
with XPath queries but are not quite as powerful.
After foundational approaches and techniques have been covered, supervised, semi-
supervised and unsupervised IE system concepts are presented along with a few exem-
plary case studies.
7Logarithmically Reducible to Context-Free Languages
2.2 Information Extraction 15
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
28/92
2.2.3 Supervised Information Extraction
Manually observing recurrent patterns in web pages is a rather cumbersome and error-
prone process which can be alleviated by automatically learning extraction rules from
labeled training documents. This approach is referred to as supervised IE. As depictedin 2.3 on page 11, the user has to label relevant data with the help of a graphical user
interface (GUI). In the example of the book page, the user may mark "Ulysses" as the
title of the book and does that for a set of other pages. The IE system then tries to derive
rules from these examples and, depending on the IE system, may suggest additional
informative pages to be labeled by the user.
For example, Rapier is a supervised extraction system that uses a relational learning
algorithm [CM97]. It initializes the system with specific rules to extract the labeled
data and successively replaces those with more general rules. Syntactic and semantic
information is incorporated using a part-of-speech (POS) tagger. Extraction rules consist
of pre-filler, filler and post-filler patterns for each data field. These describe the context
and syntax of the extraction target. The respective patterns for extracting the publisher
name in the running example could be "", "", "Publisher:" as pre-filler
tokens and "", "" as post-filler tokens. Depending on the training data, the
filler pattern might specify that the publisher name consists of at most two words which
were labeled as nouns by the POS tagger.
Other examples of supervised IE systems are SRV [Fre98], WIEN [KWD97], Soft-
Mealy [HD98], STALKER [MMK99] and DEByE [LRNdS02].
2.2.4 Semi-Supervised Information Extraction
Labeling training data in advance is a labor-intensive process limiting the scope of the IE
system. Instead of requiring labeled data, semi-supervised IE systems extract potentially
interesting data and let the user decide what shall be extracted. In other words, the user
provides feedback to the IE system which is incorporated into the wrapper generation
process.
In the running example, a semi-supervised system might recover title, author and
the publisher as extractable data fields from a set of unlabeled book pages. The user
then selects which fields shall be extracted and how to integrate the information, e.g. bylabeling the titles as such in the extraction target tuple.
An example for a semi-supervised system is IEPAD [CL01]. Apart from extraction
target selection, semi-supervised IE systems are very similar to unsupervised IE systems.
2.2.5 Unsupervised Information Extraction
Automatic or unsupervised IE systems extract data from unlabeled training documents.
The core concept behind all unsupervised IE systems is to identify repetitive patterns in
the input data and extracting data items embodied in the recurrent pattern.
16 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
29/92
Unsupervised IE systems can be subdivided into page-level extraction systems and
record-level extraction systems. The former extract data from a page-wide template,
while the latter assume multiple data records of the same type are available rendered
by a common template into one page. In case multiple records exist in a single web
page, it might be possible to derive extraction rules from a single web page, assumingthe individual data records can be told apart. The record-level extraction task can be
described as trying to extract various items from a list page (e.g. a product list from a
web shop). In contrast, page-level extraction tasks require multiple pages (e.g. product
detail pages) to discover patterns and learn extraction rules.
Evidently, record-level extraction systems can only operate on documents containing
multiple data records and require means to identify the data regions describing the
individual data records. The latter problem can be tackled with string or tree alignment
techniques. Examples for such systems are DEPTA [ZL05] and NET [LZ05].
Page-level extraction systems can treat the whole input page as a data region fromwhich the data record shall be extracted. However, multiple pages8 for wrapper induc-
tion need to be fetched in advance. Thus, the problem of collecting training data is
shifted into the DR domain and is rarely addressed by IE researchers. Examples for
page-level extraction systems are RoadRunner [CMM01] and ExAlg [AGM03].
2.2.6 Case Studies
In the following, a selection of well-known IE systems are presented which try to solve
similar problems. One semi-supervised and three unsupervised IE systems are pre-
sented illustrating various techniques and the associated constraints to solve different
IE tasks.
RoadRunner
RoadRunner is one of the early unsupervised Web IE systems, presented in 2001 by
Crescenzi, Mecca and Merialdo [CMM01]. It compares multiple pages and generates
union-free9 regular expressions based on the identified similarities and differences.
RoadRunner initializes the wrapper with a random page of the input set and matches
the remaining pages using an algorithm called ACME matching. The wrapper is gener-alized for every encountered mismatch. Text string mismatches are interpreted as data
fields, tag mismatches are treated as indicators of optional items and iterators. In the
RoadRunner data model, individual data items must be separated by HTML tags but
tags must not occur as part of the data field. Figure 2.6 on the following page shows an
example of a wrapper generated from two input pages.
8At least two training pages are required for page-level wrapper induction. Depending on the IE sys-
tem and the template, however, ten or even more training pages may be necessary to successfully derive
extraction rules.9A union-free regular expression does not contain disjunctions (e.g. (A|B)).
2.2 Information Extraction 17
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
30/92
Figure 2.6: Wrapper induction example for RoadRunner [CMM01]
The runtime complexity is exponential in the input string length. Therefore, heuris-
tics were introduced to limit the exploration space.
ExAlg
Arasu and Garcia-Molina propose an IE system automatically deducing the template
from a set of template-generated pages [AGM03]. ExAlg has a hierarchically structured
data model and supports optional elements and disjunctions. A web page is modeled
as a list of tokens in which a token might either be a HTML tag or a word from a text
node. ExAlg builds equivalence classes of the tokens found in the input documents.
Based on these sets of tokens, the underlying template is deduced.
Figure 2.7 on the next page shows four example pages where each template-token is
labeled with an index. Tokens with the same occurrence vector across all input docu-
ments build an equivalence class. The idea is that tokens emitted from the same template
constructor will likely occur with the same frequency. Furthermore, ExAlg can detect
tokens with multiple roles, e.g. the token Name in Book Name and Reviewer Name has
a different semantic in either occurrence. It differentiates between roles based on the
occurrence-path10 and the spans of valid equivalence classes. For instance, an equiv-
10The occurrence-path, as defined by Arasu and Garcia-Molina, has a close resemblance to an XPathquery.
18 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
31/92
alence class in the given example is {, Reviewer, Name, Rating, Text, }
with the occurrence vector 1,2,1,0.
ExAlg defines large and frequent equivalence classes (LFEQs) as classes containing
many tokens which occur in a large fraction of the input documents. The LFEQs are
hierarchically structured and the order of the tokens is preserved. The nesting is gov-
erned by the span formed by all tokens in the respective equivalence class. LFEQs are
passed to the analysis stage in which the template is deduced.
Figure 2.7: Input pages in ExAlg [AGM03]
Starting from the root LFEQ the tokens occurring exactly once in all input doc-
uments, ExAlg searches for non-empty positions between consecutive tokens and gen-
erates type constructors for these locations. Nested LFEQs are recursively visited and
the types are constructed according to the data model. The generated template can then
be used to extract data from input pages. For the given example, the original schema
BBook, {BReviewer,BScore,BText} can be recovered by analyzing the four input pages.
ExAlg has a sophisticated data model compared to other automatic IE systems.Moreover, ExAlg operates on the token-level not on the tag-level as many other
unsupervised extraction systems do and thus has the chance of extracting attributes
embedded in text nodes without any markup. The effectiveness of the extraction tends
to improve with the number of input pages. However, experiments indicate that ExAlg
works well for collections of under ten input documents given that the occurrence of
attributes to be extracted exceed the chosen threshold.
2.2 Information Extraction 19
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
32/92
IEPAD
IEPAD, a semi-supervised IE system, was presented by Chang and Liu in 2001 [ CL01]. It
is capable of extracting homogeneous data records from a set of unlabeled pages. IEPAD
generates wrappers by discovering repetitive patterns using multiple string alignment.
The input document is converted to a binary representation of the data. HTML tags
and text elements are mapped to a set of fixed length binary tokens. A PAT tree, which
is a binary suffix tree, is created from the binary representation. The PAT tree, in turn, is
used to find repetitive patterns by recording occurrence count and reference points for
each recurring pattern. To tolerate inexact matches, the center star algorithm is applied
to obtain generalized extraction patterns.
The candidate patterns and the occurrence metrics are presented to the user. Upon
selection of a pattern, a regular expression is created from the binary representation.
Thus, the wrapper can also operate on web pages without transforming those into thebinary representation.
DEPTA
DEPTA stands for Data Extraction based on Partial Tree Alignment and is an unsupervised
IE system [ZL05]. DEPTA extracts data records from list pages with an algorithm called
MDR, taking advantage of the tree structure of the HTML page. MDR was first pre-
sented by Liu et al. in 2003 [LGZ03].
The design of MDR is based on two observations about data records. The first obser-vation states that similar objects are likely located in a contigous region and formatted
with almost identical or at least similar HTML tags. The second observation is that
similar data records are built by sub-trees of a common parent node.
The algorithm first builds the DOM-tree for the web page and stores the bounding
box for each element.11 Adjacent nodes that share the same parent are then compared by
computing the string edit distance of the tag strings. If the estimated similarity exceeds
a predefined threshold, the group of nodes is identified as a data region. To account for
data records that are spread over multiple sibling nodes, the concept of generalized nodes
was introduced. Generalized nodes encompass one or more sibling nodes . Figure 2.8
on the facing page shows an abstracted tag tree where nodes 5, 6 and 8, 9, 10 build twodata regions as the respective nodes in each region are similar. The combined node-
pairs (14, 15), (16, 17) and (18, 19) are also similar to each other and each pair builds a
generalized node. Data records are derived from generalized nodes. However, there are
cases when such a node does not represent a single data record. DEPTA handles some
special cases to deal with these discontinuities in data records.
Finally, data fields are extracted from the alleged data records. After all tag-trees
belonging to the data record are assembled in a new tree, partial tree alignment is
performed to induce the structure of the data. The idea is to match the fields from all
data records to build a generalized representation of the data record.
11The visual information for each tag is supplied by a web browser.
20 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
33/92
Figure 2.8: Generalized nodes and data regions in DEPTA [ZL05]
MDR can handle non-contiguous data records and is capable of extracting data
records that span multiple sibling nodes. The assumption is made that HTML tags
are generated by the template and text nodes belong to the data to be extracted. Visual
cues are consulted to distinguish individual data records. However, the extraction is lim-
ited to flat data records. Support for nested data records (e.g. two data records sharing
data items from a common parent data record) was added in a successor system called
NET [LZ05]. In the latter system, a post-order traversal of the tag tree is performed to
identify data records at different levels. NET uses simple tree matching to compute the
tree similarity and aligns the trees whose similarity is above a chosen threshold.
2.2.7 Summary
This section introduced Web IE concepts and techniques and presented a few inter-
esting automatic IE systems from the literature. An information system consisting of
a document retrieval and information extraction component is able to identify rele-
vant Web pages and extract salient data from the respective pages. However, to embed
the obtained information into an existing knowledge base information integration tech-
niques are required.
2.3 INFORMATION INTEGRATION
"It is a very sad thing that nowadays there is so little useless information." Oscar
Wilde
After retrieving and extracting information from heterogeneous sources, the obtained
data needs to be related to existing data. The inherent challenges of information integra-
tion (II) originate in the structural and semantic heterogeneity of the various information
sources. Data can be laid out and stored in different ways depending on the chosen data
model leading to structural heterogeneity. Semantic heterogeneity is concerned with the
content and meaning of the data.
2.3 Information Integration 21
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
34/92
Wache et al. state the problem of information integration and semantic interoperabil-
ity as follows [WVV+01]:
"In order to achieve semantic interoperability in a heterogeneous information sys-
tem, the meaning of the information that is interchanged has to be understood acrossthe systems. Semantic conflicts occur whenever two contexts do not use the same
interpretation of the information."
According to Pollock and Hodgson semantic conflicts can be classified as naming
conflicts, scaling and unit conflicts, confounding conflicts or domain conflicts. Naming
conflicts occur in the presence of synonyms and homonyms, i.e. multiple names exist
for the same entity. Different units and currencies lead to scaling conflicts. Metrics
may either be explicitly encoded in the data or implicitly assumed. Confounding con-
flicts arise when a same-named entity is defined differently by the various information
providers. Finally, domain conflicts occur when data is modeled with distinct domain-
specific intentions resulting in overlapping or disjoint concepts [PH04].
Information integration can be approached with ontology-mapping techniques.
Ontologies are well suited to model hidden and implicit knowledge for different
domains. Wache et al. give a concise overview of ontology-based information inte-
gration techniques [WVV+01].
2.4 LEGAL CONSIDERATIONS
Retrieving, extracting and integrating information published by a third party may have
legal implications. The terms of service of the respective sites apply which may prohibit
web scraping of their content. Although a few precedents exist, this is a grey area of
law and was differently ruled depending on the jurisdiction and the case. Adhering to
the terms of use of a web site only being visited by the IR/IE-system is not realizable
unless the terms could be retrieved and understood by the crawler.
Legal advise should be sought before employing web scraping in a public or com-
mercial software systems.
2.5 FEDSEEKO
Fedseeko is a federated search engine with the goal to facilitate obtaining product infor-
mation from the Internet [WSS09]. It uses adapters to access diverse product informa-
tion providers such as online shopping malls, producer sites and third party information
portals like forums or blogs. The information sources are accessed via web services if
such a possibility exists. For instance, the Amazon Product Advertising API12 provides
extensive vendor information through a web service. In case no such interface exists,
the information may be extracted using web scraping techniques. Figure 2.9 depicts the
12http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/
22 Chapter 2 State of the Art
http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
35/92
architecture of Fedseeko and its internal and external interfaces. The reference imple-
mentation is based on Ruby on Rails.
Figure 2.9: Fedseeko architecture [WSS09]
2.5.1 Producer Information Integration
In the following section, some important aspects of the original producer information
extraction implementation will be outlined. As a first step, the manufacturer URL for a
given product is retrieved by a web search query. The first hit of a web search restricted
to the com domain is considered to be the producer site and will be the basis of down-
stream product page searches. The product page is located via a phrase search on the
suspected producer site.
Fedseeko uses XPath queries to address the individual nodes associated with a prod-
uct attribute. The mining of XPath queries requires guidance. An example key/value
pair needs to be supplied, which is used to locate the proper product URL. Starting
from the suspected product page, the linked pages are walked and page contents are
matched via a similarity check with the key/value phrases. The search stops once a
page with the requested resemblance is found. Once a matching product page is found,
a Scrubyt13 extractor computes the XPath queries for the key, value and base queryrespectively. The identified XPath queries are associated with the producer, insinuating
a single producer-wide template.
Fedseeko uses mapping ontologies to relate producer information to available infor-
mation of similar products by other manufacturers.
The shortcomings of the existing producer information solution are first and fore-
most the required amount of user supervision. Supplying samples for each attribute
and producer is a labor-intensive process, especially considering the large variety of
13scRUBYt! is a Ruby library designed to facilitate web scraping tasks. http://scrubyt.org
2.5 Fedseeko 23
http://scrubyt.org/http://scrubyt.org/8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
36/92
producers and the number of attributes associated with some products14. Furthermore,
the limitation of one template per producer is an oversimplifying presumption. Large
producers with a manifold product range may use slightly different templates for dif-
ferent product categories. A new approach towards producer information retrieval and
extraction aiming to overcome the deficiencies of the existing implementation will bepresented in this thesis.
2.6 SUMMARY
An overview was given covering the research areas information retrieval, information
extraction and information integration. The brief treatise of IR focused on effectiveness
metrics while an in-depth introduction to Web IE was provided. Important IE tech-
niques have been presented and exemplary IE systems have been examined. Some of
the methods and techniques will be reused and referenced in the subsequent chapters.
II was swiftly covered for the sake of completeness but is otherwise outside the scope of
this thesis.15
Finally, the federated search engine Fedseeko has been introduced and its producer
information integration component was evaluated. During the course of this thesis, a
replacement of this component will be developed.
14For instance, in the domain of digital cameras more than one hundred attributes may be listed per
product.15A related work is conducted contemporaneously which revamps the ontology mapping in Fedseeko.
24 Chapter 2 State of the Art
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
37/92
3 REQUIREMENTS
The goal of the revised information extraction component is to minimize the effort as
well as the cost of obtaining and providing first-hand product information. Upon a
query for a certain product, the system shall extract all available product attributes from
the manufacturers web site without requiring guidance or supervision. In contrast
to the existing IE system, web sites based on not yet encountered templates shall be
analyzed automatically and extraction rules be inferred and stored for future requests.
A change of a known template requiring different extraction rules should be detected
and acted upon.
In this chapter, the information flow of the retrieval and extraction system is ana-lyzed. A functional and behavioral description is given. Finally, the validation criteria
for the software system will be briefly covered.
3.1 INFORMATION DESCRIPTION
In a nutshell, the information extraction system shall locate product pages in the Inter-
net and extract product attributes without any mandatory user interaction. As depicted
in figure 3.1 on the following page, the only input to the software system is a prod-
uct descriptor. This product descriptor or identifier may be manually entered or mayoriginate from vendor databases or other sources listing products.
The input is a tuple comprising a manufacturer name and a product identifier. The latter
can be decomposed into a list of tokens, where the tokens describe a specific product.
Based on this information, the manufacturers product page is to be retrieved. An
example input is Apple Inc. and MacBook Pro.
The output is an ordered set of attribute tuples extracted from the product page asso-
ciated with a product. Each attribute tuple consists of a key and a value character string,
e.g. "Weight", "42 kg".
25
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
38/92
Figure 3.1: Overview of information flow
The extracted attributes may be saved in a database, be passed to a downstream
processor or can be presented directly to the user. There is a product detail view inFedseeko presenting the producer information alongside other related data like product
reviews. Furthermore, the extracted data is passed to an information integration system
performing ontology mapping. The latter task is carried out by a separate system which
will not be discussed herein.
The source of the attributes to be extracted are product detail pages residing at the
respective manufacturer sites. Empiric observations regarding these pages will be pre-
sented in the next section.
3.1.1 Product Pages
The IE engine shall be able to extract product attributes from a vast amount of hetero-
geneous manufacturer pages. The following empirical observations describe character-
istics of typical product pages.
1. A product page with sufficient information often describes only a single product
but may contain data for different product variants.
2. A manufacturer may use more than one template for different product categories
or families.
3. There might be very few pages available with a common template.
4. Multiple description pages with different templates might exist for the same prod-
uct, e.g. a summary and a specification page.
These characteristics do not apply to all product domains. Throughout this work
the focus is laid upon those kinds of products for which a human operator could
easily tell product features apart by looking at the product page. Figure 3.2 on the
next page shows a product page of a Nikon digital camera for which attributes like
"Total Pixels", "12.9 million" shall be extracted.
26 Chapter 3 Requirements
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
39/92
Figure 3.2: Product page example with the extraction targets being highlighted
3.2 FUNCTIONAL DESCRIPTION
The complete product information retrieval system can be decomposed into two major
components. One component is responsible for the identification of the manufacturer
site as well as the proper product page. The other components task is to extract product
attributes from the aforementioned product page.
The document retrieval component locates and fetches the product page from themanufacturers web site. If multiple pages exist for a single product, the page with the
most syntactically structured content should be picked. For example, a specifications
page is better suited for Web IE than a free text summary page.
The information extraction component extracts attribute tuples from a product page
of a specific template. Its job is to filter irrelevant data and identify the useful bits
of information in a given document. Either new rules are derived for identifying the
extraction targets or already stored ones are used to extract data out of a page created
from a previously encounered template. Extraction from a page based upon a known
template is an on-line operation1. Therefore, it should deliver results within the time-
1It shall be performed while the user of the system waits for a respond to his request.
3.2 Functional Description 27
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
40/92
frame given for the overall Fedseeko query to complete. In other words, if a query for
a Fedseeko product detail page should respond within fifteen seconds, the extractions
execution time should not exceed this bound in the average case.
As it might not be possible to select the proper wrapper object to extract data from a
given document, a wrapper shall be able to detect ineligible input pages. In effect, the
wrapper verification problem must be solved inside the wrapper object.
The wrapper induction component creates extraction rules for one or more pages sharing
a specific template. Wrapper induction only needs to be executed if a new template is
discovered or a known template has changed. Thus, the operation may be performed
off-line on a best effort basis.
3.3 BEHAVIORAL DESCRIPTION
Most of the systems operations are invisible to the user. Upon requesting detailed
information for a given product, the system will retrieve the product page and extract
all product attributes from that page. No user input is required.
However, the system may not be able to retrieve the proper product page, may fail
to extract any information or select bogus data. For these cases, the user may intervene
after the retrieval and extraction steps have been executed. The user shall be given
means to correct the estimated product page URL. Furthermore, extracted data may be
discarded whereabout the extraction can be restarted. Should the automatic extraction
fail to deliver meaningful data, the user may provide hints to facilitate the extraction
process.
3.4 VALIDATION CRITERIA
The software system is evaluated according to a gold standard2. A control group of
one hundred products from twenty different domains is used to validate the proper
operation of the system as well as to measure the effectiveness of the retrieval and
extraction components. In order to spot the cause of extraction failures, the subsystems
are examined individually.
The automatic extraction of attributes shall work reliably in the majority of the test
cases. With additional information, it ought to be possible to successfully extract the
proper data from four out of five documents.
For each test product, the proper product URL is gathered manually and a reference
attribute is recorded. This manually gathered data is matched with the automatically
computed data during evaluation.
The document retrieval subsystem either succeeds to locate a product page suitable
for information extraction, or fails to do so. Therefore, the precision metric follows
2Wikipedia defines a gold standard test as a "diagnostic test or benchmark that is regarded as definitive"[Wik09]. Test results are interpreted in a way that no false-positive or false-negative results are included.
28 Chapter 3 Requirements
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
41/92
the probabilistic interpretation and states the probability that the returned document is
relevant.
3.5 SUMMARY
This chapter stated the goal of the software system and requirements were analyzed
from various perspectives. Based on the given problem analysis, a software system will
be developed. Its design, implementation and evaluation will be presented throughout
the subsequent chapters.
3.5 Summary 29
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
42/92
30 Chapter 3 Requirements
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
43/92
4 DESIGN
The system design is outlined in this chapter. A description of each component required
to solve the problem is provided as a processing narrative and in context of the archi-
tectural design.
4.1 DATA DESIGN
The input and output data is depicted in figure 4.1. The key components have been
identified as the product page locator responsible for DR, and the components revolving
around the wrapper logic, responsible for Web IE. Both components and their design
constraints will be exhibited in this section.
Product IDWrapper
Database
Wrapper
Induction
Wrapper Attributes
Product Page
Locator
Product
Page
Manufacturer Web Site
Figure 4.1: Information flow during extraction
31
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
44/92
4.1.1 Retrieving Product Pages
The DR component must supply the downstream IE processor with a genuine product
page. In contrast to the more common DR systems in which a large set of documents
is returned, selecting the proper product page is a binary choice. Either the right prod-uct page is identified or the IE component wont be able to extract relevant data. In
effect, the goal of the document retrieval subsystem is to optimize the precision for
the top-ranked candidate (i.e. according to the terminology introduced in section 2.1.2,
precision(1) shall be maximized).
In a full-fledged product page retrieval system, all manufacturer sites would have
to be indexed in advance in order to allow the retrieval of subordinate product pages.
However, this work puts the focus onto the information extraction task and only limited
resources are available. Hence, it was chosen not to build a dedicated document index
for product page retrieval from the World Wide Web. In contrast, the results of existing
web search services are used and combined to pick the product page. The results of
multiple web search engines such as Google Search, MSN Search and Yahoo! Search
shall be aggregated to obtain a maximum coverage of the World Wide Web and benefit
from well-established ranking algorithms used in the respective services.
Product page retrieval is laid out as a two step process. In a first step, the producer
page is located and, in a second step, the product page is searched at the producer
site. In this manner, first hand product information is not intermixed with third party
information like web shop offers or product reviews. In case of failing to locate the
proper producer site in the first step, the DR component should fall back to another
candidate. This is done if the product was not featured on the site.
Product Page Ranking
During product page retrieval on the producer site, the DR subsystem tries to pick the
proper page from the top-ranked set of candidates of multiple web search engines. Not
just using the single top-ranked candidate improves the chance that a relevant document
is among the set of retrieved documents. The ranking of the individual search engines
is combined using Borda ranking, known from social choice theory. In Borda ranking,
named after Jean-Charles de Borda who proposed it as an election method in 1770,
every voter announces an ordered list of preferred candidates. If there are n candidates,
the top-ranked candidate of each voter receives n points and each lower ranked candi-
date receives a decremented score. Borda ranking and other search result combination
methods are discussed in web Data Mining by Bing Liu [Liu06].
Table 4.1 shows the search results of an artificial query. As indicated in the example,
a combined ranking may not suffice to select the proper document from a set of can-
didates. Therefore, additional metrics are incorporated to refine the original ranking.
Figure 4.2 on the next page gives an overview of the approaches used to process the
candidate list. Some techniques try to identify a page that contains specification infor-
mation and other methods scan for references to the searched product. The scores of
32 Chapter 4 Design
8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources
45/92
Table 4.1: Top four search results of two web search engines
Document Relevant? Rank A Rank B Borda Rank
/news/november/the_new_shiny_product no 1 - 4 + 0 = 4
/products/detail.html?category=6&id=17 yes 2 4 3 + 1 = 4/products/index.html?category=6 no 3 1 2 + 4 = 6
/forum/show.html?post=42 no 4 3 1 + 2 = 3
/reviews/produ