Extracting Content from Online News Sites

Extracting Content fromOnline News Sites

Sigrid Lindholm

January 31, 2011

Master’s Thesis in Computing Science, 30 ECTS creditsSupervisor at CS-UmU: Johanna Hogberg

Examiner: Per Lindstrom

Umea UniversityDepartment of Computing Science

SE-901 87 UMEA

SWEDEN

Abstract

Society is producing more and more data with every year. The number ofunique URLs indexed by Google recently surpassed the one-trillion mark. Tofully benefit from this surge in data, we need efficient algorithms for searchingand extracting information. A popular approach is to use the so-called vectorspace model (VSM), that organises documents according to the terms that theycontain. This thesis contributes to an investigation of how adding syntacticalinformation to VSM affects search results. The thesis focuses on techniques forcontent extraction from online news sources, and describes the implementationand evaluation of a selection of these techniques. The extracted data is usedto obtain test data for search evaluation. The implementation is generic andthus easily adopted to new data sources, and although the implementation lacksprecision, its performance is sufficient for evaluating the syntax-based versionof VSM.

ii

Chapter 1

Preface

CodeMill is a Umea-based IT company that offers system development andresource consulting. The company also has a Research & Development divi-sion (R&D) that collaborates with the academia to turn scientific results intocommercial products. A prioritized project at R&D is the implementation ofa syntax-based search engine for information retrieval. The engine is tailoredfor companies with a constant need of updated information, which is generallythe case within the financial sector and ICT1 sector. Concrete applications aremonitoring media coverage of a company’s affairs, or keeping track of an entirefield of business.

The search engine consists of two components; a frontend that extracts andparses online news articles, and a backend for indexing and searching the gath-ered documents. Information is transferred from the frontend to the backendthrough a shared database. The author of this thesis contributes with an imple-mentation of the frontend written in Java, and with a survey and evaluation ofcurrent techniques for article extraction. The backend is described by ThomasKnutsson in [12].

1Information and Communications Technology

iii

iv Chapter 1. Preface

Contents

1 Preface iii

2 Introduction and Problem Description 1

2.1 The Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . 1

2.1.1 Latent Semantic Analysis . . . . . . . . . . . . . . . . . . 3

2.1.2 Random Indexing . . . . . . . . . . . . . . . . . . . . . . 3

2.1.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Parse Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Motivation for Improvements . . . . . . . . . . . . . . . . . . . . 4

2.4 Web Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.5 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Overview of Previous Work 7

3.1 Wrapper Techniques in General . . . . . . . . . . . . . . . . . . . 8

3.2 Extracting Articles From News Sites . . . . . . . . . . . . . . . . 10

3.2.1 Tree Edit Distance (2004) . . . . . . . . . . . . . . . . . . 11

3.2.2 The Curious Negotiator (2006) . . . . . . . . . . . . . . . 14

3.2.3 Tag Sequence and Tree Based Hybrid Method (2006) . . . 14

3.2.4 Linguistic and Structural Features (2007) . . . . . . . . . 15

3.2.5 Visual and Perceptual . . . . . . . . . . . . . . . . . . . . 17

3.2.6 A Generic Approach (2008) . . . . . . . . . . . . . . . . . 19

4 Approach 21

4.1 Web Content Syndication . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

v

vi CONTENTS

4.5 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Results 29

6 Conclusion 37

6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7 Acknowledgements 41

References 43

List of Figures

2.1 A vector space indexed by terms . . . . . . . . . . . . . . . . . . 2

2.2 Possible parses of a pair of English sentences . . . . . . . . . . . 4

3.1 Degree of structure of different types of documents . . . . . . . . 10

3.2 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Top down mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Restricted top down mapping . . . . . . . . . . . . . . . . . . . . 13

3.5 Node extraction patterns . . . . . . . . . . . . . . . . . . . . . . 13

4.1 DOM tree with path prefixes . . . . . . . . . . . . . . . . . . . . 23

4.2 Left and right numbering in the Nested Set Model . . . . . . . . 26

4.3 Outline of the complete information-retrieval system . . . . . . . 27

5.1 A sample article from the Wall Street Journal . . . . . . . . . . . 31

5.2 The continuation of the article in figure 5.1 . . . . . . . . . . . . 32

5.3 A sample article from usa today . . . . . . . . . . . . . . . . . 33

5.4 The continuation of the article in Figure 5.4 . . . . . . . . . . . 34

vii

viii LIST OF FIGURES

List of Tables

4.1 Corresponding html to Figure 4.1 . . . . . . . . . . . . . . . . . . 23

4.2 Sample output from parser . . . . . . . . . . . . . . . . . . . . . 25

5.1 Comparison of automatically- and manually extracted text from

a Wall Street Journal article . . . . . . . . . . . . . . . . . . . . . 30

5.2 Comparison of automatically- and manually extracted text from

a usa today article . . . . . . . . . . . . . . . . . . . . . . . . . 35

ix

x LIST OF TABLES

Chapter 2

Introduction and ProblemDescription

Search algorithms are an important field of computer science, and will remainso in the years to come. There is more and more data available, increasinglymany people produce, analyse and manage information, both professionally andotherwise. Information and knowledge in itself is becoming a resource, and thenecessity for good search methods are increasing.

In an attempt to evaluate a possible improvement to existing search methodsa small system will be built. The system will consist of a frontend, extractingarticles from websites, and storing these articles in a database, and a backendwhich indexes the data inserted into the database and provides an interface forusers to make queries on the data.

Given a word or phrase, the indexed documents/articles with greatest rele-vance should be returned. By relevance one does not imply the documents withthe greatest lexical similarity but rather those which are semantically similar.There are several motivations behind the use of semantic indexing: synonymsand homographs.

Synonyms are a great obstacle in lexical search. The descriptions that usersgive documents tend to vary to a large degree, and only documents which exactlymatch the search terms entered by a user are returned in lexical search. Anydocuments which are of interest, but to not contain an exact match are left out,giving a large number of false negatives. A further source of bother are wordswith more than one meaning, homographs, which give rise to false positives.When given the word ”desert” a lexical search algorithm will return both pagesrelating to dry geographic areas, as well as yummy recipes. [6]

2.1 The Vector Space Model

A common approach in semantic indexing is to use a vector space model. Thebase is a co-occurrence matrix, a sparse matrix where the rows represent unique

1

2 Chapter 2. Introduction and Problem Description

Term 1

Term 2

Ter

m 3

Figure 2.1: A vector space indexed by terms

words and the columns represents either words (a word-by-word matrix) or,as is more common, longer pieces of text such as a document (a words-by-document matrix). Each entry in a words-by-document matrix is the frequencyof occurrence, the number of times that a certain word occurs in a document.[18]

From the words-by-document matrix, each word, (or term) is mapped on anaxis in an n-dimensional coordinate system, where n is the number of words,i.e the number of rows in the co-occurrence matrix. Vectors are then created,with coordinates corresponding to the entries in the matrix. If a documentcontains the word w with an occurrence frequency of a, the value of the vectorcoordinate corresponding to w ’s axis is a, otherwise 0. This results in a vectorfor each document. A tiny example with only three terms is depicted in Figure2.1, where each vector is a representation of a document, and the coordinatesof the vectors are given by the frequency of the terms in the document.

All vector space models is based on the distributional hypothesis:

Words with similar distributional properties have similar meanings.

Words with similar meaning do not necessarily occur with each other, but theydo tend to occur with the same other words [19]. It is possible to find other doc-uments similar to a specific document by looking at the vector for the documentand then calculate which vectors are close to it. [20]

It is possible to make queries using this model by inserting a vector into thevector space, created in the same fashion as the document vectors, (with entriescorresponding to the frequency of the words in the query). Then, one calculateswhich vectors are the query vectors closest neighbours, these vector correspondto documents which are likely to be good matches to the query.

2.2. Parse Trees 3

2.1.1 Latent Semantic Analysis

The above description is a simplified one. The vector space for any real lifeproblem is very large and there are a number of optimisations and reductiontechniques employed to make this model useful.

LSA (Latent Semantic Analysis) is one method that is based on the vectorspace model. Here the co-occurrence matrix is analysed using a technique calledSVD (Single Value Decomposition). Any terms or documents whose contribu-tion is very small, are removed from the matrix. The resulting vector space isa good approximation of the “real” one, and hopefully easier to compute.

2.1.2 Random Indexing

As in LSA, a vector space is constructed and the clustering of vectors indicatesimilarity, but the method employed is quite different.

Each word is represented by an index vector of limited length, randomly filledwith 0,−1, 1. A context vector is constructed for each document, by adding theindex vectors for all words occurring in the document. To make a query vectorusing Random Indexing, simply add the index vectors corresponding to thewords in the query. Then, find context vectors which are close to it in thevector space. [18]

2.1.3 Improvements

By using parse trees instead of words, and using collections of parse trees orannotated texts instead of documents in the above mentioned methods, LSAand Random Indexing, some increase in search performance is expected.

2.2 Parse Trees

As described by [10] a language L consists of a set of strings, and can be de-scribed by a context free grammar (CFG, or simply grammar). A grammarG = (V, T, P, S) has the following components:

T Terminals, the symbols that the strings of L are built from.

V Variables/nonterminals, representing a language L.

S Start symbol.

P Rules/Productions, recursive rules which define the sentences of a language.

Some small examples of parse trees are shown in Figure 2.2. A bit moreformally, a parse tree (also known as a concrete syntax tree), is a structuralrepresentations of a valid string in a language L. All interior nodes of a parsetree are nonterminals from the grammar G, its leaves are either nonterminals,terminals, or if the leaf is an only child ǫ (the empty string) from G. Also,the children of any interior node, must follow one of the productions in P : If


the node’s children are named X1,X2,. . . ,Xk (left to right) there must be aproduction P in G such that A → X1X2 · · ·Xk where A is the interior node.

S

NP

NNP

Johanna

VP

VBD

left

NP

PRP

her

NNS

keys

PP

IN

on

NP

DT

the

NN

table

S

NP

NNP

Jeremy

VP

AUX

is

ADJP

JJ

afraid

PP

IN

of

NP

NNS

mopeds

Figure 2.2: Possible parses of a pair of English sentences

2.3 Motivation for Improvements

There are several motivations given in [9] why parse trees might provide betterresults.

It is possible to make conclusions about a term’s semantics depending uponits location in the syntactical structure which is inherent in parse trees. Fur-thermore, a term’s location in the structure may also provide information abouthow important a word is, providing a means to remove less important words,called syntactic filtering.

2.4 Web Extraction

This thesis covers only the frontend of this project which is concerned with theextraction of articles from news sites. Extraction is a technique to find relevantparts of a document and store these in a structured way, connecting values datawith their semantics. Here, the aim is to produce a sufficient amount of datafor the indexing and data analysis of the backend. News articles was chosen asan appropriate data type due to the Penn Treebank, a corpus1 of parse treesoriginating from articles in the Wall Street Journal2. The Penn Treebank isa parsed corpus, where each sentence has been annotated as a parse tree. It iswell known and often used in the research community, making comparisons ofthis work to previous ones easier and more credible.

The main target of extraction therefore is the Wall Street Journal, but prefer-able extraction should also work on other U.S news sites as well, ideally it should

1A structured, large set of texts2www.wsj.com

2.5. Requirements 5

be as generic as possible. (The restriction on the news origin is due to that ThePenn Treebank, of course, is in American English)

2.5 Requirements

So, briefly the task is to build a system, which extracts articles, parses theseand inserts them into a database. More specifically the requirements are thefollowing:

– Fetch and extract articles. Primarily from Wall Street Journal, secondarilyfrom other U.S websites. Extraction should be as accurate as possible.

– Include the Charniak natural language parser into the project, and buildan interface to it.

– Find a suitable database structure.

– Insert parsed trees into the database, while maintaining the syntacticalstructure.

– After each stage is completed, testing should be performed if appropriate.

– If time is available; research to find an another better natural languageparser to generate parse trees from the documents found, and research thepossibility of using web crawlers, and which difficulties which might arisefrom the use of these. If possible the implementation will be extendedwith results from research.

There were some technical requirements from the company which commis-sioned the work, most importantly all programming must be done in java, usingthe Eclipse IDE, and SVN. It was also required that RSS be used to find pagesto extract articles from.

2.6 Thesis Outline

The outline of this thesis is as follows:Chapter 2 describes the problem at hand, first the project as a whole, and thennarrowing in on the frontend. A survey of web extraction methods in general,but with focus on the field of extracting articles from news sites, is given inchapter 3. The approach used in this project is presented in chapter 4, followedby the results of this labour in chapter 5. The last chapter contains containsconcluding remarks and gives pointers for future work.


Chapter 3

Overview of Previous Work

Web extraction is the retrieval of relevant parts of a document and storingthese in a structured way, connecting data values with their semantics. [11]Websites tend to be dynamic and heterogeneous, which makes the task nontrivial. The most common approach to web extraction (henceforth: extraction)is the use of a wrapper, a small program which applies a set of scripts or rulesto identify relevant data and output it in some structured fashion. Initially thiswas all done manually, but considerable research has been done, mainly withinmachine learning, to create semi- or fully automated induction (generation) ofwrappers. Pages from the same site tend to share the same underlying structure,or template, whose content often is filled in by the backend database. Wrapperinduction (WI) is about finding this hidden template. Each site has its owntemplate and therefore require a wrapper of its own.

There are several advantages of using WI:

1. If the underlying structure of a web page is changed, generating a newwrapper is easier than writing a new set of rules.

2. Creating rules manually demands domain knowledge. WI lessen the de-mand on the end user, which does not need the same computer and pro-gramming experithe.

3. Sites have different underlying structure and must be processed differently.Creating wrappers manually is time consuming and therefore rarely fea-sible for larger data sets. Maintaining wrappers over time is particularlycumbersome.

Many recent tools are either fully- or semi-automated. Typically, manualand semi-automated tools require user interaction, often through a GUI, wherethe user selects areas of interest, and the tool attempts to learn which areasthe user are interested in. Naturally, this is not a feasible method when a largenumber of sites are involved. However, fully automated tools also require somekind of user interaction. However this rarely takes place in the extraction phase,but rather the post-processing phase during data labelling.

7

8 Chapter 3. Overview of Previous Work

3.1 Wrapper Techniques in General

Various techniques have been used to extract data or generate wrappers, differ-ent ways to classify wrappers have been presented, among these are [3] whichmainly is concerned with the degree of supervision from the user (not to beconfused with supervised learning from AI). Supervision may be through dif-ferent inputs such as training sets or through a GUI where the user chooseswhich parts of a page he finds interesting. A semisupervised system requiresless specific examples in its training set than a supervised system, alternativelyit only needs some user response after extraction has taken place to evaluate theresult. Unsupervised systems need very little interaction with the user duringthe extraction process, although some post-extraction filtering, or labelling ofdata often is needed.

Laender et al. [13] instead divides wrapper generation techniques into thefollowing six categories: (All references to the degree of supervision in the com-parison below are from [3]).

Languages for Wrapper Development.This is one of the first methods of wrapper generation. Special purposelanguages were proposed in addition to existing programming languagessuch as Perl or Java. These were the earliest attempts at web extrac-tion systems, and were more or less rule based and required that usershad a large knowledge about computers and programming [3]. However,these languages, which include Minerva, TSIMMIS and Web-OQL, didnot make a large remaining impact in the field. They were often basedon simple rules in different formats. TSIMMIS is one of the very firstweb wrappers. Input was a long list of commands, stating where the dataof interest was to be placed in a page, and which variables to instantiatewith the data[3]. Minerva used grammar-based production rules, writtenin EBNF, that expressed a pattern to help locate the target data[3].

HTML-aware Tools.By examining the structure of html documents, these tools transform doc-uments into trees, retaining the structure of the html document. The treesare then processed to generate wrappers. W4F, XWrap and RoadRunnerare examples of tools from this category. W4F and XWrap are both con-sidered by [3] to be constructed manually. RoadRunner on the other handis far newer and one of few unsupervised systems. It attempts to find sim-ilar structures at the html tag level in several different pages and worksbest on data-dense sites (which tend to be dense in structure as well)[11].

NLP-based Tools.These tools include RAPIER, SRV and WHISK. They use natural lan-guage processing (NLP) e.g filtering or lexical semantic tagging to identifyrelevant parts of a document, and build rules on how to extract these parts.These tools only work for documents containing mainly plain text, such

3.1. Wrapper Techniques in General 9

as different kinds of ads (job, rent, dating etc). RAPIER and WHISK areboth supervised WI systems.

Wrapper Induction Tools.These tools use the structure and formatting of documents. A numberof example training sets must be supplied by the user, and are used toinduce extraction rules. Systems belonging to this classification are WIEN,SoftMealy and STALKER, and they are all seen as supervised WI systemsdue to the large amount of input (training sets) required.

Modelling-based Tools.Given a structure of interest, formatted using a simple primitive, such as atuple or a list, a modelling-based tool will try to find the same structure inthe document of interest. NoDoSE and DEByE are both user supervisedsystems modelling based systems that provide a GUI where the user selectswhich regions are interesting.

Ontology-based.In contrast with all tools previously described, an ontology based tooldoes not use the underlying structure/presentation, but instead examinesthe data. This is only applicable to specific domains and often must beconstructed by an human well acquainted with the domain. Being a veryabstract concept, this has not been researched much. Ontology basedtools are hard to create, demanding a very high level of expertise, nor arethey very generic due to the domain constraint. However, they are easilyadapted to pages within the same domain, and formatting changes willnot affect their performance. Brigham Young University has developedthe most mature tool in this field.

Some newer tools which are not included in [13] are for instance IEPAD, OL-ERA, Thresher (semi-supervised) and DeLa, EXALG, DEPTA (unsupervised).A semi-supervised system, as previously mentioned, requires less exact exam-ples in the training set than a supervised system, or it only needs some userresponse after extraction has taken place to evaluate the result. IEPAD is ableto find, (given only a few examples), other similar repetitive patterns, providedthese patterns are large enough; it is not able to find any single instances oftarget data. [3]

Unsupervised systems, such as RoadRunner or DEPTA require very little in-teraction with the user before extraction, no training sets are needed. However,some post-extraction filtering, or labelling of extracted data is often needed.Unfortunately they have the drawback of only working on pages which are datadense, since they require a certain amount of input to correctly identify patterns.[3]

One of the tools which has attracted the most attention recently is a su-pervised system called LiXto. A GUI is provided where the user interactivelyselects regions of interest from an integrated web browser (Mozilla), withoutdealing with the underlying html. The system stores the path in the html tree


Hard to understand by machine Easy to understand by machine

Unstructured

Semi-structured

Structured

Free texts e.g. News articles

Postings on newsgroups e.g. apartment rentalsEquipment maintenance logsMedical records

Hand-written HTML

cgi-generated HTML

XML

Database

Soderland

Elmarsri & Navathe

Figure 3.1: Degree of structure of different types of documents

of this selected region, and notes its pattern. LiXto is then able to find similarpatterns using a programming language called Elog, which is related to Data-log. The system is quite friendly to end users. Neither Elog nor html is everpresented to the user, therefore no knowledge about these is needed.

As Laender et al. points out in [13], there is a trade off between a tool’sdegree of flexibility and its degree of automatisation. A tool with a high degreeof automatisation tend to have parameters and heuristics tuned to a certaintype of page. These must be changed if applied to another domain, there aredifferent challenges to face when extracting data from a page selling hard drivescompared to a page listing available jobs.

3.2 Extracting Articles From News Sites

The level of difficulty for a particular extraction task (in general, not necessarilyweb extraction) is dependent upon how structured a document is. The methodof measuring degree of structure varies between different research domains, asseen in Figure 3.1 from [3] which depicts the relationship between difficulty andstructure when viewed by a linguist, (Soderland) or by database researchers(Elmarsri and Navathe). Most challenging of all is unstructured text, such asnews articles. This applies when one extracts data from a news article. However,this project is concerned with the extraction of the news article itself, from html,a semi-structured document type, which is a less overwhelming task. [3]

Nonetheless news extraction is a specific domain in itself. Semantically, thecontent of the article is not important, the only thing that matters is that thearticle as a whole has been successfully identified. This is a much easier task thatextracting for instance product information for different sites, where data firstmust be correctly extracted and semantically labelled. If extracting informationfrom sites selling race bikes, it is not enough to find the bike information, e.g.model number, colour, frame size and price, these data fields must also correctly

3.2. Extracting Articles From News Sites 11

be detected as precisely these.However, there are also numerous difficulties. News extraction is less about

separating one structure from a larger structure or many structures and moreabout filtering content from unwanted clutter. The lack of structure becomesa problem. Fully automated wrapper generation is unsuitable for news pageextraction, since these do not contain enough reoccurring block patterns. [24]

Just like most other sites there are many factors creating nuisances: Everynews site has it is own design, which also frequently change. Pages are coveredwith scripts, images and various other distracting elements. There are also somemore minor details complicating the process which are more specific to newssites. According to [15] web designers even try to make it harder to extractdata from news sites data, motivated by the need to complicate the functionof ad-blockers. Sites might require users to log in before allowing them to read(full) articles. Papers may choose to divide their news articles into severalseparate pages.

There is not a large amount of work specifically aimed at this domain, themost relevant works are described briefly in the rest of this chapter. There arenot many of them, since as claimed by Zhang et al. [22], wrapper techniqueshave, (as far as by 2006), never been aimed at news extraction. (This howeveris not true, as one of the works listed below is from 2004, but it was obviouslyoverlooked).

3.2.1 Tree Edit Distance (2004)

Reis et al. [17] have conducted the most known work in this field in 2004, it hassince then been the major work of reference. Their method is aimed at findinghidden templates, using structural similarities. In short, training set pages froma site of interest are clustered based on similarity. A pattern with wildcards isthen calculated to match all of the pages clustered together, and data is matchedand labelled. During extraction, the current page is compared to the patternsof the clustered pages to find the most similar cluster/pattern. Similarity isbased on tree-edit distance, which is the cost of transforming one labelled or-dered rooted tree into another. There are three permittable operations involvedin transformations; node insertion, node deletion and node replacement. Alloperations have a cost, which may be unit cost.

An alternative way of describing the cost of transformations is through amapping [17]:

Definition 3.2.1. Let Tx be a tree and let Tx[i] be the i-th vertex of tree Tx ina preorder walk of the tree. A mapping between a tree T1 of size n1 and a treeT2 of size n2 is a set M of ordered pairs (i,j), satisfying the following conditionsfor all (i1, j1), (i2, j2) ∈ M

– i1 = i2 iff j1 = j2

– T1[i1] is on the left of T1[i2] iff T2[j1] is on the left of T2[j2].

– T1[i1] is an ancestor of T1[i2] iff T2[j1] is an ancestor of T2[j2].


BB

R

A C

D E

R

B

A

G

E

T1 T2

Figure 3.2: Mapping

BB

R

A C

D E

R

B

A

G

E

T1 T2

Figure 3.3: Top down mapping

An example of a mapping can be seen in Figure 3.2 taken from [17]. Anydotted lines going from T1 to T2 indicate that that vertex should be changedin T1. Any vertices in T1 which are not connected by lines to T2 should bedeleted and similarly, any vertices in T2 which are not connected to T1 shouldbe inserted.

Calculating the edit distance is quite computationally expensive and manysimplified, restricted definitions exist, e.g. the top down mapping defined below,which disallows removal and insertion in all positions but the leaves, see Figure3.3 from [17] for an example.

Definition 3.2.2. A mapping M between a tree T1 and a tree T2 is said tobe top-down only if for every pair (i1, i2) ∈ M there is also a pair (parent(i1),parent(i2)) ∈ M , where i1 and i2 are non-root nodes of T1 and T2 respectively.

This restricted definition is restricted further by the authors, by adding thatreplacements too, must only take place in leaves, resulting in a restricted topdown mapping RTDM. An example provided by [17] is depicted in Figure 3.4

Definition 3.2.3. A top-down mapping M between a tree T1 and a tree T2

is said to be restricted top-down only if for every pair (i1, i2) ∈ M , such thatT1[i1] 6= T2[i2], there is no descendant of i1 or i2 in M, where i1 and i2 arenon-root nodes of T1 and T2 respectively.

The algorithm locates all identical subtrees in trees T1 and T2 in the samelevel, and group them in equivalence classes. For each class, a mapping is foundbetween the trees.


BB

R

A C

D E

R

B

A

G

E

T1 T2

Figure 3.4: Restricted top down mapping example

+

?

+

*

Figure 3.5: Node extraction patterns

To extract data from a web site one must use a training set, a large number ofpages from the same site. Pages are compared using the RTDM algorithm andthen clustered using a standard clustering technique. For each of the clusters anode extraction pattern (ne-pattern) is generated, a form of regular expressionfor trees, se Figure 3.5 for some examples given in [17].

Definition 3.2.4. Let a pair of sibling sub-trees be a pair of sub-trees rootedat sibling vertices. A node extraction pattern is a rooted ordered labelled treethat can contain special vertices called wildcards. Every wildcard must be aleaf in the tree, and each wildcard can be one of the following types.

– Singel(·) A wildcard that captures one sub-tree and must be consumed.

– Plus(+) A wildcard that captures sibling sub-trees and must be con-sumed.

– Option(?) A wildcard that captures one sub-tree and may be discarded.

– Kleene(∗) A wildcard that captures sibling sub-trees and may be dis-carded.

If one wishes to extract a page, the tree representing this page is comparedwith node extraction patterns generated for each of the clusters from the trainingdata. Once a page is matched against a pattern, data is easily located. Data islabelled using a number of number of heuristics, including length of text.


3.2.2 The Curious Negotiator (2006)

The curious negotiator is an agent systems for negotiations described by [22].The negotiated items are pieces of information, specifically news articles. As apart of the system articles must be fetched, classified and stored for later useby negotiation agents. A data extraction agent added to the system performsthe following three stages: 1) data extraction, 2) text filtering using a dynamicfilter, 3) keyword validation.

The key ideas behind the extraction is that web sites are constructed fromhidden or visible nested tables in combinations with CSS1 and that a newsarticle is the largest block of text in a web page.

Extracting news from a page becomes the task of identifying the largestportion of text in a table. This is done by inserting all html tags of a page intoan array, removing all tags which are not a table <table>......</table> orwithin table tags. Iterate the array, for each text item, append it to a containerwhich holds all text at this nesting depth. Once the array has been iterated,the container with the largest amount of text is returned.

A second page, preferably similar to the one first is the fetched, and extrac-tion done in the same manner. The result is compared to the extracted bodyof text from the first page. Any identical sentences are considered static partsof the page and are then removed, filtered, from the end result.

To ensure that the URL used during extraction was a valid one, or thatnothing else went wrong, the result is validated. Validation is performed usingkeywords that should reasonably occur in the text. The keywords that are usedare words appearing in the title, (except for stop words2). If they are found toa satisfiable degree the text is accepted as an article.

Naıve as this approach may seem, the authors claim this method is fullyfunctional, although, they do not compare it to any other.

3.2.3 Tag Sequence and Tree Based Hybrid Method (2006)

Li et al. views all extraction techniques as either tag sequence based or treebased, and give a proposal in [14] on an hybrid method using both of these.With a tag sequence based approach, one may use existing techniques such aslanguages with good support for regular expressions. Unfortunately the nestingstructure of the document is hard to keep track of using this approach, which onthe other hand is an inherent property of tree based approaches. These however,do not have the same support for comparing similarity or pattern matching.

The html document is transformed into a novel representation format forweb pages, what the authors call a TSReC, Tag Sequence with Region Code, alist, whose entries are called tag sequences (TS ).

TS =< N,RCb, RCe, RCp, RCl, C >

1Cascading Style Scheets (CSS)2Stop words are words which do not add any real information e.g. “she”, “the”.


Tag sequences contain information about region beginning (RCb), regionend (RCe), parent(RCp) and level(RCl). C is the content; inner html tags ornothing at all. Together these provide the possibility to treat the TSReC as atree.

Any web page can be divided into three different kinds of areas:

Common parts, such areas which are common on all pages of a site, such as thetop part of a news site which usually has items such as the paper name,date and a small navigational menu with the sections of the paper.

Regular parts, for instance navigational field, which occur on all pages, butwhose content may change.

Content parts, the target of extraction.

Another page from the same web site, as similar as possible is compared tothe current page. The common parts of both pages are first identified usingsequence matching and then the regular part using tree matching techniques,leaving the content part as a final result.

The sequence matching is done by finding the string edit distance betweenthe two pages, and the tag sequences of the common parts are removed from thesequence. String edit distance3 is used, since the common parts are not alwayscompletely identical, for instance date or time may be included.

Finding the regular parts are harder than the common parts, in particularwhen they are too flexible, too different. The solution is to create subtreesof all tag sequences which belong together, using the region codes in the tagsequences. Subtrees of one page are then compared to the subtrees of the otherpage

The remaining tag sequences contain the targeted content part.

3.2.4 Linguistic and Structural Features (2007)

Ziegler et al [24] suggest a system which identifies text blocks in a documentand finds threshold values for different features, (properties) of the blocks todetermine if a block is a part of the article or not.

The actual values of the thresholds are calculated using a stochastic nonlinear optimisation method called Particle Swarm Optimisation using a largeamount of pages. These threshold values are then used when extracting otherpages.

Text Block Identification

First all the text blocks of a page must be isolated. A html page is converted toxhtml to allow the page to be parsed into a DOM tree. The tree is traversed andits nodes are either removed, pruned or untouched. When pruning a node n,

3The number of operations (taken from some predefined set) that are needed to convertone string into the other


the entire subtree of n is also deleted along with n, as opposed to when a nodeis removed and only n, or rather its the value, is deleted from the tree. Theremoval of an element or subtree may also mean that the value of the node isreplace by whitespace, to ensure that whitespace in the text is preserved. Thisis the case of the <br>, <p> tags. Pure formatting tags, such as <big>, <span>

or <i> are simply discarded. Pruning removes any parts of the tree which nevercontain any real content, such as <textarea>, <img> or script. The tree isthen traversed and the text blocks are identified.

Features

For each block, minimum or maximum thresholds are calculated for eight sepa-rate features, properties, which can either be linguistic or structural. There arefour features of each:

– Structural features

1. Anchor ratio - the amount of words which are links compared to thetotal number of words, tend to be low in text.

2. Format tag ratio - plain text contains relatively many formattingtags.

3. List element ratio - not very common in text compared to surround-ings.

4. Text structuring ratio - headings, paragraphs, text block alignmenttags are expected to be more frequent in text.

– Linguistic features

1. Average sentence length - text tend to have longer sentence lengththan for instance link lists.

2. Number of sentences - often higher in plain text.

3. Character distribution - ratio of alphanumeric characters comparedto other characters. Expected to be higher in text.

4. Stop-word ratio - should be high for continuous text.

Thresholds are found using a large set of pages, where the correct text blockshave been manually evaluated before hand.

Extraction

Once the thresholds have been found, extraction of a html page can be doneby parsing it into a DOM tree which is then processed as described above,to identify blocks. Which blocks that belong to the article is determined bycalculating the features for each block and then comparing them to the thresholdvalues for each feature.


3.2.5 Visual and Perceptual

Compared to previously described approaches to news content extraction, visualand perceptual methods are quite very different. In contrast to other methods,where the most common course of action is to process html either on a tag levelbasis and/or as trees, often as parsed DOM, visual and perceptual approachesare only concerned with how the html has been rendered, not its structure.

Visual Consistency (2007)

As observed by Zheng et al. [23], humans are easily able to locate the actualcontent of a news page, regardless if is written in a language they are familiarwith or not. The reason is that papers, perhaps by convention, always give thearea where the news article content the same characteristics, they are visuallyconsistent :

– Its area is relatively large compared to surrounding objects.

– There is a title at top, with a contrasting size and/or font.

– It consists mainly of plain text, and a few other items such as pictures ordiagrams.

– The area is fairly centred on page.

Even if the hidden structure behind changes, these four characteristics remain.Motivated by this observation Zheng et al. suggest dividing a page into

blocks corresponding to the visual parts that it is built from. These blocks arerendered from a pair of html tags or the text between them, and are given anID. They are nestled, have a size and a position.

Inner blocks have at least one child block, leaf blocks none. Leaf blocks arelabelled as one of Title, Content or Others.

The largest block is the entire page itself, everything within the <body>

. . . </body> tags. A page can be viewed as a visual tree made from blocks.The tree contains a parent-child relationship for blocks where one is on top ofanother.

The authors use a html parsing tool from Microsoft to obtain a visual tree.(It should perhaps be mentioned that two of the three authors work for Microsoftin China.) The tool is able to provide a lot of information about each block,including coordinates, width, height, information about formatting e.g. font size,if text is bold or italic. Lastly some statistics are given, including the numberof images, hyperlinks and paragraphs, length of text among others.

Machine learning is used to train the tool and induce a template independentwrapper. Since the visual properties of any news page are quite stable even whenthe structure behind is changed, there is no need to find any hidden template.

Blocks of the pages in a training set are manually labelled as inner or leaf,and further: a positive inner block is one which has at least one child labelledas Title or Content, otherwise it is called negative.


By using the values of the inner blocks and their labels, and the informationabout each block given by the html parsing tool, the system is trained to deter-mine which blocks contain some news. A second learning iteration is performedon these blocks to more precisely locate news and title.

Perception Oriented (2008)

Another, quite similar approach which also focuses on how a human is believedto go about when finding news is given by Chen et al. in [4]. Properties, quitesimilar to the ones mentioned by [23] are given to the areas of a page whichcontain the actual news content, and which humans use to identify news content.

These properties are:

Functional property: the key function of this area is to provide information.

Space continuity : the contents are located continuously in space, separatedonly by non informational areas, such as images, navigational- or decora-tional areas.

Formatting continuity: all news areas should have approximately the sameformatting.

The authors claim that their method mimics humans by first locating thearea containing news and then locating the actual news content.

Each page is turned into a structural representation consisting of objects,called Function based Object Models (FOM). Each object represents a contentpart of a page. Objects have different functions, basic or specific, depending onwhat kind of content they represent, and what the author is trying to mediatewith this particular piece of content [5]. There are four types: Information-,Navigation-, Interaction-, or Decoration Objects. Objects carrying more thanone function are called mixed. There are also a number of objects specific tothe domain of news extraction:

Block Object: separated by other objects by spaces.

Inline Objects: objects which are displayed one after another within a blockobject.

Text Information Object / Media Information Object: specific type of informa-tion object.

Leaf Block Object: a special kind of block object. Leaf information objects cannot contain any other kind of block objects, only inline objects.

Leaf Block Information Object(LBIO): a leaf block object which main func-tionality is to supply information. If the media type of the object is text,this is a Text Leaf Block Information Object(TLBIO)


The authors use a five step algorithm which is based on the succeedingtheorems (the writers call these axioms):

Theorem 3.2.1. News content of a news Web page is presented as a set ofTLBIOs in the page.

Theorem 3.2.2. A news TLBIO can only be contained in an information- ormixed object

Theorem 3.2.3. News TLBIOs of a news page are presented in one or morerectangular areas. Vertically, these rectangular areas are separated by mediainformation objects and/or non-information objects.

Theorem 3.2.4. The major content format in a news area is similar to theformats used by the majority of objects inside all news areas.

The five stage algorithm is, in short, as follows:

1. In the first stage of the algorithm, the page is transformed into tree struc-ture of FOMs. We refer the reader to [5] for the details of this procedure.

2. Next all TLBIOs are detected. The tree is traversed top to bottom. Basedon Theorem 3.2.1 and 3.2.2 all blocks which are information or mixedobjects are added to a set of TLBIOs. The children of any compositeblock, whose relative area compared to the rest of the page is large enough,is also added to the set (to avoid missing an information object in a largenavigational object).

3. Any areas in the set of TLBIOs where the areas are close enough (Theo-rem 3.2.3) or have similar formatting (Theorem 3.2.4) are merged. If theresulting area is larger than a predetermined threshold value this step isperformed again, but with a more conservative view on closeness.

4. The merged area, and the TLBIOs are examined and based on position,formatting among other criteria, the system decides which parts are theactual news content.

5. In the final stage the title is extracted, using a few heuristics. Some of thefactors which are considered, include that titles tend to be short, close tothe article and relatively large in font size.

3.2.6 A Generic Approach (2008)

Dong et al. [7] presents a method which does not need to find any hiddentemplate for each page, but instead use a few heuristics based on the followingobservations about the DOM trees from articles on news web pages : newsarticles, (including text, date and title) are generally located under a separatenode, they are comprised of a number of paragraphs, located close to each otherand often with other, unrelated material, between them. Formatwise, theycontain a lot of text and few links.

The authors use the following terminology to describe this more concise:


Block Node: is the ancestor of nodes who contribute to the structure of thepage. Usually contains html tags, such as <table>, <div>, <td>, <tr> or<body>.

Paragraph Node: represents news, tags as <p> or <br>.

Invalid Node: does not contribute to the content of the article, such as<script>, <form>, <input>, <select>, or whose children are empty.

Node’s Semantic Value: is the number of characters of the content belowthe node, which are not included in a hyperlink.

The previous observations can now be put as into these general rules: 1)news, including text is located below a block node, 2) text is below a paragraphnode 3) the block node containing news will have the largest semantic value 4)invalid nodes are irrelevant and should be deleted.

These rules are then applied during extraction in the following manner:

1. To remove invalid nodes, the page is transformed into XML.

2. Next the tree is traversed top to bottom, layer by layer until a paragraphnode is reached. In each step, the semantic value for each node is calcu-lated and only the children of the node with the largest semantic valueis evaluated next. The news article text is composed of text nodes underthe current block node.

3. From the current block and upwards until a <table> or <div> the contentof nodes are matched against a regular expressions in an attempt to finda date indication.

Chapter 4

Approach

There are a number of design choices to take into consideration. First themajor tasks of the system are identified as page fetching, extraction, parsingand database interaction. Motivated by the aim to keep the program modular,each of these major tasks, and other less central ones, are all implemented inseparate parts.

All four major parts are described in more detail below. Apart from these,some additional functionality is required, and is described shortly at the end ofthis chapter.

4.1 Web Content Syndication

To compile information from different web-based sources, and present it in aconcise manner to a user is called web (content) syndication. As a service toits readers, a news provider, or even a simple blogger, may choose to provideautomatic updates on subjects of the users chosen interests by publishing whatis called a feed.

To be able to receive a feed one must subscribe to it using a program calledan aggregator or feed reader, which polls the site for updated feeds. Updates aresent to the reader through a web feed or a syndication feed, an XML-formattedmessage containing the headline, a link to the page, and also often a shortsummary, or sometimes, even the actual content. There are a number of end-user syndication publishers and readers available, both standalone, web basedand as extensions to most modern web browsers.

Instead of visiting a number of different sites to see if there has been anyupdates, a user can simply use the feed reader to get an overview, and decide ifthis was information he or she was interested in.

All this allows for an easy way to manage information flow, relieving usersof the need to constantly check their favourite sites for updates. At the sametime it is easier to get an overview of incoming updates giving the opportunityto easily determine if this was information of interest, in which case the content

21

22 Chapter 4. Approach

of the link location can be investigated further. [21]There are two major formats for syndication feeds, RRS[2] and Atom. The

majority of feeds use RSS. The actual meaning behind this acronym dependson which version is being referred to, the current, RSS 2.0, is said to be ”ReallySimple Syndication” [1]. A multitude of different versions of RSS exist, andthere are numerous compatibility issues between them [16]. Atom is not aswidespread, despite not having any compatibility issues and a stricter view ondocument formatting to avoid the use of poorly structured feeds.

These techniques can be utilised for our purposes as well, as all major newssites provide feeds as a service to their readers. Instead of using spiders orcrawlers these feeds can be used in data gathering.

In addition to existing RSS and Atom readers there are also various librariesaiding programmers to directly publish and subscribe to feeds, among others:ROME1, Informa2, Eddie3, Universal Feed Parser4 All support several versionsof RSS and Atom, but neither project except ROME have been updated in along time, and Eddie is also unsuitable in this project from a license point ofview.

ROME is free, recently updated and still maintained, and is released underApache License 2.0. It supports all versions of RSS and Atom and relieves theuser from worrying about incompatible versions and other details. Due to thesereasons ROME is used in the system to read RSS and Atom feeds.

4.2 Extraction

There are many factors to consider when designing an extraction module. Asargued in Section 3.2 it is clear that there are quite varied approaches to theproblem. However, as work was in an initial stage, focus was set to get a simpleprototype working, with less consideration to its actual performance. This wasmotivated by the desire to get a basic version of the entire system up an runningat an early stage, partly to avoid delaying the development of the backend ofthe system.

Inarguably, the most simple of all the algorithms in Section 3.2 is the onegiven in the “Curious Negotiator” in Section 3.2.2. However, questions wereraised as to how effective this algorithm really was. It was considered slightlytoo naıve, and some alterations, described below, were done.

The “Curious Negotiator” algorithm presented by Zhang and Simoff [22]is entirely tag oriented, and also makes the assumption that page structure iscompletely constructed from tables.

The extraction implemented in this system is based on Zhang and Simoffsalgorithm but it is less focused on tables and is not tag oriented, instead thepage is converted into a DOM tree. In order for conversion to work, the page

1Project homepage: http://rome.dev.java.net2Project homepage: http://informa.sourceforge.net3Project homepage: http://www.davidpashley.com/projects/eddie.html4Project homepage: http://www.feedparser.org/

4.2. Extraction 23

Document

< b o d y >< h e a d >

< t i t l e >

text node"A short document"

< h 2 >

text node"My Hobby" text node

"Some of my favourite shows are the

Simpsons and the

Big Bang Theory."

< p >< p >

text node"I l ike watching TV

< i >text node

" ! ! ! "

1

1.0

1.0.0

2.0

2.0.0 2.1.0 2.1.1

2.1.1.0

2.1.1.0.0

2

2.1

2.1.2

2.2

< h e a d >

< t i t l e >

text node"A short document"

< h 2 >

text node"My Hobby" text node

"I l ike watching TV

< i >

< b >

text node"a lot"

text node"Some of my favourite

shows are the Simpsons and the

Big Bang Theory."

text node" ! ! ! "

< p >< p >

< b o d y >

2.2.0

Figure 4.1: DOM tree with path prefixes

must be processed. Few web pages conform to set standards, and thereforemust be cleaned before conversion can be done. This is done by a open sourcelibrary called HtmlCleaner5, which interprets html in a fashion similar to thatof browsers. Among other things the library rearranges tags to produce morewell formatted html, and is finally able to return a DOM tree.

A very small, simplified example of a DOM tree is given in Figure 4.1. Thecorresponding html is provided in Figure 4.1.

<html>

<head>

<title>A short document!</title>

</head>

<body>

<h1>My Hobby!</h1>

<p>I like watching TV <i><b>a lot</b></i>!!!</p>

<p>Some of my favourite shows are the Simpsons and the Big Bang theory.</p>

</body>

</html>

Table 4.1: Corresponding html to Figure 4.1

Instead of looking at tags, the DOM tree returned from HtmlCleaner isexamined, and this examination is not limited to only nodes below table nodes.Instead all text nodes, (with a few exceptions, most notably lists of links,) are

5Project homepage: http://htmlcleaner.sourceforge.net/


included in the process. All nodes are given a unique path name based on theirlocation in the tree. These are the numbers close to each node in the tree inFigure 4.1, which are not part of the DOM tree, but a manner of describingthe path or location of each node. All text nodes which have a common pathname prefix are appended after each other. The longest body of text with acommon prefix, is considered to be the article. This description is a simplifiedone. There are several alterations done to the DOM tree, prior to finding thelongest common path name prefix. The alterations are simple heuristics whichhave been found, by observation, to improve results.

To further improve results [22] applies a dynamic filter, which is constructedby extracting the content from two pages on a site. Any common material isadded to the filter.

The curious negotiator algorithm uses a web crawler to fetch pages from newssites, and to avoid including invalid pages, or pages not containing news, key-word validation is applied. Since this system will use RSS feeds from news sites,the risk of encountering invalid pages is considered small enough too completelyignore this stage.

To cope with pages of a less well structured character, with plenty of format-ting such as <i>, <b>, or <blockquote> which might be used in a non-standardmanner, any text node which occurs below a node that represents any tags thatare simple and purely for text formatting are given the same prefix as its parentnode.

Although no dynamic filters are available, a user can add static filter sen-tences which will never be included in an extraction.

By providing the name of the html tag pointing to the next page in a multi-page article, the extractor can fetch all pages in the article, and extract thearticle from all pages of it.

4.3 Parsing

The original requirement that some implementation of the Charniak parser mustbe used was changed as there apparently is no Java implementation easily avail-able. The Stanford NLP parser6 is considered an acceptable alternative, isavailable in Java and is therefore used in place of the Charniak parser. Un-fortunately, it is not the ideal choice when it comes to software licences, sincethe Stanford NLP parser is released under GNU GLP V2, which does not allowincorporation into proprietary software. However, by building the system mod-ularly, the parser can be replaced if needed, as long as output for a new parserfollows the standard Penn Tree Bank annotation style.

Two small examples of output from the Stanford parser are provided in Table4.2, the sentences are “Negative, I am a meat popsicle” and “Jeremy is afraidof mopeds”. A corresponding parse tree for the right sentence can be seen inFigure 2.2. (There is a small missmatch between them, they are however, both

6Available at http://nlp.stanford.edu/software/lex-parser.shtml

4.4. Database 25

valid but the result of two different parsers. The tree in Figure 2.2 is generatedusing the Stanford parser and the one in Table 4.2 using a Charniak parser.)

(ROOT (ROOT

(S (S

(ADVP (RB Negative)) (NP (NNP Jeremy))

(, ,) (VP (VBZ is)

(NP (PRP I)) (ADJP (JJ afraid)

(VP (VBP am) (PP (IN of)

(NP (DT a) (NN meat) (NN popsicle))) (NP (NNS mopeds)) )))

(. .))) (. .)))

Table 4.2: Sample output from parser

4.4 Database

Each document must be stored in the database for later processing by the back-end. Information that must be stored include an URL to the document, thetitle of the article, all sentences from the article along with their correspondingparse trees and information about each site/RSS feed. Parse trees are stored ina separate table, where each entry in the table is a node of a parse tree, andinclude a tree ID, node ID, word classification ID and the actual value (whichis null for all inner nodes).

The simple approach to storing trees would be to use an adjacency list, whereeach node entry contains a reference to the parent. To find an entire tree, onerecursively queries the database for each level from the root and downwards.The adjacency list model is quite slow and inefficient, and troublesome to workwith for applications such as ours.

This work uses another model called the Nested Set Model. It is based onan idea found in [8] on how to preserve the hierarchical structure of trees. Eachnode entry in the database contains a left and right value which are numbered ina preorder tree traversal fashion. The left value of each node will be the smallestnumber among all its descendants, and its right value will be the largest numberof all its descendants. All leaves will have a difference of one between its leftand right value. A small example taken from [12] is depicted in Figure 4.2

In order to improve performance during backend queries, tree depth wasalso added during a later stage in the development, (which actually removedthe need for a right value altogether).

The database runs on a MySQL server provided by CodeMill AB.

4.5 System Overview

In short, the components described above are the following.

Extractor. Extracts articles, provides the possibility to add static filters andfetches multi-page articles.


A

B

FE

D

C

1

32

12

4 11

5 10

6 7 8 9

Figure 4.2: Left and right numbering in the Nested Set Model

FeedReader. Takes a url pointing to a feed and extracts data from the feed,using a library called ROME.

Parser. Returns a string representation of a parse tree, by using the StanfordNLP-Parser.

DatabaseInteraction. Converts any parse tree into a more appropriate formatfor insertion into the database. Performs all interaction with database.

In addition to these some other minor tasks must be carried out. Theseinclude managing information about each site, providing a common policy forhandling strings and supporting DOM processing. They are implemented in thefollowing parts:

DOMparser. Auxiliary methods for examining DOM documents.

SiteManagement. Handles adding and removing sites.

TextProcessing. Ensure identical processing of text in different parts of thesystem. Detects sentences in the text using OpenNLP7, a natural languageprocessing library.

A simplified outline of the entire system is depicted in Figure 4.3, with thefrontend in the left part of the figure, the backend to the right and the databasein the middle.

The system has a small administration GUI which allows a user to add andremove RSS/Atom URLs, and add filters to any existing feeds.

7Project homepage: : http://opennlp.sourceforge.net/

4.5. System Overview 27

DatabaseInteraction

Parser

Extractior

FeedReaderSiteManagement

DomParser

Mapping

Indexing

Query processing

User

Frontend Backend

Figure 4.3: Outline of the complete information-retrieval system


Chapter 5

Results

Testing the effectiveness of the implemented extraction method is a complicatedmatter. It is difficult to determine what to include in an article, since this is amatter of subjective opinion, as is noted by Ziegler et al. in [24]. Humans donot always agree on which text excerpts that should be included in an article,so it is difficult to find a ground truth for evaluating automated extraction.

Furthermore, there is the question of how to perform comparisons. Theresults can for instance be compared document-, character- or sentence-wise.Evaluating results through document-to-document comparison is not a feasibleapproach, since it is too coarse-grained to be useful. Character comparisons willmost likely give better results than sentencewise comparisons, if one looks onlyat the percentage of correctly extracted text. However, not only the amountof text, but also the text quality is of major importance. If sentences are erro-neously detected, then then the input to the parser, and thus also its output,will not be meaningful. For this reason, we decided to perform comparisons ata sentence-to-sentence level. The disadvantage of using sentences comparisonsis that the outcome will depend upon the performance of the library used todetect sentences from free text.

The extraction system is thus evaluated as follows. First articles from anumber of pages are extracted by hand. The system employs a library calledOpenNLP to perform sentence detection on all extracted text before invokingthe parser sentence by sentence. During evaluation, all hand extracted articlesare divided into sentences using OpenNLP. The html pages from which thearticles were extracted are fetched by the system. Extraction is then performedon the fetched pages, and all the extracted text is divided into sentences, againusing the OpenNLP library. Finally, machine extracted sentences are comparedto the hand evaluated sentences. Testing is performed using a simple JUnit1

class.

It is difficult to choose a good metric for measuring the systems performance.To simply count the number of extracted sentences that are identical to those of

1A framework for unit testing in Java. Available at http://www.junit.org/

29

30 Chapter 5. Results

the reference document is not reasonable. An exact sentence may be evaluated asmissing from the extracted article compared to its hand evaluated counterpart,but a very similar sentence may be detected instead.

Example 1. The Wall Street Journal

An example of an attempted extraction from the article2 in Figure 5.1 and Fig-ure 5.2 detects 32 common (correctly extracted) sentences. Three sentences arelabelled as missing from the extracted article and three sentences are mistakenlyincluded as seen in Table 5.1. On closer inspections, the errors do not seem tobe too severe.

Sentences erroneously found by automatic

extraction, which should not be included

Sentences present in the manually extracted

text but missing from the automatic extrac-

tion

By . Wind and biofuel could become the nextsubprime mortgage fiasco.

WILLIAM TUCKER There isn’t muchdoubt that Congress and incoming PresidentBarack Obama will try to impose some kindof limits on carbon emissions.

By WILLIAM TUCKER There isn’t muchdoubt that Congress and incoming PresidentBarack Obama will try to impose some kindof limits on carbon emissions.

Mr. Tucker is author of Terrestrial Energy:How Nuclear Power Will Lead the GreenRevolution and End America’s Long EnergyOdyssey, published in October by BartlebyPress. Please add your comments to theOpinion Journal forum.

Mr. Tucker is author of Terrestrial Energy:How Nuclear Power Will Lead the GreenRevolution and End America’s Long EnergyOdyssey, published in October by BartlebyPress.

Table 5.1: Comparison of automatically- and manually extracted text from aWall Street Journal article

The extraction does miss the subtitle “Wind and biofuel could become the next sub-

prime mortgage fiasco.”, and it wrongly includes the invite to add comments aboutthe article. But the other two sentences are correctly detected as belonging tothe text, although they are not correctly split into sentences.

Example 2. USA TODAY

The previous example demonstrated a fairly well executed extraction. WallStreet Journal in general is rather kind to this extraction approach, but thereare occasions where the result is not quite as successful.

Lets look at the result from extraction of an article published in usa today3.Extraction correctly finds 54 sentences. There are a number of sentences in

the extracted article, which are not present in the manually extracted article,please view Table 5.2

2Carbon Limits, Yes; Energy Subsidies, No

http://online.wsj.com/article/SB123051123182738427.html3What’s the attraction? Look to society, biology, not ’logic’

http://www.usatoday.com/tech/science/2009-02-10-attraction N.htm

31

Figure 5.1: A sample article from the Wall Street Journal

The last four sentences in the list of erroneously included sentences can easilybe removed with a filter for this site (the same sentences occur on every pagewhere comments are allowed), see Figure 5.4. The sentences “BRAIN SCANS:Honeymoon period doesn’t always end” and “ BETTER LIFE: More on sexualhealth” can not be excluded using filters. These are links in the middle of thepage see Figure 5.3, at the same level of the DOM tree as the article, invitingthe reader to follow up on the subject by reading articles with similar content.The link is unique to this page and no site filter can be applied to solve thisproblem. Another common cause for error can be viewed here; often the initialsentence will be prefixed with information such as byline , news provider andnews paper name.


Figure 5.2: The continuation of the article in figure 5.1

There are some situations where extraction fails. One such example is blogposts, where presumably only the largest post on the page will be returned.Extracting an article from a page that allows readers to comment, may produceany comment which is longer than the article, or all comments (and perhapseven the article) concatenated to a massive blob of text, containing perhaps400-500, or even more sentences! Further, articles which are very short may becompletely looked over for any larger parts of text in the page.

A small timing test was performed on 50 pages, measuring only the time forextraction, sentence detection and parsing, no page fetching, database manage-ment, etc is included. The proportion of time spent parsing the extracted text(on average) is 0.96, thus by far dominating the total computation time.

33

Figure 5.3: A sample article from usa today


Figure 5.4: The continuation of the article in Figure 5.4

35

Sentences erroneously found by automatic

extraction, which should not be included

Sentences present in the manually extracted

text but missing from the automatic extrac-

tion

BRAIN SCANS: Honeymoon period doesn’talways end But this question of initial at-traction is more than just about the physi-cal.

But this question of initial attraction is morethan just about the physical.

BETTER LIFE: More on sexual healthOverall, experts say, men are visual beingsand focus on the physical; woman view at-tractiveness as more than just looks.

Overall, experts say, men are visual beingsand focus on the physical; woman view at-tractiveness as more than just looks.

By Sharon Jayson, USA TODAY Beforelove or even the inkling of a relationship,there is attraction, that unexplained mag-netism that draws people together, possiblyleading to love.

Before love or even the inkling of a relation-ship, there is attraction, that unexplainedmagnetism that draws people together, pos-sibly leading to love.

Don’t attack other readers personally, andkeep your language decent.Guidelines: You share in the USA TODAYcommunity, so please keep yourcomments smart and civil.Read more.Use the Report Abuse button to make a dif-ference.

Table 5.2: Comparison of automatically- and manually extracted text from ausa today article


Chapter 6

Conclusion

Despite its naıve appearance the extraction algorithm performs quite well onordinary news articles after the modifications described in Section 4.2. Thesurprisingly high degree of effectiveness on ordinary article pages is probablyowing to the modifications made to the original algorithm, which required a lotof time consuming tweaking and experimenting.

The main advantage of the extraction algorithm is that it is generic. Nospecial wrappers need to be generated, the only site specific information thatmust be stored about each site are filters and multi-page tags if such are applied.When taken together, these features make it easy to adapt the system to monitornew sites.

However, the results are not always as exact with respect to precision asone might want, and the method is restricted to common news paper articleswithout too much formatting. Sites allowing readers to comment are hard toextract from since the result cannot be trusted. Blog pages are even harder.Failing to extract from pages with comments can, on one hand, be quite badsince more and more sites appear to encourage its readers to provide feedback.On the other hand, this does not have to be a problem if the page is fetched atthe time of publication, when the risk of lengthy comments is small.

The extraction method can be applied to many sites without modification,but results will probably never be perfect. There could be other tools andmethods which provide better precision, but these must certainly require moreeffort with respect to wrapper maintenance. Another shortage of this method isthat any additional information which may be associated with an article but notpresent in the article itself, such as captions or boxes of facts, are not includedin extraction. On the other hand, it is unlikely that any of the other methodswould succeed any better.

So would any of the alternative methods have yielded better results? Ofcourse, without actually implementing the methods one can only speculate. Theauthor believes that the method using linguistic and structural features wouldfail in a similar manner when applied to blogpost and commented articles, aswould visually and perceptionally based methods, and also the generic approach

37

38 Chapter 6. Conclusion

described by Dong et al. in [7]. These, and the modified approach used by thissystem, all essentially try to find parts of the page which has certain characteris-tics that can be found in news articles. Unfortunately, these characteristics arealso inherent in comments. The problem with blog-like pages is that text oftenis divided into separate parts, and finding all these part is not considered in anyof these algorithms. The tree-edit distance method is without doubt the methodwith the most solid theoretical base and the de facto standard of reference inthis field, but it is also the oldest and all research which mentions this workclaim to surpass it. Nonetheless, it is possible that this method may solve theproblem with commented articles better than the other ones. The strategy ofthe tag sequence and tree based hybrid method would presumably not performany better either. Although different pages from the same site may indicate acertain overall structure, and one cannot be confident that there are not localdeviations, and that all content at the expected location in the structure reallybelongs to the article. There are too many cases of dynamic parts occurringin, or close to, news articles which are not part of the actual article, but onlyrelated to it. For instances references to similar articles, short explanations ofconcepts presented in the article or even current stock exchange informationabout companies mentioned in text.

The current trend is that pages are becoming more visually cluttered andtechniques are emerging which add dynamic content e.g. AJAX which makethe underlying structure even harder to process. These are factors which mayrender visual and perceptual methods more interesting in the future, althoughthese will most likely be quite domain specific.

As noted by Nørvag in [15], the design and construction of news pages variesfrom different countries, e.g. Norwegian newspaper editors use a URL text URL

pattern more commonly than international papers. In some countries such asTaiwan, pages are to a great extent constructed from tables, (this was one ofthe main assumptions of the original extraction algorithm). For some reasonthe vast majority of the papers studied as part of this thesis are publishedby researchers working at Asian universities or companies. The reasons whyresearch in this field appears to more intense in this part of the world areunknown. Perhaps these countries have a different tradition when designingnews paper sites, making extraction harder. This however is merely a theoryand has not been tested or examined.

6.1 Limitations

The largest limitations to this system is the lack of precision in extraction andthe speed of parsing. In fact, the parser is extremely slow, but the author hasno previous experience in the field, and cannot say whether this is due to theparticular parser chosen for the implementation, or to the parsing paradigm perse.

The use of RSS is both a convenience and a limitation. By using such alibrary, new data can be easily detected, and it is rather safe to assume that

6.2. Future work 39

pages contain articles and nothing else, which is not the case when webcrawlersare used. If however, old data is of relevance as well, then this must be retrievedmanually or left out. Furthermore, there are no guarantees that a given site willpublish any feeds at all, if it does not, then a special-purpose webcrawler isrequired to retrieve the data.

6.2 Future work

Project development took far longer than expected. Finding appropriate li-braries to solve sub problems and improving the extraction algorithm, wherethe largest issues. But just as much time where spent on tiny details like en-coding problems, trying to deal with short comings in the libraries and projectcode, learning a new development environment, bug fixing etc. Designing andperforming tests turned out to be time consuming as well. Despite this thereare only some minor bug fixes left.

If this project is to be deployed live in any form, the possibility to replacethe parser must investigated. If more precision is required there is no use tospend more time on the method of extraction currently used, instead it shouldprobably be replaced. At the time of writing there is no obvious method to takeits place.

40 Chapter 6. Conclusion

Chapter 7

Acknowledgements

I would like give a huge thank-you to both my supervisors Johanna Hogbergand Rickard Lonneborg for all their support and patience, and also to all thepeople at CodeMill AB (hi5! :) especially to my partner in crime, ThomasKnutsson, who made this work far more fun!

On a more personal note, I would also like to thank my friend N P forconvincing me to do this thesis in the first place and for all helping me in morethan one way during its completion. Last of all, I’m very grateful to my manyother friends through out these (slightly too many) years at Umea University.

41

42 Chapter 7. Acknowledgements

References

[1] Rss. Webpage/blog. http://en.wikipedia.org/wiki/RSS (file format),viewed on 2008-10-09.

[2] Rss advisory board. Webpage. http://www.rssboard.org/, viewed on2008-10-08.

[3] Chang, C.-H., Kayed, M., Girgis, M. R., and Shaalan, K. F. A sur-vey of web information extraction systems. IEEE Transactions on Knowl-edge and Data Engineering 18, 10 (2006), 1411–1428.

[4] Chen, J., and Xiao, K. Perception-oriented online news extraction.In JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference onDigital libraries (2008), pp. 363–366.

[5] Chen, J., and Zhang, H. Function-based object model towards websiteadaptation. In In Proceedings of the 10th International World Wide WebConference (2001), ACM Press, pp. 587–596.

[6] Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and

Harshman, R. Indexing by latent semantic analysis. Journal of the Amer-ican Society for Information Science, 41(6), 391-407. (1990).

[7] Dong, Y., Li, Q., Yan, Z., and Ding, Y. A generic web news extractionapproach. In International Conference on Information and Automation(2008), pp. 179–183.

[8] Hillyer, M. Managing hierarchical data in mysql. Webpage.http://dev.mysql.com/tech-resources/articles/hierarchical-data.html,viewed on 2008-11-10.

[9] Hogberg, J. Adding syntactical information to the vector space model.In Proc. Swedish Language Technology Conference 2008 (2008), KungligaTekniska Hogskolan.

[10] Hopcroft, J. E., Motwani, R., and Ullman, J. D. Introduction toAutomata Theory, Languages, and Computation (2nd Edition). AddisonWesley.

43

44 REFERENCES

[11] Kaiser, K., and Miksch, S. Information extraction. a survey. Tech. Rep.Asgaard-TR-2005-6, Vienna University of Technology, Institute of SoftwareTechnology and Interactive Systems, 2005.

[12] Knutsson, T. Master thesis: Traveling the outer dimensions of vectorspace. Tech. rep., Department of Computing Science, Umea University,2009.

[13] Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and Teix-

eira, J. S. A brief survey of web data extraction tools. SIGMOD Rec. 31,2 (2002), 84–93.

[14] Li, Y., Meng, X., Li, Q., and Wang, L. Hybrid method for automatednews content extraction from the web. In 7th International Conferenceon Web Information Systems Engineering, Wuhan, China, 2006 (2006),Aberer, K and Peng, Z and Rundensteiner, EA and Zhang, Y and Li, X,Ed., vol. 4255, pp. 327–338.

[15] Norvag, K., and Oyri, R. News item extraction for text mining in webnewspapers. In International Workshop on Challenges in Web InformationRetrieval and Integration, Proceedings (2005), pp. 195–204.

[16] Pilgrim, M. The myth of rss compatibility. Webpage/blog, 02 Feb. 2004.http://diveintomark.org/archives/2004/02/04/incompatible-rss,viewed on 2008-10-08.

[17] Reis, D. C., Golgher, P. B., Silva, A. S., and Laender, A. F.

Automatic web news extraction using tree edit distance. In Proceedingsof the 13th international conference on World Wide Web (2004), ACM,pp. 502–511.

[18] Sahlgren, M. An introduction to random indexing. In Proceedings ofMethods and Applications of Semantic Indexing Workshop at the 7th In-ternational Conference on Terminology and Knowledge Engineering (2005).

[19] Sahlgren, M. The Word-Space Model, (chapter 2.3). PhD thesis, Stock-holm University, 2006.

[20] Salton, G., Wong, A., and Yang, C. S. A vector space model forautomatic indexing. Communications of the ACM 18, 11 (1975), 613–620.

[21] Shea, D. What is rss/xml/atom/syndication? Webpage/blog, 19 May2004. http://diveintomark.org/archives/2004/02/04/incompatible-rss,viewed on 2008-10-09.

[22] Zhang, D., and Simoff, S. Informing the curious negotiator: Automaticnews extraction from the Internet. In Data Mining: Theory, Methodology,Techniques, And Applications, vol. 3755 of Lecture Notes In Artificial In-telligence. Springer-Verlag Berlin, 2006, pp. 176–191.

REFERENCES 45

[23] Zheng, S., Song, R., and Wen, J.-R. Template-independent newsextraction based on visual consistency. In Proceedings of the Twenty-SecondAAAI Conference on Artificial Intelligence: AAAI (2007).

[24] Ziegler, C.-N., and Skubacz, M. Content extraction from news pagesusing particle swarm optimization on linguistic and structural features. InProceedings Of The IEEE/WIC/ACM International Conference On WebIntelligence - WI 2007 (2007), pp. 242–249.

Documents

Extracting Content from Online News Sites