Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, H.J. Zeng, B.Y. Zhang, Y.H. Lu and W.Y. Ma TsingHua University, Microsoft

Web-page Classification through Summarization

D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma

TsingHua University, *Microsoft Research Asia,

**Hong Kong University

(SIGIR 2004)

2/29

Abstract Web-page classification is much more

difficult than pure-text classification due to a large variety of noisy information embedded in Web pages

In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy

3/29

Abstract Experimental results show that our proposed

summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm

We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12% improvement over pure-text based methods

4/29

Introduction With the rapid growth of WWW, there is an increasi

ng need to provide automated assistance to Web users for Web page classification and categorization Yahoo directory LookSmart directory

However, it is difficult to meet without automated Web-page classification techniques due to the labor-intensive nature of human editing

5/29

Introduction Web pages have their own underlying embedded

structure in the HTML language and typically contain noisy content such as advertisement banner and navigation bar

If a pure-text classification is applied to these pages, it will incur much bias for the classification algorithm, making it possible to loss focus on the main topic and important content

Thus, a critical issue is to design an intelligent preprocessing technique to extract the main topic of a Web page

6/29

Introduction In this paper, we show that using Web-page

summarization techniques for preprocessing in Web-page classification is a viable and effective technique

We further show that instead of using off-the-shelf summarization technique that is designed for pure-text summarization, it is possible to design specialized summarization methods catering to Web-page structures

7/29

Related Work Recent works on Web-page summarization

Ocelot (Berger al et., 2000) is a system for summarizing Web pages using probabilistic models to generate the “gist” of a Web page

Buyukkokten et al. (2001) introduces five methods for summarizing parts of Web pages on handheld devices

Delort (2003) exploits the effect of context in Web page summarization, which consists of the information extracted from the content of all the documents linking to a page

8/29

Related Work Our work is related to that for removing noise

from a Web page Yi et al. (2003) propose an algorithm by

introducing a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages

Chen et al. (2000) extract title and meta-data included in HTML tags to represent the semantic meaning of the Web pages

9/29

Web-Page Summarization Adapted Luhn’s Summarization Method

Every sentence is assigned with a significance factor, and the sentence with the highest significance factor are selected to form the summary

The significance factor of a sentence can be computed as follows.

A “significant words pool” is built using word frequency Set a limit L for the distance at which any two significant words c

ould be considered as being significantly related Find out a portion in the sentence that is bracketed by significant

words not more than L non-significant words apart Count the number of significant words contained in the portion an

d divide the square of this number by the total number of words within the portion

10/29

Web-Page Summarization In order to customize the Luhn’s algorithm for web-pages, w

e make a modification The category information of each page is already known in the trainin

g data, thus significant-words selection could be processed with each category

We build significant words pool for each category by selecting the words with high frequency after removing the stop words

For a testing Web page, we do not have the category information

We calculate the significant factor for each sentence according to different significant words pools over all categories separately

The significant factor for the target sentence will be averaged over all categories and referred to as Sluhn

11/29

Web-Page Summarization Latent semantic Analysis (LSA)

LSA is applicable in summarization because of two reasons:

LSA is capable of capturing and modeling interrelationship among terms by semantically clustering terms and sentences

LSA can capture the salient and recurring word combination pattern in a document which describe a certain topic or concept

LSA is based on singular value decompositionTVUA A :m*n

:n*nV :n*n

12/29

Web-Page Summarization Latent semantic Analysis (LSA)

Concepts can be represented by one of the singular vectors where the magnitude of the corresponding singular value indicates the importance degree of this pattern within the document

Any sentence containing this word combination pattern will be projected along this singular value

The sentence that best represents this pattern will have the largest index value with this vector

We denote this index value as Slsa and select the sentences with the highest Slsa to form the summary

13/29

Web-Page Summarization Content Body Identification by Page Layout Analysis

The structured character of Web pages makes Web-page summarization different from pure-text summarization

In order to utilize the structure information of Web pages, we employ a simplified version of Function-Based Object Model (Chen et al., 2001)

Basic Object (BO) Composite Object (CO)

After detecting all BOs and COs in a Web page, we could identify the category of each object according to some heuristic rules:

Information Object Navigation Object Interaction Object, Decoration Object Special Function Object

14/29

Web-Page Summarization Content Body Identification by Page Layout Analysis

In order to make use of these objects, we define the Content Body (CB) of a Web page which consists of the main object related to the topic of that page

The algorithm for detecting CB is as follows: Consider each selected object as a single document and build the TF*ID

F index for the object Calculate the similarity between any two objects using COSINE similari

ty computation and add a link between them if their similarity is greater than a threshold (=0.1)

In the graph, a core object is defined as the object having the most edges Extract the CB

In the sentence is included in CB, Scb=1, otherwise, Scb=0 All sentences with Scb=1 give rise to the summary

15/29

Web-Page Summarization Supervised Summarization

There are eight features fi1 measures the position of a sentence Si in a certain paragraph fi2 measures the length of a sentence Si, which is the number of w

ords in Si

fi3 =ΣTFw*SFw fi4 is the similarity between Si and the title fi5 is the consine similarity between Si and all text in the pages fi6 is the consine similarity between Si and meta-data in the page fi7 is the number of occurrences of word from Si in a special wor

d set fi8 is the average font size of words in Si

16/29

Web-Page Summarization Supervised Summarization

We apply Naïve Bayesian classifier to train a summarizer

17/29

Web-Page Summarization An Ensemble of Summarizer

By combining the four methods presented in the previous sections, we obtain a hybrid Web-page

The sentences with the highest S will be chosen into the summary

SSSS cblsaluhnS

sup

18/29

Experiments Data Set

We use about 2 millions Web pages crawled from the LookSmart Web directory

Due to the limitation of network bandwidth, we only downloaded about 500 thousand descriptions of Web pages that are manually created by human editors

We randomly sampled 30% of the pages with descriptions for our experiment purpose

The Extracted subset includes 153,019 pages, which are distributed among 64 categories

19/29

Experiments Data Set

In order to reduce the uncertainty of data split, a 10-fold cross validation procedure is applied

20/29

Experiments Classifiers

Naïve Bayesian Classifier (NB)

Support Vector Machine (SVM)

21/29

Experiments Experiment Measure

We employ the standard measures to evaluate the performance of Web classification: Precision, recall and F1-measure

To evaluate the average performance across multiple categories, there are two conventional methods: micro-average and macro-average

Micro-average gives equal weight to every document Macro-average gives equal weight to every category, regardless

of its frequency Only micro-average will be used to evaluate the

performance of classifiers

22/29

Experiments Experimental Results and Analysis

Baseline (Full text) Each web page is represented as a bag-of-words, in which the

weight of each word is assigned with their term frequency Human’s summary (Description, Title, meta-data)

We extract the description of each Web page from the LookSmart Website and consider it as the “ideal” summary

23/29


Unsupervised summarization Content Body, Adapted Luhn’s algorithm (compression rate:20

%) and LSA (compression rate:30%) Supervised summarization algorithm

We define one sentence as positive if its similarity with the description is greater than a threshold (=0.3), and others as negative

Hybrid summarization algorithm Use all Unsupervised and supervised summarization algorithms

24/29


25/29


Performance on Different Compression

26/29


Effect of different weighting schemata Schema1: we assigned the weight of each summarization met

hod in proportion to the performance of each method (micro-F1)

Schema2: We increased the value of Wi (i=1,2,3,4) to 2 in schema2-5 respectively and kept others as one

27/29

Experiments Case studies

We randomly selected 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline (denoted as set A) and 500 pages randomly from the testing pages (denoted as set B)

The average size of pages in A which is 31.2k in much larger than that of in B which is 10.9k

The summarization techniques can help us to extract this useful information from large pages

28/29

Experiments Case studies

29/29

Conclusions and Future Work Several web-page summarization algorithms are

proposed for extracting most relevant features from Web pages for improving the accuracy of Web classification

Experimental results show that automatic summary can achieve a similar improvement (about 12.9%) as the ideal-case accuracy achieved by using the summary created by human editors

We will investigate methods for multi-document summarization of hyperlinked Web pages to boost the accuracy of Web classification

Documents

Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft

Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, H.J. Zeng, B.Y. Zhang, Y.H. Lu and W.Y. Ma TsingHua University, Microsoft