Upload
dortha-haynes
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Web-page Classification through Summarization
D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma
TsingHua University, *Microsoft Research Asia,
**Hong Kong University
(SIGIR 2004)
2/29
Abstract Web-page classification is much more
difficult than pure-text classification due to a large variety of noisy information embedded in Web pages
In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy
3/29
Abstract Experimental results show that our proposed
summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm
We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12% improvement over pure-text based methods
4/29
Introduction With the rapid growth of WWW, there is an increasi
ng need to provide automated assistance to Web users for Web page classification and categorization Yahoo directory LookSmart directory
However, it is difficult to meet without automated Web-page classification techniques due to the labor-intensive nature of human editing
5/29
Introduction Web pages have their own underlying embedded
structure in the HTML language and typically contain noisy content such as advertisement banner and navigation bar
If a pure-text classification is applied to these pages, it will incur much bias for the classification algorithm, making it possible to loss focus on the main topic and important content
Thus, a critical issue is to design an intelligent preprocessing technique to extract the main topic of a Web page
6/29
Introduction In this paper, we show that using Web-page
summarization techniques for preprocessing in Web-page classification is a viable and effective technique
We further show that instead of using off-the-shelf summarization technique that is designed for pure-text summarization, it is possible to design specialized summarization methods catering to Web-page structures
7/29
Related Work Recent works on Web-page summarization
Ocelot (Berger al et., 2000) is a system for summarizing Web pages using probabilistic models to generate the “gist” of a Web page
Buyukkokten et al. (2001) introduces five methods for summarizing parts of Web pages on handheld devices
Delort (2003) exploits the effect of context in Web page summarization, which consists of the information extracted from the content of all the documents linking to a page
8/29
Related Work Our work is related to that for removing noise
from a Web page Yi et al. (2003) propose an algorithm by
introducing a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages
Chen et al. (2000) extract title and meta-data included in HTML tags to represent the semantic meaning of the Web pages
9/29
Web-Page Summarization Adapted Luhn’s Summarization Method
Every sentence is assigned with a significance factor, and the sentence with the highest significance factor are selected to form the summary
The significance factor of a sentence can be computed as follows.
A “significant words pool” is built using word frequency Set a limit L for the distance at which any two significant words c
ould be considered as being significantly related Find out a portion in the sentence that is bracketed by significant
words not more than L non-significant words apart Count the number of significant words contained in the portion an
d divide the square of this number by the total number of words within the portion
10/29
Web-Page Summarization In order to customize the Luhn’s algorithm for web-pages, w
e make a modification The category information of each page is already known in the trainin
g data, thus significant-words selection could be processed with each category
We build significant words pool for each category by selecting the words with high frequency after removing the stop words
For a testing Web page, we do not have the category information
We calculate the significant factor for each sentence according to different significant words pools over all categories separately
The significant factor for the target sentence will be averaged over all categories and referred to as Sluhn
11/29
Web-Page Summarization Latent semantic Analysis (LSA)
LSA is applicable in summarization because of two reasons:
LSA is capable of capturing and modeling interrelationship among terms by semantically clustering terms and sentences
LSA can capture the salient and recurring word combination pattern in a document which describe a certain topic or concept
LSA is based on singular value decompositionTVUA A :m*n
:n*nV :n*n
12/29
Web-Page Summarization Latent semantic Analysis (LSA)
Concepts can be represented by one of the singular vectors where the magnitude of the corresponding singular value indicates the importance degree of this pattern within the document
Any sentence containing this word combination pattern will be projected along this singular value
The sentence that best represents this pattern will have the largest index value with this vector
We denote this index value as Slsa and select the sentences with the highest Slsa to form the summary
13/29
Web-Page Summarization Content Body Identification by Page Layout Analysis
The structured character of Web pages makes Web-page summarization different from pure-text summarization
In order to utilize the structure information of Web pages, we employ a simplified version of Function-Based Object Model (Chen et al., 2001)
Basic Object (BO) Composite Object (CO)
After detecting all BOs and COs in a Web page, we could identify the category of each object according to some heuristic rules:
Information Object Navigation Object Interaction Object, Decoration Object Special Function Object
14/29
Web-Page Summarization Content Body Identification by Page Layout Analysis
In order to make use of these objects, we define the Content Body (CB) of a Web page which consists of the main object related to the topic of that page
The algorithm for detecting CB is as follows: Consider each selected object as a single document and build the TF*ID
F index for the object Calculate the similarity between any two objects using COSINE similari
ty computation and add a link between them if their similarity is greater than a threshold (=0.1)
In the graph, a core object is defined as the object having the most edges Extract the CB
In the sentence is included in CB, Scb=1, otherwise, Scb=0 All sentences with Scb=1 give rise to the summary
15/29
Web-Page Summarization Supervised Summarization
There are eight features fi1 measures the position of a sentence Si in a certain paragraph fi2 measures the length of a sentence Si, which is the number of w
ords in Si
fi3 =ΣTFw*SFw fi4 is the similarity between Si and the title fi5 is the consine similarity between Si and all text in the pages fi6 is the consine similarity between Si and meta-data in the page fi7 is the number of occurrences of word from Si in a special wor
d set fi8 is the average font size of words in Si
16/29
Web-Page Summarization Supervised Summarization
We apply Naïve Bayesian classifier to train a summarizer
17/29
Web-Page Summarization An Ensemble of Summarizer
By combining the four methods presented in the previous sections, we obtain a hybrid Web-page
The sentences with the highest S will be chosen into the summary
SSSS cblsaluhnS
sup
18/29
Experiments Data Set
We use about 2 millions Web pages crawled from the LookSmart Web directory
Due to the limitation of network bandwidth, we only downloaded about 500 thousand descriptions of Web pages that are manually created by human editors
We randomly sampled 30% of the pages with descriptions for our experiment purpose
The Extracted subset includes 153,019 pages, which are distributed among 64 categories
19/29
Experiments Data Set
In order to reduce the uncertainty of data split, a 10-fold cross validation procedure is applied
20/29
Experiments Classifiers
Naïve Bayesian Classifier (NB)
Support Vector Machine (SVM)
21/29
Experiments Experiment Measure
We employ the standard measures to evaluate the performance of Web classification: Precision, recall and F1-measure
To evaluate the average performance across multiple categories, there are two conventional methods: micro-average and macro-average
Micro-average gives equal weight to every document Macro-average gives equal weight to every category, regardless
of its frequency Only micro-average will be used to evaluate the
performance of classifiers
22/29
Experiments Experimental Results and Analysis
Baseline (Full text) Each web page is represented as a bag-of-words, in which the
weight of each word is assigned with their term frequency Human’s summary (Description, Title, meta-data)
We extract the description of each Web page from the LookSmart Website and consider it as the “ideal” summary
23/29
Experiments Experimental Results and Analysis
Unsupervised summarization Content Body, Adapted Luhn’s algorithm (compression rate:20
%) and LSA (compression rate:30%) Supervised summarization algorithm
We define one sentence as positive if its similarity with the description is greater than a threshold (=0.3), and others as negative
Hybrid summarization algorithm Use all Unsupervised and supervised summarization algorithms
24/29
Experiments Experimental Results and Analysis
25/29
Experiments Experimental Results and Analysis
Performance on Different Compression
26/29
Experiments Experimental Results and Analysis
Effect of different weighting schemata Schema1: we assigned the weight of each summarization met
hod in proportion to the performance of each method (micro-F1)
Schema2: We increased the value of Wi (i=1,2,3,4) to 2 in schema2-5 respectively and kept others as one
27/29
Experiments Case studies
We randomly selected 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline (denoted as set A) and 500 pages randomly from the testing pages (denoted as set B)
The average size of pages in A which is 31.2k in much larger than that of in B which is 10.9k
The summarization techniques can help us to extract this useful information from large pages
28/29
Experiments Case studies
29/29
Conclusions and Future Work Several web-page summarization algorithms are
proposed for extracting most relevant features from Web pages for improving the accuracy of Web classification
Experimental results show that automatic summary can achieve a similar improvement (about 12.9%) as the ideal-case accuracy achieved by using the summary created by human editors
We will investigate methods for multi-document summarization of hyperlinked Web pages to boost the accuracy of Web classification