Upload
donhi
View
218
Download
2
Embed Size (px)
Citation preview
Mining Landmark Papers and Related Keyphrases
Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science (by Research)
in
Computer Science
by
Annu Tuli
200707002
Center for Data Engineering
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY
(Deemed University)
Hyderabad, India
July 2011
Thesis Certificate
This is to certify that the thesis entitled “Mining Landmark Papers and Related
Keyphrases” submitted by Annu Tuli to the International Institute of Information Tech-
nology, Hyderabad, for the award of the Degree of Master of Science (by Research) is
a record of bona-fide research work carried out by her under my supervision and guidance.
The contents of this thesis have not been submitted to any other university or institute for
the award of any degree or diploma.
Date Advisor:
Dr. Vikram Pudi
Associate Professor
IIIT Hyderabad
Copyright c© Annu Tuli, 2011. All rights reserved.
The author hereby grants to IIIT-Hyderabad permission to reproduce and distribute
publicly paper and electronic copies of the thesis document in whole or in part.
To Parameshwara and my Parents
“The highest men are calm, silent, and unknown. They are the men who really know the
power of thought; they are sure that, even if they go into a cave and close the door and
simply think five true thoughts and then pass away, these five thoughts of theirs will live
through eternity. Indeed such thoughts will penetrate through the mountains, cross the
oceans, and travel through the world.”
- Swami Vivekananda
Acknowledgements
First n foremost, I would like to thank Dr. Vikram Pudi who has had a tremendous
influence on me. On top of being an expert, he has been a dear friend, a profound philoso-
pher and a devoted guide, far beyond ordinary. He introduced me to research and closely
guided my first steps. He has this special property of always searching for the crux of any
matter. This, together with his sharpness, energy and amazing sense of humour makes him
a remarkable person and pleasure to work with. His feedback and encouragement greatly
helped me to keep my spirits up.
I am also thankful to all the faculty and staff members of CDE - Dr. Kamalakar Kar-
lapalem and Dr P. K. Reddy, for providing a wonderful research center. I also take this
opportunity to thank IIIT-Hyderabad for giving me an opportunity to see the world of
research so closely.
This acknowledgment would be incomplete without mention of my friends who made
my journey memorable. This includes - Amit, Hanuma Kumar, Rohit Paravastu, Aditya,
Prashant, Raghvendra, Pratibha, Saurabh, Padmini, Chetna, Bhanukiran, Velidi Padmini,
Srilakshmi and Lydia.
Above all I would like to express my gratitude towards my grand mother, parents, sis-
ters, brother-in-law and his family who have had, are having and will continue to have a
tremendous influence on my development.
Synopsis
Text mining is a subfield of data mining that, in turn, is a component of a more general
category of Knowledge Management (KM). In the real world, knowledge is represented not
only by the structured data found in traditional databases, but in a wide variety of un-
structured sources such as books, research papers, word documents, letters, digital libraries,
e-mail messages, news feeds, Web pages, and so forth. Text mining is particularly relevant
today because of the enormous amount of knowledge that resides in unstructured collection
of text documents. It uncovers relationships in a text collection, and leverages the creativity
of the knowledge worker to explore these relationships and discover new knowledge.
In recent years, we have witnessed a tremendous growth in the volume of text documents
available on the internet, digital libraries, news resources and so on. With the dramatic
growth of text information, there is an increasing need for powerful text mining systems
that can automatically discover useful knowledge from text. The digital data has become
one of the most important resources for information. But the fact that more data is available
does not necessarily mean that it is being used in efficient manner. It has been observed
that no one is willing to or capable of manually browsing through large data collections. To
satisfy user’s information need, a system should list a precise and small subset of the data
collection.
People often interact with these document collections and thus may be interested in meth-
ods to help them better use the documents or retrieve the useful knowledge. For retrieving
individual documents, search engines have already been very successful. Other methods
such as topic modeling can provide a coarse overview of the topics in a document collection.
While information retrieval and topic modeling methods have been widely applicable and
useful, current methods for drilling deeper to understand the structure and development of
research areas in terms of topics and their relationships with each other and their originating
documents for each of those topics in a corpus as a whole could still be improved.
In this thesis, we provide methods for a set of tasks that seek to identify the important or
ii
new keyphrases and corresponding first originating document, known as Landmark Paper.
These methods focus on supplying a fine-grained picture of development of keyphrases over
time to help users grasp the keyphrase collection’s development as a whole. We focus on the
problem of finding novel keyphrases within the document collections, and their originating
documents through the multiple conferences with respect to the conference year. In addition
to this, we also consider that it is insufficient for new researchers who are exploring a research
area. It is much more helpful to be able to see the related research areas based on proximity
and importance of keyphrases and their originating documents.
We investigate the system where the user can enter a set of keyphrases and the system
recommend a set of top k - keyphrases by using knapsack approach, corresponding to the
user’s query and also show the relationships of these topics to each other and finally outputs
the landmark paper for each of those topics. This will essentially provide all the material
required for the researcher to exhaustively understand the foundations of that research area.
This thesis explores text-based approaches for these tasks. For wide applicability, the meth-
ods use only document text. We evaluate our methods experimentally on actual research
publications from the different Data Mining and Databases conferences proceedings avail-
able on the Digital Bibliography Library Project (DBLP) website1. We have prepared a
cleaned-up dataset with the text proceedings to conduct this evaluation.
In addition to the above tasks, we have develop an approach by using knapsack based
solution as discussed above in a different domain i.e. to recommend set of items to the
customers in retail stores such that the profit of the store is maximized.
1http://www.informatik.uni-trier.de/ ley/db/conf/indexa.html
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Overview of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background and Related Work 9
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 KEA : Keyphrase Extraction Algorithm . . . . . . . . . . . . . . . . 10
2.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Receiver Operator Characteristic (ROC) Curve and Space . . . . . . 14
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Topic Detection and Tracking . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 First Story Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Hot Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.5 Temporal Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.6 Journal Article Topic Detection based on Semantic Features . . . . . 17
2.2.7 COA: Finding Novel Patents through Text Analysis . . . . . . . . . . 18
2.3 Differences from Landmark Paper Mining . . . . . . . . . . . . . . . . . . . . 18
iv
3 Mining Landmark Papers 21
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Extracting Keyphrases . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Identifying Landmark Papers . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Experimental Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Identifying Keyphrases and Originating Documents from Each Con-
ference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Number of Misclassifications . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 Identifying Keyphrases and Originating Documents from Multiple Con-
ferences Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Mining Topic-based Landmark Papers 40
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Steps of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Mining Related Keyphrases . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 Keyphrase Evolutionary Graph . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Matching Queries and Keyphrases . . . . . . . . . . . . . . . . . . . . 45
4.2.4 Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.5 Evaluating Keyphrase Ranking (i.e. Keyrank) . . . . . . . . . . . . . 50
4.2.6 Identifying k - Nearest Neighbors of each Keyphrase . . . . . . . . . 51
4.2.7 Evaluating Keyscore of Keyphrases . . . . . . . . . . . . . . . . . . . 53
4.2.8 Recommending Keyphrases . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.9 Identify Originating Document . . . . . . . . . . . . . . . . . . . . . 54
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
v
4.3.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.4 Identifying Originating Documents . . . . . . . . . . . . . . . . . . . 63
4.3.5 Analysis using Google Similarity Distance Measure . . . . . . . . . . 79
4.3.6 Spearman’s Rank Correlation . . . . . . . . . . . . . . . . . . . . . . 82
4.3.7 t - test for testing the significance of an observed sample correlation
coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.8 Analysis of Results using Rank Correlation and t - test . . . . . . . . 85
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 ProMax: A Profit Maximizing Recommendation System for Market Bas-
kets 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 The ProMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.1 Clustering Customer Transactions and Identification of the Current
Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.2 Calculating Expected Profit . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.3 Recommending Items: Knapsack Approach . . . . . . . . . . . . . . . 98
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5.1 Performance of DualRank . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.2 Performance of ProMax . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 Conclusion and Future Work 104
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Bibliography 107
vi
List of Figures
1.1 Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Relationship between the set of relevant and retrieved documents. . . . . . . 12
3.1 Flow Diagram for Identifying Landmark Papers. . . . . . . . . . . . . . . . . 24
3.2 ROC Space: Comparison of Different Predicted Results. . . . . . . . . . . . 38
3.3 Different Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Contingency Table 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Contingency Table 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Contingency Table 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Contingency Table 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Contingency Table 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.9 Contingency Table 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.10 Contingency Table 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11 Contingency Table 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 An Example of Keyphrase Evolutionary Graph . . . . . . . . . . . . . . . . . 47
4.2 Keyphrase Evolutionary Graph relevant to the query term “decision trees”. . 56
4.3 Keyphrase Evolutionary Graph for query terms (“cluserting uncertain data”,
“data streams”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Keyphrase Evolutionary Graph relevant to the query term “skyline query”. . 61
4.5 Keyphrase Evolutionary Graph relevant to the query term “target schema”. . 62
4.6 The Significane of Spearman’s Rank Correlation and degrees of freedom. . . 86
5.1 DualRank Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
vii
5.2 Comparisons of profits earned by the algorithms based on the number of items
selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 ProMax Performance on different datasets . . . . . . . . . . . . . . . . . . . 103
viii
List of Tables
2.1 Sample Output of Keyphrase Extraction from Research Articles. . . . . . . . 11
2.2 A confusion matrix for positive and negative tuples. . . . . . . . . . . . . . . 13
3.1 Basic information of data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 List of keyphrases and corresponding originating documents in VLDB confer-
ence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 List of keyphrases and corresponding originating documents in SIGMOD con-
ference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 List of keyphrases and corresponding originating documents in ICDE conference. 32
3.5 List of keyphrases and corresponding originating documents in ICDM confer-
ence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 List of keyphrases and corresponding originating documents in KDD conference. 34
3.7 List of keyphrases and corresponding originating documents in PAKDD con-
ference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 No of incorrectly classified keyphrases from each conference. . . . . . . . . . 36
3.9 List of global keyphrases and originating documents from multiple conferences. 37
4.1 Sample market-basket dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Documents and Set of Keyphrases . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Frequent Keyphrases and Support Count . . . . . . . . . . . . . . . . . . . . 44
4.4 Keyphrases and Keyranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Keyphrases and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Keyphrases and Keyscores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7 List of keyphrases and their keyranks matching with query term “decision trees” 69
ix
4.8 List of keyphrases and their distances corresponding to the query term “deci-
sion trees”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.9 List of keyphrases and their keyscores corresponding to the query term “de-
cision trees”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.10 List of keyphrases and their keyranks corresponding to the query term (“clusert-
ing uncertain data”, “data streams”) . . . . . . . . . . . . . . . . . . . . . . 70
4.11 List of keyphrases and their distances matching with query term (“cluserting
uncertain data”, “data streams”). . . . . . . . . . . . . . . . . . . . . . . . . 71
4.12 List of keyphrases and their keyscores corresponding to the query term (“clusert-
ing uncertain data”, “data streams”) . . . . . . . . . . . . . . . . . . . . . . 71
4.13 List of keyphrases and their keyranks corresponding to the query term “skyline
query” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.14 List of keyphrases and their keyranks corresponding to the query term “target
schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.15 List of keyphrases and their distances matching with query term “skyline query” 74
4.16 List of keyphrases and their distances matching with query term “target
schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.17 List of keyphrases and their keyscores matching with query term “skyline query” 76
4.18 List of keyphrases and their keyscores matching with query term “target
schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.19 List of keyphrases and corresponding originating documents . . . . . . . . . . 78
4.20 List of keyphrases and their NGD corresponding to the query term “decision
trees”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.21 List of keyphrases and their NGD corresponding to the query term (“clusert-
ing uncertain data”, “data streams”). . . . . . . . . . . . . . . . . . . . . . . 82
4.22 List of keyphrases and their NGD corresponding to the query term “skyline
query”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.23 List of keyphrases and their NGD corresponding to the sub-query term “target
schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.24 Spearman’s rank correlation results . . . . . . . . . . . . . . . . . . . . . . . 85
x
4.25 t - test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
List of Algorithms
1 Mining Landmark Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 FP-Tree Algorithm for finding frequent itemsets . . . . . . . . . . . . . . . . . 45
3 FP-Growth(T, α) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Keyrank(graph, dampingfactor=0.85, maxiterations=100, mindelta= 0.00001) 65
5 Dijkstra’s Shortest Path Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Mining Profitable Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xi
Chapter 1
Introduction
Recent proliferation of the World Wide Web and availability of inexpensive storage media
has led to accumulation of enormous amount of data. Digital data has become one of the
most important resources for information. But the fact that more data is available doesn’t
necessarily mean that it is being used in an efficient manner. It is the sheer amount of data
that emphasizes the need for intelligent automatic access; no one is willing to, or capable
of manually browsing through large data collections. To satisfy user’s information need, a
system should list a precise and small subset of the data collection. The ultimate goal is to
help users find what they are looking for.
Text Data Mining (TDM) can be considered a field of its own, containing a number of
applications. It has also been known as text analysis, text mining or knowledge discovery
in text. In general, TDM applications are used to extract non-trivial and useful information
from large corpora of text data, which are available in unstructured or structured format.
Text mining applications require the use and application of many related fields such as
Information Retrieval, Machine Learning, Statistics, and Linguistics. There are various
application of TDM, such as in bio-informatics, market research, consumer trend studies,
and scientific research [De05].
Information Retrieval (IR) and Information Extraction (IE) areas are associated with text
mining. IE has the goal of transforming a collection of documents into information that is
more readily digested and analyzed with the help of an IR system. IE extracts relevant facts
from the documents, while IR selects relevant documents. IE is a kind of pre-processing
1
stage in the text mining process, which is the step after the IR process and before the data
mining techniques are performed.
Today, the internet is growing through a rapid phase of growth and development. With
the growth of the internet, information contained in electronic documents is increasingly
widespread, with the World Wide Web as its primary repository. The convenience of elec-
tronic documents has motivated their more efficient application in information management
and knowledge discovery [WHGL09].
A typical information retrieval problem is to locate relevant documents in a document
collection based on a user’s query, which is often some keywords describing an information
need, although it could also be an example relevant document. In such a search problem,
a user takes the initiative to “pull” the relevant information out from the collection; this is
most appropriate when a user has some ad hoc (i.e. short-term) information need, such as
finding information to buy a used car. When a user has long-term information need (e.g. a
researcher’s interests), a retrieval system may also take the initiative to “push” any newly
arrived information item to the user if the item is judged as being relevant to the user’s
information need.
One goal of text mining is to provide automatic methods to help people grasp the key
ideas in ever-increasing document collections. People often interact with these document
collections and thus may be interested in methods to help them better “use” the documents.
For retrieving individual documents, search engines have already been very successful. Other
methods such as topic modeling can provide a coarse overview of the topics in a document
collection. While information retrieval and topic modeling methods have been widely appli-
cable and useful, current methods for drilling deeper to understand the idea structure and
development in a corpus as a whole could still be improved.
We provide methods for a set of tasks that seek to identify important or new keyphrases
and corresponding first originating document from a corpus over time. These methods
focus on supplying a fine-grained picture of generated list of keyphrases over time to help
users grasp the keyphrase collection’s development and their originating documents with
respect to various conferences as a whole. We focus on the problem of finding important
keyphrases within the document collections, and development of keyphrases from documents
2
through the multiple conferences with respect to the conference year. However, to be useful
to new researcher who are exploring research areas, this is insufficient. It is much more
helpful to be able to see the relevant research areas in terms of proximity and importance
of keyphrases. This will essentially provide all the material required for a researcher to
exhaustively understand the developments of research areas. This thesis explores text-based
approaches for these tasks. For wide applicability, the methods use only document text.
We evaluate our methods experimentally on research publications from the different Data
Mining and Databases conferences proceedings available on the Digital Bibliography Library
Project (DBLP) site. We have prepared a cleaned-up dataset with the text proceedings to
conduct this evaluation.
1.1 Motivation
In many application domains, we encounter a stream of text, in which each text document has
some meaningful time stamp. For examples, a collection of news articles about a topic and
research papers in a subject area can both be viewed as natural text streams with publication
dates as time stamps. In such text data streams, there often exist some interesting and
meaningful keywords. For example, an event covered in news articles generally has some
meaningful keywords consisting of themes (i.e. subtopics) characterizing the beginning,
progression, and impact of the event, among others. Similarly, in research papers, some
important and meaningful keywords may also exhibit similar patterns. For example, the
study of one topic specified by some keyphrases in some time period may have influenced
or stimulated the study of another topic associated with same keyphrases after the time
period. In all these cases, it would be very useful if we can discover and extract these
important keyphrases and also identify the first corresponding paper automatically from
text to get knowledge about the keyphrases from where they originate with respect to time
stamp. Indeed, such research papers are not only are useful by themselves, but also would
facilitate organization and navigation of the information stream according to the underlying
keywords. In addition to this, it will be helpful for the users to explore the extra fields, if they
are getting additional information with their query in terms of other relevant information.
3
1.2 Problem Description
In this section, we discuss our problem in two folds:
1. We focus on the problem of finding list of keyphrases and also identifying the first
document from the corpus where important keyphrases are introduced for the first
time. We present the problem of Mining Landmark Papers (MLP). This problem re-
quires simultaneously understanding what keyphrases/topics are new or important and
which documents drive these keyphrases. The following definition formally captures
the problem statement:
Definition 1 (Landmark Paper Mining) Given a collection of time indexed doc-
uments, C = {d1, d2, . . . , dT}, where di refers to a document with time stamp i, each
document is a sequence of words from a vocabulary set V = {w1, w2, . . . , w|V |}, the
problem is to identify the first document that introduces important keyphrases for the
first time known as landmark papers.
This can be broken into two sub-problems:
(a) Find the right keyphrases/topics in a collection of documents.
(b) Identify the originating documents of important keyphrases.
2. Mining topic-based landmark papers: For a given user’s query (in terms of key-
words), find the keyphrases which are relevant to the query term and recommend top
k - related keyphrases to the query and their originating documents if it exists.
In addition to above approach, we also explore our problem in a different domain. We
formalize a technique that utilizes knapsack based solution to recommend an optimal set of
items to the customers in a retail store based on the contents of their market baskets and
overall profit of the store is maximized.
4
1.3 Scope
Mining Landmark papers is not only useful for the beginning researcher, but for anyone
keeping track of important developments in a particular area. This is important today
due to the large numbers of researchers and published research papers. Keeping track of
keyphrases, their related keyphrases and landmark papers is especially useful to track of
key research developments not necessarily in the specific area of one’s research, but in its
numerous related areas, which tend to be voluminous.
Consider, for example, there are often hundreds of research papers published annually in
a research area. A researcher, especially a beginning researcher, often wants to understand
how the research topics in the literature have been evolving. For example, if a researcher
wants to know about data mining, both the historical milestones and the recent research
trends of data mining would be valuable for him/her. Identifying the origins of important
and new keyphrases will also make it much easier for the researcher to selectively choose
appropriate new field of research. Also, the corresponding first document (i.e. landmark
paper) for that keyphrase will also help the researchers to read only those papers based on
their research interests.
1.4 Contribution of the Thesis
In this thesis, our work explores how document collections develop over time specifically,
by detecting keyphrases from documents by looking at important keyphrases, their related
keyphrases and by detecting where in documents these keyphrases originate. These (entirely)
text-based methods can be used to detect the new/important keyphrases which develop over
time and corresponding originating documents of the keyphrases with respect to the time-
stamp. We address the problems of identifying right/important keypharses and originating
documents of important keypharses which introduce new keypharses that has large impact.
Figure 1.1, shows the block diagram of steps preformed by our approach. In next section, we
define the steps for keyphrase extraction and our approach in brief to identify the landmark
papers.
5
Figure 1.1: Block Diagram.
1.5 Overview of Proposed Approach
Keyphrases provide semantic metadata that summarize and characterize documents. To ex-
tract kephrases from documents, we use Keyphrase Extraction Algorithm1 (KEA) [WPF+99].
KEA, an algorithm for automatically extracting keypharses from text. It is a single docu-
ment summarizer, which employs a Bayesian supervised learning approach to build a model
out of training data, then applies the model to unseen documents for keyphrase extrac-
tion [WPF+99].
KEA, is simple, effective and publicly available. KEA’s extraction algorithm has two
stages:
1. Training: Create a model for identifying keyphrases, using training documents where
the author’s assign keyphrases are known.
2. Extraction: Choose keyphrases from a new document, using the above model.
Both stages choose a set of candidate phrases from their input documents, and then
calculate the values of certain attributes (called features) for each candidate.
In our experiments, we consider the full text of documents to extract the keyphrases from
text. We set the length of keyphrases as minimum is 2 and maximum is 3 to avoid the
redundant keywords from text.
Next, to identify originating documents from corpus, we propose an approach that is easy
direct and simple to understand. We prepared a database of well tagged “Data Mining/-
1http://www.nzdl.org/Kea/
6
Databases” research papers from DBLP (Digital Bibliography and Library Project) website2.
DBLP is a computer science bibliography website hosted at university of Trier in Germany.
We extract the information related to data mining and databases conferences like VLDB,
ICDM and SIGMOD etc. and store the information in our database to perform the experi-
ments. The information we extract includes the year of conference, author’s name, conference
name, paper title, and general paper topic if any and full-text of research papers. In our
approach, we consider time-stamp i.e. conference month and year as one of the important
parameter for sorting the resulting documents, and additional pruning step to refine our
results, we consider references section of their corresponding candidate landmark papers, as
the another important parameter.
1.6 Applications
Text is the natural choice for formal exchange of information by common people through
electronic mail, Internet chat, World Wide Web, digital libraries, electronic publications,
and technical reports, to name few. The wealth of information embedded in typically stored
in text (or document) databases distributed all over is enormous, and such databases are
growing exponentially with the revolution of current internet and information technology.
Automatic understanding of the content of textual data, and hence the extraction of relevant
knowledge from it, is a long-standing challenge in Artificial Intelligence. In order to aid
mining of information from large collections of such text databases, special types of data
mining methodologies, known as “text mining”, have been developed. Our ultimate goal is
to help the user to identify the relevant information and extract the information in which user
might interested in future. Today, users may not want to spend a huge amount of time to
search the relevant information of their choice. To consider all these aspects of user’s need, we
added one-step further to identify the recent research trends and corresponding originating
research articles, to help the user to grasp the key knowledge of the topics quickly. The
applications of our work includes wide variety of different fields. It evolves many domains
of text documents, such as new articles, web pages, research publications and journals,
2http://www.informatik.uni-trier.de/ ley/db/conf/indexa.html
7
digital libraries, blogs, email analysis, electronic publication of books, technical and business
documents, thesis and dissertation reports, patent data analysis and so on.
1.7 Organization of the Thesis
In addition to the problem definition of our work we gave a brief introduction of motivation
and scope of our problem. The remainder of this thesis is organized as follows:
• Chapter 2: Describes the background knowledge and relevant literature in the con-
text of this thesis. In this chapter we first discuss some background information and
then related approaches and later we explain how our approach is different from other
existing approaches.
• Chapter 3: Presents a framework of MLP (Mining Landmark Papers). In this chapter,
first we describes the pre-processing step that is required for the keyphrase extraction in
the context of the research documents. Next, we define our methodology for identifying
landmark papers and presents the experimental results and evaluation of our approach
on text corpus.
• Chapter 4: Develop an approach to
– find related keyphrases based on proximity and importance of keyphrases and
recommend a set of top k - keyphrases corresponding to the user’s query, and
– identify the corresponding originating document of those keyphrases.
We also discuss various case studies later in the chapter.
• Chapter 5: Use the Knapsack based solution developed for above Chapter 4 in a
different domain i.e. to recommend set of items to the customers in retail stores such
that the profit of the store is maximized. We evaluate our approach experimentally
and compare against the existing approach.
• Chapter 6: Finally, this chapter conclude our work and also present possible directions
of future work of this thesis.
8
Chapter 2
Background and Related Work
In this chapter, we first present some background and related information that can be use-
ful to understand the underlying idea and motivation of our research problem. Later we
discuss the related approaches and how our approach differ significantly from other existing
approaches.
2.1 Background
In this section, first we explain the notion of keyphrase extraction and, next we describe the
evaluation metrics that are used to evaluate the performance of our results. Keyphrase pro-
vide a simple way of describing a document, giving the reader some clues about its content.
In addition, keyphrases can help users get a feel for the content of a collection, provide sensi-
ble entry points into it, show how queries can be extended, facilitate document skimming by
visually emphasizing important phrases; and offer a powerful means of measuring document
similarity [GPW+99, Jon98, Wit03]. There are many open-source tools are available, to ex-
tract the keyphrases from research articles. We have used Keyphrase Extraction Algorithm
i.e. KEA [WPF+99], for extracting keyphrases from our database. In next section, first we
explain KEA in detail.
9
2.1.1 KEA : Keyphrase Extraction Algorithm
KEA [WPF+99], is an algorithm for extracting keyphrases from text corpus. It can be either
used for free indexing or for indexing with a controlled vocabulary. KEA1 is implemented in
JAVA and is platform independent. It is an open-source software.
KEA identifies candidate keyphrases using lexical methods [Tur99], calculate feature values
for each candidate, and uses a machine-learning algorithm to predict which candidates are
good keyphrases. The machine learning schemes first builds a prediction model using training
documents with known keyphrases, and then uses the model to find keyphrases in new
documents. It uses the Naive Bayes machine learning algorithm for training and keyphrase
extraction.
In Table 2.1, we show the output of KEA’s as an example in our dataset. Table 2.1, shows
the titles of 2 research articles and 2 sets of keyphrases for each article. First set gives the
keyphrases assigned by the user; the other was determined automatically from the article’s
full text by KEA. Phrases in common between the two sets are italicized. As seen from
the Table 2.1, the automatically-extracted keyphrases and author’s assigned keyphrases are
quite similar.
KEA’s extraction algorithm has two stages:
1. Training: Create a model for identifying keyphrases, using training documents where
the author’s keyphrases are known. KEA first needs to create a model that learns
the extraction strategy from manually indexed documents. In our database, we used
150 research articles as a training documents and assign the keyphrases manually. For
each training document, candidate phrases are identified and their feature values are
calculated. To reduce the size of the training set, KEA discard any phrase that occurs
only once in the document. Each phrase is then marked as a keyphrase or a non-
keyphrase, using the actual keyphrases for that document. This binary feature used
by the machine learning scheme. The scheme then generates a model that predicts
the class using the values of the other two features. It uses the Naive Bayes technique
(e.g. [DP97]) because it is simple and yields good results. This scheme learns two sets
1http://www.nzdl.org/Kea/
10
Table 2.1: Sample Output of Keyphrase Extraction from Research Articles.
A Bayesian Method for Guessing Efficient Processing of Top-k Domina-
the Extreme Values in a Data Set -ting Queries on Multi-Dim Data
Author’s
assign
keyphrases
Kea’s gen-
erated
keyphrases
Author’s
assign
keyphrases
Kea’s gen-
erated
keyphrases
Bayesian Bayesian ap-
proach
Top-k domi-
nating queries
top-k domi-
nating queries
Bayesian Bayesian
method
Skyline
Queries
skyline
queries
Query Pro-
cessing
query pro-
cessing
Multi Dimen-
sional Data
multi-
dimensional
data
Data Man-
agement
data manage-
ment
Ranking
Functions
ranking func-
tion
Estimator Tree Traver-
sal
Sub-linear
speed algo-
rithm
method for
guessing
Non-Indexed
Data
non-indexed
data
of numeric weights from the discretized feature values, one set applying to positive (“is
a keyphrase”) examples and the other to negative (“is not a keyphrase”) instances.
2. Extraction: To identify keyphrases from a new document, KEA determines candidate
phrases and feature values, and then applies the model built during training phase.
This model determines the overall probability that each candidate is a keyphrase, and
then a post-processing operation selects the best set of keyphrases. Phrases with the
highest probabilities are selected into the final set of keyphrases. The user can specify
11
the number of keyphrases that need to be selected. In our experiments, we specify the
number of keyphrases to be extracted from each document are 10.
2.1.2 Evaluation Metrics
To evaluate the performance of our text retrieval system we use standard measures such
as recall, precision and F-score measure. Let the set of documents relevant to a query be
denoted as Relevant, and the set of document retrieved be denoted as Retrieved. the set
of documents that are both relevant and retrieved is denoted as Relevant⋂Retrieved, as
shown in the venn diagram in Figure 2.1. There are two basic measures for assessing the
quality of text retrieval.
Figure 2.1: Relationship between the set of relevant and retrieved documents.
Definition 2 (Precision) This is the percentage of retrieved documents that are in fact
relevant to the query (i.e., “correct responses”). It is formally defined as:
Precision(P ) =(Relevant
⋂Retrieved)
Retrieved(2.1)
Definition 3 (Recall) This is the percentage of documents that are relevant to the query
and were, in fact retrieved. It is formally defined as:
Recall(R) =(Relevant
⋂Retrieved)
Relevant(2.2)
An information retrieval system often needs to trade off recall for precision or vice-versa.
One commonly use trade-off is the F-score, which is defined as the harmonic mean of recall
and precision:
12
F -score =2× recall × precisionrecall + precision
(2.3)
To check the accuracy of our system, we use the notion of confusion matrix. The confusion
matrix is a useful tool for analyzing how well our classifier can recognize tuples of different
classes. In the context of confusion matrix, we build our analogy in terms of retrieved and
relevant documents with respect to actual and predicted class labels. A confusion matrix
for two class labels is shown in the Table 2.2.
Table 2.2: A confusion matrix for positive and negative tuples.
Predicted Results
Actual Resultstp fp
fn tn
In terms of tp (true-positive), tn (true-negative), fp (false-positive) and fn (false-negative),
Recall and Precision are defined as:
Definition 4 (Precision) This is the probability that a (randomly selected) retrieved doc-
ument is relevant. It is formally defined as:
Precision(P ) =tp
tp+ fp(2.4)
Definition 5 (Recall) is the probability that a (randomly selected) relevant document is
retrieved in a search. It is formally defined as:
Recall(R) =tp
tp+ fn(2.5)
To identify the number of accurate and mislead documents, we use accuracy and error-rate
measures. The accuracy of any system is the percentage of test set tuples that are correctly
classified and error rate identify how many are misclassified. The accuracy and error rate is
given by:
Accuracy =tp+ tn
tp+ tn+ fp+ fn(2.6)
13
error-rate = 1− Accuracy (2.7)
2.1.3 Receiver Operator Characteristic (ROC) Curve and Space
In a binary decision problem, a classifier labels examples as either positive or negative.
The decision made by the classifier can be represented in a structure known as a confusion
matrix or contingency table. The contingency table can derive several evaluation metrics as
discussed above. The confusion matrix can be used to construct a point in ROC space. An
ROC curve, show how the number of correctly classified positive examples varies with the
number of incorrectly classified negative examples. In ROC space [DG06], one plots the False
Positive Rate (FPR) on the x-axis and the True Positive Rate (TPR) on the y-axis, depicts
relative trade-offs between true positives and false positive. The FPR measures the fraction
of negative examples that are misclassified as positive. The TPR measures the fraction of
positive examples that are correctly labeled.
The best possible prediction method would yield a point in the upper left corner or co-
ordinate (0,1) of the ROC space, representing 100 percent sensitivity (no false negative) and
100 present specificity (no false positive). The (0,1) point also called perfect classification.
A completely random guess would give a point along a diagonal line i.e. a line of no dis-
crimination from the left bottom to the right corners. This diagonal line divides the ROC
space. Points above the diagonal represents good classification results, and points below the
line yields poor classification.
2.2 Related Work
In this section, we explain existing and related work information in detail that has been
published already in the literature.
14
2.2.1 Overview
For self-referential document collection such as research publications, emails, or news articles,
we would like to answer the basic questions: Did one document d has some important/new
keyphrases introduced first time in comparison of another document d’? Why one document
is more important than other document? This information can then be put together to an-
swer more complicated questions such as the following: What are the documents contain new
keyphrases introduced first time. These documents are the important ones or can be called
as landmark papers since they represent the essence of new keyphrases introduced first time.
Answering this fundamental question has many applications. On the web, methods such as
Hub and Authorities [Kle99] and Page Rank [PBMW99] have been used to find important
documents. There is a whole research community that analyzes research publications by
their citations to determine which have the most impact [Gar55, Gar72].
The number of electronic documents is growing faster than ever before [WHGL09] : in-
formation is generated faster than people can deal with it. In order to handle this problem,
many electronic periodical databases have proposed keyword search methods to decrease
the effort and time spent by users in searching the electronic documents of their interest.
However, the users still have to deal with a huge number of search results. How to provide
an efficient search, i.e., to present the search results in categories, has become an important
current research issue. If search results can be classified and shown by their topics, users can
find papers of interest quickly.
In the most popular form of search, the search criteria are keywords, or concepts that may
be contained in the electronic documents [Voo99]. However, the users still have to deal with
the overabundance of search results in some way. During the last decade, the question of
how best to filter the results of search engines has become an important issue.
2.2.2 Topic Detection and Tracking
Topic Detection is an experimental method for automatically organizing search results. It
could help users save time in identifying useful information from large scale electronic docu-
ments. In [All02], A topic is defined to be a seminal event or activity, along with all directly
15
related events and activities. Many different data mining methods are employed to recognize
topics, for instance, the Naive Bayes classifier [LWY06], hierarchical clustering algorithms
(HCA) [HNP05, Ber02, Kan03], paragraph relationship maps [SM86], Formal Concept Anal-
ysis (FCA) [Wil81] and lexicon chains [BE97, HH76]. These methods use the frequencies
of words to calculate the similarity between two documents. Therefore, their accuracy is
greatly hindered by the presence of synonyms.
Halliday and Hasan [HH76] proposed a semantics-based lexical chain method, that can
be used to identify the central theme of a document. Based on the lexical chain method,
combined with the electronic WordNet database2, the proposed method clusters electronic
documents by semantic similarity and extracts the important topics for each cluster. Ulti-
mately, the method provides more user-friendly search results that are relevant to the topics.
In [WHGL09] proposed a document topic detection method based on semantic features
in order to improve the traditional method. The key contribution is to design a method
based on bibliographic structures and semantic properties to extract important words and
cluster the literature. It can be used to retrieve the topics and display the search results
clustered by topics. Expert users can easily acquire literature of interest and correctly find
information from the topic cluster display.
2.2.3 First Story Detection
In [De05], Indro et al. presented First Story Detection (FSD) whose task requires identifying
those stories within a large set of data that discuss an event that has not already been
reported in earlier stories. In this FSD approach, algorithm look for keywords in a news
story and compare the story with earlier stories. FSD is defined as the process to find all
stories within a corpus of text data that are the first stories describing a certain event [Con00].
An event is a topic that is described or reported in a number of stories. Examples can be
government elections, natural disasters, sports events, etc. The First Story Detection process
runs sequentially, looking at a time-stamped stream of stories and making the decision
based on a comparison of key terms to previous stories. FSD is closely linked to the Topic
2http://wordnet.princeton.edu/
16
Detection task, a process that builds clusters of stories that discuss the same topic area or
event [NIS, ALJ00]. Comparable to this, FSD evaluates the corpus and finds stories that
are discussing a new event. FSD is a more specialized version of Topic Detection, because
in Topic Detection the system has to determine when a new topic is being discussed and the
resulting stories will be the “first-stories”.
2.2.4 Hot Topics
Shewhart and Wasson [SW99] described a process that monitors newsfeeds for topics that re-
ceive unexpectedly high amounts of coverage (i.e. hot topics) on a given day. They performed
trend analysis in order to find hot topics, except that they used controlled vocabulary terms
rather than phrases extracted from text. The purpose of the study is to monitor newsfeeds
in order to identify when any topic from a predefined list of topics is a hot topic.
2.2.5 Temporal Text Mining
In 2005, Mei and Zhai [MZ05] discovered evolutionary theme patterns from text information
collected over time. Temporal Text Mining (TTM) has many applications in multiple do-
mains, such as summarizing events in news articles and revealing research trends in scientific
literature. TTM task is discovering and summarizing the evolutionary patterns of themes
in a text stream. They define this new text mining problem and present general probabilis-
tic methods for solving this problem through (1) discovering latent themes from text; (2)
constructing an evolution graph of themes; (3) analyzing life cycles of themes.
2.2.6 Journal Article Topic Detection based on Semantic Features
In 2009, Wang and others [WHGL09] describes a document topic detection method based
on bibliographic structures (e.g. Title, Keyword and Abstract) and semantic properties to
extract important words and cluster the scholarly literature. The approach can be used to
retrieve topics and display the search results illustrated by topics. Expert users can easily
acquire literature of interest and correctly find information from the topic-cluster display. In
order to exploit the semantic features to detect topics, they constructed lexical chains from
17
a corpus [BE97, HH76, ?]. They performed three main steps to build system architecture.
At first, the pre-process model collects journal papers and process their title, keyword and
abstract information to prepare for the lexical chain construction. Secondly, the document
representative model build lexical chains and finally, last step is the semantic clustering
model. They propose method to calculate the semantic similarity between semantic features.
After the semantic similarity calculation, HCA (Hierarchical Clustering Algorithm) method
is used to cluster the documents. Within a cluster, topics are extracted from the documents
and phrase frequency (PF) method is used to extract the topics. The key contribution
is the ability to extract topics by semantic features, taking into account the influence of
bibliographic structures and to recommend clusters to the users.
2.2.7 COA: Finding Novel Patents through Text Analysis
In 2009, M. Al. Hasan and others [HSGA09] build a patent ranking software, named COA
- Claim Originality Analysis that rates patent based on its value by measuring the recency
and the impact of the important phrases that appear in the “claim” section of a patent.
COA address the novelty and non-obviousness of a patent. It assesses the patent by eval-
uating the originality of its invention. It uses an information retrieval approach, where a
patent is considered valuable, if the invention presented in the patent is novel and also, is
subsequently used or expanded by later patents. This knowledge is gleaned from the patent
text, specifically, from the text composing the patent claims. From the “claims” section of a
patent, first identify a set of phrases (single word or multi-word) that retain the key ideas of
the patent. Then, for every phrase find the earliest patent that had used that phrase. They
also track the usages of that phrase by later patents. Finally, they feed these information
into a ranking function to obtain a numeric value that denotes the value of that patent.
2.3 Differences from Landmark Paper Mining
We now show that existing related techniques, specifically first story detection, hot topic
mining, theme mining, journal article topic detection and COA do not effectively handle
the landmark paper mining problem. Our approach is simple and more direct. We can not
18
reduce our requirements to the first story detection, hot topic mining and theme mining
effectively.
In first story detection (FSD) [De05], algorithms look for keyword in a first news story and
compare the story with earlier stories. FSD is the process to find all stories within a corpus
of text data that are the first stories describing a certain event [Con00]. The FSD process
runs sequentially looking at a time-stamped stream of stories and making the decision based
on a comparison of key terms to previous stories. FSD is closely linked to the topic detection
task [NIS], a process that builds cluster of stories that discuss the same topic area or event.
Landmark paper mining differs significantly from FSD in the following ways. In FSD, a
new story is detected as being a first story if it has a significant vocabulary shift from recent
papers. First, a vocabulary shift could occur even without the introduction of new key terms
if the frequencies of existing key terms is significantly altered. Second, a document can be
flagged as a first story, even when there is an earlier document with the same key terms
and frequencies. For example, even if there was an earth-quake last year, the first story
describing a more recent earth-quake will be detected as a first story.
In hot topic mining [SW99], a topic is known as hot when it receives an unusually high
amount of coverage in the news on a given day because of the importance of the events
involving that topic. They used trend analysis in order to find hot topics, except that they
are using controlled vocabulary terms rather than phrases extracted from text. Landmark
paper mining is clearly different problem as it seeks to mine interesting papers and for
identifying important and related keyphrases, instead of interesting topics.
Temporal Text Mining (TTM) [MZ05] task is to discover, extract and summarize the
evolutionary patterns of themes in a text stream. In this paper, the authors identify when
a theme starts, reaches its peak, and then deteriorates, as well as which subsequent themes
it influences. A timeline based theme structure is a very informative summary of the event,
which also facilities navigation through themes.
Theme Mining can be considered as an approach to mine interesting papers that originate
themes. A new theme containing only existing keyphrases with altered frequencies does
not have necessarily represent a new concept. In fact, this step (of determining themes)
is both unnecessary and insufficient to determine if a paper originates a new concept or
19
not. In contrast, a new keyphrase almost certainly indicates the presence of a new concept.
However, in landmark paper mining, we follow a simpler and more direct approach . We
identify papers that originate important keyphrases instead of themes (which can contain a
collection of keyphraes). Our approach is simpler because it avoids the notion of themes -
so there is no need to decide which collection of keyphrases form a theme. By avoiding this
unnecessary step, our approach is more direct.
In [WHGL09], the emphasis of author is on, extracting important concepts from documents
and then the documents are clustered by semantic similarity. The main goal of user is to find
only topics and documents by using the topic-cluster display, but landmark paper mining is
different from this approach significantly, we are identifying important keyphrases and their
related keyphrases based on their proximity and importance and corresponding originating
document of the keyphrase from where the keyphrase starts. Another advantage of mine
landmark papers is, in journal article topic paper they consider only bibliography structure
(e.g. title, keyword and abstract), but we consider full text of research article as the input
for our approach. On the other hand, they are considering clustering approach which can
lead to the many duplication of topics, if there are large number of clusters.
Finally, In COA [HSGA09], a patent is said to be novel if the ranking to the patent is
high. The ranking of the patent is based on the recency of the keyphrases. COA allows a
user to define a time-window, the keyphrases that first appeared in some patents published
within the given time-window are considered. So, in COA, many of the keyphrases that
are used by the patent are recent within the given time-window or some of the keyphrases
are used by the patent first time. The novelty of a patent does not totally depend on the
recency of the keyphrases corresponding to that patent, it can be used by already published
patent class. But landmark paper mining differs significantly from this case. A document is
said to be landmark if it introduces important keyphrases first time or in other way, identify
the document that introduces important keyphrases for the first time. In addition, we are
identifying related keyphrases which are relevant to the user’s given query based on their
keyscore values.
20
Chapter 3
Mining Landmark Papers
Most of the research in the field of text mining such as for example, identifying hot top-
ics, fist story detection (FSD), topic detection and tracking (TDT), and, also discovering
evolutionary theme patterns i.e. temporal text mining etc. did not address the problem of
identifying important keyphrases from the text corpus of research documents and further,
did not identify the first document from where it originates. In this chapter, we present
MLP (Mining Landmark Papers) approach, to identify the important keyphrases and also
originating documents from text. The use of MLP at an initial stage by new researchers or
users will help them explore and to choose their field of interest in more structured and an
effective manner and help them to understand the structure of how keyphrases emerge in
the time-stamped documents.
3.1 Overview
Mining Landmark Paper (MLP) is concerned with discovering important keyphrases and
corresponding first document in text information collected over time. We consider the prob-
lem of analyzing the development of important keyphrases and identify first document where
keyphrases were introduced first time in the collection. Text is quite noisy and there are
many documents, so we focus primarily on methods to help people grasp the most important
keyphrases and how they developed through the documents (after all, not many people have
time to try to understand everything.).
21
In addition, people typically like to keep up-to-date with new and current keyphrases, or
to see how keyphrases developed over time. Thus our method typically focuses on the most
important or earliest documents where keyphrases occur first time in the given time-stamp.
Most of the existing work related to these questions has focused on exploiting meta-
data like hyperlinks and citation information. Graph-based algorithms like HITS [Kle99],
PageRank [PBMW98], and its descendants [CC00] exploits information in the hyperlink
structure to find outstanding documents. These algorithms are based on citation analysis
from bibliometrics [MFH+03] that are used to detect related work and define impact [Gar03,
Gar]. In contrast of using citation data we propose an approach that consider whole text of
the document as a input.
For the problem of discovering topics and trends in a collection of documents, there is an
abundance of work that has been done already. The topic detection and tracking (TDT)
evaluations [ACD+98, APL98] emphasized online new topic detection for news articles. In
short, other work has focused on burst detection, correlating real-world events such as the
rise and fall of a topics popularity with single words from the documents [Kle02, SJ00].
Evolutionary theme patterns demonstrate the entire “life cycle” of a topic from a probabilistic
background [MZ05].
3.2 Problem Definition
Given a collection of time-stamped documents, we formulate and explore the following two
questions:
1. What are the important/right keyphrases and how do these keyphrases develop over
time?
2. Which documents introduced these keyphrases and which document is the originating
document of the keyphrase?
3. How to identify first document (d) in the collection having keyphrase (k), which is
originating in the document?
22
To answer all these questions for general document collections, we impose that our algorithm
must work entirely based on the text in the document collection. The following formal
problem statement captures the above requirements:
Definition 6 (Landmark Paper Mining) Given a collection of time indexed documents,
C = {d1, d2, . . . , dT}, where di refers to a document with time stamp i, each document is a
sequence of words from a vocabulary set V = {w1, w2, . . . , w|V |}, the problem is to identify
the first document that introduces important keyphrases for the first time known as landmark
papers.
The term important in the above definition is used to denote keyphrases that are extracted
by using standard techniques [WPF+99] such as TF-IDF, position of occurrence of keyphrase
in document, etc.
To identify landmark papers from text corpus, we first discuss the type of data we are
considering. We assume that the text corpus consists of documents where:
• the text of the document is accessible,
• and the documents are time-stamped,
Examples of such collections are emails, research articles, news articles, blogs, proceedings
of scientific conferences and scientific journals, etc. We use methods that leverage only the
text of the documents so that they can apply to many domains of data. Many domains have
no information about the structure besides what is expressed in the text of the documents
themselves. Using the text only has two advantages, one advantage is that unsupervised
methods that are based exclusively on text widely apply in many domains of the documents,
and other is that link information between the documents also contains information that
could be useful. For example, citation data for research publications could be used in addition
to document text to find the originating documents and important keyphrases.
Our approach is simple, direct and significantly differs from other related approaches.
In addition to finding keyphrases from research articles, we also identify the originating
document of the keyphrase from where it originates. In next few sections, we will explain
steps of our approach in detail and also present the results that validate its effectiveness.
23
3.3 Methodology
The steps of MLP to identify a landmark paper from text corpus is outlined in Figure 3.1.
We discuss them as follows.
Figure 3.1: Flow Diagram for Identifying Landmark Papers.
3.3.1 Extracting Keyphrases
For building a model and extracting keyphrases, there is a pre-processing step needed to
choose a set of candidate phrases from their input documents and then calculate the feature
values for each candidate. For choosing candidate keyphrases, first cleans the input text
like apostrophes, non-token words are deleted, punctuation marks, brackets and numbers
are replaced by phrase boundaries, etc; then phrase identification and finally stemming and
case-folding. For each candidate phrase, two features are calculated. They are: TF-IDF, a
measure of a phrase’s frequency in a document compared to its rarity in general use; and
first occurrence, which is the distance into the document of the phrase’s first appearance. To
perform these tasks we have used the KEA [WPF+99], an algorithm for extracting keyphrases
from text corpus. Detail description of KEA is given in section 2.1.1.
24
3.3.2 Identifying Landmark Papers
After extracting keyphrases from text document collection, our next aim is to identify the
first originating document from text corpus in chronological order. To achieve our task we
first identify the number of documents corresponding to the each keyphrase from our text
collection. In our database, we store the information of conference year, conference name,
general-title of the paper, paper-title, and, first-author name. After identifying the set of
documents corresponding to the keyphrase, sort the documents containing it in increasing
order with respect to the time-stamp e.g. conference year.
To prune the unnecessary and meaningless keyphrases from our dataset, we set the thresh-
old value to discard the keyphrases which are infrequent in documents. Our motive is to
identify important or meaningful keyphrases, so to discard all other keyphrases we set a
parameter called minimum support count i.e. minsup(3) i.e.“threshold value”. We remove
keyphrase that are not contained in at least minsup documents i.e. if a keyphrase occurs
less than 3 documents implies that keyphrase is not important. The parameter “minsup”
ensures that the keyphrases considered are persistent and thereby important. In sorted list
of documents, the first document for a keyphrase is identified as a candidate landmark paper
corresponding to that keyphrase.
These candidate landmark papers are further refined by ensuring that the significant
keyphrase does not appear in the references section of the corresponding paper. This is also
pruning step for removal of the keyphrases; which ensures that the keyphrases, we identified
are important and meaningful from our text corpus. The overview of whole algorithm is
shown in algorithm 1.
With the rapidly increasing number of research articles published recently, one application
is to automatically identify a small set for example, 20 to 30 research publications that
introduced the keyphrases (research field) in respective year of the conferences. We present
such a method to identify list of keyphrases and first originating research publication of these
keyphrases among the documents published at the conferences such as VLDB, SIGMOD,
PAKDD, and etc. People who read this limited subset of articles could hopefully get the gist
of the most important ideas and development of the research community. Based on users
25
Algorithm 1: Mining Landmark Papers
1: for each keyphrase k of the document: do
2: Identify number of documents where k occurs
sort the documents in ascending-order w.r.t. conference-year
3: end for
4: for each keyphrase k of the length l: do
5: if length of keyphrase ≥ minsupcount then
6: print keyphrase of corresponding sorted list of k-docs
7: else
8: prune keyphrases
9: end if
10: end for
11: for each document d from sorted-list of docs: do
12: print first document from sorted-list
13: end for
{Additional Pruning Step}
14: for each keyphrase k of the document in references: do
15: if count of keyphrase > 0 then
16: prune keyphrase
17: else
18: print keyphrase, first document (landmark paper) from sorted-list of docs
19: end if
20: end for
26
interests they can read only those documents and can get a good sense of the later trend
and popular topics. As another example, like mining customer reviews, a few particularly-
insightful reviews about kind of services provided by shopkeeper often stand out from the
rest and spark much discussion. By starting with influential reviews the shopkeeper could
potentially save time by reading only the important comments first instead of skimming the
whole discussion.
3.4 Experimental Results and Evaluation
In this section, we present the performance model and experimental result of over approach.
3.4.1 Data Preparation
To experimentally evaluated our approach on a data set of research papers, we prepared
a database of well tagged “Data Mining and Databases” conferences from DBLP website1.
DBLP (Digital Bibliography and Library Project) is a computer science bibliography website
hosted at University of Trier in Germany. The website maintains information regarding re-
search papers in various fields of computer science. The website currently has more than 220
research papers indexed. We conducted our experiments on the collection of research articles,
in particular the articles published in the proceedings of the “Data Mining and Databases”
conferences such as Knowledge Discovery and Data Mining (KDD), Very Large Database
(VLDB), Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), In-
ternational Conference on Management of Data (SIGMOD), International Conference on
Data Engineering (ICDE) and International Conference on Data Mining (ICDM) between
the year 2005 and 2009. The reason for choosing this dataset is twofold.
1. The research document collections fulfill the assumptions stated above in the problem
definition Section 3.2, and
2. We are familiar with the development of the scientific community, which allow us to
evaluate the performance of our results as an informed insider.
1http://www.informatik.uni-trier.de/ ley/db/conf/indexa.html
27
The research papers are classified on their topic, their conference, year of publication
and their author’s name. We have extracted the information from the conferences such as
SIGMOD, VLDB, KDD, ICDE, PAKDD, ICDM and store the information in our database
to perform the experiments. The information we extracted include the year of publication,
authors name, conference name, paper title, general paper topic and the link from where the
user can download the full-text pdf files of research papers. We have converted full-text pdf
files into text files by the pdf2text software in Linux. There are total of 3781 documents,
with approximately 100-150 documents each year. 8 articles were excluded because they
were not recognizable by the pdf2text software. The basis statistics of the data sets are
shown in Table 3.1. We intentionally did not perform stemming or stop word pruning in
order to test the robustness of our algorithm.
Table 3.1: Basic information of data sets.
Conference # of docs Time range
VLDB 735 2005-2009
SIGMOD 586 2005-2009
ICDM 672 2005-2009
KDD 601 2005-2009
PAKDD 551 2005-2009
ICDE 636 2005-2009
We wish to measure how well our algorithm perform on real data. All experiments are
conducted on a 160 GHZ Intel PC with 8GB main memory using a Linux machine. All
algorithms are coded in Python.
Methodology described in Section 3.3 are performed on the research articles of the con-
ferences such as VLDB, SIGMOD, KDD, ICDE, ICDM, and PAKDD; together and also
each one separately. In below sections, we have shown the generated list of keyphrases and
corresponding originating documents, first corresponding to each conference separately and
then from all the conferences together.
28
3.4.2 Identifying Keyphrases and Originating Documents from
Each Conference
In this section, we first discuss the set of experiments to analyze how our approach MLP
is able to identify the keyphrases and originating documents for each respective conference.
Table 3.2, shows small set of those keyphrases and originating documents that are identified
by our algorithm from VLDB conference. In Table 3.2, first column represent keyphrases,
second column published year, third column paper title i.e. originating document and last
column shows first author name of the document. In VLDB, total number of documents
were 735 for 5 years. Out of these our approach identified 21 keyphrases which are having
originating documents in VLDB conference. The physical significance of originating docu-
ment, implies that this is the first document where the respective keyphrase was introduced
first time in our database. For example, as shown in Table 3.2, keyphrase “plan choices”
introduced first time in the paper titled “Analyzing plan diagrams of database query opti-
mizers” in the year 2005 of VLDB conference from our database of research articles. Our
tool also shows that (not shown in Table 3.2), this keyphrase was also present in 2007 and
2008. This shows the development of keyphrase in the next few years.
Similarly, for SIGMOD conference total number of documents are 586 for 5 years. Our
algorithm identified 20 important keyphrases and their originating documents from corpus.
Table 3.3, shows a portion of the keyphrases list corresponding to the SIGMOD conference.
Table 3.4, show the small part of results corresponding to the ICDE conference. We
have identified 35 originating documents corresponding to the keyphrases out of total 636
documents published in ICDE during 5 years.
Table 3.5, 3.6 and 3.7, shows the portion of the results generated by our algorithm from
ICDM, KDD and PAKDD conferences. Number of extracted keyphrases and originating
documents from 672 documents of ICDM, 601 documents of KDD and 551 documents of
PAKDD are respectively 23, 21 and 24.
29
Table 3.2: List of keyphrases and corresponding originating documents in VLDB conference.
Keyphrases Year Paper Title First Author Name
plan choices 2005 Analyzing plan diagrams of database Naveen Reddy
query optimizers
execution 2007 A genetic approach for random testing Hardik Bati
feedback of database systems
optimal plan 2005 Analyzing plan diagrams of database Naveen Reddy
query optimizers
fact table 2006 Cure for cubes:Cubing using a ROLAP Konwtantinos
engine Morfonios
query execution 2005 Query execution assurance for outsourced Radu Sion
databases
HTML tables 2008 WebTables:exploring the power of tables Michael J. Cafarella
on the web
storage schemes 2005 ULoad: Choosing the right storage for Andrei Arion
your XML application
database schema 2006 Simple and realistic data generation Kenneth Houkjar
3.4.3 Number of Misclassifications
We want to show that our heuristic to identify the originating documents is effective. To
do this we measure the following: For each (Landmark paper L, Keyphrase K) pair, we
apply our algorithm on a random sample of papers that do not contain the landmark paper
L. We then verify that the papers in the random sample that contain keyphrase K are
not landmark papers. If it identifies any of the papers as a landmark paper, then it is
an error. Table 3.8, shows the number of errors for the keyphrases that have correponding
landmark papers in different conferences. The numerical digits in Table 3.8, shows number of
incorrectly classified keyphrases from the chosen samples against the respective conferences.
30
Table 3.3: List of keyphrases and corresponding originating documents in SIGMOD confer-
ence.
Keyphrases Year Paper Title First Author Name
application 2005 IBM SOA on the edge Gennaro A. Cuom
servers
execution plans 2005 Proactive re-optimization with Rio Shivnath Babu
provenance 2006 Provenance management in curated Peter Buneman
information databases
enterprise 2005 Data and metadata management in service Vishal Sikka
application oriented architectures:some open challenges
application logic 2006 Automatic client-server partitioning of data- Nicholas Gerner
-driven web applications
data publishing 2005 Verifying completeness of relational query HweeHwa Pang
results in data publishing
query keywords 2007 BLINKS: ranked keyword searches on graphs Hao He
data 2005 Safe data sharing and data dissemination Luc Bouganim
dissemination on smart devices
For example, the first row in Table 3.8 represents, 19 out of 21 VLDB keyphrases were
correctly identified as not originating in sample 1 dataset only 2 are misclassified; 17 out of
21 were classified correctly from other sample 2 dataset; 21 out of 21 were classified correctly
from sample 3 and 5; and 20 out of 21 were classified correctly as the originating document
in sample 4. Similarly, for other respective conferences and corresponding number shows the
number of incorrectly classified keyphrases from the total number of keyphrases identified by
our algorithm. In Table 3.8, most of the incorrectly classified numbers are zero or not more
than 4, implies that our heuristic to identify the originating documents are able to perform
well significantly.
31
Table 3.4: List of keyphrases and corresponding originating documents in ICDE conference.
Keyphrases Year Paper Title First Author Name
XML streams 2006 Unifying the processing of XML streams Xin Zhou
and relational data streams
spatial objects 2005 Evaluation of spatio-temporal predicates Markus Schneider
on moving objects
query sequence 2007 Stream monitoring under the time Yasushi Sakurai
warping distance
stream data 2008 Approximate clustering on distributed Qi Zhang
data streams
search space 2006 Mining actionable patterns by role models Ke Wang
graph databases 2006 Searching substructures with super- Xifeng Yan
-imposed distance
XPath query 2005 Spatial range querying for gaussian-based Yoshiharu Ishikawa
imprecise query objects
web databases 2006 Answering imprecise queries over Ullas Nambiar
autonomous web databases
3.4.4 Identifying Keyphrases and Originating Documents from
Multiple Conferences Together
After evaluating the keyphrases and originating documents results from each conference
separately, we perform our results on whole dataset together. We ran our algorithm on
3781 research documents out of these 261 keyphrases identified as the keyphrases which are
having originating documents in our corpus. Table 3.9, shows the small list of the keyphrases
and corresponding originating document from multiple conferences together. We call these
keyphrases as the global keyphrases where as the keyphrases which are identified in each
conference separately known as the local keyphrases.
32
Table 3.5: List of keyphrases and corresponding originating documents in ICDM conference.
Keyphrases Year Paper Title First Author
gradient 2007 Training conditional random fields by periodic Han-Shen
descent step size adaption for large-scale text mining Huang
dynamic 2007 Efficient Algorithm for Mining significant Huahai He
programming substructures in graphs with quality guarantees
subspace 2006 A novel scalable algorithm for Jun Yan
learning supervised subspace learning
text data 2008 Text Cube: Computing IR measures for Cindy Xide
multidimensional text database analysis
latent dirichlet 2008 Collective Latent Dirichlet Allocation Zhiyong Shen
allocation
vector machine 2006 Cluster Based Core Vector Machine Asharaf S
relation 2005 Mining ontological knowledge from Xing Jiang
extraction domain-specific text documents
distance 2005 Alternate representation of distance matrices Keith Marsolo
matrix for characterization of protein structure
From Table 3.9, the physical significance of originating document for eg: “A framework
for clustering uncertain data streams” from sorted list, implies that this is the first document
where keyphrase “clustering uncertain data” was introduced first time in our database. As
shown in Table 3.1, we took total 3781 number of documents for VLDM, SIGMOD, ICDM,
ICDE, PAKDD, and KDD conferences over 5 years, where “A framework for clustering
uncertain data streams” represents first document (Landmark Paper) where the keyphrase
is introduced first time ICDE conference at 2008. The output of our algorithm is not limited
and it will be more accurate if we will run our algorithm on large datasets. Also, for
the keyphrase “quasi-identifier attributes” we have seen that in SIGMOD 2005 “Incognito:
Efficient full-domain k-anonymity” is the originating document and after this it is identified
in SIGMOD and VLDB conferences during 2007 from our database. This implies that this
33
Table 3.6: List of keyphrases and corresponding originating documents in KDD conference.
Keyphrases Year Paper Title First Author Name
web pages 2008 Information extraction from wikipedia: Fei Wu
moving down the long tail
decision 2006 A general framework for accurate Wei Fan
tree and fast regression by data summari-
-zation in random decision trees
linear 2007 Scalable look-ahead linear David S.Vogel
Regression regression trees
latent diri- 2005 Modeling and predicting personal Xiaodan Song
-chlet allocation information behavior
density 2006 New EM derived from Kullback- Longin Jan
estimation Leibler divergence Latecki
tensor 2008 Heterogeneous data fusion for Jieping Ye
factorization alzheimer’s disease study
community 2006 Center-piece subgraphs: problem Hanghang Tong
detection definition and fast solutions
clustering 2005 A general model for clustering binary data Tao Li
model
keyphrase is not used after this conference. So, it will be helpful for new researchers, to
choose this keyphrase as an important keyphrase to continue their work in this area. Also,
some of the keyphrases occur in recent years for eg: 2008, not before in our database, give
intuition to user or researcher that this is the important or new keyphrase which introduced
very recently in the coming years. For keyphrase “stream processing system” we analyze
that this keyphrase occurred in our whole database very frequently implies the keyphrase has
equal influence throughout the time stamp window and it may occurs in coming years implies
that the keyphrase is neither new nor important for a new user. In addition to this, for the
34
Table 3.7: List of keyphrases and corresponding originating documents in PAKDD confer-
ence.
Keyphrases Year Paper Title First Author
information 2006 Generalized conditional entropy and a Dan A. Simovici
gain metric splitting criterion for decision trees
anomaly 2005 An anomaly detection method for Ryohei Fujimaki
detection spacecraft using relevance vector learning
regression 2006 improving on bagging with input smearing Eibe Frank
problems
incremental 2007 AttributeNets: An incremental learning Hu Wu
learning method for interpretable classification
rough set theory 2005 Using rough set in feature selection and Le Hoai Bac
reduction in face recognition problem
kernel function 2005 A Kernel function method in Clustering Ling Zhang
classification 2005 A privacy-preserving classification Weiping Ge
accuracy mining algorithm
Mixture model 2008 A mixture model for expert finding Jing Zhang
keyphrases which are new and important for a user, we are also providing corresponding first
document to the user, from where it originates. This will provide help to the new researcher
to read the documents to get the overview of the keyphrases from where it introduced.
To evaluate the performance of our system, we took 8 random samples of 50 keyphrases
with their originating documents (as identified by our algorithm) and test whether the
keyphrases really originated in the respective documents or not, was annotated by domain
experts. We draw the contingency table for each set of the keyphrases. The confusion ma-
trix is a useful tool for analyzing how well our predicted results can match to actual results.
Different confusion matrix for predicted class (i.e. landmark or not) and actual class is
shown in Figure 3.3. The contingency table derive several evaluation metrics like precision,
recall, and accuracy as discussed in the Section 2.1. Figure 3.3, show different prediction
35
Table 3.8: No of incorrectly classified keyphrases from each conference.
Conference
Name
# of
keyphrases
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
VLDB 21 2 4 0 1 0
SIGMOD 20 2 3 0 0 0
PAKDD 24 2 0 1 2 3
ICDE 35 4 0 2 1 2
ICDM 23 0 0 2 2 1
KDD 21 0 0 2 1 3
results corresponding to the positive and negative instances. For each confusion matrix, true
positive rate (TPR), false positive rate (FPR), and accuracy has been calculated. To draw
an ROC space refer Section 2.1.3, only the true positive rate (TPR) or sensitivity and false
positive rate (FPR) or (1-specificity) are needed.
An ROC space is defined by FPR and TPR as x and y axes respectively, which depicts
relative trade-off between true positive and false positive. Each prediction result or one
instance of confusion matrix represents one point in the ROC space in Figure 3.2.
In Figure 3.2, out of 8 sample test 6 points are above the diagonal line implies that
results are classify correctly and only 2 points are below the diagonal line yields misclassified
attributes. Accuracy is measured by the area under the ROC curve. An area of 1 represents
a perfect test; an area of 0.5 represents a worthless test. From Figure 3.2, the area under
ROC curve shows the “good” accuracy of the system.
3.5 Summary
We built an approach that is easy, direct and simple to understand and helps the researchers
to identify the set of keyphrases and corresponding originating document from the available
text corpus. We have used the notion of time-stamped documents and identify the list of
keyphrases from multiple conferences with associated conference year. We come up with
36
Table 3.9: List of global keyphrases and originating documents from multiple conferences.
Keyphrases Conference Paper Title First Author
Year
stream pro- ICDE Dynamic load management for distributed Yongluan
cessing system 2005 continuous query systems Zhou
communication SIGMOD Holistic aggregates in a networked world: Graham
network 2005 distributed tracking of approximate quantiles Cormode
query network ICDE High-Availability algorithms for Jeong-Hyon
2005 distributed stream processing Hwang
compression SIGMOD Integrating compression and execution in Daniel J.
technique 2006 column-oriented database systems Abadi
stream PAKDD An incremental data stream clustering Jing Gao
clustering 2005 algorithm based on dense units detection
web graph SIGMOD Graph summarization with bounded error Saket
2008 Navlakha
RNN query VLDB On computing top-t most influential spatial Tian Xia
2005 sites
kernel matrices KDD Learning the kernel matrix in discriminant Jieping Ye
2007 analysis quadratically constrained prog.
peculiar data ICDM The PDD framework for detecting Mahesh
2006 categories of peculiar data Shrestha
quasi-identifier SIGMOD Incognito: Efficient full-domain Kristen
attributes 2005 k-anonymity LeFevre
clustering ICDE A framework for clustering uncertain Charu C.
uncertain data 2008 data streams Aggarwal
the small list of keyphrases and their first document out of huge number of research articles
published annually. To evaluate the performance of our approach, we manually check our
37
Figure 3.2: ROC Space: Comparison of Different Predicted Results.
results and build the metric of incorrectly classified results with the help of other colleagues
as consider them an expert in this field.
In this chapter, we identified the keyphrases and their originating document from avail-
able corpus, but on the other hand we did not talk about the related keyphrases. In the
next chapter, we build our approach to test the importance of keyphrases; and identify the
keyphrases and corresponding originating document, which are relevant to the user query.
38
Figure 3.3: Different Prediction Results
Figure 3.4: TPR=0.71,FPR=0.37,ACC=0.68 Figure 3.5: TPR=0.79,FPR=0.69,ACC=0.64
Figure 3.6: TPR=0.68,FPR=0.48,ACC=0.60 Figure 3.7: TPR=0.46,FPR=0.67,ACC=0.43
Figure 3.8: TPR=0.52,FPR=0.65,ACC=0.44 Figure 3.9: TPR=0.63,FPR=0.4,ACC=0.62
Figure 3.10: TPR=0.7,FPR=0.35,ACC=0.68 Figure 3.11: TPR=0.33,FPR=0.77,ACC=0.26
39
Chapter 4
Mining Topic-based Landmark Papers
4.1 Overview
In the previous Chapter 3, we described Mining Landmark Paper (MLP) approach that
only generates a plain list of keyphrases and their originating documents. However, to be
useful to new researchers who are exploring a research area, this is insufficient. It is much
more helpful to be able to see the structure of the research area in terms of topics and their
relationships with each other. We investigate the system where the user can enter a set of
keyphrases and the system outputs all the topics related to those keyphrases and also shows
the relationships of these topics to each other and final outputs the landmark papers for each
of those topics. This will essentially provide all the material required for the researcher to
exhaustively understand the foundations of that research area. We formalize this problem
and present a method for solving it. This method involves the following steps:
1. Finding relevant keyphrases from the global list of keyphrases which are extracted from
the research document collections.
2. Constructing weighted graph of keyphrases.
3. Matching user given query with keyphrases.
4. Evaluating Keyrank of keyphrases.
5. Identifying k-nearest neighbors of each keyphrase.
40
6. Evaluating Keyscore of keyphrases.
7. Recommending top k - keyphrases related to the user query.
8. Identifying originating document corresponding to each keyphrase.
In the next few sections, we will explain the steps of our methodology described above in
detail. Further, three cases studies in this context is shown in the Section 4.3.
4.2 Steps of Methodology
4.2.1 Mining Related Keyphrases
Definition 7 (Related keyphrases) Given an area of research (q) (specified by keyphrases)
and set of keyphrases {k1, k2, ..., kn}, our aim is to identify the top k - related keyphrases that
are relevant to the user’s query. To find related keyphrases our intuition is based on mainly
two factors; (i) keyrank, and (ii) distance1 of keyphrase (i.e. distance(k1, k2)). A keyphrase
k1 is said to be more related to the user’s query q in comparison to the other keyphrase k2, if
1. keyrank(k1) > keyrank(k2), and
2. distance(q, k1) < distance(q, k2).
Further, if keyrank(k1) = keyrank(k2), and distance(q, k1) < distance(q, k2), then keyphrase
k1 is to be the more related keyphrase to the query point q i.e. if keyphrases are having same
keyrank values then keyphrase which is closer to the query point will be consider as more re-
lated keyphrase. Also, if distance(q, k1) = distance(q, k2), and keyrank(k1) > keyrank(k2),
then keyphrase with high keyrank value i.e. k1 will be consider as a more related keyphrase
to the query point q.
In Section 3.3.2, we present our method to identify the set of keyphrases from each docu-
ment. After extracting the keyphrases from research documents, our aim is to mine the set
1Here, distance is not a metric and does not satisfy metric space conditions necessarily. It is named as a
distance because it is representing proximity between 2 keyphrses in-terms of their co-occurrence.
41
of frequent keyphrases. We use the concept of frequent itemsets mining for mining frequent
keyphrases. In this section, we first explain the some basic definitions and background of
frequent itemset mining problem and later corresponding to {tid, set of items} mapping of
{document, keyphrase} has been represented mathematically.
Definition 8 (Itemset) An itemset is a set of items where each item is an element drawn
from a finite set of possible items. An itemset having length k items is said to have length k
and denoted as a k - itemset.
Definition 9 (Frequent Itemset) A frequent itemset is a set of words which occur to-
gether frequently in the dataset.
Mathematically: An itemset X is frequent if
support(X ) ≥ minsup (Minimum Support)
where, support of an itemset X in a database D is the fraction of D that contain X as a
subset, defined by
support(X) =count(X)
|D|count(X) is the number of records of D that contain X as a subset.
minimum support (or minsup) is a user defined threshold indicating that the itemsets
whose support is less than this threshold are not interesting or relevant.
The input data for frequent itemset mining consists of records, each of which contain a set
of items. For example, customer buying set of items from a supermarket, or users submitting
set of words as queries to a search engine. The task is to find all the subsets that occur more
frequently than some user - specified threshold. Consider the example transactional dataset
given in Table 4.1.
For the table 4.1, the count of the itemset {tomato, potato} is 3, whereas its support is 35
= 0.60. If the minimum support is 0.5, then the set of all frequent itemsets are {{tomato},
{potato}, {chilly}, {carrot}, {tomato,potato}, {potato,chilly}}.
In Table 4.2, documents Di’s are corresponding to the transaction ids and set of keyphrases
kj’s are corresponding to the list of items described in market basket dataset in the Table 4.1.
This forms a transaction space in terms of {document, keyphrase} space as illustrated in
42
Table 4.1: Sample market-basket dataset.
Transaction ID List of Items
T1 tomato, potato, onion
T2 tomato, potato, chilly
T3 tomato, onion, carrot, chilly
T4 potato, carrot, chilly
T5 tomato, potato, carrot, chilly
Table 4.2. We apply Frequent Pattern Growth (or FP-Growth) algorithm [HPYM04], for
mining frequent keyphrases on all the documents to mine the relevant keyphrases. The
pseudo-code of mining an FP-tree is shown in Algorithm 2 and FP-Growth is shown in
Algorithm 3.
Table 4.2: Documents and Set of Keyphrases
Document Keyphrases contained in the document
D1 k1, k2, k3, k4, . . .
D2 k2, k3, k4, k5, . . .
D3 k1, k3, k5, k6, . . .
D4 k4, k5, k6, k7, . . .
D5 k1, k2, k4, k8, . . .
In this {document, keyphrase} space, frequent combination of keyphrases (i.e. frequent
itemsets) that are common to a group of documents implies that those documents are similar
to each other and set of keyphrases are occur frequently together in the transactions are
relevant keyphrases. An example set of relevant keyphrases with their respective support
count has been shown in Table 4.3.
43
Table 4.3: Frequent Keyphrases and Support Count
Frequent keyphrases support count
timeseries 37
queryprocessing 63
datastreams 23
streamprocessing 25
skylinequery 27
queryprocessing, skylinequery 4
datastreams, streamprocessing 8
queryprocessing, skylinequery, streamprocessing 3
4.2.2 Keyphrase Evolutionary Graph
A Keyphrase Evolutionary Graph is a weighted keyphrase graph G = (N, E) in which each
vertex v ∈ N is a keyphrase, and each edge e ∈ E is a transition between the keyphrase-
nodes. The weight on an edge indicates the similarity between the two keyphrase nodes. We
use support count as the weight as it is a measure of correlation between the nodes. Clearly,
each path in a keyphrase evolutionary graph represents a relationship between the nodes -
that these nodes occur together frequently in the database.
An example of a keyphrase evolutionary graph is shown in Figure 4.1, where each ver-
tex is a keyphrase identified from the documents. Each edge is a transition between the
frequent keyphrases relevant to each other. The distance between the nodes indicates how
close the two keyphrases being connected are and how trustful the corresponding transition
is; more support count indicates a closer distance between the keyphrases and a more trustful
transition. For example in Figure 4.1, consider root node = “datamining”2, the distance be-
tween (“datamining”, “decisiontrees”) = 0.1429, which is smaller than the distance between
“datamining” and any other node implies that former is more closer.
2For implementation, we removed spaces in between each word-pair. In actual, for e.g., “datamining” is
similar to “data mining”.
44
Algorithm 2: FP-Tree Algorithm for finding frequent itemsets
Input: DB, a transaction database;
min−sup, the minimum support count threshold.
Ensure: The complete set of frequent patterns.
{Pre-Processing }
1: Scan the transaction database DB once.
2: Collect FI, the set of single element frequent items and their support counts.
3: Discard all infrequent items.
{Create FP-tree}
4: Sort FI in support count decreasing order as Flist, the list of frequent items.
{InsertFP-Tree(FI, T)}
5: Create the root of an FP − tree, T, and label it as “null′′
6: for each transaction, Trans in DB: do
7: select the frequent items in Trans and sort them according to the order of FList :
8: starting from root, follow path p as long as seq of elts in p is a prefix of
sorted item FI.
9: reach at node n, copy to the longest prefix of r in T.
10: add a path p’ of nodes as descendents of n to hold reamining elements of FI.
Maintain linked lists from header table.
11: last node in path p+p’ represents new transaction
12: Increment counters of all nodes in p+p’
13: end for
4.2.3 Matching Queries and Keyphrases
Retrieval of text - based information has traditionally been termed information retrieval (IR)
and has recently become a topic of great interest with the advent of text search engines on
the Internet. Text is considered to be composed of two fundamental units, namely document
and the term. A document can be a traditional document such as a book or journal paper,
but more generally is used as a name for any structured segment of text such as chapters,
sections, paragraphs, or even e-mail messages, Web pages, computer source code, and so
45
Algorithm 3: FP-Growth(T, α)
1: if T contains a single path p then
2: for each β = nodes combination in p do
3: pattern = β ∪ α
4: support = min(support of the nodes in β)
5: end for
6: else
7: for each ai in the header of T do
8: pattern β = ai ∪ α with support = ai.support;
9: construct β′s conditional pattern base of β;
10: construct β′s conditional fp-tree T’
11: if T’ ! = φ then
12: FP-Growth(T’, β)
13: end if
14: end for
15: end if
forth. A term can be a word, word-pair, or phrase within a document, e.g., the word data
or word-pair data mining.
Traditionally in IR, text queries are specified as set of terms. Although documents will
usually be much longer than queries, it is convenient to think of a single representation
language that we can use to represent both documents and queries. By representing both
in unified manner, we can begin to think of directly computing distances between queries
and documents, thus providing a framework within which to directly implement simple text
retrieval algorithms.
In this section, we cover the different kinds of queries normally posed by the user to text
retrieval systems. The type of query the user might formulate is largely dependent on the
underlying information retrieval model. An important issue is that most query languages try
to use the content (i.e., the semantics) and the structure of the text (i.e., the text syntax)
to find relevant documents. In that sense, the system may fail to find the relevant answers.
46
Figure 4.1: An Example of Keyphrase Evolutionary Graph
For this reason, a number of techniques meant to enhance the usefulness of the queries
exist. Examples include the expansion of a word to the set of its synonyms or the use of a
thesaurus and stemming to put together all the derivatives of the same word. Moreover, some
words which are very frequent and do not carry meaning (such as ‘the’), called stopwords,
are removed. We first show the queries that can be formulated with keyword-based query
languages. They are aimed at information retrieval, including simple words and phrases as
well as boolean operators which manipulates sets of documents. In the second part, we cover
pattern matching, which includes more complex queries.
47
Keyword - based Querying
A query is the formulation of a user information need. In its simplest form, a query is
composed of keywords and documents containing such keywords are searched for. Keyword-
based queries are popular because they are intuitive and easy to express. Thus, a query can
be simply a word, although it can in general be a more complex combination of operations
involving several words.
In the rest of this section we will refer to single-word and multiple-word queries as basic
queries. Patterns, which are covered in section 4.2.4, are also considered as basic queries.
Single - Word Queries
The most elementary query that can be formulated by a user is a word. Text documents are
assumed to be essentially long sequences of words. More general view allow us to see the
text in this perspective and to search words. In other perspective, we are also able to see
the internal divison of words into letters. In the latter case, we can search of other type of
patterns. The set of words retrieved by these extended queries are considered as a keywords
matching to the query.
A word is normally defined in a rather simple way. The alphabet is split into ‘letters’ and
‘separators’, and word is a sequence of letters surrounded by separators.
Phrase Queries
Phrase is a sequence of single-word queries. An occurrence of the phrase is a sequence of
words. For instance, it is posisble that user can search for the word ‘knowledge’ and then
for the word ‘discovery’. In phrase queries it is normally understood that the separators
in the text need not to be the same as those in the query (e.g., two spaces versus one space
etc.). There can be different separators like ‘,’, ‘and’, ‘&’, ‘or’, ‘|’ are defined by user in a
query. In our system, we have implemented many possible combinations for example:
1. if a user has given a query like (a,b) then we are considering a and b both are separated
keyphrases.
48
2. if a query is given like (a and b), then we search for both a and b separately; also
considered ‘and’ as a boolean operator and selects all answers which statisfy a and b.
3. if a query is like (a or b), then select all keyphrases which statisfy a or b. Duplicates
are eliminated.
4.2.4 Pattern Matching
In this section, we first discuss more specific query formulations (based on the concept of a
pattern which allow the retrieval of pieces of text that have some property.
A pattern is a set of syntactic features that must occur in a text segment. Those segments
satisfying the pattern specifications are said to ‘match’ the pattern. We are interested in set
of keyphrases which match a given search pattern. Specification of some types of patterns
can be range from very simple (for example, words) to rather complex (such as regular
expressions) patterns. Generally, most used types of patterns are:
• Words A string (sequence of characters) which must be a word in the text (see Sec-
tion 4.2.3). This is the most basic pattern.
• Ranges A pair of strings which matches any word lying between them in lexicograph-
ical order. Alphabets are normally sorted, and this induces an order into the strings
which is called lexicographical order (this is indeed the order in which words in a dictio-
nary are listed). For instance, the range between words ‘data’ and ‘drug’ will retrieve
strings such as ‘document’ and ‘dictionary’.
• Prefixes A string which must form the beginning of a text word. For example, given
the prefix ‘comput’ all the words such as ‘computer’, ‘computation’, ‘computing’, etc.
are retrieved.
• Suffixes A string which can appear within a text word. For instance, given the sub-
string ‘tal’ all the documents containing words such as ‘talk’, ‘mestallic’, etc. are
retrieved. This query can be restricted to find the substrings inside words, or it can
go further and search the substring anywhere in the text (in this case the query is not
49
restricted to be a sequence of letters but can contain word separators). For instance,
a search for ‘any flow’ will match in the phrase ‘...many flowers...’
• Allowing errors A word together with an error threshold, This search pattern re-
trieves all text words which are ‘similar’ to the given word. The concept of similarity
can be defined in many ways. The general concept is that the pattern or the text may
have errors (coming from typing or spelling), and the query should try to retrieve the
given word and what are likely to be its erroneous variants. Although there are many
measure for similarity among words, the most generally accepted is the Levenshtein
distance or simply edit distance. Therefore, the query specifies the maximum number
of allowed errors for a word to match the pattern (i.e., the maximum allowed edit
distance). For example, if a typing error splits ‘flower’ into ‘flo wer’ it could still be
found with one error, while in the restricted case of words it could not (since neither
‘flo’ nor ‘wer’ are at edit distance 1 from ‘flower’).
• Regular Expressions A regular expression is a rather general pattern built up by
simple strings (which are meant to be matched as substrings) and the following oper-
ators:
- union: if e1 and e2 are regular expressions, then (e1 | e2) matches what e1 or e2
matches.
- concatenation: if e1 and e2 are regular expressions, the occurrences of (e1 e2) are
formed by the occurrences of e1 immediately followed by those of e2 (therefor simple
strings can be thought of as a concatenation of their individual letters).
- repition: if e is a regular expression, then (e∗) matches a sequences of zero or more
contiguous occurrences of e.
For example, consider a query like ‘pro (blem | tein) (s | Φ) (0 | 1 | 2) ∗’ (where Φ denotes
the empty string). It will match words such as ‘problem02’ and ‘proteins’.
4.2.5 Evaluating Keyphrase Ranking (i.e. Keyrank)
In the previous steps, after finding all keyphrases matched with user’s query and corre-
sponding keyphrase evolutionary graph, our task is to identify the top-k keyphrases which
50
are relevant to the user’s query. As discussed in Section 4.2.1, for recommending top k
- related keyphrases, our approach is based on keyphrase ranking and nearest - neighbor
keyphrases.
In this section, we first discuss the keyphrase ranking scheme with the help of an example.
This way, the user can choose to see only the category of results of his interest. The core
idea is that if a keyphrase that are relevant to many other keyphrases, then it is likely that
the keyphrase has ‘high impact’ with respect to the other keyphrases. Further, a keyphrase
may be considered even more influential, if it is linked from a large number of ‘high impact’
keyphrases. The results of a search query can then be ranked based on their ‘impact’.
The above idea led to rank the keyphrases, by using traditional PageRank algorithm [BP98].
PageRank is a iterative link analysis algorithm that is proved to converge [Hav99, ANTT02],
developed by Larry Page and Sergey Brin and used by the Google Internet search engine,
that assigns an equal importance (or Keyrank) to set of all keyphrases, with the purpose of
“measuring” its relative importance within the set. Then, in each iteration, it refines this
value using the following formula:
Keyrank(x) =∑
y
Keyrank(y)
|links(y)|(4.1)
where y is the set of keyphrases that link to x and |links(y)| is the number of links in
keyphrase y. The pseudo-code for keyrank has been shown in Algorithm 4.
Figure 4.1, shows keyphrase connectivitiy graph. By using the graph, we have calculated
keyphrase rank. Table 4.4, shows the list of keyphrases and their respective keyranks.
4.2.6 Identifying k - Nearest Neighbors of each Keyphrase
After we identified keyrank of the keyphrases, another important factor is to find nearest
- neighbor keyphrases i.e. set of keyphrases that are closest to a query point. Let us now
make the notion of nearest-neigbor keyphrases more precise and clear.
Definition 10 (k - Nearest Neighbor) Suppose we have a collection K of keyphrases.
For a nearest neighbor keyphrase, we are given a query point q, and goal is to determine the
nearest neighbor set NN(q) defined as
51
NN(q) = {r ∈ S | ∀ p ∈ K : d(q, p) ≤ d(q, r) }
To find the k - nearest neigbor for a given query point q, we have used Dijkstra’s shortest
path algorithm. In below section, we first explain the shortest path definition mathematically
and later Dijkstra’s algorithm has been explained in detail.
Definition 11 (Shortest-Path) Given a graph G = (V,E,w), where V is a set of vertices,
E is a set of edges and w is weight function that maps edges to real-valued weights, a path p
from a vertex u to a vertex v in this graph is a sequence of vertices (v0, v1, ..., vk) such that
u = v0, v = vk and (vi−1, vi) ∈ E. The weight w(p) of this path is a sum of weights over all
edges = w(v0, v1) + w(v1, v2) + ...+ w(vk−1, vk). It also reminds that a shortest path from u
to v is the path with minimum weight of all paths from u to v.
Definition 12 (Dijkstra’s Algorithm) For a given source vertex (node) in the graph, the
algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and
every other vertex. For example, if the vertices of the graph represent keyphrases and edge
weight represent distances between pairs of keyphrases connected by a direct link, Dijkstra’s
algorithm can be used to find the shortest route between one keyphrase (i.e. query point or
source vertex) and all other keyphrases. The pseudo-code for Dijkstra’s algorithm has been
shown in Algorithm 5.
In keyphrase evolutionary graph Figure 4.1, where each node is a keyphrase and edge
between the nodes represent connectivity between them and weight associated between edges
are showing how frequently those keyphrases occur together in the database. For finding
shortest distance between the root node and all other nodes, instead of using that weight
(i.e. support count) directly, we reframe weight as inverse of count(K), where count(K) is
number of times the keyphrase occurs in the transactions. Considering these weights, we
have calculated shortest distance between the nodes by using Algorithm 5.
Example: Considering source vertex = “datamining”, we have calculated the distance
between source vertex and all other nodes. Distance of source node from itself will be 0.
Table 4.5, represent keyphrases and their corresponding distances from the source vertex
(i.e. query point).
52
4.2.7 Evaluating Keyscore of Keyphrases
As discussed in the Section 4.2.1, for identifying the keyphrases which are related to the
user’s query, we have introduced the notion of “Degree of Importance” of a keyphrase. For
a given user’s query, there is a possibility that many of the keyphrases are similar to the
user’s query and results are overwhelmed to the user. To avoid this, we have assign a unique
“Keyscore” value for each keyphrase, which is defined as:
Keyscore =Keyrank
distance(4.2)
The intution is that:
• Keyrank of keyphrase should be high, and
• distance (i.e. shortest distance) between the keyphrase and query point should be low.
Table 4.6, show the keyphrases and their respectives keyscores.
4.2.8 Recommending Keyphrases
Our main intention is to recommend k - keyphrases to the user, where k is a constant chosen
by the user. The goal is to recommend the keyphrases which are having high keyscore
value and ensure that the recommended keyphrases have high keyphrase rank and minimum
shortest distance. We have transformed current problem to the knapsack problem in the
following manner.
Definition 13 (Knapsack Problem) Given a set of n items from which we are to select
some number of items to be carried in a knapsack. Each item has both the weight and profit.
The objective is to choose the set of items that fits in the knapsack and maximizes the profit.
We reduce our problem to a knapsack problem using the following equivalences:
Given a set of keyphrases {k = k1, k2, ..., kn} and keyphrase score {ks = ks1, ks2, ..., ksn}.
Let wi be the weight of the ith item, ksi be the profit accrued when the ith item is carried
in the knapsack, and C be the capacity of the knapsack. Let xi be a variable the value of
53
which is either zero or one. The variable xi has the value one when the ith item is carried in
the knapsack. Our objective is to maximize
n∑i=1
ksixi (4.3)
where
ksi =kri
ksti
in which kri denotes the rank and ksti is the shortest distance of ith keyphrase
subject to the constraint
n∑i=1
wixi ≤ C (4.4)
• Items to carry in knapsack is correspond to the keyphrases matching to the user’s
query.
• Keyphrase weight (in knapsack problem) is set to 1 for all keyphrases in our problem.
• Weight limit (in knapsack problem) is set to k - the number of keyphrases to recommend
to the users based on their query.
After identifying the keyscore for each keyphrase, we sort the keyphrase in descending
order of their score and output top k - keyphrases to the user which are related to the user’s
query.
4.2.9 Identify Originating Document
Finally, we identify the originating documents of the obtained related keyphrases. For finding
these originating documents, methodology describe in the Section 3.3 has been used.
4.3 Experiments
In this section, we measure how well our proposed method for finding related keyphrases
performs on the real data. As it is hard to objectively define the notion of “relevance” of
54
keyphrases and due to the absence of competiting approaches, we demonstrate the effective-
ness of our system through case-studies. To evaluate our approach we consider three case
studies to show the performance of our system. We also compare our results with Google
Scholar by using Google similairty distance measure and show that they are “statistically
significant” by using Spearman’s rank correlation and t - test at the 5% and 1% significance
level.
4.3.1 Case Study 1
For a given user’s query specified by keyphrases, our aim is to first identify those keyphrases
which match the user’s query and then identify the top k - most related keyphrases based on
their importance (i.e. keyscore) and corresponding originating documents (i.e. landmark pa-
pers) of those keyphrases. To obtain this, we have performed the sequence of steps discussed
in the Section 4.2 .
Keyphrase Graph
Consider that the user has given the query term as “decision trees”. First we identify the
set of keyterms which occur frequently together with the query term from the exhaustive
list of keyphrases. Using these, we construct the keyphrase evolutionary graph shown in
Figure 4.2. In Figure 4.2, only the relevant portion of the keyphrase graph has been shown
with support counts as weights of edges of the graph.
Keyphrase Ranking
Once we obtain a set of relevant keyphrases, we show the impact of each keyphrase in
keyphrase graph Figure 4.2. Impact of keyphrase is calculated by using KeyRank formula
described in Section 4.2.5. After finding the keyphrase rank, values are normalized by using
z - score defined as:
z-score =x−min+ δ
max−min(4.5)
where x is calculated keyrank value of each keyphrase, max and min is maximum and
55
Figure 4.2: Keyphrase Evolutionary Graph relevant to the query term “decision trees”.
minimum value from overall keyranks value range. δ is min/10, is introduce for descreasing
the influence of max and min values. By using z - score formula, keyrank values are ranged
between [0 - 1]. Table 4.7, shows the list of keypharses and their corresponding keyranks in
increasing order.
Nearest Neighbor Keyphrases
In order to identify the set of keyphrases which are closest to query point, we have calcu-
lated distance (i.e. shortest distance) between query point and all other keyphrases. For
calculating distance between keyphrases, keyphrase graph shown in Figure 4.2 and shortest
path algorithm described in Section 4.2.6 is used. Table 4.8, shows keyphrases and distances
in increasing order between source vertex and each keyphrase node.
Keyscore and Recommending k - keyphrases
Finally, for recommending k - keyphrases to the users related to their query term, we have
assign unique value to each keyphrase to decide which keyphrase is more relevant in terms of
their keyscore values. Keyscore of each keyphrase is calculated by using the formula described
in Section 4.2.7. List of keyphrases and their respective keyscore values in increasing order
are shown in Table 4.9.
56
Analysis of Relevance Results
Table 4.9, shows set of top k - recommended keyphrases for the query “decision trees” based
on their keyscore values in increasing order. In Section 4.2.7, we mention that keyscore value
of a keyphrase is calculated by using Keyrank and the distance i.e. shortest - distance from
the query point. Table 4.7, shows list of keyphrases sorted by keyranks. Table 4.8, shows
the keyphrases and their distances from the query point i.e. list of keyphrases that are close
to the query point.
In Table 4.9, we notice that first recommended relevant keyphrase is “informationgain”
with keyscore value as 0.469178, which is having keyrank value 0.078212 from Table 4.7 and
its distance from query point is 0.1667 from Table 4.8. This implies that keyphrase with high
keyrank value and low distance among all keyphrases is first recommended related keyphrase
of the query point.
Further in Table 4.9, keyphrases “datamining” and “classlabel” with keyscore value as
0.384080 and 0.286780 are recommended as the related keyterms for a query. From Table 4.7,
we can notice that keyphrase “datamining” is having high keyrank value i.e. 0.076816
in comparison to the keyphrase “classlabel” which is having keyrank value as 0.071695.
From Table 4.8, we can see that the keyphrase “datamining” is having low distance i.e.
0.2 in compare to the keyphrase “classlabel” i.e. 0.25. This also implies that keyrank of
keyphrase is dominating over the distance values and which ever is having high keyrank
value is recommended first, So keyphrase “datamining” is recommended before than the
keyphrase “classlabel”.
Next, in Table 4.7, keyphrase “streamingdata” and “streamprocessing” both are having
same keyrank value i.e. 0.054004 but keyphrse “streamingdata” is recommended later than
the keyphrase “streamprocessing”, because it is far away from the query point in compari-
son to “streamprocessing” at distance 0.2. This fact implies that when there is a tie with
keyranks then distance measure efficiently breaks that and which ever is nearset to the
query point is recommended first. Also, consider the keyphrases “regressionproblems” and
“neuralnetwork”, keyphrase “regressionproblems” is recommended before than the keyphrase
“neuralnetwork” because it is having high keyscore value. Since keyscore value affects by
57
keyranks and their distances, notice from Table 4.8, both of these keyphrases are at same
distance i.e. 0.3333 from the query point, whereas keyrank value of “regressionproblems” is
higher than the keyrank value i.e. 0.052142 of “neuralnetwork”. This shows that when there
is a tie with distances the keyrank measure efficiently breaks that and keyphrase with high
keyrank value is recommended.
Further, in Figure 4.2, we notice that keyphrases such as “associationrules”, “frequentitem-
set” and “clusteringalgorithms” are connected to query point i.e. “decisiontrees” through
node “datamining”. Keyphrases “streamingdata” and “streamprocessing” are connected
through “datamining” and “datastreams” to the “decisiontrees”. In Table 4.9, these keyphrases
are recommended later because they are connected indirectly with query point. We can no-
tice that in Table 4.7, keyphrases “datastreams” “associationrules”, “frequentitemset” and
“clusteringalgorithms” all are having same keyrank value as 0.049348, but keyphrase “datas-
treams” is at low distance i.e 0.3929 in comparision to remaining keyphrases from query
point. As discussed above that when there is a tie with keyranks, then distance measure
efficiently breakes that tie and dominate in the keyscore value and which ever is having low
distance is closest to the query point and recommended prior to the other. So, keypharses
“datastreams”, “associationrules”, “frequentitemset” and “clusteringalgorithms” are having
same keyrank values (0.049348) but due to their distances from query point we can notice
that they are not close to the query point, so they are recommended later in the Table 4.9.
In the next few sections, we have consider different case studies for few other types of
queries that can be consider by a user and discuss how our approach will handle different
situations significantly each one separately. The methodology follow for these case studies
is similar to the previous case study described in the Section 4.2.
4.3.2 Case Study 2
Let us suppose user entered a query combined with 2 keyphrases (“cluserting uncertain
data”, “data streams”), our aim is to first identify those keyphrases which are matching with
the user’s query and then identify the top k - most related keyphrases based on their impor-
tance and proximity and finally identify corresponding originating documents (i.e. landmark
papers) of those keyphrases. To obtain this, same steps discussed in the Section 4.3.1 are
58
performed on dataset. In this case, user has given a query specified by two keyterms, to find
relevant keyphrases based on keyphrases specified by a user, first we match the similar terms
related to this query. In our keyphrase database, we found these two terms occur frequently
in the documents. So, single keyphrase evolutionary graph will be obtained for the query.
Considering those frequent itemsets and also their other related keypharses which occur fre-
quently with any of the keyphrase, we obtain the keyphrase graph for the query term. As
mentioned earlier, each keyphrase can connect with other keyphrase either directly or indi-
rectly but keyphrases which belongs into the nearest neighbor area and having high impact
are more influential for the user’s query. Figure 4.3, shows keyphrase graph of keyphrases
which occur together in the documents for the query (“clustering uncertain data”, “data
streams”).
Figure 4.3: Keyphrase Evolutionary Graph for query terms (“cluserting uncertain data”,
“data streams”).
After identifying keyphrase garph, we have calculated keyranks and their nearest neigh-
bor keyphrases by using shortest distance algorithm. Figure 4.3, shows keyphrase graph
for keyranks identification. In Table 4.10, keyphrases and their corresponding keyranks in
59
increasing order is shown.
Table 4.11, shows keyphrases and their distances from query point.
Finally, in Table 4.12, final set of related keyphrases and their respective keyscores in
increasing order with their rank is shown.
Analysis of Relevance Results
For a user’s query (“cluserting uncertain data”, “data streams”), consider Table 4.12 and can
refer that most relevant keyphrase to the user query by considering impact and proximity
value is “uncertaindatastreams” with keyscore value as 0.297765, and also having highest
keyrank value i.e. 0.99245 [refer Table 4.10] and lowest distance factor i.e. 0.3333 [refer
Table 4.11] among all other keyphrases. Next, keyphrase “streamclustering” is having 2nd
high keyscore vale i.e. 0.223361 and recommended at 2nd position. From Table 4.10, we
can notice that “streamclustering” is having 2nd highest keyrank value among remaining
keyphrases. But from Table 4.11, we can see that this keyphrase is not having lowest
distance form other remaining keyphrases. Despite of this fact, it is recommended before
the keyphrase “uncertaindata” and “clusteringalgorithms”, signifies that keyrank measure
is dominating in the keyscore value and highly impact keyphrase will be recommended first
in-compare to nearest neighbor.
Furthermore, for the query when two keyphrases having similar keyrank values then
distance measure will efficiently overcome the situation with the dominating effect and
keyphrase with less distance value will be considered first and other will be later. For
example, consider keyphrase “streamingdata” and “streamprocessing” from Table 4.10 both
are having same keyrank value i.e. 0.055589, but due to less distance value i.e. 0.8012,
“streamingdata” is recommended prior to the keyphrase “streamprocessing”. On the other
hand, for the keyphrases which are at equi-distance from the query point, will be consider
first on the basis of high keyrank value. For example, in Table 4.11, keyphrase “uncertain-
data” and “clusteringalgorithms” both are at 0.3333 distance from the query point but due
to high keyrank value of “uncertaindata” in - comparison of “clusteringalgorithms”, it is
recommended before with respect to the other keyphrase.
60
4.3.3 Case Study 3
Lastly, we have consider user has given a query (“skyline query”, “target schema”). In this
case, user has given a query specified by two keyterms, to find relevant keyphrases based on
keyphrases specified by a user first we identify frequent keyterms which occur together and
match the similar terms related to this query. In our database, we did not find any direct
relationship between these two keyphrases. So these terms consider as separate keyphrases
and for both the terms set of related keyphrases based on their impact and nearest distance,
list of related keyphrases are identify. For each sub-query term a separate keyphrase graph
has been drawn. In Figure 4.4, “skyline query” and their relevant keyphrases graph has
been shown and Figure 4.5, shows the graph of all keypharses which are found frequent in
database with sub-query term “target schema”.
Figure 4.4: Keyphrase Evolutionary Graph relevant to the query term “skyline query”.
For recommending top k - keyphrases related to user’s query, keyscore of each keyphrase
has been obtained by using keyranks and their nearest neighbor distances as discussed in the
preivous section. Table 4.13 and Table 4.14, shows set of keyphrases and their corresponding
keyranks which are relevant to the respective sub-query terms “skyline query” and “target
schema.
Table 4.15 and Table 4.16, shows keyphrases and their corresponding distances from the
keyphrase which are relevant to the sub-query term “skyline query” and “target schema” is
61
Figure 4.5: Keyphrase Evolutionary Graph relevant to the query term “target schema”.
shown.
Finally, Table 4.17 and Table 4.18, show keyphrases and their keyscores with respect to
sub-query terms “skyline query” and “target schema. After calculating keyscore values, we
recommend the set of top k - keyphrases which are having high keyscore values.
Analysis of Relevance Results
For the query (“skyline query”, “target schema”), we first discuss analysis for sub-query
term (“skyline query” and later for sub-query term “target schema”. From Table 4.17, we
can see that for sub - query term (“skyline query”, first recommended related keyphrase
is “skylinecomputation” with high keyscore value as 0.766994 because it is having highest
62
keyrank value i.e. 0.085213 [refer Table 4.13] and it is first closest point to the query term
[refer Table 4.15]. Also, we can see that keyphrases such as “skylinequeryprocessing” is
recommended before even if it is farthest from the query point beacuse of high keyrank value
in comparision to “skylinepoints” and “skylineobjects” which are nearest to the query point
are having low keyrank values.
Further, we can verify that even if there is a tie with in the keyranks of keyphrases then
keyphrase with low distance value will be recommended first, such as keyphrases “datapoints”
and “movingobjects”. On the other hand, also for equi-distant keyphrases from the query
point, keyphrases with high keyrank value is recommended first is verified from keyphrases
like “skylinequeryprocessing” and “skylinealgorithms”.
For sub - query term “target schema”, from Table 4.18, top recommeded keyphrase is
“sourceinstance” having keyscore value as 0.415500 with high keyrank value as 0.059375 [from
Table 4.14] and at the distance 0.1429 [from Table 4.16] implies keyphrase with high keyrank
value and at the least distance is always top prioritize keyphrase. Similary, keyphrases such
as “souceschema”, “schemamapping” and “targetinstance” are having high keyrank values
and at low distance from query point in the respective order, are recommeded continuously
one after another. Also, for tied keyrank values, low distance is dominating in keyscore
value and closest keyphrase is recommended first, for e.g. “schemamatching”, “schemaevo-
lution” and “dataintegration”. Furthermore, keyphrases such as “mappingcomposition” and
“sourceandtarget” both are at equal distance i.e. 0.5333 [from Table 4.16] from query point
but due to high keyrank value i.e. 0.041667 [from Table 4.14] keyphrase “mappingcomposi-
tion” is recommended first, which signifies that when the keyphrases are at same distance
keyranks will dominate keyscore values and high keyranked keyphrases will be recommended
first always.
4.3.4 Identifying Originating Documents
Once we obtain set of related keyphrases which are relevant to the user’s query we identify
the originating documents of those keyphrases in the database. Table 4.19, shows small set
of those keyphrases and their originating documents identified by method discussed in the
Section 3.3. In Table 4.19, first column shows the keyphrase, second column published year
63
and conference, third column paper title, i.e. originating document and last column shows
first author name of the document.
For different case studies as discussed earlier in the Section 4.3, overall 46 set of related
keyphrases were found which are relevant to the user’s query. Out of these, for 15 keyphrases
we found originating documents.
64
Algorithm 4: Keyrank(graph, dampingfactor=0.85, maxiterations=100, mindelta=
0.00001)
Input: Graph of Keyphrases;
Ensure: Dictionary cointaing all the nodes PageRank.
1: nodes = graph.nodes()
2: graphsize = len(nodes)
3: if graphsize == 0: then
4: return
5: end if
6: minvalue = (1.0-dampingfactor)/graphsize //value for nodes without inbound links
{Intialize the key rank dict with 1/N for all nodes}
7: keyrank = dict.fromkeys(nodes, 1.0/graphsize)
8: for i in range(maxiterations): do
9: diff = 0
total difference compared to last iteraction
{Computes each node KeyRank based on inbound links}
10: for node in nodes: do
11: rank = minvalue
12: for referringkeyphrase in graph.incidents(node): do
13: rank += dampingfactor * keyrank[referringkeyphrase] /
len(graph.neighbors(referringkeyphrase))
14: end for
15: end for
16: end for
17: diff += abs(keyrank[node] - rank)
18: keyrank[node] = rank
{Stop if Keyrank has converged}
19: if diff < mindelta: then
20: break
21: end if
22: return keyrank65
Table 4.4: Keyphrases and Keyranks
Keyphrase Keyrank
regressionproblems 0.041667
streamprocessing 0.041667
streamingdata 0.052083
datastreams 0.055074
spectralclustering 0.055074
informationgain 0.055435
hierarchicalclustering 0.055435
clusteringalgorithms 0.067708
frequentitemset 0.069307
associationrules 0.107261
decisiontrees 0.110562
66
Algorithm 5: Dijkstra’s Shortest Path Algorithm
1: def shortest-path(Graph, source):
{Initialization}
2: dist = {source : 0} // Distance from sounce to source
3: previous = {source : None}
4: q = graph.nodes() //q: = set of all nodes in graph
5: while q: do
6: u = q[0]
7: for node in q[1:]: do
8: if ((u not in dist) then
9: or (node in dist and dist[node] < dist[u])):
10: u = node
11: end if
12: q.remove(u)
13: end for
14: end while
{Process reachable, remaining nodes from u}
15: if (u in dist): then
16: for v in graph[u]: do
17: if v in q: then
18: alt = dist[u] + graph.edge-weight((u, v))
19: if (v not in dist) or (alt < dist[v]): //Relax (u,v,a)
20: dist[v] = alt
21: previous[v] = u
22: end if
23: end for
24: end if
25: return previous, dist
26: end shortest-path
67
Table 4.5: Keyphrases and Distance
Keyphrase Distance
decisiontrees 0.142900
frequentitemset 0.166700
associationrules 0.200000
clusteringalgorithms 0.200000
informationgain 0.267900
datastreams 0.470000
hierarchicalclustering 0.450000
streamingdata 0.476200
spectralclustering 0.592900
regressionproblems 0.666600
streamprocessing 0.666600
Table 4.6: Keyphrases and Keyscores
Keyphrase Keyscore
regressionproblems 0.062506
streamprocessing 0.062506
streamingdata 0.0781323
datastreams 0.085552
spectralclustering 0.092889
hierarchicalclustering 0.123189
informationgain 0.205577
clusteringalgorithms 0.33854
frequentitemset 0.415759
associationrules 0.536305
decisiontrees 0.773702
68
Table 4.7: List of keyphrases and their keyranks matching with query term “decision trees”
Keyphrase Keyrank
frequentitemset 0.049348
clusteringalgorithms 0.049348
datastreams 0.049348
associationrules 0.049348
neuralnetwork 0.052142
streamprocessing 0.054004
streamingdata 0.054004
regressionproblems 0.065212
classlabel 0.071695
datamining 0.076816
informationgain 0.078212
Table 4.8: List of keyphrases and their distances corresponding to the query term “decision
trees”.
Keyphrase Distance
informationgain 0.166700
datamining 0.200000
classlabel 0.250000
regressionproblems 0.333300
neuralnetwork 0.333300
datastreams 0.392900
frequentitemset 0.416700
associationrules 0.450000
clusteringalgorithms 0.500000
streamprocessing 0.517900
streamingdata 0.726200
69
Table 4.9: List of keyphrases and their keyscores corresponding to the query term “decision
trees”.
Keyphrase Keyscore Rank
streamingdata 0.096327 11
clusteringalgorithms 0.097904 10
streamprocessing 0.104083 9
associationrules 0.108782 8
frequentitemset 0.117475 7
datastreams 0.124591 6
neuralnetwork 0.145142 5
regressionproblems 0.195656 4
classlabel 0.286780 3
datamining 0.384080 2
informationgain 0.469178 1
Table 4.10: List of keyphrases and their keyranks corresponding to the query term (“clusert-
ing uncertain data”, “data streams”)
Keyphrase Keyrank
streamprocessing 0.055589
streamingdata 0.055589
spectralclustering 0.058200
datamining 0.059028
hierarchicalclustering 0.059028
clusteringalgorithms 0.072132
uncertaindata 0.073423
streamclustering 0.080401
uncertaindatastreams 0.099245
70
Table 4.11: List of keyphrases and their distances matching with query term (“cluserting
uncertain data”, “data streams”).
Keyphrase Distance
uncertaindatastreams 0.333300
uncertaindata 0.333300
clusteringalgorithms 0.333300
streamclustering 0.360000
datamining 0.533300
hierarchicalclustering 0.533300
spectralclustering 0.676200
streamingdata 0.801200
streamprocessing 0.819100
Table 4.12: List of keyphrases and their keyscores corresponding to the query term (“clusert-
ing uncertain data”, “data streams”)
Keyphrase Keyscore Rank
streamprocessing 0.067866 9
streamingdata 0.069383 8
spectralclustering 0.086069 7
datamining 0.110684 6
hierarchicalclustering 0.110684 5
clusteringalgorithms 0.216418 4
uncertaindata 0.220290 3
streamclustering 0.223361 2
uncertaindatastreams 0.297765 1
71
Table 4.13: List of keyphrases and their keyranks corresponding to the query term “skyline
query”
Keyphrase Keyrank
sensornetworks 0.038462
datamanagement 0.042308
dataintegration 0.042308
datapoints 0.042308
movingobjects 0.042308
subspaceskyline 0.051122
skylineobjects 0.053212
skylinepoints 0.057692
skylinealgorithms 0.063462
queryprocessing 0.080409
skylinequeryprocessing 0.083894
skylinecomputation 0.085213
72
Table 4.14: List of keyphrases and their keyranks corresponding to the query term “target
schema”.
Keyphrase Keyrank
mappingsystem 0.028214
dataintegration 0.033333
schemamatching 0.033333
schemaevolution 0.033333
dataexchange 0.037210
schematree 0.039583
mappinggeneration 0.040365
sourceandtarget 0.040365
schemaelements 0.041667
mappingcomposition 0.041667
targetinstance 0.050000
schemamapping 0.056250
sourceschema 0.057031
sourceinstance 0.059375
73
Table 4.15: List of keyphrases and their distances matching with query term “skyline query”
Keyphrase Distance
skylinecomputation 0.111100
skylinepoints 0.200000
skylineobjects 0.250000
skylinequeryprocessing 0.250000
skylinealgorithms 0.250000
subspaceskyline 0.333300
datapoints 0.392900
movingobjects 0.500000
sensornetworks 0.533300
datamanagement 0.583300
dataintegration 0.583300
queryprocessing 0.583300
74
Table 4.16: List of keyphrases and their distances matching with query term “target schema”.
Keyphrase Distance
sourceinstance 0.142900
sourceschema 0.166700
schemamapping 0.200000
targetinstance 0.250000
mappinggeneration 0.325000
schemaelements 0.333300
dataintegration 0.533300
mappingcomposition 0.533300
sourceandtarget 0.533300
schematree 0.540000
schemamatching 0.592900
schemaevolution 0.650000
dataexchange 0.717900
mappingsystem 0.816700
75
Table 4.17: List of keyphrases and their keyscores matching with query term “skyline query”
Keyphrase Keyscore Rank
sensornetworks 0.072120 12
datamanagement 0.072532 11
dataintegration 0.072532 10
movingobjects 0.084615 9
datapoints 0.107681 8
queryprocessing 0.137851 7
subspaceskyline 0.153381 6
skylineobjects 0.212848 5
skylinealgorithms 0.253848 4
skylinepoints 0.288462 3
skylinequeryprocessing 0.335577 2
skylinecomputation 0.766994 1
76
Table 4.18: List of keyphrases and their keyscores matching with query term “target schema”.
Keyphrase Keyscore Rank
mappingsystem 0.034546 14
schemaevolution 0.051282 13
dataexchange 0.051832 12
schemamatching 0.056221 11
dataintegration 0.062504 10
schematree 0.073302 9
sourceandtarget 0.075688 8
mappingcomposition 0.078131 7
mappinggeneration 0.124199 6
schemaelements 0.125014 5
targetinstance 0.200000 4
schemamapping 0.281250 3
sourceschema 0.342117 2
sourceinstance 0.415500 1
77
Table 4.19: List of keyphrases and corresponding originating documents .
Keypharse Conference Paper Title First Author
Year Name
clustering un- ICDE A Framework for Clustering Charu C.
-certain data 2008 Uncertain Data Streams Aggarwal
skyline query ICDE Efficient Skyline Query Processing Shiyuan Wang
processing 2007 on Peer-to-Peer Networks
spectral ICDM Integrating Hidden Markov Models Jie Yin
clustering 2005 and Spectral Analysis for Sensory
Time Series Clustering
uncertain ICDE A Framework for Clustering Charu C.
data streams 2008 Uncertain Data Streams Aggarwal
subspace ICDE SUBSKY: Efficient Computation Yufei Tao
skyline 2006 of Skylines in Subspaces
78
4.3.5 Analysis using Google Similarity Distance Measure
To interpret significance of our results, we use Google Scholar, to show that list of related
keyphrases recommend by our approach are acceptable and relevant to the user’s query. To
see the relevance of keyphrases with user’s query term, we search for the pair of keyphrases
such as (“query term” “keyphrase”) together as well as single keyphrases such as (“query
term”), (“keyphrase”) and collect the statistics that show how many relevant documents
are produced by Google Scholar. In Google Scholar, under advance scholar search option we
have considered following search options:
• find articles with exact phrase.
• return articles published between year 2005 - 2009, as our dataset includes only articles
published in these years.
• search articles only in the (Engineering, Computer Science, and Mathematics) subject
areas.
When the Google Scholar engine is used to search for word x, Google displays the number
of hits that word x has. The ratio of this number of hits to the total number of webpages
indexed by Google represents the probability that word x appears on a webpage. Cilibrasi
and Vitanyi [CV07] use this probability to extract the meaning of words from the world-
wide-web. If word y has a higher conditional probability to appear on a webpage, given
that word x also appears on the webpage, than it does by itself, then it can be concluded
that words x and y are related. Moreover, higher conditional probabilities imply a closer
relationship between the two words. Thus, word x provides some meaning to word y and
vice versa.
Cilibrasi and Vitanyi’s Normalized Google Distance (NGD) function measures how close
word x is to word y on a zero to infinity scale. A distance of zero indicates that the two
words are practically the same. Two independent words have a distance of one. A distance
of infinity occurs for two words that never appear together.
On the average, two random words should be independent of one another. Hence, two
random words should have an NGD of one. To evaluate the NGD among the different word
79
pairs, Cilibrasi and Vitanyi’s formula is defind as:
NGD(x, y) =max(logf(x), logf(y))− logf(x, y)
logM −min(logf(x), logf(y))(4.6)
where f(x) and f(y) are the number of hits of words x and y, respectively, and M is the
total number of web pages that Google indexes.
To measure the relativeness of keyphrases obtained by our approach, we use the Normalized
Google Distance (NGD) measure. The method works by first calculating a distance matrix
whose entries are the pairwise NGD ’s of the terms in the input list.
Consider first case study where quey term is “decision trees”. Final recommended list of
keyphrases from Table 4.9 is considered and NGD for each pair of (query term, keyphrase) is
calculated. At the time of doing the experiment, a Google Scholar search of “decision trees”,
returned 16,400 hits. The number of hits for the search term “information gain” was 7,530.
Searching for the pages where both (“decision trees”, “information gain”) occur gave 2,600
hits. As we did not know total number of pages indexed by Google, based on our intution
we consider total number of pages indexed are 1,000,000 for this case. Using these numbers
in the NGD formula we derive below, with M = 1,000,000, this yields a Normalized Google
Distance between the terms “decision trees” and “information gain” as follows:
NGD(decision trees, information gain) ≈ 0.3767
Similarly for other keyphrase pairs NGD is calculated. Table 4.20, shows the list of
keyphrases, their NGD distances with query point “decision trees”, and ranking of results
respectively.
From Table 4.9 and Table 4.20, notice that first recommended keyphrase is “information-
gain” using both the approaches. The keyphrase rank [in Table 4.20] of other keyphrases is
also well connected with that in Table 4.9. But they are not exactly the same. This may
be due to the fact that the dataset used by Google Scholar is huge in comparsion with our
dataset where we consider limited number of conferences only for data mining/databases
domain. For some exceptional cases, where ranks differ significantly it may be that these
keyphrases occur very frequently in the documents of other fields and occur rarely in our
dataset.
80
Table 4.20: List of keyphrases and their NGD corresponding to the query term “decision
trees”.
Keyphrase NGD Rank
information gain 0.3767 1
data mining 0.6065 4
neural network 0.7165 8
regression problems 0.6233 5
class label 0.3924 2
association rules 0.5815 3
frequent itemset 0.6847 7
data streams 0.7508 9
stream processing 0.9328 11
streaming data 0.8072 10
clustering algorithms 0.6480 6
Similarly, for other case studies NGD measure is identified based on their query term
and related keyphrases. For query terms (“cluserting uncertain data”, “data streams”) and
(“skyline query”, “target schema”), first number of hits are collected through Google Scholar
and then by using NGD measure, distance between the query term and each recommended
keyphrases is calculated. For case study 2’s query term, final list of recommended keyphrases
are shown in Table 4.12 and consider for evaluating NGD measure. Table 4.21, shows list
of keyphrases, their evaluated NGD and ranking with respect to query term (“cluserting
uncertain data”, “data streams”). Whereas, in case study 3, query term is evaluated by
using each sub-query term separately, so for each sub-query term different list of related
keyphrases and their keyscores table is identified. So Table 4.17 and Table 4.18 are used
for identifying NGD measure. Table 4.22 and Table 4.23, shows list of keyphrases, their
evaluated NGD and ranking for sub-query terms (“skyline query”) and (“target schema”)
respectively.
81
Table 4.21: List of keyphrases and their NGD corresponding to the query term (“cluserting
uncertain data”, “data streams”).
Keyphrase NGD Rank
uncertaindatastreams 0.045032 1
uncertaindata 0.444231 3
clusteringalgorithms 0.676886 5
datamining 0.834248 9
spectralclustering 0.765654 8
hierarchicalclustering 0.744809 7
streamprocessing 0.693548 6
streamingdata 0.566321 4
streamclustering 0.360304 2
Typically the results of our keyscore measure correlates very well with NGD. Note that
this is the case “even though” the dataset (i.e. the research paper collection) and the ranking
techniques are quite different between the two approaches. Specifically while we have only
research papers from 6 conferences, the Google Scholar collection is more exhaustive. Also,
Google may use undocumented techniques for ranking such as utilizing the implicit feedback
of users while browsing through search results. Inspite of these differences, there is still a
good correlation between Google Scholar results and our approach as can be seen in the
results Table 4.9 and Table 4.20. Further, in the next section to measure the degree of
correlation between the two approaches, we used Spearman’s Rank Correlation coefficient.
Results obtained previously, are interpreted statistically by using t - test and show the
“statistical significance” of results at a significance level of 5% and 1%.
4.3.6 Spearman’s Rank Correlation
Spearman’s Rank Correlation is a non-parametric measure of statistical dependence between
two variables. It assesses how well the relationship between two variables can be described
using a monotonic function. If there are no repeated data values, a perfect Spearman cor-
82
Table 4.22: List of keyphrases and their NGD corresponding to the query term “skyline
query”.
Keyphrase NGD Rank
subspaceskyline 0.179653 6
skylinequeryprocessing 0.097559 3
skylinepoints 0.0952461 2
datapoints 0.684654 8
skylineobjects 0.177634 5
skylinealgorithms 0.158339 4
queryprocessing 0.522789 7
skylinecomputation 0.063742 1
dataintegration 0.809186 11
datamanagement 0.715312 9
movingobjects 0.717215 10
sensornetworks 0.918243 12
relation of +1 or -1 occurs when each of the variables is a perfect monotone function of the
other. It is a technique used to test the direction and strength of the relationship between
two variables. In other words, its a device to show whether any one set of numbers has
an effect on another set of numbers. It is often denoted by ρ or rs. The Spearman rank
correlation coefficient is defined by:
rs = 1− 6∑n
i=1 d2i
n(n2 − 1)(4.7)
where, di = xi-yi is the difference in statistical rank of corresponding variables and n is
total number of observations in dataset. Its value is lies between -1 to +1. The closer rs
is to +1 or -1, the stronger the likely correlation. A perfect positive correlation is +1 and
a perfect negative correlation is -1. If rs = 0, implies there is no correlation beteen the
variables.
A further technique is now required to test the significance of the relationship. The
83
Table 4.23: List of keyphrases and their NGD corresponding to the sub-query term “target
schema”.
Keyphrases NGD Rank
sourceandtarget 0.474858 11
sourceinstance 0.083498 1
schemamapping 0.204707 3
dataexchange 0.552462 13
targetinstance 0.239534 5
schematree 0.338834 9
mappingsystem 0.625251 14
schemaelements 0.234208 4
schemamatching 0.278603 7
sourceschema 0.088522 2
mappinggeneration 0.271431 6
dataintegration 0.478366 12
mappingcomposition 0.321632 8
schemaevolution 0.339999 10
calculated rs value must be looked up on the Spearman’s Rank significance table at ν = (n
- 2) degrees of freedom (df) for either two-tailed test or on-tailed test at 0.05 and 0.01 level
of significance.
Determine Significance
One approach to testing whether an observed value of rs is significantly different from zero (
rs will always maintain 1 ≥ rs ≥ -1) is to calculate the probability that it would be greater
than or equal to the observed rs, given the null hypothesis, by using a permutation test. An
advantage of this approach is that it automatically takes into account the number of tied
data values there are in the sample, and the way they are treated in computing the rank
correlation.
84
4.3.7 t - test for testing the significance of an observed sample
correlation coefficient
If rs is the observed correlation coefficient in a sample of n pairs of observations from a
bivariate normal population, then Prof. Fisher proved that under the null hypothesis, H0 :
rs = 0 i.e. correlation coefficient is 0, the statistic:
t = rs
√(n− 2)
(1− r2s
(4.8)
follows student’s t-distribution with ν = (n - 2) degrees of freedom (df).
If the value of t comes out to be significant, we reject H0 at the level of significance adopted
and concluded that rs 6= 0, i.e. rs is significant of correlation in population.
It t comes out to be non-significant, then H0 may be accepted and we conclude that
variable may be regarded as uncorrelated in the population.
4.3.8 Analysis of Results using Rank Correlation and t - test
Table 4.24, shows the rank correlation results. These results are calculated by using the
formula defined in Section 4.3.6, where xi and yi are rank values assigned based on their
keyscores values in keyscore tables and NGD’s value in NGD tables for each query term
respectively. These results demonstrate the critical and acceptance values for each query
term.
Table 4.24: Spearman’s rank correlation results
Query n df cal rs 0.05 (tab rs) 0.01 (tab rs)
decision trees 11 9 0.6818 0.700 0.833
(clustering uncertain 9 7 0.69 0.786 0.929
data, data streams)
skyline query 12 10 0.980769 0.648 0.794
target schema 14 12 0.907692 0.538 0.675
Figure 4.6, shows the significance graph for Spearman’s rank correlation with respect to
85
the degrees of freedom. The calulated rs value looked up on the Spearman Rank significance
table below as follows:
• If it is below the line marked as 5%, then it is possible our result was the product of
chance and we must accept the null hypothesis.
• If it is above 1%, we can say we are 99% confident to accept the alternative hypothesis.
• If it is above 5%, but below 1%, we can say we are 95% confident (i.e. statistically
there is a 5% likelihood the result occurred by chance).
Figure 4.6: The Significane of Spearman’s Rank Correlation and degrees of freedom.
From Table 4.24, for df = 7 value 0.69 gives a significance level of slightly less than 5%
as shown in Graph 4.6. That means that the probability of the relationship we have found
being a chance event is about 5 in 100. We are 95% certain that our hypothesis is correct.
For df = 9 value is 0.6818, it lies in between the region of 5% and 1%, implies we are 95%
confident to reject our null hypothesis. This signifies that out of 100 only for 95 cases both
approaches are performing equally well. For df = 10 and 12, we are 99% sure for acceptance
of alternative hypothesis and it implies that both of our apporaches are performing very
closely. For 0.5 < rs < 1, we can say that there is strong positive correlation between the
86
variables. From Table 4.24, we can notice all values of rs are lying between (0.5 < rs < 1)
signifies strong positive correlation between both the approaches.
Further, to test degree of acceptance or rejection of both approaches for each case sepa-
rately t - test is calculated. We consider our assumption for null hypothesis that there is “no
- correlation” i.e. rs = 0, between the two approaches, whereas for alternative hypothesis rs
6= 0. So our hypothesis, null (H0) vs alternative (H1) is defiend as follows:
H0 : rs = 0,
H1 : rs 6= 0
• If calculated t0.05,df < tabulated t0.05,df , then accept H0 and reject the H1.
• If calculated t0.05,df > tabulated t0.05,df , then reject H0 and accept the H1.
In addition, Table 4.25 shows the t - test results calculated by using rs values and formula
described in Section 4.3.7.
Table 4.25: t - test results
Query n df cal t 0.05 (tab t) Testing result
decision trees 11 9 2.8 2.2620 Accept H1
(clustering uncertain 9 7 2.1653 2.365 Accept H0
data, data streams)
skyline query 12 10 15.8773 2.228 Accept H1
target schema 14 12 7.4930 2.179 Accept H1
From Table 4.25, we can notice that for 3 out of 4 cases, our assumption that there is “no
- correlation” between the two approaches is not significant. There is a correlation between
the results even they produced by different approaches. They are not significant differ. Only
for df = 7 case, our null hypothesis is accepted which implies that for this case there is no
significant relationship between the approaches.
87
4.4 Summary
We presented related keyphrases identification method to recommend set of keyphrases rel-
evant to the given user’s query. We have built our approach by using the notion of impact
(keyrank) and proximity (shortest - distance) of the keyphrases. One final remark regarding
keyphrases ranking is that we obtain keyscore value for each keyphrase by using keyrank and
shortest distance measures. Keyphrases from high to low keyscore values are recommended
from first to last. The experimental results show that proposed method performed well and
its comparison with Google similarity measure is also shown. Furthermore, statistical test
is used to measure the degree of correlation between two approaches at different level of
significance.
88
Chapter 5
ProMax: A Profit Maximizing
Recommendation System for Market
Baskets
In the preivous chapter, we discuss an approach to identify related keyphrases based on the
proximity and importance of keyphrases and use the knapsack approach to recommend a set
of top k - keyphrases corresponding to the user’s query and their originating documents. In
this chapter, we use the knapsack based solution in a different domain i.e. to recommend
set of items to the customers in retail stores such that the profit of the store is maximized.
Most data mining research has focused on developing algorithms to discover statistically
significant patterns in large datasets. However, utilizing the resulting patterns in decision
support is more of an art, and the utility of such patterns is often questioned. We formalize
a technique that utilizes data mining concepts to recommend an optimal set of items to
customers in a retail store based on the contents of their market baskets. The recommended
set of items maximizes the expected profit of the store and is decided based on patterns
learnt from past transactions. In addition to concepts of clustering and frequent itemsets, the
proposed method also combines ideas of recommendations systems and knapsack problems
to decide on the items to recommend. We empirically compare our approach with existing
methods on both real and synthetic datasets and show that our method yields better profits
while being faster and simpler.
89
5.1 Introduction
The data mining research literature is replete with several algorithms - most of which are
elegant, efficient and effective. However, there are only a few formal studies concerning how
data mining can actually be beneficial in more specific targets. A major obstacle in data
mining application is the gap between statistically significant pattern extraction and value-
based decision making [WZH02]. It is often questioned as to how exactly one should make
use of the patterns extracted through data mining algorithms for making effective business
decisions, with the ultimate goal of yielding better profits for the business.
Similarly, studies about the retail market [Tom02] have received wide attention, although
only a few of them have seriously dealt with data mining. Ke Wang et al. [WZH02] first
presented a profit mining approach to reduce this gap in 2002 and recent investigations have
shown an increasing interest on how to make decisions by utilizing association rules. We
focus on the problem of recommending products to customers in retail stores such that the
profit of the store is maximized.
Market basket databases contain historical data on prior customer choices where each
customer has selected a subset of items, a market basket, out of a large but finite set.
This data can be used to generate a dynamic recommendation of new items to a customer
who is in the process of making the item choices. Some retail outfits provide carts with
displays that provide product information and recommendations as the customer shops.
Remote shopping systems allow customers to compose shopping lists through personal digital
assistants (PDAs), with interactive recommendations of likely items. Internet commercial
sites often provide dynamic product choices as the customer adds items into the virtual
shopping cart, or market basket. Internet sites also display dynamically changing set of
links to related sites depending on the browsing pattern during a surfing session. Faced
with an enoromous variety of options, customers surfing the web gravitate toward sites that
offer information tailored to their personal preferences. All these activities are characterized
by the progressive item choices being made by a customer, and the provider’s desire to
recommend items that are the most likely next choice of the customer [HNB01].
Recommender systems are rapidly becoming a core tool to accelerate cross-selling and
90
strengthen customer loyalty due to the prosperity of electronic commerce. Enterprises have
been developing new business portals and providing large amount of product information
to create more business opportunities and expand their markets. However, it results in the
information overload problem which has become the burden of customers when making a
purchase decision among a huge variety of products [CCH+07]. The past history of the
items in each market basket transaction is often available, although most of this data is
proprietary. The process of generating recommendations using this data has been called
collaborative filtering, and treated in the same way as the modeling of personal preferences
of movies or news articles. However, we feel that the market basket recommendation is
inherently a different problem due to the large number of categories of items and their
associated profits.
From the market angle, two important criteria [BSVW99] should be taken into account
during the process of mining profit: the items in retail shops should first meet the basic
sale request, and second, should bring higher profits. Therefore, how to meet these two
principles is the core problem of profit mining. The cross-selling effect of items [Tom02]
has been noticed by current retailers: the profit of an item is not only involved in the item
itself, but is also influenced by its related items. Some items fail to produce high profit by
themselves, but they might stimulate customers to buy other profitable items. Consequently,
the cross-selling factor which can be studied by the analysis of historical transactions should
be involved in the problem of item selection.
Searching for such a relation of items to support cross-selling has become an important
issue. Current approaches to study these relationships are based on association rules. How-
ever, association rules by themselves do not suggest how to maximize profit.
We present an algorithm that combines simple ideas from association rule mining, clus-
tering, recommendation systems, and the knapsack problem to recommend those items to
customers that maximize the expected profit. We tested our algorithm on two popular
datasets: One was generated by using the data generator from IBM Almaden Quest re-
search group [BSVW04] and the other was a retail market basket dataset available on the
FIMI1 repository. Our experiments show that our algorithm performs better than compet-
1http://fimi.cs.helsinki.fi/data
91
ing algorithms in terms of obtaining better profits, while at the same time being faster and
simpler.
5.2 Related Work
Many novel and important methods were proposed to support profit mining. Brijs et al.
first proposed the PROFSET Model [Tom02], which adopted the size-constrained strategy
of 0-1 programming, took advantage of the cross-selling effect of items to solve the problem
of item selection.
In 2002, Ke Wang et al. first presented the profit mining approach and several related
problems [WZH02, Zho09]. Ke Wang et al. proposed the HAP algorithm [WS02] which
is extended from the webpage-layered algorithm HITS [BP98] to solve the problem of item
ranking with the consideration of the influence of confidence and profit, but it still has several
drawbacks [wWcF03]. Raymond Wong et al. proposed the maximal profit problem of item
selection (MPIS) [wWcF03] which has the goal of mining out a subset of items with the
maximal profit and then ameliorates those above drawbacks. However, MPIS is too difficult
to implement and solves a NP-hard problem even in the simplest situation. In other words,
although MPIS algorithm could find the best solution, the cost of time is too expressive to
be tolerated.
Recently, Raymond Wong et al proposed a Hill Climbing method [WF04] to solve the
problem of Item Selection for Marketing (ISM) by considering the cross-selling effect. Ray-
mond Wong et al. [WFW05] also adopted genetic algorithms to generate local optimal profit
to fit the optimal profit of the item selection problem. In general if the support of items is
decreased then the number of association rules will keep increasing.
The DualRank [XJWZ05] algorithm which uses a graph based on association rules and
is compressed because the number of out-degrees for the items decrease if the minimum
support is increased. Matrix calculations being done are affected which then affects the item
selection based on the profit. Another problem with DualRank [XJWZ05] is that it is not
very efficient for sparse data sets and it includes cumbersome calculations of eigenvalues and
eigenvectors which become intractable for large transactional data sets. To overcome all
92
these situations we have developed a new algorithm for market basket datasets.
5.3 Problem Definition
In this section, we focus on the problem of recommending products to customers in retail
stores such that the profit of the store is maximized. The following definition formally
captures the problem statement:
Definition 14 (Profit Mining) Given a transactional dataset D = {t1, t2, . . . , tm}, where
each transaction ti contains a subset of items (itemset) from the set I = {i1, i2, . . . , in},
each item ij having an associated profit pj, the problem is to select k additional items to
recommend for each transaction, with the goal of making more sales and maximizing the
total profit of the resulting transactions.
There are several challenges to this problem:
1. Model product purchase probabilities: We need to recommended products that
are more likely to be purchased. It is therefore important to have a clear understanding
of how to model the probability of purchase of each product.
2. Model product relationships: It is not sufficient to know the individual purchase
probabilties of different products. We need to also identify relationships between items
for cross-selling. The standard technique in recent approaches for this purpose has
been to use association rules.
3. Model customers: Even high-confidence association rules may not apply to a par-
ticular customer, whereas some low-confidence rules may apply. Thus, it is imperative
to model customers regarding which category they belong to and then study the rules
within that category. This is more likely to result in effective recommendations. The
standard technique in the recommendation system community for this purpose is to
cluster customers such that customers within each cluster share the same purchase
patterns.
93
4. Balance purchase probability and profit: A pure association rule based approach
will favour rules with high confidence so as to maximize the probability that the cus-
tomer will purchase the recommended item. For example, the rule {Perfume} →
{Lipstick} will be favoured because of higher confidence compared to a rule {Perfume}
→ {Diamond}. In contrast, a pure profit-based approach will favour the latter rule
hoping for higher profit. Neither necessarily maximizes the true profit. Indeed, items
of high profit often also have low supports because fewer people buy expensive stuffs.
5. Decide the number of products to recommend: An implicit constraint here
is that if we recommend too many items then the customer will be overwhelmed by
the choices and is likely to avoid choosing any item at all. On the other hand, if we
recommend too few items, then we may miss a successful sale of a product that has
not been brought to the attention of the customer. The correct number of products to
recommend depends on the attention span of customers and would vary depending on
the domain.
Clearly, the problem of recommending the right products for each market basket in a store is
a challenging problem and thereby the task of designing an elegant, simple and yet effective
algorithm seems very difficult upfront.
5.4 The ProMax Algorithm
As discussed in the previous section, at the face of it, the problem of recommending the right
products is replete with several challenges. It therefore seems daunting to design a simple
yet effective algorithm for this task, so much so that to design an optimal algorithm that
guarantees to maximize the expected profit, seems out of reach. Yet this is exactly what we
achieve. In this section we present ProMax, a Profit Max imizing recommendation system
for market baskets.
Our algorithm performs a clustering of the customer transactions as a preprocessing. For
this purpose, it uses the clustering algorithm explained in [WXL99]. Next, at the time of
recommendation, the algorithm is based on the following steps:
94
1. Identify the cluster C to which the current record belongs.
2. Calculate the expected profit of each item in the cluster C.
3. Sort the items based on expected profit and recommend the top k items, where k is a
parameter to the algorithm.
First, in Section 5.4.1 we describe the clustering algorithm used in step 1. Next, the manner
in which expected profit of items is computed is later described in Section 5.4.2.
5.4.1 Clustering Customer Transactions and Identification of the
Current Cluster
Since the probability of earning more profit is directly proportional to the purchase proba-
bility of items, it is imperative to be able to accurately estimate the puchase probability of
specific items by specific customers. A naive solution is to use the global support of items
as an estimate of their probability. But, as customers differ in their purchase patterns, the
global support of an item is an unreliable estimator of its likelihood of sale.
Therefore, a natural approach is to first cluster the transactions based on their purchase
patterns and then use the support of items within a cluster as more accurate estimates of
their purchase probabilities. The clustering criterion we use is based on the notion of large
items and was proposed in [WXL99].
For a given collection of transactions and a minimum support, our aim is to find a clustering
C such that cost(C) is minimum. The main aim here is to cluster the similar transactions
where intra cluster similarity is high and inter cluster similarity is less. cost(C) is calculated
by using the intra cluster similarity and inter cluster similarity which are calculated by using
the notion of Small Items and Large Items respectively.
Large items are used to calculate the similarity measure used for clustering transactions.
The support of an item in a cluster Ci is the number of transactions in Ci that contain the
item. Let |S| denote the number of elements in set S. For a user-specified minimum support
θ (0 < θ ≤ 1), an item is large in cluster Ci if its support in Ci is atleast θ× |Ci|; otherwise,
an item is small in Ci.
95
Intuitively, large items are popular items in a cluster, thus, contribute to similarity of
items within a cluster. In contrast, small items contribute to dissimilarity in a cluster. Let
Largei denote the set of large items in Ci, and Smalli denote the set of small items in Ci.
Consider a clustering C = {C1, C2, . . . , Ck}. The cluster to which the current record r
belongs to, depends on the cost which can be calculated using the equation. It can be defined
mathematically as.
Cluster(r) = argmini[Cost(Ci)] (5.1)
The cost of C to be minimized depends on two components: the intra-cluster cost and
the inter-cluster cost:
Intra-cluster cost: This component is charged for the intra-cluster dissimilarity, mea-
sured by the total number of small items:
Intra(C) = | ∪ki=1 Smalli| (5.2)
This component will restrain creating loosely bound clusters that have too many small
items.
Inter-cluster cost: This component is charged for the inter-cluster similarity. Since large
items contribute to similarity in a cluster, each cluster should have as little overlapping of
large items as possible. This overlapping is measured as:
Inter(C) =k∑
i=1
|Largei| − | ∪ki=1 Largei| (5.3)
In words, Inter(C) measures the duplication of large items in different clusters. This
component will restrain creating similar clusters.
To put two together, one can specify weights for their relative importance. The cost
function of the clustering C then is defined as:
Cost(C) = w × Intra(C) + Inter(C) (5.4)
A weight w > 1 gives more emphasis to the intra-cluster similarity, and a weight w < 1
gives more emphasis to the inter-cluster dissimilarity. By default w = 1.
96
The pseudocode of the clustering algorithm as described above is shown in Algorithm 6.
Algorithm 6: Clustering Algorithm
/*Allocation Phase*/
while not end of file do
read the next transaction < t,− >;
allocate t to an existing or a new cluster Ci to minimize Cost(C);
write < t,Ci >;
end while
/*Refinement Phase*/
repeat
not moved = true;
while not end of the file do
read the next transaction < t,Ci >;
move t to an existing non-singleton cluster Cj to minimize Cost(C);
if Ci 6= Cj then
write < t,Cj >;
not moved = false;
eliminate any empty cluster;
end if
end while
until not moved ;
5.4.2 Calculating Expected Profit
Once clusters are computed, the probability of an item’s purchase can be found by first
identifying which cluster the current transaction falls into and then looking up the support
of the corresponding item in that cluster. That is, the item’s probability is estimated as its
support within a cluster, rather than its global support. As mentioned in Section 5.4.1, this
manner of computing the probability of items is far more accurate in terms of their likelihood
of purchase.
97
In this context, we can compute the expected profit of a given item i with the help of its
probability (intra-cluster frequency) f and profit p. The expected profit for each item can be
computed as:
Ei = fi × pi (5.5)
5.4.3 Recommending Items: Knapsack Approach
The current scenario is that we can recommend k items to the customer where k is a constant
chosen based on the domain. The goal is to maximize profit - so we must ensure that the
recommended items have high purchase probability and high profit.
There are mainly two options:
1. Recommending items with high purchase probability: Consider the example of lipstick
and diamond. Suppose the cost of lipstick is $1 and cost of diamond is $500 and
purchase probability of lipstick is 0.03 and diamond is 0.0001. When we consider the
purchase probability as the only case, we will be recommending lipstick which would
not yield high profit.
2. Recommending items with high profit: Consider the same example as given above,
suppose the cost of lipstick is $2 and cost of diamond is $500, diamond has higher
profitability than lipstick. When we consider only the high profit, then we will recom-
mend diamond which is not a good recommendation.
If the goal is to maximize either purchase probability or profit separately as mentioned
above, we could directly use the knapsack approach [HS78]. We reduce the current problem
to the knapsack problem in the following manner.
Definition 15 (Knapsack Problem) Given a set of items, each with a weight and a value,
determine the number of each item to include in a collection so that the total weight is less
than a given limit and the total value is as large as possible.
We reduce profit mining to a knapsack problem using the following equivalences:
98
• Items to carry in a knapsack correspond to the items for sale in profit mining.
• Item weight (in knapsack problem) is set to 1 for all items in the store.
• Weight limit (knapsack problem) is set to k – the number of items to recommend (in
profit mining).
• Item value (in knapsack problem) is set to expected profit of that item (which is
purchase probability ≤ item profit).
The result is a greedy algorithm to recommend items - Once the expected profit of items
is computed, we sort all items in decreasing order of these expected profit values and rec-
ommend the k items which have the highest expected profits. The Knapsack approach
guarantees that this greedy approach maximizes the overall expected profit.
The pseudo-code of the resulting algorithm as described above is shown in Algorithm 7.
5.5 Experimental Evaluation
To evaluate the performance of the ProMax algorithm we ran a set of experiments on two
data sets: a real retail data set [BSVW99] available from the Frequent Itemset Mining
Implementations (FIMI) Repository2 and an IBM Sythetic Dataset [BSVW04]. The choice
of these particular data sets is not restrictive since they are widely used standard benchmark
data sets, and the structure is typical of the most common application scenarios. In fact
we can apply this algorithm in all scenarios provided we have the past transactions and a
current customer transaction.
We compare the efficiency of ProMax algorithm against the DualRank algorithm [XJWZ05],
because the DualRank algorithm has already been shown to perform well when compared
with the HAP algorithm and the naive approach. The DualRank algorithm [XJWZ05] is the
state of art in the consideration of item selection methods with customer-behaviour model
in the data mining literature but ProMax algorithm is based on the customer-item-behavior
model.
2http://fimi.cs.helsinki.fi/data
99
Algorithm 7: Mining Profitable Items
Identify the cluster to which the user’s transaction belongs
I = Items bought by the user
for not end of the transactions in the cluster do
i = cluster id
w = weight function
Ci = ith cluster
Cost(Ci) = w ∗ Intra(Ci) + Inter(Ci)
end for
min = minimum cost value
minid = id of the cluster which has the minimum cost
while not end of the transactions in the cluster having min cost do
f = frequency of each item i
p = profit of item i
E = f × p (expected profit)
end while
sort the items in descending order of expected profits
select k items with maximum value from the sorted list
All experiments are conducted on a 1.60GHZ Intel PC with 8GB main memory, using a
linux machine. Both algorithms were coded in C++. Profits of items are generated randomly
since the way profit is generated does not affect the nature of the results generated. To
perform the experiments, we took a set of four to five items which is a current customer
transaction as an input and recommend top five items to the user as an output based on the
past transaction history.
5.5.1 Performance of DualRank
Initially we have considered the performance of DualRank on the synthetic dataset. For a
larger number of transactions, DualRank was not able to execute fast enough due to the
huge number of computations it has to perform. Performance of DualRank is shown in the
100
graph of Figure 5.1.
Figure 5.1: DualRank Performance
For a minimum support greater than 0.1, DualRank could not be executed. The reason
behind this is that, since there are very few frequent singleton items, there are no edges in
the item-graph created in DualRank.
We notice that the performance of DualRank deteriorates as the min-support value is
reduced. The reason is that for low minimum supports, the number of association rules
generated is very large resulting in a very large matrix. This makes it difficult to perform
the intensive matrix operations that DualRank requires such as the calculation of eigen
values and eigen vectors. DualRanks recommendations are independent of a particular input
transaction, but are globally determined by the overall profitability of individual items.
ProMax algorithm was also evaluated under the same conditions as DualRank. Perfor-
mance of ProMax is shown in the graph of Figure 5.3. We notice that it performs better over
the entire range of min-support values that DualRank could run on. We notice that its per-
formance increases slightly when the min-support is very low. This is only coincidental there
is as such no particular direct relationship between the min-support value and the profits
generated by ProMax. This is because here the min-support can only affect the quality of
clustering by modifying the intra-cluster and inter-cluster costs. Both too low and too high
values of min-support could deteriorate the clustering quality. However, there is a damping
affect when min-support is reduced, there are more small items leading to a high intra-cluster
101
cost and a low inter-cluster cost. The increase in intra-cluster cost is often compensated by
the decrease in inter-cluster cost thereby resulting in no net change in cluster quality.
We have also noticed that by keeping the min-support value as constant and varying the
number of items being selected, ProMax outperforms the DualRank algorithm as shown in
the Figure 5.2. DualRank always generates the static recommendations and is independent
of the customer’s current transaction. Hence, until the database of previous transaction
history changes, DualRank always recommend the same items.
Figure 5.2: Comparisons of profits earned by the algorithms based on the number of items
selected.
5.5.2 Performance of ProMax
In this experiment, we evaluated the performance of ProMax on both the real and synthetic
datasets. The results are shown in the graph of Figure 5.3.
In this graph, the x-axis denotes the min-support, whereas the y-axis denotes the profit
generated by ProMax. As per our observation, the algorithm is performing in the same
manner across different datasets. For datasets which are not very sparse, profit is compar-
atively more. This is because of the clustering approach where the bulkiness of the clusters
increases.
Also, the clustering quality effect described in the previous experiment is clearly visible
in the real dataset curve. Notice that this curve has a peak at around min-support = 0.05
102
Figure 5.3: ProMax Performance on different datasets
when clustering quality is high and deteriorates on both sides as the min-support is either
increased or decreased.
5.6 Summary
In this paper, we presented an algorithm that combines simple ideas from association rule
mining, clustering, recommendation systems, and the knapsack problem to recommend those
items to customers that maximize the expected profit. We tested our algorithm on two pop-
ular datasets and compared our algorithm with the previously existing DualRank algorithm.
Our experiments showed that our algorithm performs better than DualRank in terms of
obtaining better profits, while at the same time being faster and simpler.
103
Chapter 6
Conclusion and Future Work
This thesis has presented methods for helping people to understand the development of key
research topics in terms of a generated list of keyphrases and their originating documents
in collection of time-stamped text documents. In addition to this we built an approach for
identifying related set of keyphrases for a given user’s query and corresponding originating
document. The modeling and evaluation led to the following conclusions and ideas for future
work.
6.1 Conclusions
1. We have proposed and evaluated method for helping user to understand the interac-
tions between keyphrases and their documents. These offline methods were based on
unsupervised models for such documents archives where documents accumulate over
time. We developed an approach to identify a plane list of keyphrases to recommend
the user and first document refer to the originating document of these keyphrases from
different datamining/databases conferences. Our motive is to help new users or re-
searchers who can not navigate through each and every topic and research publication
published in various conferences w.r.t. time-stamp (i.e. conference year). No one is
willing to spend to much time on unrelevant information. Users who read these limited
subset of articles could hopefully get the gist of the important key ideas. Based on
users interrest they can choose their own topics and read only those papers in depth.
104
Experimental results describe in Section 3.4, can refer for detailed understanding.
2. In first approach, we did not consider any input from user. We are generated a list of
keyphrases and output whole list of keyphrases to the users and later they can decide
their field based on their interests.
In second approach discussed in Chapter 4, for a given user’s query, we recommend a
set of top-k keyphrases corresponding to the user’s query based on the importance and
proximity of keyphrases and their corresponding originating document. We developed
our approach based on knapsack problem to recommend top - k keyphrases to the
user. We used the notion of keyphrases evolutionary graph and keyscores to identify
top keyphrases which are relevant to the given query. Keyscore of each keyphrase is
calculated by using the impact of keyphrases i.e. keyranks and identify - k nearest
neighbors of a given user query and finally output top - k keyphrases which are having
high keyscore values and also their corresponding originating document. In Section 4.3,
experimental results are shown and compared with Google similarity distance measure.
3. In addition to the above approaches, we also explored our problem in a different domain.
We used the knapsack based solution discussed for task 2 (above) in a different domain
i.e. to recommend set of items to customers in retail stores such that the profit of the
store is maximized. We developed an algorithm i.e. ProMax that combines simple
ideas from association rule mining, clustering, recommendation systems and knapsack
problem.
To validate the performance of our algorithm, we used real retail data set [BSVW99]
available from Frequent Itemset Mining Implementations (FIMI) Repository1 and an
IBM Sythetic Dataset [BSVW04]. Experiments discussed in Section 5.5, show that our
algorithm is performing better than the competiting algorithm i.e. DualRank in terms
of obtaining more profits, while at the same time being faster and simpler.
1http://fimi.cs.helsinki.fi/data
105
6.2 Future Work
Work discussed in this thesis can be extended further and hence, has substantial room for
improvement. The improvement can be made in the following distinct arenas.
1. We can improve the performance of existing keyphrase extraction technique by intro-
ducing the notion of other parameters to identify the novelty of keyphrases.
2. We have only considered a flat structure of keyphrases, it would be interesting to
explore a hierarchical structure of keyphrases which can give us a picture of keyphrase
evolutions at different resolutions.
3. To identify the originating document we are considering only reference section i.e.
title, author name, conference year etc. We can further extend this work to check the
presence or absence of keyphrases inside the text of these references and can increase
the accurancy of the prediction.
4. It will be much more helpful for users if we able to identify the particular content of a
keyphrase inside the document instead of navigating through whole document. They
can refere only the content of a keyphrase for their quick reference and understanding
of a particular keyphrase.
5. Our approach for identifying related keyphrases and recommend top - k keyphrases
corresponding to a user’s query is based on keyscore ranking technique. Although, it
performs well for a pioneering effort, it is far from perfect. We like to maneuver this
approach by introducing other distance metrics and ranking techniques.
6. For experimental results we can test our approach on large datasets and can develop
a system for large number of confereneces and maintain the user freindly manner of
visualization and can give ample user interaction options in the entire system make it
a very useful tool in practical research topics and their document evolutions. Such a
system can be very useful for managing all kinds of text stream data.
7. Also it would be interesting to identify the set of papers that are responsible for in-
troducing a new keyphrase. Similarly, it would be interesting to design an algorithm
106
to cluster those documents that can directly capture the splitting and merging of
documents over time and identify the main keyphrase which is associated with those
documents and vice versa.
8. Next, for ProMax algorithm, we believe that there are few aspects for improvement.
First, in the initial phase of our algorithm, clustering technique can be done more ef-
ficiently by appropriately identifying the parameters while calculating the cost. More-
over, we consider only the type of items but did not consider quantity of items, which
can be extended as a future work.
107
Publications
• “Mining Landmark Papers”, SIGAI Workshop on Emerging Research Trends in AI
(ERTAI 2010), Mumbai, India, April, 2010.
• “Extended Approach for Mining Landmark Papers from Text Corpus”, Grace Hopper
Celebration Conference India, Bangalore, India, December 7-9, 2010.
• “ProMax: A Profit Maximizing Recommendation System for Market Baskets”, SIGAI
Workshop on Emerging Research Trends in AI (ERTAI 2010), Mumbai, India, April,
2010.
108
Bibliography
[ACD+98] James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and
Yiming Yang. Topic Detection and Tracking Pilot Study Final Report. In In
Proceedings of the DARPA Broadcast News Transcription and Understanding
Workshop, pages 194–218, 1998.
[ALJ00] James Allan, Victor Lavrenko, and Hubert Jin. First story detection in tdt
is hard. In CIKM ’00: Proceedings of the ninth international conference on
Information and knowledge management, pages 374–381, New York, NY, USA,
2000. ACM.
[All02] James Allan, editor. Topic detection and tracking: event-based information
organization. Kluwer Academic Publishers, Norwell, MA, USA, 2002.
[ANTT02] Arvind Arasu, Jasmine Novak, Andrew Tomkins, and John Tomlin. PageRank
computation and the structure of the Web: Experiments and algorithms. In
Proceedings of the Eleventh International World Wide, 2002.
[APL98] James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection
and tracking. In SIGIR ’98: Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in information retrieval, pages
37–45, New York, NY, USA, 1998. ACM.
[BE97] Regina Barzilay and Michael Elhadad. Using lexical chains for text summa-
rization. In In Proceedings of the ACL Workshop on Intelligent Scalable Text
Summarization, pages 10–17, 1997.
109
[Ber02] Pavel Berkhin. Survey of clustering data mining techniques. Technical report,
2002.
[BP98] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual
web search engine. In Proceedings of the seventh international conference on
World Wide Web 7, WWW7, pages 107–117, Amsterdam, The Netherlands,
The Netherlands, 1998. Elsevier Science Publishers B. V.
[BSVW99] Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. Using association
rules for product assortment decisions: a case study. In Proceedings of the
fifth ACM SIGKDD international conference on Knowledge discovery and data
mining, KDD ’99, pages 254–260, New York, NY, USA, 1999. ACM.
[BSVW04] Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. IBM synthetic
data generator. 2004.
[CC00] David Cohn and Huan Chang. Learning to probabilistically identify authori-
tative documents. In ICML ’00: Proceedings of the Seventeenth International
Conference on Machine Learning, pages 167–174, San Francisco, CA, USA, 2000.
Morgan Kaufmann Publishers Inc.
[CCH+07] Mu-Chen Chen, Long-Sheng Chen, Fei-Hao Hsu, Yuanjia Hsu, and Hsiao-Ying
Chou. Hprs: A profitability based recommender system. In Industrial En-
gineering and Engineering Management, 2007 IEEE International Conference
on, pages 219 –223, dec. 2007.
[Con00] The Linguistic Data Consortium, editor. The Year 2000 Topic Detection and
Tracking TDT2000 Task Definition and Evaluation Plan. 2000.
[CV07] Rudi L. Cilibrasi and Paul M. B. Vitanyi. The google similarity distance. IEEE
Trans. on Knowl. and Data Eng., 19:370–383, March 2007.
[De05] Indro De. Experiments in first story detection, 2005.
110
[DG06] Jesse Davis and Mark Goadrich. The relationship between precision-recall and
roc curves. In Proceedings of the 23rd international conference on Machine
learning, ICML ’06, pages 233–240, New York, NY, USA, 2006. ACM.
[DP97] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian
classifier under zero-one loss. Mach. Learn., 29(2-3):103–130, 1997.
[Gar] E. Garefield. The impact factor.
[Gar55] Eugene Garfield. Citation indexes for science. a new dimension in documentation
through association of ideas. Science, 122:1123–1127, 1955.
[Gar72] Eugene Garfield. Citation analysis as a tool in journal evaluation can be ranked
by frequency and impact of citations for science policy studies. SCIENCE,
178(4060):471–479, 1972.
[Gar03] E. Garefield. The Meaning of the Impact Factor. International Journal of
clinical and Health Psychology, 3(2):363–369, 2003.
[GPW+99] Carl Gutwin, Gordon Paynter, Ian Witten, Craig Nevill-Manning, and Eibe
Frank. Improving browsing in digital libraries with keyphrase indexes. Decis.
Support Syst., 27(1-2):81–104, 1999.
[Hav99] T. Haveliwala. Efficient computation of pagerank. Technical Report 1999-31,
Stanford InfoLab, 1999.
[HH76] M. A. K. Halliday and R. Hasan. Cohesion in English (English Language).
Longman Pub Group, 1976.
[HNB01] Se June Hong, Ramesh Natarajan, and Ilana Belitskaya. A new approach for
item choice recommendations. In Proceedings of the Third International Con-
ference on Data Warehousing and Knowledge Discovery, DaWaK ’01, pages
131–140, London, UK, 2001. Springer-Verlag.
111
[HNP05] Andreas Hotho, Andreas Nrnberger, and Gerhard Paa. A brief survey of text
mining. LDV Forum - GLDV Journal for Computational Linguistics and Lan-
guage Technology, 2005.
[HPYM04] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns
without candidate generation: A frequent-pattern tree approach. Data Min.
Knowl. Discov., 8:53–87, January 2004.
[HS78] Ellis Horowitz and Sartaj Sahni. Fundamentals of Computer Algorithms. Com-
puter Science Press, 1978.
[HSGA09] Mohammad Al Hasan, W. Scott Spangler, Thomas Griffin, and Alfredo Alba.
Coa: finding novel patents through text analysis. In KDD ’09: Proceedings of
the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 1175–1184, New York, NY, USA, 2009. ACM.
[Jon98] Steve Jones. Link as You Type. Working Paper. Department of Computer
Science, University of Waikato, New Zealand, 1998.
[Kan03] M. Kantardzic, editor. Data Mining. Wiley Inter-Science,, Hoboken, 2003.
[Kle99] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM,
46(5):604–632, 1999.
[Kle02] Jon Kleinberg. Bursty and hierarchical structure in streams. In KDD ’02:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 91–101, New York, NY, USA, 2002. ACM.
[LWY06] Minsuk Lee, Weiqing Wang, and Hong Yu. Exploring supervised and unsuper-
vised methods to detect topics in biomedical text. BMC Bioinformatics, 7:140,
2006.
[MFH+03] Amy Mcgovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast,
Jennifer Neville, and David Jensen. Exploiting relational structure to un-
derstand publication patterns in high-energy physics. SIGKDD Explorations,
5:2003, 2003.
112
[MZ05] Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns
from text: an exploration of temporal text mining. In KDD ’05: Proceedings of
the eleventh ACM SIGKDD international conference on Knowledge discovery in
data mining, pages 198–207, New York, NY, USA, 2005. ACM.
[NIS] National Institue of Standards and Technology.
[PBMW98] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pager-
ank citation ranking: Bringing order to the web. Technical report, Stanford
Digital Library Technologies Project, 1998.
[PBMW99] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pager-
ank citation ranking: Bringing order to the web, 1999.
[SJ00] R. Swan and D. Jensen. Timemines: Constructing timelines with statistical
models of word usage, 2000.
[SM86] Gerard Salton and Michael J. McGill. Introduction to Modern Information Re-
trieval. McGraw-Hill, Inc., New York, NY, USA, 1986.
[SW99] Mark Shewhart and Mark Wasson. Monitoring a newsfeed for hot topics. In
KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 402–404, New York, NY, USA,
1999. ACM.
[Tom02] Brijs Tom. Retail market basket analysis: A quantitative modelling, 2002.
[Tur99] Peter Turney. Learning to Extract Keyphrases from Text. National Research
Council Canada, Institute for Information Technology, 1999.
[Voo99] Ellen M. Voorhees. Natural language processing and information retrieval. In In-
formation Extraction: Towards Scalable, Adaptable Systems, pages 32–48, Lon-
don, UK, 1999. Springer-Verlag.
[WF04] Raymond Chi-Wing Wong and Ada Wai-Chee Fu. Ism: Item selection for mar-
keting with cross-selling considerations. In PAKDD, pages 431–440, 2004.
113
[WFW05] Raymond Chi-Wing Wong, Ada Wai-Chee Fu, and Ke Wang. Data mining for
inventory item selection with cross-selling considerations. Data Min. Knowl.
Discov., 11:81–112, July 2005.
[WHGL09] Hei-Chia Wang, Tian-Hsiang Huang, Jiunn-Liang Guo, and Shu-Chuan Li. Jour-
nal article topic detection based on semantic features. In IEA/AIE ’09: Proceed-
ings of the 22nd International Conference on Industrial, Engineering and Other
Applications of Applied Intelligent Systems, pages 644–652, Berlin, Heidelberg,
2009. Springer-Verlag.
[Wil81] Rudolf Wille. Restructuring lattice theory: An approach based on hierarchies
of concepts. Ordered Sets, Ivan Rival Ed., NATO Advanced Study Institute,
83:445–470, September 1981.
[Wit03] Ian H. Witten. Browsing around a digital library. In SODA ’03: Proceedings
of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages
99–99, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Math-
ematics.
[WPF+99] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G.
Nevill-Manning. Kea: practical automatic keyphrase extraction. In DL ’99:
Proceedings of the fourth ACM conference on Digital libraries, pages 254–255,
New York, NY, USA, 1999. ACM.
[WS02] Ke Wang and Ming-Yen Thomas Su. Item selection by ”hub-authority” profit
ranking. In Proceedings of the eighth ACM SIGKDD international conference
on Knowledge discovery and data mining, KDD ’02, pages 652–657, New York,
NY, USA, 2002. ACM.
[wWcF03] Raymond Chi wing Wong and Ada Wai chee Fu. MPIS: Maximal-profit item
selection with cross-selling considerations. In in: IEEE International Conference
on Data Mining (ICDM), 2003.
114
[WXL99] Ke Wang, Chu Xu, and Bing Liu. Clustering transactions using large items. In
Proceedings of the eighth international conference on Information and knowledge
management, CIKM ’99, pages 483–490, New York, NY, USA, 1999. ACM.
[WZH02] Ke Wang, Senqiang Zhou, and Jiawei Han. Profit mining: From patterns to ac-
tions. In Proceedings of the 8th International Conference on Extending Database
Technology: Advances in Database Technology, EDBT ’02, pages 70–87, London,
UK, 2002. Springer-Verlag.
[XJWZ05] Xiujuan Xu, Lifeng Jia, Zhe Wang, and Chunguang Zhou. DualRank: A dual-
phase algorithm for optimal profit mining in retailing market. In ASIAN, pages
182–192, 2005.
[Zho09] Senqiang Zhou. Profit mining. In Encyclopedia of Data Warehousing and Min-
ing, pages 1598–1602. 2009.
115