Mining Landmark Papers and Related Keyphrasesweb2py.iiit.ac.in/research_centres/publications/download/masters... · Mining Landmark Papers and Related Keyphrases Thesis submitted

Mining Landmark Papers and Related Keyphrases

Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science (by Research)

in

Computer Science

by

Annu Tuli

200707002

[email protected]

Center for Data Engineering

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

(Deemed University)

Hyderabad, India

July 2011

Thesis Certificate

This is to certify that the thesis entitled “Mining Landmark Papers and Related

Keyphrases” submitted by Annu Tuli to the International Institute of Information Tech-

nology, Hyderabad, for the award of the Degree of Master of Science (by Research) is

a record of bona-fide research work carried out by her under my supervision and guidance.

The contents of this thesis have not been submitted to any other university or institute for

the award of any degree or diploma.

Date Advisor:

Dr. Vikram Pudi

Associate Professor

IIIT Hyderabad

Copyright c© Annu Tuli, 2011. All rights reserved.

The author hereby grants to IIIT-Hyderabad permission to reproduce and distribute

publicly paper and electronic copies of the thesis document in whole or in part.

To Parameshwara and my Parents

“The highest men are calm, silent, and unknown. They are the men who really know the

power of thought; they are sure that, even if they go into a cave and close the door and

simply think five true thoughts and then pass away, these five thoughts of theirs will live

through eternity. Indeed such thoughts will penetrate through the mountains, cross the

oceans, and travel through the world.”

- Swami Vivekananda

Acknowledgements

First n foremost, I would like to thank Dr. Vikram Pudi who has had a tremendous

influence on me. On top of being an expert, he has been a dear friend, a profound philoso-

pher and a devoted guide, far beyond ordinary. He introduced me to research and closely

guided my first steps. He has this special property of always searching for the crux of any

matter. This, together with his sharpness, energy and amazing sense of humour makes him

a remarkable person and pleasure to work with. His feedback and encouragement greatly

helped me to keep my spirits up.

I am also thankful to all the faculty and staff members of CDE - Dr. Kamalakar Kar-

lapalem and Dr P. K. Reddy, for providing a wonderful research center. I also take this

opportunity to thank IIIT-Hyderabad for giving me an opportunity to see the world of

research so closely.

This acknowledgment would be incomplete without mention of my friends who made

my journey memorable. This includes - Amit, Hanuma Kumar, Rohit Paravastu, Aditya,

Prashant, Raghvendra, Pratibha, Saurabh, Padmini, Chetna, Bhanukiran, Velidi Padmini,

Srilakshmi and Lydia.

Above all I would like to express my gratitude towards my grand mother, parents, sis-

ters, brother-in-law and his family who have had, are having and will continue to have a

tremendous influence on my development.

Synopsis

Text mining is a subfield of data mining that, in turn, is a component of a more general

category of Knowledge Management (KM). In the real world, knowledge is represented not

only by the structured data found in traditional databases, but in a wide variety of un-

structured sources such as books, research papers, word documents, letters, digital libraries,

e-mail messages, news feeds, Web pages, and so forth. Text mining is particularly relevant

today because of the enormous amount of knowledge that resides in unstructured collection

of text documents. It uncovers relationships in a text collection, and leverages the creativity

of the knowledge worker to explore these relationships and discover new knowledge.

In recent years, we have witnessed a tremendous growth in the volume of text documents

available on the internet, digital libraries, news resources and so on. With the dramatic

growth of text information, there is an increasing need for powerful text mining systems

that can automatically discover useful knowledge from text. The digital data has become

one of the most important resources for information. But the fact that more data is available

does not necessarily mean that it is being used in efficient manner. It has been observed

that no one is willing to or capable of manually browsing through large data collections. To

satisfy user’s information need, a system should list a precise and small subset of the data

collection.

People often interact with these document collections and thus may be interested in meth-

ods to help them better use the documents or retrieve the useful knowledge. For retrieving

individual documents, search engines have already been very successful. Other methods

such as topic modeling can provide a coarse overview of the topics in a document collection.

While information retrieval and topic modeling methods have been widely applicable and

useful, current methods for drilling deeper to understand the structure and development of

research areas in terms of topics and their relationships with each other and their originating

documents for each of those topics in a corpus as a whole could still be improved.

In this thesis, we provide methods for a set of tasks that seek to identify the important or

ii

new keyphrases and corresponding first originating document, known as Landmark Paper.

These methods focus on supplying a fine-grained picture of development of keyphrases over

time to help users grasp the keyphrase collection’s development as a whole. We focus on the

problem of finding novel keyphrases within the document collections, and their originating

documents through the multiple conferences with respect to the conference year. In addition

to this, we also consider that it is insufficient for new researchers who are exploring a research

area. It is much more helpful to be able to see the related research areas based on proximity

and importance of keyphrases and their originating documents.

We investigate the system where the user can enter a set of keyphrases and the system

recommend a set of top k - keyphrases by using knapsack approach, corresponding to the

user’s query and also show the relationships of these topics to each other and finally outputs

the landmark paper for each of those topics. This will essentially provide all the material

required for the researcher to exhaustively understand the foundations of that research area.

This thesis explores text-based approaches for these tasks. For wide applicability, the meth-

ods use only document text. We evaluate our methods experimentally on actual research

publications from the different Data Mining and Databases conferences proceedings avail-

able on the Digital Bibliography Library Project (DBLP) website1. We have prepared a

cleaned-up dataset with the text proceedings to conduct this evaluation.

In addition to the above tasks, we have develop an approach by using knapsack based

solution as discussed above in a different domain i.e. to recommend set of items to the

customers in retail stores such that the profit of the store is maximized.

1http://www.informatik.uni-trier.de/ ley/db/conf/indexa.html

iii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Overview of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background and Related Work 9

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 KEA : Keyphrase Extraction Algorithm . . . . . . . . . . . . . . . . 10

2.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Receiver Operator Characteristic (ROC) Curve and Space . . . . . . 14

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Topic Detection and Tracking . . . . . . . . . . . . . . . . . . . . . . 15

2.2.3 First Story Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.4 Hot Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.5 Temporal Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.6 Journal Article Topic Detection based on Semantic Features . . . . . 17

2.2.7 COA: Finding Novel Patents through Text Analysis . . . . . . . . . . 18

2.3 Differences from Landmark Paper Mining . . . . . . . . . . . . . . . . . . . . 18

iv

3 Mining Landmark Papers 21

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Extracting Keyphrases . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 Identifying Landmark Papers . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Experimental Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.2 Identifying Keyphrases and Originating Documents from Each Con-

ference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.3 Number of Misclassifications . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.4 Identifying Keyphrases and Originating Documents from Multiple Con-

ferences Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Mining Topic-based Landmark Papers 40

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Steps of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.1 Mining Related Keyphrases . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 Keyphrase Evolutionary Graph . . . . . . . . . . . . . . . . . . . . . 44

4.2.3 Matching Queries and Keyphrases . . . . . . . . . . . . . . . . . . . . 45

4.2.4 Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.5 Evaluating Keyphrase Ranking (i.e. Keyrank) . . . . . . . . . . . . . 50

4.2.6 Identifying k - Nearest Neighbors of each Keyphrase . . . . . . . . . 51

4.2.7 Evaluating Keyscore of Keyphrases . . . . . . . . . . . . . . . . . . . 53

4.2.8 Recommending Keyphrases . . . . . . . . . . . . . . . . . . . . . . . 53

4.2.9 Identify Originating Document . . . . . . . . . . . . . . . . . . . . . 54

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

v

4.3.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.4 Identifying Originating Documents . . . . . . . . . . . . . . . . . . . 63

4.3.5 Analysis using Google Similarity Distance Measure . . . . . . . . . . 79

4.3.6 Spearman’s Rank Correlation . . . . . . . . . . . . . . . . . . . . . . 82

4.3.7 t - test for testing the significance of an observed sample correlation

coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.3.8 Analysis of Results using Rank Correlation and t - test . . . . . . . . 85

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5 ProMax: A Profit Maximizing Recommendation System for Market Bas-

kets 89

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4 The ProMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4.1 Clustering Customer Transactions and Identification of the Current

Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.4.2 Calculating Expected Profit . . . . . . . . . . . . . . . . . . . . . . . 97

5.4.3 Recommending Items: Knapsack Approach . . . . . . . . . . . . . . . 98

5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.5.1 Performance of DualRank . . . . . . . . . . . . . . . . . . . . . . . . 100

5.5.2 Performance of ProMax . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Conclusion and Future Work 104

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Bibliography 107

vi

List of Figures

1.1 Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Relationship between the set of relevant and retrieved documents. . . . . . . 12

3.1 Flow Diagram for Identifying Landmark Papers. . . . . . . . . . . . . . . . . 24

3.2 ROC Space: Comparison of Different Predicted Results. . . . . . . . . . . . 38

3.3 Different Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Contingency Table 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39








4.1 An Example of Keyphrase Evolutionary Graph . . . . . . . . . . . . . . . . . 47

4.2 Keyphrase Evolutionary Graph relevant to the query term “decision trees”. . 56

4.3 Keyphrase Evolutionary Graph for query terms (“cluserting uncertain data”,

“data streams”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Keyphrase Evolutionary Graph relevant to the query term “skyline query”. . 61

4.5 Keyphrase Evolutionary Graph relevant to the query term “target schema”. . 62

4.6 The Significane of Spearman’s Rank Correlation and degrees of freedom. . . 86

5.1 DualRank Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

vii

5.2 Comparisons of profits earned by the algorithms based on the number of items

selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3 ProMax Performance on different datasets . . . . . . . . . . . . . . . . . . . 103

viii

List of Tables

2.1 Sample Output of Keyphrase Extraction from Research Articles. . . . . . . . 11

2.2 A confusion matrix for positive and negative tuples. . . . . . . . . . . . . . . 13

3.1 Basic information of data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 List of keyphrases and corresponding originating documents in VLDB confer-

ence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 List of keyphrases and corresponding originating documents in SIGMOD con-

ference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 List of keyphrases and corresponding originating documents in ICDE conference. 32

3.5 List of keyphrases and corresponding originating documents in ICDM confer-

ence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 List of keyphrases and corresponding originating documents in KDD conference. 34

3.7 List of keyphrases and corresponding originating documents in PAKDD con-

ference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.8 No of incorrectly classified keyphrases from each conference. . . . . . . . . . 36

3.9 List of global keyphrases and originating documents from multiple conferences. 37

4.1 Sample market-basket dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Documents and Set of Keyphrases . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Frequent Keyphrases and Support Count . . . . . . . . . . . . . . . . . . . . 44

4.4 Keyphrases and Keyranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5 Keyphrases and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.6 Keyphrases and Keyscores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.7 List of keyphrases and their keyranks matching with query term “decision trees” 69

ix

4.8 List of keyphrases and their distances corresponding to the query term “deci-

sion trees”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.9 List of keyphrases and their keyscores corresponding to the query term “de-

cision trees”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.10 List of keyphrases and their keyranks corresponding to the query term (“clusert-

ing uncertain data”, “data streams”) . . . . . . . . . . . . . . . . . . . . . . 70

4.11 List of keyphrases and their distances matching with query term (“cluserting

uncertain data”, “data streams”). . . . . . . . . . . . . . . . . . . . . . . . . 71

4.12 List of keyphrases and their keyscores corresponding to the query term (“clusert-

ing uncertain data”, “data streams”) . . . . . . . . . . . . . . . . . . . . . . 71

4.13 List of keyphrases and their keyranks corresponding to the query term “skyline

query” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.14 List of keyphrases and their keyranks corresponding to the query term “target

schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.15 List of keyphrases and their distances matching with query term “skyline query” 74

4.16 List of keyphrases and their distances matching with query term “target

schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.17 List of keyphrases and their keyscores matching with query term “skyline query” 76

4.18 List of keyphrases and their keyscores matching with query term “target

schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.19 List of keyphrases and corresponding originating documents . . . . . . . . . . 78

4.20 List of keyphrases and their NGD corresponding to the query term “decision

trees”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.21 List of keyphrases and their NGD corresponding to the query term (“clusert-

ing uncertain data”, “data streams”). . . . . . . . . . . . . . . . . . . . . . . 82

4.22 List of keyphrases and their NGD corresponding to the query term “skyline

query”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.23 List of keyphrases and their NGD corresponding to the sub-query term “target

schema”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.24 Spearman’s rank correlation results . . . . . . . . . . . . . . . . . . . . . . . 85

x

4.25 t - test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

List of Algorithms

1 Mining Landmark Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 FP-Tree Algorithm for finding frequent itemsets . . . . . . . . . . . . . . . . . 45

3 FP-Growth(T, α) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Keyrank(graph, dampingfactor=0.85, maxiterations=100, mindelta= 0.00001) 65

5 Dijkstra’s Shortest Path Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7 Mining Profitable Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

xi

Chapter 1

Introduction

Recent proliferation of the World Wide Web and availability of inexpensive storage media

has led to accumulation of enormous amount of data. Digital data has become one of the

most important resources for information. But the fact that more data is available doesn’t

necessarily mean that it is being used in an efficient manner. It is the sheer amount of data

that emphasizes the need for intelligent automatic access; no one is willing to, or capable

of manually browsing through large data collections. To satisfy user’s information need, a

system should list a precise and small subset of the data collection. The ultimate goal is to

help users find what they are looking for.

Text Data Mining (TDM) can be considered a field of its own, containing a number of

applications. It has also been known as text analysis, text mining or knowledge discovery

in text. In general, TDM applications are used to extract non-trivial and useful information

from large corpora of text data, which are available in unstructured or structured format.

Text mining applications require the use and application of many related fields such as

Information Retrieval, Machine Learning, Statistics, and Linguistics. There are various

application of TDM, such as in bio-informatics, market research, consumer trend studies,

and scientific research [De05].

Information Retrieval (IR) and Information Extraction (IE) areas are associated with text

mining. IE has the goal of transforming a collection of documents into information that is

more readily digested and analyzed with the help of an IR system. IE extracts relevant facts

from the documents, while IR selects relevant documents. IE is a kind of pre-processing

1

stage in the text mining process, which is the step after the IR process and before the data

mining techniques are performed.

Today, the internet is growing through a rapid phase of growth and development. With

the growth of the internet, information contained in electronic documents is increasingly

widespread, with the World Wide Web as its primary repository. The convenience of elec-

tronic documents has motivated their more efficient application in information management

and knowledge discovery [WHGL09].

A typical information retrieval problem is to locate relevant documents in a document

collection based on a user’s query, which is often some keywords describing an information

need, although it could also be an example relevant document. In such a search problem,

a user takes the initiative to “pull” the relevant information out from the collection; this is

most appropriate when a user has some ad hoc (i.e. short-term) information need, such as

finding information to buy a used car. When a user has long-term information need (e.g. a

researcher’s interests), a retrieval system may also take the initiative to “push” any newly

arrived information item to the user if the item is judged as being relevant to the user’s

information need.

One goal of text mining is to provide automatic methods to help people grasp the key

ideas in ever-increasing document collections. People often interact with these document

collections and thus may be interested in methods to help them better “use” the documents.

For retrieving individual documents, search engines have already been very successful. Other

methods such as topic modeling can provide a coarse overview of the topics in a document

collection. While information retrieval and topic modeling methods have been widely appli-

cable and useful, current methods for drilling deeper to understand the idea structure and

development in a corpus as a whole could still be improved.

We provide methods for a set of tasks that seek to identify important or new keyphrases

and corresponding first originating document from a corpus over time. These methods

focus on supplying a fine-grained picture of generated list of keyphrases over time to help

users grasp the keyphrase collection’s development and their originating documents with

respect to various conferences as a whole. We focus on the problem of finding important

keyphrases within the document collections, and development of keyphrases from documents

2

through the multiple conferences with respect to the conference year. However, to be useful

to new researcher who are exploring research areas, this is insufficient. It is much more

helpful to be able to see the relevant research areas in terms of proximity and importance

of keyphrases. This will essentially provide all the material required for a researcher to

exhaustively understand the developments of research areas. This thesis explores text-based

approaches for these tasks. For wide applicability, the methods use only document text.

We evaluate our methods experimentally on research publications from the different Data

Mining and Databases conferences proceedings available on the Digital Bibliography Library

Project (DBLP) site. We have prepared a cleaned-up dataset with the text proceedings to

conduct this evaluation.

1.1 Motivation

In many application domains, we encounter a stream of text, in which each text document has

some meaningful time stamp. For examples, a collection of news articles about a topic and

research papers in a subject area can both be viewed as natural text streams with publication

dates as time stamps. In such text data streams, there often exist some interesting and

meaningful keywords. For example, an event covered in news articles generally has some

meaningful keywords consisting of themes (i.e. subtopics) characterizing the beginning,

progression, and impact of the event, among others. Similarly, in research papers, some

important and meaningful keywords may also exhibit similar patterns. For example, the

study of one topic specified by some keyphrases in some time period may have influenced

or stimulated the study of another topic associated with same keyphrases after the time

period. In all these cases, it would be very useful if we can discover and extract these

important keyphrases and also identify the first corresponding paper automatically from

text to get knowledge about the keyphrases from where they originate with respect to time

stamp. Indeed, such research papers are not only are useful by themselves, but also would

facilitate organization and navigation of the information stream according to the underlying

keywords. In addition to this, it will be helpful for the users to explore the extra fields, if they

are getting additional information with their query in terms of other relevant information.

3

1.2 Problem Description

In this section, we discuss our problem in two folds:

1. We focus on the problem of finding list of keyphrases and also identifying the first

document from the corpus where important keyphrases are introduced for the first

time. We present the problem of Mining Landmark Papers (MLP). This problem re-

quires simultaneously understanding what keyphrases/topics are new or important and

which documents drive these keyphrases. The following definition formally captures

the problem statement:

Definition 1 (Landmark Paper Mining) Given a collection of time indexed doc-

uments, C = {d1, d2, . . . , dT}, where di refers to a document with time stamp i, each

document is a sequence of words from a vocabulary set V = {w1, w2, . . . , w|V |}, the

problem is to identify the first document that introduces important keyphrases for the

first time known as landmark papers.

This can be broken into two sub-problems:

(a) Find the right keyphrases/topics in a collection of documents.

(b) Identify the originating documents of important keyphrases.

2. Mining topic-based landmark papers: For a given user’s query (in terms of key-

words), find the keyphrases which are relevant to the query term and recommend top

k - related keyphrases to the query and their originating documents if it exists.

In addition to above approach, we also explore our problem in a different domain. We

formalize a technique that utilizes knapsack based solution to recommend an optimal set of

items to the customers in a retail store based on the contents of their market baskets and

overall profit of the store is maximized.

4

1.3 Scope

Mining Landmark papers is not only useful for the beginning researcher, but for anyone

keeping track of important developments in a particular area. This is important today

due to the large numbers of researchers and published research papers. Keeping track of

keyphrases, their related keyphrases and landmark papers is especially useful to track of

key research developments not necessarily in the specific area of one’s research, but in its

numerous related areas, which tend to be voluminous.

Consider, for example, there are often hundreds of research papers published annually in

a research area. A researcher, especially a beginning researcher, often wants to understand

how the research topics in the literature have been evolving. For example, if a researcher

wants to know about data mining, both the historical milestones and the recent research

trends of data mining would be valuable for him/her. Identifying the origins of important

and new keyphrases will also make it much easier for the researcher to selectively choose

appropriate new field of research. Also, the corresponding first document (i.e. landmark

paper) for that keyphrase will also help the researchers to read only those papers based on

their research interests.

1.4 Contribution of the Thesis

In this thesis, our work explores how document collections develop over time specifically,

by detecting keyphrases from documents by looking at important keyphrases, their related

keyphrases and by detecting where in documents these keyphrases originate. These (entirely)

text-based methods can be used to detect the new/important keyphrases which develop over

time and corresponding originating documents of the keyphrases with respect to the time-

stamp. We address the problems of identifying right/important keypharses and originating

documents of important keypharses which introduce new keypharses that has large impact.

Figure 1.1, shows the block diagram of steps preformed by our approach. In next section, we

define the steps for keyphrase extraction and our approach in brief to identify the landmark

papers.

5

Figure 1.1: Block Diagram.

1.5 Overview of Proposed Approach

Keyphrases provide semantic metadata that summarize and characterize documents. To ex-

tract kephrases from documents, we use Keyphrase Extraction Algorithm1 (KEA) [WPF+99].

KEA, an algorithm for automatically extracting keypharses from text. It is a single docu-

ment summarizer, which employs a Bayesian supervised learning approach to build a model

out of training data, then applies the model to unseen documents for keyphrase extrac-

tion [WPF+99].

KEA, is simple, effective and publicly available. KEA’s extraction algorithm has two

stages:

1. Training: Create a model for identifying keyphrases, using training documents where

the author’s assign keyphrases are known.

2. Extraction: Choose keyphrases from a new document, using the above model.

Both stages choose a set of candidate phrases from their input documents, and then

calculate the values of certain attributes (called features) for each candidate.

In our experiments, we consider the full text of documents to extract the keyphrases from

text. We set the length of keyphrases as minimum is 2 and maximum is 3 to avoid the

redundant keywords from text.

Next, to identify originating documents from corpus, we propose an approach that is easy

direct and simple to understand. We prepared a database of well tagged “Data Mining/-

1http://www.nzdl.org/Kea/

6

Databases” research papers from DBLP (Digital Bibliography and Library Project) website2.

DBLP is a computer science bibliography website hosted at university of Trier in Germany.

We extract the information related to data mining and databases conferences like VLDB,

ICDM and SIGMOD etc. and store the information in our database to perform the experi-

ments. The information we extract includes the year of conference, author’s name, conference

name, paper title, and general paper topic if any and full-text of research papers. In our

approach, we consider time-stamp i.e. conference month and year as one of the important

parameter for sorting the resulting documents, and additional pruning step to refine our

results, we consider references section of their corresponding candidate landmark papers, as

the another important parameter.

1.6 Applications

Text is the natural choice for formal exchange of information by common people through

electronic mail, Internet chat, World Wide Web, digital libraries, electronic publications,

and technical reports, to name few. The wealth of information embedded in typically stored

in text (or document) databases distributed all over is enormous, and such databases are

growing exponentially with the revolution of current internet and information technology.

Automatic understanding of the content of textual data, and hence the extraction of relevant

knowledge from it, is a long-standing challenge in Artificial Intelligence. In order to aid

mining of information from large collections of such text databases, special types of data

mining methodologies, known as “text mining”, have been developed. Our ultimate goal is

to help the user to identify the relevant information and extract the information in which user

might interested in future. Today, users may not want to spend a huge amount of time to

search the relevant information of their choice. To consider all these aspects of user’s need, we

added one-step further to identify the recent research trends and corresponding originating

research articles, to help the user to grasp the key knowledge of the topics quickly. The

applications of our work includes wide variety of different fields. It evolves many domains

of text documents, such as new articles, web pages, research publications and journals,


7

digital libraries, blogs, email analysis, electronic publication of books, technical and business

documents, thesis and dissertation reports, patent data analysis and so on.

1.7 Organization of the Thesis

In addition to the problem definition of our work we gave a brief introduction of motivation

and scope of our problem. The remainder of this thesis is organized as follows:

• Chapter 2: Describes the background knowledge and relevant literature in the con-

text of this thesis. In this chapter we first discuss some background information and

then related approaches and later we explain how our approach is different from other

existing approaches.

• Chapter 3: Presents a framework of MLP (Mining Landmark Papers). In this chapter,

first we describes the pre-processing step that is required for the keyphrase extraction in

the context of the research documents. Next, we define our methodology for identifying

landmark papers and presents the experimental results and evaluation of our approach

on text corpus.

• Chapter 4: Develop an approach to

– find related keyphrases based on proximity and importance of keyphrases and

recommend a set of top k - keyphrases corresponding to the user’s query, and

– identify the corresponding originating document of those keyphrases.

We also discuss various case studies later in the chapter.

• Chapter 5: Use the Knapsack based solution developed for above Chapter 4 in a

different domain i.e. to recommend set of items to the customers in retail stores such

that the profit of the store is maximized. We evaluate our approach experimentally

and compare against the existing approach.

• Chapter 6: Finally, this chapter conclude our work and also present possible directions

of future work of this thesis.

8

Chapter 2

Background and Related Work

In this chapter, we first present some background and related information that can be use-

ful to understand the underlying idea and motivation of our research problem. Later we

discuss the related approaches and how our approach differ significantly from other existing

approaches.

2.1 Background

In this section, first we explain the notion of keyphrase extraction and, next we describe the

evaluation metrics that are used to evaluate the performance of our results. Keyphrase pro-

vide a simple way of describing a document, giving the reader some clues about its content.

In addition, keyphrases can help users get a feel for the content of a collection, provide sensi-

ble entry points into it, show how queries can be extended, facilitate document skimming by

visually emphasizing important phrases; and offer a powerful means of measuring document

similarity [GPW+99, Jon98, Wit03]. There are many open-source tools are available, to ex-

tract the keyphrases from research articles. We have used Keyphrase Extraction Algorithm

i.e. KEA [WPF+99], for extracting keyphrases from our database. In next section, first we

explain KEA in detail.

9

2.1.1 KEA : Keyphrase Extraction Algorithm

KEA [WPF+99], is an algorithm for extracting keyphrases from text corpus. It can be either

used for free indexing or for indexing with a controlled vocabulary. KEA1 is implemented in

JAVA and is platform independent. It is an open-source software.

KEA identifies candidate keyphrases using lexical methods [Tur99], calculate feature values

for each candidate, and uses a machine-learning algorithm to predict which candidates are

good keyphrases. The machine learning schemes first builds a prediction model using training

documents with known keyphrases, and then uses the model to find keyphrases in new

documents. It uses the Naive Bayes machine learning algorithm for training and keyphrase

extraction.

In Table 2.1, we show the output of KEA’s as an example in our dataset. Table 2.1, shows

the titles of 2 research articles and 2 sets of keyphrases for each article. First set gives the

keyphrases assigned by the user; the other was determined automatically from the article’s

full text by KEA. Phrases in common between the two sets are italicized. As seen from

the Table 2.1, the automatically-extracted keyphrases and author’s assigned keyphrases are

quite similar.

KEA’s extraction algorithm has two stages:

1. Training: Create a model for identifying keyphrases, using training documents where

the author’s keyphrases are known. KEA first needs to create a model that learns

the extraction strategy from manually indexed documents. In our database, we used

150 research articles as a training documents and assign the keyphrases manually. For

each training document, candidate phrases are identified and their feature values are

calculated. To reduce the size of the training set, KEA discard any phrase that occurs

only once in the document. Each phrase is then marked as a keyphrase or a non-

keyphrase, using the actual keyphrases for that document. This binary feature used

by the machine learning scheme. The scheme then generates a model that predicts

the class using the values of the other two features. It uses the Naive Bayes technique

(e.g. [DP97]) because it is simple and yields good results. This scheme learns two sets

1http://www.nzdl.org/Kea/

10

Table 2.1: Sample Output of Keyphrase Extraction from Research Articles.

A Bayesian Method for Guessing Efficient Processing of Top-k Domina-

the Extreme Values in a Data Set -ting Queries on Multi-Dim Data

Author’s

assign

keyphrases

Kea’s gen-

erated

keyphrases

Author’s

assign

keyphrases

Kea’s gen-

erated

keyphrases

Bayesian Bayesian ap-

proach

Top-k domi-

nating queries

top-k domi-

nating queries

Bayesian Bayesian

method

Skyline

Queries

skyline

queries

Query Pro-

cessing

query pro-

cessing

Multi Dimen-

sional Data

multi-

dimensional

data

Data Man-

agement

data manage-

ment

Ranking

Functions

ranking func-

tion

Estimator Tree Traver-

sal

Sub-linear

speed algo-

rithm

method for

guessing

Non-Indexed

Data

non-indexed

data

of numeric weights from the discretized feature values, one set applying to positive (“is

a keyphrase”) examples and the other to negative (“is not a keyphrase”) instances.

2. Extraction: To identify keyphrases from a new document, KEA determines candidate

phrases and feature values, and then applies the model built during training phase.

This model determines the overall probability that each candidate is a keyphrase, and

then a post-processing operation selects the best set of keyphrases. Phrases with the

highest probabilities are selected into the final set of keyphrases. The user can specify

11

the number of keyphrases that need to be selected. In our experiments, we specify the

number of keyphrases to be extracted from each document are 10.

2.1.2 Evaluation Metrics

To evaluate the performance of our text retrieval system we use standard measures such

as recall, precision and F-score measure. Let the set of documents relevant to a query be

denoted as Relevant, and the set of document retrieved be denoted as Retrieved. the set

of documents that are both relevant and retrieved is denoted as Relevant⋂Retrieved, as

shown in the venn diagram in Figure 2.1. There are two basic measures for assessing the

quality of text retrieval.

Figure 2.1: Relationship between the set of relevant and retrieved documents.

Definition 2 (Precision) This is the percentage of retrieved documents that are in fact

relevant to the query (i.e., “correct responses”). It is formally defined as:

Precision(P ) =(Relevant

⋂Retrieved)

Retrieved(2.1)

Definition 3 (Recall) This is the percentage of documents that are relevant to the query

and were, in fact retrieved. It is formally defined as:

Recall(R) =(Relevant

⋂Retrieved)

Relevant(2.2)

An information retrieval system often needs to trade off recall for precision or vice-versa.

One commonly use trade-off is the F-score, which is defined as the harmonic mean of recall

and precision:

12

F -score =2× recall × precisionrecall + precision

(2.3)

To check the accuracy of our system, we use the notion of confusion matrix. The confusion

matrix is a useful tool for analyzing how well our classifier can recognize tuples of different

classes. In the context of confusion matrix, we build our analogy in terms of retrieved and

relevant documents with respect to actual and predicted class labels. A confusion matrix

for two class labels is shown in the Table 2.2.

Table 2.2: A confusion matrix for positive and negative tuples.

Predicted Results

Actual Resultstp fp

fn tn

In terms of tp (true-positive), tn (true-negative), fp (false-positive) and fn (false-negative),

Recall and Precision are defined as:

Definition 4 (Precision) This is the probability that a (randomly selected) retrieved doc-

ument is relevant. It is formally defined as:

Precision(P ) =tp

tp+ fp(2.4)

Definition 5 (Recall) is the probability that a (randomly selected) relevant document is

retrieved in a search. It is formally defined as:

Recall(R) =tp

tp+ fn(2.5)

To identify the number of accurate and mislead documents, we use accuracy and error-rate

measures. The accuracy of any system is the percentage of test set tuples that are correctly

classified and error rate identify how many are misclassified. The accuracy and error rate is

given by:

Accuracy =tp+ tn

tp+ tn+ fp+ fn(2.6)

13

error-rate = 1− Accuracy (2.7)

2.1.3 Receiver Operator Characteristic (ROC) Curve and Space

In a binary decision problem, a classifier labels examples as either positive or negative.

The decision made by the classifier can be represented in a structure known as a confusion

matrix or contingency table. The contingency table can derive several evaluation metrics as

discussed above. The confusion matrix can be used to construct a point in ROC space. An

ROC curve, show how the number of correctly classified positive examples varies with the

number of incorrectly classified negative examples. In ROC space [DG06], one plots the False

Positive Rate (FPR) on the x-axis and the True Positive Rate (TPR) on the y-axis, depicts

relative trade-offs between true positives and false positive. The FPR measures the fraction

of negative examples that are misclassified as positive. The TPR measures the fraction of

positive examples that are correctly labeled.

The best possible prediction method would yield a point in the upper left corner or co-

ordinate (0,1) of the ROC space, representing 100 percent sensitivity (no false negative) and

100 present specificity (no false positive). The (0,1) point also called perfect classification.

A completely random guess would give a point along a diagonal line i.e. a line of no dis-

crimination from the left bottom to the right corners. This diagonal line divides the ROC

space. Points above the diagonal represents good classification results, and points below the

line yields poor classification.

2.2 Related Work

In this section, we explain existing and related work information in detail that has been

published already in the literature.

14

2.2.1 Overview

For self-referential document collection such as research publications, emails, or news articles,

we would like to answer the basic questions: Did one document d has some important/new

keyphrases introduced first time in comparison of another document d’? Why one document

is more important than other document? This information can then be put together to an-

swer more complicated questions such as the following: What are the documents contain new

keyphrases introduced first time. These documents are the important ones or can be called

as landmark papers since they represent the essence of new keyphrases introduced first time.

Answering this fundamental question has many applications. On the web, methods such as

Hub and Authorities [Kle99] and Page Rank [PBMW99] have been used to find important

documents. There is a whole research community that analyzes research publications by

their citations to determine which have the most impact [Gar55, Gar72].

The number of electronic documents is growing faster than ever before [WHGL09] : in-

formation is generated faster than people can deal with it. In order to handle this problem,

many electronic periodical databases have proposed keyword search methods to decrease

the effort and time spent by users in searching the electronic documents of their interest.

However, the users still have to deal with a huge number of search results. How to provide

an efficient search, i.e., to present the search results in categories, has become an important

current research issue. If search results can be classified and shown by their topics, users can

find papers of interest quickly.

In the most popular form of search, the search criteria are keywords, or concepts that may

be contained in the electronic documents [Voo99]. However, the users still have to deal with

the overabundance of search results in some way. During the last decade, the question of

how best to filter the results of search engines has become an important issue.

2.2.2 Topic Detection and Tracking

Topic Detection is an experimental method for automatically organizing search results. It

could help users save time in identifying useful information from large scale electronic docu-

ments. In [All02], A topic is defined to be a seminal event or activity, along with all directly

15

related events and activities. Many different data mining methods are employed to recognize

topics, for instance, the Naive Bayes classifier [LWY06], hierarchical clustering algorithms

(HCA) [HNP05, Ber02, Kan03], paragraph relationship maps [SM86], Formal Concept Anal-

ysis (FCA) [Wil81] and lexicon chains [BE97, HH76]. These methods use the frequencies

of words to calculate the similarity between two documents. Therefore, their accuracy is

greatly hindered by the presence of synonyms.

Halliday and Hasan [HH76] proposed a semantics-based lexical chain method, that can

be used to identify the central theme of a document. Based on the lexical chain method,

combined with the electronic WordNet database2, the proposed method clusters electronic

documents by semantic similarity and extracts the important topics for each cluster. Ulti-

mately, the method provides more user-friendly search results that are relevant to the topics.

In [WHGL09] proposed a document topic detection method based on semantic features

in order to improve the traditional method. The key contribution is to design a method

based on bibliographic structures and semantic properties to extract important words and

cluster the literature. It can be used to retrieve the topics and display the search results

clustered by topics. Expert users can easily acquire literature of interest and correctly find

information from the topic cluster display.

2.2.3 First Story Detection

In [De05], Indro et al. presented First Story Detection (FSD) whose task requires identifying

those stories within a large set of data that discuss an event that has not already been

reported in earlier stories. In this FSD approach, algorithm look for keywords in a news

story and compare the story with earlier stories. FSD is defined as the process to find all

stories within a corpus of text data that are the first stories describing a certain event [Con00].

An event is a topic that is described or reported in a number of stories. Examples can be

government elections, natural disasters, sports events, etc. The First Story Detection process

runs sequentially, looking at a time-stamped stream of stories and making the decision

based on a comparison of key terms to previous stories. FSD is closely linked to the Topic

2http://wordnet.princeton.edu/

16

Detection task, a process that builds clusters of stories that discuss the same topic area or

event [NIS, ALJ00]. Comparable to this, FSD evaluates the corpus and finds stories that

are discussing a new event. FSD is a more specialized version of Topic Detection, because

in Topic Detection the system has to determine when a new topic is being discussed and the

resulting stories will be the “first-stories”.

2.2.4 Hot Topics

Shewhart and Wasson [SW99] described a process that monitors newsfeeds for topics that re-

ceive unexpectedly high amounts of coverage (i.e. hot topics) on a given day. They performed

trend analysis in order to find hot topics, except that they used controlled vocabulary terms

rather than phrases extracted from text. The purpose of the study is to monitor newsfeeds

in order to identify when any topic from a predefined list of topics is a hot topic.

2.2.5 Temporal Text Mining

In 2005, Mei and Zhai [MZ05] discovered evolutionary theme patterns from text information

collected over time. Temporal Text Mining (TTM) has many applications in multiple do-

mains, such as summarizing events in news articles and revealing research trends in scientific

literature. TTM task is discovering and summarizing the evolutionary patterns of themes

in a text stream. They define this new text mining problem and present general probabilis-

tic methods for solving this problem through (1) discovering latent themes from text; (2)

constructing an evolution graph of themes; (3) analyzing life cycles of themes.

2.2.6 Journal Article Topic Detection based on Semantic Features

In 2009, Wang and others [WHGL09] describes a document topic detection method based

on bibliographic structures (e.g. Title, Keyword and Abstract) and semantic properties to

extract important words and cluster the scholarly literature. The approach can be used to

retrieve topics and display the search results illustrated by topics. Expert users can easily

acquire literature of interest and correctly find information from the topic-cluster display. In

order to exploit the semantic features to detect topics, they constructed lexical chains from

17

a corpus [BE97, HH76, ?]. They performed three main steps to build system architecture.

At first, the pre-process model collects journal papers and process their title, keyword and

abstract information to prepare for the lexical chain construction. Secondly, the document

representative model build lexical chains and finally, last step is the semantic clustering

model. They propose method to calculate the semantic similarity between semantic features.

After the semantic similarity calculation, HCA (Hierarchical Clustering Algorithm) method

is used to cluster the documents. Within a cluster, topics are extracted from the documents

and phrase frequency (PF) method is used to extract the topics. The key contribution

is the ability to extract topics by semantic features, taking into account the influence of

bibliographic structures and to recommend clusters to the users.

2.2.7 COA: Finding Novel Patents through Text Analysis

In 2009, M. Al. Hasan and others [HSGA09] build a patent ranking software, named COA

- Claim Originality Analysis that rates patent based on its value by measuring the recency

and the impact of the important phrases that appear in the “claim” section of a patent.

COA address the novelty and non-obviousness of a patent. It assesses the patent by eval-

uating the originality of its invention. It uses an information retrieval approach, where a

patent is considered valuable, if the invention presented in the patent is novel and also, is

subsequently used or expanded by later patents. This knowledge is gleaned from the patent

text, specifically, from the text composing the patent claims. From the “claims” section of a

patent, first identify a set of phrases (single word or multi-word) that retain the key ideas of

the patent. Then, for every phrase find the earliest patent that had used that phrase. They

also track the usages of that phrase by later patents. Finally, they feed these information

into a ranking function to obtain a numeric value that denotes the value of that patent.

2.3 Differences from Landmark Paper Mining

We now show that existing related techniques, specifically first story detection, hot topic

mining, theme mining, journal article topic detection and COA do not effectively handle

the landmark paper mining problem. Our approach is simple and more direct. We can not

18

reduce our requirements to the first story detection, hot topic mining and theme mining

effectively.

In first story detection (FSD) [De05], algorithms look for keyword in a first news story and

compare the story with earlier stories. FSD is the process to find all stories within a corpus

of text data that are the first stories describing a certain event [Con00]. The FSD process

runs sequentially looking at a time-stamped stream of stories and making the decision based

on a comparison of key terms to previous stories. FSD is closely linked to the topic detection

task [NIS], a process that builds cluster of stories that discuss the same topic area or event.

Landmark paper mining differs significantly from FSD in the following ways. In FSD, a

new story is detected as being a first story if it has a significant vocabulary shift from recent

papers. First, a vocabulary shift could occur even without the introduction of new key terms

if the frequencies of existing key terms is significantly altered. Second, a document can be

flagged as a first story, even when there is an earlier document with the same key terms

and frequencies. For example, even if there was an earth-quake last year, the first story

describing a more recent earth-quake will be detected as a first story.

In hot topic mining [SW99], a topic is known as hot when it receives an unusually high

amount of coverage in the news on a given day because of the importance of the events

involving that topic. They used trend analysis in order to find hot topics, except that they

are using controlled vocabulary terms rather than phrases extracted from text. Landmark

paper mining is clearly different problem as it seeks to mine interesting papers and for

identifying important and related keyphrases, instead of interesting topics.

Temporal Text Mining (TTM) [MZ05] task is to discover, extract and summarize the

evolutionary patterns of themes in a text stream. In this paper, the authors identify when

a theme starts, reaches its peak, and then deteriorates, as well as which subsequent themes

it influences. A timeline based theme structure is a very informative summary of the event,

which also facilities navigation through themes.

Theme Mining can be considered as an approach to mine interesting papers that originate

themes. A new theme containing only existing keyphrases with altered frequencies does

not have necessarily represent a new concept. In fact, this step (of determining themes)

is both unnecessary and insufficient to determine if a paper originates a new concept or

19

not. In contrast, a new keyphrase almost certainly indicates the presence of a new concept.

However, in landmark paper mining, we follow a simpler and more direct approach . We

identify papers that originate important keyphrases instead of themes (which can contain a

collection of keyphraes). Our approach is simpler because it avoids the notion of themes -

so there is no need to decide which collection of keyphrases form a theme. By avoiding this

unnecessary step, our approach is more direct.

In [WHGL09], the emphasis of author is on, extracting important concepts from documents

and then the documents are clustered by semantic similarity. The main goal of user is to find

only topics and documents by using the topic-cluster display, but landmark paper mining is

different from this approach significantly, we are identifying important keyphrases and their

related keyphrases based on their proximity and importance and corresponding originating

document of the keyphrase from where the keyphrase starts. Another advantage of mine

landmark papers is, in journal article topic paper they consider only bibliography structure

(e.g. title, keyword and abstract), but we consider full text of research article as the input

for our approach. On the other hand, they are considering clustering approach which can

lead to the many duplication of topics, if there are large number of clusters.

Finally, In COA [HSGA09], a patent is said to be novel if the ranking to the patent is

high. The ranking of the patent is based on the recency of the keyphrases. COA allows a

user to define a time-window, the keyphrases that first appeared in some patents published

within the given time-window are considered. So, in COA, many of the keyphrases that

are used by the patent are recent within the given time-window or some of the keyphrases

are used by the patent first time. The novelty of a patent does not totally depend on the

recency of the keyphrases corresponding to that patent, it can be used by already published

patent class. But landmark paper mining differs significantly from this case. A document is

said to be landmark if it introduces important keyphrases first time or in other way, identify

the document that introduces important keyphrases for the first time. In addition, we are

identifying related keyphrases which are relevant to the user’s given query based on their

keyscore values.

20

Chapter 3

Mining Landmark Papers

Most of the research in the field of text mining such as for example, identifying hot top-

ics, fist story detection (FSD), topic detection and tracking (TDT), and, also discovering

evolutionary theme patterns i.e. temporal text mining etc. did not address the problem of

identifying important keyphrases from the text corpus of research documents and further,

did not identify the first document from where it originates. In this chapter, we present

MLP (Mining Landmark Papers) approach, to identify the important keyphrases and also

originating documents from text. The use of MLP at an initial stage by new researchers or

users will help them explore and to choose their field of interest in more structured and an

effective manner and help them to understand the structure of how keyphrases emerge in

the time-stamped documents.

3.1 Overview

Mining Landmark Paper (MLP) is concerned with discovering important keyphrases and

corresponding first document in text information collected over time. We consider the prob-

lem of analyzing the development of important keyphrases and identify first document where

keyphrases were introduced first time in the collection. Text is quite noisy and there are

many documents, so we focus primarily on methods to help people grasp the most important

keyphrases and how they developed through the documents (after all, not many people have

time to try to understand everything.).

21

In addition, people typically like to keep up-to-date with new and current keyphrases, or

to see how keyphrases developed over time. Thus our method typically focuses on the most

important or earliest documents where keyphrases occur first time in the given time-stamp.

Most of the existing work related to these questions has focused on exploiting meta-

data like hyperlinks and citation information. Graph-based algorithms like HITS [Kle99],

PageRank [PBMW98], and its descendants [CC00] exploits information in the hyperlink

structure to find outstanding documents. These algorithms are based on citation analysis

from bibliometrics [MFH+03] that are used to detect related work and define impact [Gar03,

Gar]. In contrast of using citation data we propose an approach that consider whole text of

the document as a input.

For the problem of discovering topics and trends in a collection of documents, there is an

abundance of work that has been done already. The topic detection and tracking (TDT)

evaluations [ACD+98, APL98] emphasized online new topic detection for news articles. In

short, other work has focused on burst detection, correlating real-world events such as the

rise and fall of a topics popularity with single words from the documents [Kle02, SJ00].

Evolutionary theme patterns demonstrate the entire “life cycle” of a topic from a probabilistic

background [MZ05].

3.2 Problem Definition

Given a collection of time-stamped documents, we formulate and explore the following two

questions:

1. What are the important/right keyphrases and how do these keyphrases develop over

time?

2. Which documents introduced these keyphrases and which document is the originating

document of the keyphrase?

3. How to identify first document (d) in the collection having keyphrase (k), which is

originating in the document?

22

To answer all these questions for general document collections, we impose that our algorithm

must work entirely based on the text in the document collection. The following formal

problem statement captures the above requirements:

Definition 6 (Landmark Paper Mining) Given a collection of time indexed documents,

C = {d1, d2, . . . , dT}, where di refers to a document with time stamp i, each document is a

sequence of words from a vocabulary set V = {w1, w2, . . . , w|V |}, the problem is to identify

the first document that introduces important keyphrases for the first time known as landmark

papers.

The term important in the above definition is used to denote keyphrases that are extracted

by using standard techniques [WPF+99] such as TF-IDF, position of occurrence of keyphrase

in document, etc.

To identify landmark papers from text corpus, we first discuss the type of data we are

considering. We assume that the text corpus consists of documents where:

• the text of the document is accessible,

• and the documents are time-stamped,

Examples of such collections are emails, research articles, news articles, blogs, proceedings

of scientific conferences and scientific journals, etc. We use methods that leverage only the

text of the documents so that they can apply to many domains of data. Many domains have

no information about the structure besides what is expressed in the text of the documents

themselves. Using the text only has two advantages, one advantage is that unsupervised

methods that are based exclusively on text widely apply in many domains of the documents,

and other is that link information between the documents also contains information that

could be useful. For example, citation data for research publications could be used in addition

to document text to find the originating documents and important keyphrases.

Our approach is simple, direct and significantly differs from other related approaches.

In addition to finding keyphrases from research articles, we also identify the originating

document of the keyphrase from where it originates. In next few sections, we will explain

steps of our approach in detail and also present the results that validate its effectiveness.

23

3.3 Methodology

The steps of MLP to identify a landmark paper from text corpus is outlined in Figure 3.1.

We discuss them as follows.

Figure 3.1: Flow Diagram for Identifying Landmark Papers.

3.3.1 Extracting Keyphrases

For building a model and extracting keyphrases, there is a pre-processing step needed to

choose a set of candidate phrases from their input documents and then calculate the feature

values for each candidate. For choosing candidate keyphrases, first cleans the input text

like apostrophes, non-token words are deleted, punctuation marks, brackets and numbers

are replaced by phrase boundaries, etc; then phrase identification and finally stemming and

case-folding. For each candidate phrase, two features are calculated. They are: TF-IDF, a

measure of a phrase’s frequency in a document compared to its rarity in general use; and

first occurrence, which is the distance into the document of the phrase’s first appearance. To

perform these tasks we have used the KEA [WPF+99], an algorithm for extracting keyphrases

from text corpus. Detail description of KEA is given in section 2.1.1.

24

3.3.2 Identifying Landmark Papers

After extracting keyphrases from text document collection, our next aim is to identify the

first originating document from text corpus in chronological order. To achieve our task we

first identify the number of documents corresponding to the each keyphrase from our text

collection. In our database, we store the information of conference year, conference name,

general-title of the paper, paper-title, and, first-author name. After identifying the set of

documents corresponding to the keyphrase, sort the documents containing it in increasing

order with respect to the time-stamp e.g. conference year.

To prune the unnecessary and meaningless keyphrases from our dataset, we set the thresh-

old value to discard the keyphrases which are infrequent in documents. Our motive is to

identify important or meaningful keyphrases, so to discard all other keyphrases we set a

parameter called minimum support count i.e. minsup(3) i.e.“threshold value”. We remove

keyphrase that are not contained in at least minsup documents i.e. if a keyphrase occurs

less than 3 documents implies that keyphrase is not important. The parameter “minsup”

ensures that the keyphrases considered are persistent and thereby important. In sorted list

of documents, the first document for a keyphrase is identified as a candidate landmark paper

corresponding to that keyphrase.

These candidate landmark papers are further refined by ensuring that the significant

keyphrase does not appear in the references section of the corresponding paper. This is also

pruning step for removal of the keyphrases; which ensures that the keyphrases, we identified

are important and meaningful from our text corpus. The overview of whole algorithm is

shown in algorithm 1.

With the rapidly increasing number of research articles published recently, one application

is to automatically identify a small set for example, 20 to 30 research publications that

introduced the keyphrases (research field) in respective year of the conferences. We present

such a method to identify list of keyphrases and first originating research publication of these

keyphrases among the documents published at the conferences such as VLDB, SIGMOD,

PAKDD, and etc. People who read this limited subset of articles could hopefully get the gist

of the most important ideas and development of the research community. Based on users

25

Algorithm 1: Mining Landmark Papers

1: for each keyphrase k of the document: do

2: Identify number of documents where k occurs

sort the documents in ascending-order w.r.t. conference-year

3: end for

4: for each keyphrase k of the length l: do

5: if length of keyphrase ≥ minsupcount then

6: print keyphrase of corresponding sorted list of k-docs

7: else

8: prune keyphrases

9: end if

10: end for

11: for each document d from sorted-list of docs: do

12: print first document from sorted-list

13: end for

{Additional Pruning Step}

14: for each keyphrase k of the document in references: do

15: if count of keyphrase > 0 then

16: prune keyphrase

17: else

18: print keyphrase, first document (landmark paper) from sorted-list of docs

19: end if

20: end for

26

interests they can read only those documents and can get a good sense of the later trend

and popular topics. As another example, like mining customer reviews, a few particularly-

insightful reviews about kind of services provided by shopkeeper often stand out from the

rest and spark much discussion. By starting with influential reviews the shopkeeper could

potentially save time by reading only the important comments first instead of skimming the

whole discussion.

3.4 Experimental Results and Evaluation

In this section, we present the performance model and experimental result of over approach.

3.4.1 Data Preparation

To experimentally evaluated our approach on a data set of research papers, we prepared

a database of well tagged “Data Mining and Databases” conferences from DBLP website1.

DBLP (Digital Bibliography and Library Project) is a computer science bibliography website

hosted at University of Trier in Germany. The website maintains information regarding re-

search papers in various fields of computer science. The website currently has more than 220

research papers indexed. We conducted our experiments on the collection of research articles,

in particular the articles published in the proceedings of the “Data Mining and Databases”

conferences such as Knowledge Discovery and Data Mining (KDD), Very Large Database

(VLDB), Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), In-

ternational Conference on Management of Data (SIGMOD), International Conference on

Data Engineering (ICDE) and International Conference on Data Mining (ICDM) between

the year 2005 and 2009. The reason for choosing this dataset is twofold.

1. The research document collections fulfill the assumptions stated above in the problem

definition Section 3.2, and

2. We are familiar with the development of the scientific community, which allow us to

evaluate the performance of our results as an informed insider.


27

The research papers are classified on their topic, their conference, year of publication

and their author’s name. We have extracted the information from the conferences such as

SIGMOD, VLDB, KDD, ICDE, PAKDD, ICDM and store the information in our database

to perform the experiments. The information we extracted include the year of publication,

authors name, conference name, paper title, general paper topic and the link from where the

user can download the full-text pdf files of research papers. We have converted full-text pdf

files into text files by the pdf2text software in Linux. There are total of 3781 documents,

with approximately 100-150 documents each year. 8 articles were excluded because they

were not recognizable by the pdf2text software. The basis statistics of the data sets are

shown in Table 3.1. We intentionally did not perform stemming or stop word pruning in

order to test the robustness of our algorithm.

Table 3.1: Basic information of data sets.

Conference # of docs Time range

VLDB 735 2005-2009

SIGMOD 586 2005-2009

ICDM 672 2005-2009

KDD 601 2005-2009

PAKDD 551 2005-2009

ICDE 636 2005-2009

We wish to measure how well our algorithm perform on real data. All experiments are

conducted on a 160 GHZ Intel PC with 8GB main memory using a Linux machine. All

algorithms are coded in Python.

Methodology described in Section 3.3 are performed on the research articles of the con-

ferences such as VLDB, SIGMOD, KDD, ICDE, ICDM, and PAKDD; together and also

each one separately. In below sections, we have shown the generated list of keyphrases and

corresponding originating documents, first corresponding to each conference separately and

then from all the conferences together.

28

3.4.2 Identifying Keyphrases and Originating Documents from

Each Conference

In this section, we first discuss the set of experiments to analyze how our approach MLP

is able to identify the keyphrases and originating documents for each respective conference.

Table 3.2, shows small set of those keyphrases and originating documents that are identified

by our algorithm from VLDB conference. In Table 3.2, first column represent keyphrases,

second column published year, third column paper title i.e. originating document and last

column shows first author name of the document. In VLDB, total number of documents

were 735 for 5 years. Out of these our approach identified 21 keyphrases which are having

originating documents in VLDB conference. The physical significance of originating docu-

ment, implies that this is the first document where the respective keyphrase was introduced

first time in our database. For example, as shown in Table 3.2, keyphrase “plan choices”

introduced first time in the paper titled “Analyzing plan diagrams of database query opti-

mizers” in the year 2005 of VLDB conference from our database of research articles. Our

tool also shows that (not shown in Table 3.2), this keyphrase was also present in 2007 and

2008. This shows the development of keyphrase in the next few years.

Similarly, for SIGMOD conference total number of documents are 586 for 5 years. Our

algorithm identified 20 important keyphrases and their originating documents from corpus.

Table 3.3, shows a portion of the keyphrases list corresponding to the SIGMOD conference.

Table 3.4, show the small part of results corresponding to the ICDE conference. We

have identified 35 originating documents corresponding to the keyphrases out of total 636

documents published in ICDE during 5 years.

Table 3.5, 3.6 and 3.7, shows the portion of the results generated by our algorithm from

ICDM, KDD and PAKDD conferences. Number of extracted keyphrases and originating

documents from 672 documents of ICDM, 601 documents of KDD and 551 documents of

PAKDD are respectively 23, 21 and 24.

29

Table 3.2: List of keyphrases and corresponding originating documents in VLDB conference.

Keyphrases Year Paper Title First Author Name

plan choices 2005 Analyzing plan diagrams of database Naveen Reddy

query optimizers

execution 2007 A genetic approach for random testing Hardik Bati

feedback of database systems

optimal plan 2005 Analyzing plan diagrams of database Naveen Reddy

query optimizers

fact table 2006 Cure for cubes:Cubing using a ROLAP Konwtantinos

engine Morfonios

query execution 2005 Query execution assurance for outsourced Radu Sion

databases

HTML tables 2008 WebTables:exploring the power of tables Michael J. Cafarella

on the web

storage schemes 2005 ULoad: Choosing the right storage for Andrei Arion

your XML application

database schema 2006 Simple and realistic data generation Kenneth Houkjar

3.4.3 Number of Misclassifications

We want to show that our heuristic to identify the originating documents is effective. To

do this we measure the following: For each (Landmark paper L, Keyphrase K) pair, we

apply our algorithm on a random sample of papers that do not contain the landmark paper

L. We then verify that the papers in the random sample that contain keyphrase K are

not landmark papers. If it identifies any of the papers as a landmark paper, then it is

an error. Table 3.8, shows the number of errors for the keyphrases that have correponding

landmark papers in different conferences. The numerical digits in Table 3.8, shows number of

incorrectly classified keyphrases from the chosen samples against the respective conferences.

30

Table 3.3: List of keyphrases and corresponding originating documents in SIGMOD confer-

ence.


application 2005 IBM SOA on the edge Gennaro A. Cuom

servers

execution plans 2005 Proactive re-optimization with Rio Shivnath Babu

provenance 2006 Provenance management in curated Peter Buneman

information databases

enterprise 2005 Data and metadata management in service Vishal Sikka

application oriented architectures:some open challenges

application logic 2006 Automatic client-server partitioning of data- Nicholas Gerner

-driven web applications

data publishing 2005 Verifying completeness of relational query HweeHwa Pang

results in data publishing

query keywords 2007 BLINKS: ranked keyword searches on graphs Hao He

data 2005 Safe data sharing and data dissemination Luc Bouganim

dissemination on smart devices

For example, the first row in Table 3.8 represents, 19 out of 21 VLDB keyphrases were

correctly identified as not originating in sample 1 dataset only 2 are misclassified; 17 out of

21 were classified correctly from other sample 2 dataset; 21 out of 21 were classified correctly

from sample 3 and 5; and 20 out of 21 were classified correctly as the originating document

in sample 4. Similarly, for other respective conferences and corresponding number shows the

number of incorrectly classified keyphrases from the total number of keyphrases identified by

our algorithm. In Table 3.8, most of the incorrectly classified numbers are zero or not more

than 4, implies that our heuristic to identify the originating documents are able to perform

well significantly.

31

Table 3.4: List of keyphrases and corresponding originating documents in ICDE conference.


XML streams 2006 Unifying the processing of XML streams Xin Zhou

and relational data streams

spatial objects 2005 Evaluation of spatio-temporal predicates Markus Schneider

on moving objects

query sequence 2007 Stream monitoring under the time Yasushi Sakurai

warping distance

stream data 2008 Approximate clustering on distributed Qi Zhang

data streams

search space 2006 Mining actionable patterns by role models Ke Wang

graph databases 2006 Searching substructures with super- Xifeng Yan

-imposed distance

XPath query 2005 Spatial range querying for gaussian-based Yoshiharu Ishikawa

imprecise query objects

web databases 2006 Answering imprecise queries over Ullas Nambiar

autonomous web databases

3.4.4 Identifying Keyphrases and Originating Documents from

Multiple Conferences Together

After evaluating the keyphrases and originating documents results from each conference

separately, we perform our results on whole dataset together. We ran our algorithm on

3781 research documents out of these 261 keyphrases identified as the keyphrases which are

having originating documents in our corpus. Table 3.9, shows the small list of the keyphrases

and corresponding originating document from multiple conferences together. We call these

keyphrases as the global keyphrases where as the keyphrases which are identified in each

conference separately known as the local keyphrases.

32

Table 3.5: List of keyphrases and corresponding originating documents in ICDM conference.

Keyphrases Year Paper Title First Author

gradient 2007 Training conditional random fields by periodic Han-Shen

descent step size adaption for large-scale text mining Huang

dynamic 2007 Efficient Algorithm for Mining significant Huahai He

programming substructures in graphs with quality guarantees

subspace 2006 A novel scalable algorithm for Jun Yan

learning supervised subspace learning

text data 2008 Text Cube: Computing IR measures for Cindy Xide

multidimensional text database analysis

latent dirichlet 2008 Collective Latent Dirichlet Allocation Zhiyong Shen

allocation

vector machine 2006 Cluster Based Core Vector Machine Asharaf S

relation 2005 Mining ontological knowledge from Xing Jiang

extraction domain-specific text documents

distance 2005 Alternate representation of distance matrices Keith Marsolo

matrix for characterization of protein structure

From Table 3.9, the physical significance of originating document for eg: “A framework

for clustering uncertain data streams” from sorted list, implies that this is the first document

where keyphrase “clustering uncertain data” was introduced first time in our database. As

shown in Table 3.1, we took total 3781 number of documents for VLDM, SIGMOD, ICDM,

ICDE, PAKDD, and KDD conferences over 5 years, where “A framework for clustering

uncertain data streams” represents first document (Landmark Paper) where the keyphrase

is introduced first time ICDE conference at 2008. The output of our algorithm is not limited

and it will be more accurate if we will run our algorithm on large datasets. Also, for

the keyphrase “quasi-identifier attributes” we have seen that in SIGMOD 2005 “Incognito:

Efficient full-domain k-anonymity” is the originating document and after this it is identified

in SIGMOD and VLDB conferences during 2007 from our database. This implies that this

33

Table 3.6: List of keyphrases and corresponding originating documents in KDD conference.


web pages 2008 Information extraction from wikipedia: Fei Wu

moving down the long tail

decision 2006 A general framework for accurate Wei Fan

tree and fast regression by data summari-

-zation in random decision trees

linear 2007 Scalable look-ahead linear David S.Vogel

Regression regression trees

latent diri- 2005 Modeling and predicting personal Xiaodan Song

-chlet allocation information behavior

density 2006 New EM derived from Kullback- Longin Jan

estimation Leibler divergence Latecki

tensor 2008 Heterogeneous data fusion for Jieping Ye

factorization alzheimer’s disease study

community 2006 Center-piece subgraphs: problem Hanghang Tong

detection definition and fast solutions

clustering 2005 A general model for clustering binary data Tao Li

model

keyphrase is not used after this conference. So, it will be helpful for new researchers, to

choose this keyphrase as an important keyphrase to continue their work in this area. Also,

some of the keyphrases occur in recent years for eg: 2008, not before in our database, give

intuition to user or researcher that this is the important or new keyphrase which introduced

very recently in the coming years. For keyphrase “stream processing system” we analyze

that this keyphrase occurred in our whole database very frequently implies the keyphrase has

equal influence throughout the time stamp window and it may occurs in coming years implies

that the keyphrase is neither new nor important for a new user. In addition to this, for the

34

Table 3.7: List of keyphrases and corresponding originating documents in PAKDD confer-

ence.

Keyphrases Year Paper Title First Author

information 2006 Generalized conditional entropy and a Dan A. Simovici

gain metric splitting criterion for decision trees

anomaly 2005 An anomaly detection method for Ryohei Fujimaki

detection spacecraft using relevance vector learning

regression 2006 improving on bagging with input smearing Eibe Frank

problems

incremental 2007 AttributeNets: An incremental learning Hu Wu

learning method for interpretable classification

rough set theory 2005 Using rough set in feature selection and Le Hoai Bac

reduction in face recognition problem

kernel function 2005 A Kernel function method in Clustering Ling Zhang

classification 2005 A privacy-preserving classification Weiping Ge

accuracy mining algorithm

Mixture model 2008 A mixture model for expert finding Jing Zhang

keyphrases which are new and important for a user, we are also providing corresponding first

document to the user, from where it originates. This will provide help to the new researcher

to read the documents to get the overview of the keyphrases from where it introduced.

To evaluate the performance of our system, we took 8 random samples of 50 keyphrases

with their originating documents (as identified by our algorithm) and test whether the

keyphrases really originated in the respective documents or not, was annotated by domain

experts. We draw the contingency table for each set of the keyphrases. The confusion ma-

trix is a useful tool for analyzing how well our predicted results can match to actual results.

Different confusion matrix for predicted class (i.e. landmark or not) and actual class is

shown in Figure 3.3. The contingency table derive several evaluation metrics like precision,

recall, and accuracy as discussed in the Section 2.1. Figure 3.3, show different prediction

35

Table 3.8: No of incorrectly classified keyphrases from each conference.

Conference

Name

# of

keyphrases

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5

VLDB 21 2 4 0 1 0

SIGMOD 20 2 3 0 0 0

PAKDD 24 2 0 1 2 3

ICDE 35 4 0 2 1 2

ICDM 23 0 0 2 2 1

KDD 21 0 0 2 1 3

results corresponding to the positive and negative instances. For each confusion matrix, true

positive rate (TPR), false positive rate (FPR), and accuracy has been calculated. To draw

an ROC space refer Section 2.1.3, only the true positive rate (TPR) or sensitivity and false

positive rate (FPR) or (1-specificity) are needed.

An ROC space is defined by FPR and TPR as x and y axes respectively, which depicts

relative trade-off between true positive and false positive. Each prediction result or one

instance of confusion matrix represents one point in the ROC space in Figure 3.2.

In Figure 3.2, out of 8 sample test 6 points are above the diagonal line implies that

results are classify correctly and only 2 points are below the diagonal line yields misclassified

attributes. Accuracy is measured by the area under the ROC curve. An area of 1 represents

a perfect test; an area of 0.5 represents a worthless test. From Figure 3.2, the area under

ROC curve shows the “good” accuracy of the system.

3.5 Summary

We built an approach that is easy, direct and simple to understand and helps the researchers

to identify the set of keyphrases and corresponding originating document from the available

text corpus. We have used the notion of time-stamped documents and identify the list of

keyphrases from multiple conferences with associated conference year. We come up with

36

Table 3.9: List of global keyphrases and originating documents from multiple conferences.

Keyphrases Conference Paper Title First Author

Year

stream pro- ICDE Dynamic load management for distributed Yongluan

cessing system 2005 continuous query systems Zhou

communication SIGMOD Holistic aggregates in a networked world: Graham

network 2005 distributed tracking of approximate quantiles Cormode

query network ICDE High-Availability algorithms for Jeong-Hyon

2005 distributed stream processing Hwang

compression SIGMOD Integrating compression and execution in Daniel J.

technique 2006 column-oriented database systems Abadi

stream PAKDD An incremental data stream clustering Jing Gao

clustering 2005 algorithm based on dense units detection

web graph SIGMOD Graph summarization with bounded error Saket

2008 Navlakha

RNN query VLDB On computing top-t most influential spatial Tian Xia

2005 sites

kernel matrices KDD Learning the kernel matrix in discriminant Jieping Ye

2007 analysis quadratically constrained prog.

peculiar data ICDM The PDD framework for detecting Mahesh

2006 categories of peculiar data Shrestha

quasi-identifier SIGMOD Incognito: Efficient full-domain Kristen

attributes 2005 k-anonymity LeFevre

clustering ICDE A framework for clustering uncertain Charu C.

uncertain data 2008 data streams Aggarwal

the small list of keyphrases and their first document out of huge number of research articles

published annually. To evaluate the performance of our approach, we manually check our

37

Figure 3.2: ROC Space: Comparison of Different Predicted Results.

results and build the metric of incorrectly classified results with the help of other colleagues

as consider them an expert in this field.

In this chapter, we identified the keyphrases and their originating document from avail-

able corpus, but on the other hand we did not talk about the related keyphrases. In the

next chapter, we build our approach to test the importance of keyphrases; and identify the

keyphrases and corresponding originating document, which are relevant to the user query.

38

Figure 3.3: Different Prediction Results

Figure 3.4: TPR=0.71,FPR=0.37,ACC=0.68 Figure 3.5: TPR=0.79,FPR=0.69,ACC=0.64




39

Chapter 4

Mining Topic-based Landmark Papers

4.1 Overview

In the previous Chapter 3, we described Mining Landmark Paper (MLP) approach that

only generates a plain list of keyphrases and their originating documents. However, to be

useful to new researchers who are exploring a research area, this is insufficient. It is much

more helpful to be able to see the structure of the research area in terms of topics and their

relationships with each other. We investigate the system where the user can enter a set of

keyphrases and the system outputs all the topics related to those keyphrases and also shows

the relationships of these topics to each other and final outputs the landmark papers for each

of those topics. This will essentially provide all the material required for the researcher to

exhaustively understand the foundations of that research area. We formalize this problem

and present a method for solving it. This method involves the following steps:

1. Finding relevant keyphrases from the global list of keyphrases which are extracted from

the research document collections.

2. Constructing weighted graph of keyphrases.

3. Matching user given query with keyphrases.

4. Evaluating Keyrank of keyphrases.

5. Identifying k-nearest neighbors of each keyphrase.

40

6. Evaluating Keyscore of keyphrases.

7. Recommending top k - keyphrases related to the user query.

8. Identifying originating document corresponding to each keyphrase.

In the next few sections, we will explain the steps of our methodology described above in

detail. Further, three cases studies in this context is shown in the Section 4.3.

4.2 Steps of Methodology

4.2.1 Mining Related Keyphrases

Definition 7 (Related keyphrases) Given an area of research (q) (specified by keyphrases)

and set of keyphrases {k1, k2, ..., kn}, our aim is to identify the top k - related keyphrases that

are relevant to the user’s query. To find related keyphrases our intuition is based on mainly

two factors; (i) keyrank, and (ii) distance1 of keyphrase (i.e. distance(k1, k2)). A keyphrase

k1 is said to be more related to the user’s query q in comparison to the other keyphrase k2, if

1. keyrank(k1) > keyrank(k2), and

2. distance(q, k1) < distance(q, k2).

Further, if keyrank(k1) = keyrank(k2), and distance(q, k1) < distance(q, k2), then keyphrase

k1 is to be the more related keyphrase to the query point q i.e. if keyphrases are having same

keyrank values then keyphrase which is closer to the query point will be consider as more re-

lated keyphrase. Also, if distance(q, k1) = distance(q, k2), and keyrank(k1) > keyrank(k2),

then keyphrase with high keyrank value i.e. k1 will be consider as a more related keyphrase

to the query point q.

In Section 3.3.2, we present our method to identify the set of keyphrases from each docu-

ment. After extracting the keyphrases from research documents, our aim is to mine the set

1Here, distance is not a metric and does not satisfy metric space conditions necessarily. It is named as a

distance because it is representing proximity between 2 keyphrses in-terms of their co-occurrence.

41

of frequent keyphrases. We use the concept of frequent itemsets mining for mining frequent

keyphrases. In this section, we first explain the some basic definitions and background of

frequent itemset mining problem and later corresponding to {tid, set of items} mapping of

{document, keyphrase} has been represented mathematically.

Definition 8 (Itemset) An itemset is a set of items where each item is an element drawn

from a finite set of possible items. An itemset having length k items is said to have length k

and denoted as a k - itemset.

Definition 9 (Frequent Itemset) A frequent itemset is a set of words which occur to-

gether frequently in the dataset.

Mathematically: An itemset X is frequent if

support(X ) ≥ minsup (Minimum Support)

where, support of an itemset X in a database D is the fraction of D that contain X as a

subset, defined by

support(X) =count(X)

|D|count(X) is the number of records of D that contain X as a subset.

minimum support (or minsup) is a user defined threshold indicating that the itemsets

whose support is less than this threshold are not interesting or relevant.

The input data for frequent itemset mining consists of records, each of which contain a set

of items. For example, customer buying set of items from a supermarket, or users submitting

set of words as queries to a search engine. The task is to find all the subsets that occur more

frequently than some user - specified threshold. Consider the example transactional dataset

given in Table 4.1.

For the table 4.1, the count of the itemset {tomato, potato} is 3, whereas its support is 35

= 0.60. If the minimum support is 0.5, then the set of all frequent itemsets are {{tomato},

{potato}, {chilly}, {carrot}, {tomato,potato}, {potato,chilly}}.

In Table 4.2, documents Di’s are corresponding to the transaction ids and set of keyphrases

kj’s are corresponding to the list of items described in market basket dataset in the Table 4.1.

This forms a transaction space in terms of {document, keyphrase} space as illustrated in

42

Table 4.1: Sample market-basket dataset.

Transaction ID List of Items

T1 tomato, potato, onion

T2 tomato, potato, chilly

T3 tomato, onion, carrot, chilly

T4 potato, carrot, chilly

T5 tomato, potato, carrot, chilly

Table 4.2. We apply Frequent Pattern Growth (or FP-Growth) algorithm [HPYM04], for

mining frequent keyphrases on all the documents to mine the relevant keyphrases. The

pseudo-code of mining an FP-tree is shown in Algorithm 2 and FP-Growth is shown in

Algorithm 3.

Table 4.2: Documents and Set of Keyphrases

Document Keyphrases contained in the document

D1 k1, k2, k3, k4, . . .

D2 k2, k3, k4, k5, . . .

D3 k1, k3, k5, k6, . . .

D4 k4, k5, k6, k7, . . .

D5 k1, k2, k4, k8, . . .

In this {document, keyphrase} space, frequent combination of keyphrases (i.e. frequent

itemsets) that are common to a group of documents implies that those documents are similar

to each other and set of keyphrases are occur frequently together in the transactions are

relevant keyphrases. An example set of relevant keyphrases with their respective support

count has been shown in Table 4.3.

43

Table 4.3: Frequent Keyphrases and Support Count

Frequent keyphrases support count

timeseries 37

queryprocessing 63

datastreams 23

streamprocessing 25

skylinequery 27

queryprocessing, skylinequery 4

datastreams, streamprocessing 8

queryprocessing, skylinequery, streamprocessing 3

4.2.2 Keyphrase Evolutionary Graph

A Keyphrase Evolutionary Graph is a weighted keyphrase graph G = (N, E) in which each

vertex v ∈ N is a keyphrase, and each edge e ∈ E is a transition between the keyphrase-

nodes. The weight on an edge indicates the similarity between the two keyphrase nodes. We

use support count as the weight as it is a measure of correlation between the nodes. Clearly,

each path in a keyphrase evolutionary graph represents a relationship between the nodes -

that these nodes occur together frequently in the database.

An example of a keyphrase evolutionary graph is shown in Figure 4.1, where each ver-

tex is a keyphrase identified from the documents. Each edge is a transition between the

frequent keyphrases relevant to each other. The distance between the nodes indicates how

close the two keyphrases being connected are and how trustful the corresponding transition

is; more support count indicates a closer distance between the keyphrases and a more trustful

transition. For example in Figure 4.1, consider root node = “datamining”2, the distance be-

tween (“datamining”, “decisiontrees”) = 0.1429, which is smaller than the distance between

“datamining” and any other node implies that former is more closer.

2For implementation, we removed spaces in between each word-pair. In actual, for e.g., “datamining” is

similar to “data mining”.

44

Algorithm 2: FP-Tree Algorithm for finding frequent itemsets

Input: DB, a transaction database;

min−sup, the minimum support count threshold.

Ensure: The complete set of frequent patterns.

{Pre-Processing }

1: Scan the transaction database DB once.

2: Collect FI, the set of single element frequent items and their support counts.

3: Discard all infrequent items.

{Create FP-tree}

4: Sort FI in support count decreasing order as Flist, the list of frequent items.

{InsertFP-Tree(FI, T)}

5: Create the root of an FP − tree, T, and label it as “null′′

6: for each transaction, Trans in DB: do

7: select the frequent items in Trans and sort them according to the order of FList :

8: starting from root, follow path p as long as seq of elts in p is a prefix of

sorted item FI.

9: reach at node n, copy to the longest prefix of r in T.

10: add a path p’ of nodes as descendents of n to hold reamining elements of FI.

Maintain linked lists from header table.

11: last node in path p+p’ represents new transaction

12: Increment counters of all nodes in p+p’

13: end for

4.2.3 Matching Queries and Keyphrases

Retrieval of text - based information has traditionally been termed information retrieval (IR)

and has recently become a topic of great interest with the advent of text search engines on

the Internet. Text is considered to be composed of two fundamental units, namely document

and the term. A document can be a traditional document such as a book or journal paper,

but more generally is used as a name for any structured segment of text such as chapters,

sections, paragraphs, or even e-mail messages, Web pages, computer source code, and so

45

Algorithm 3: FP-Growth(T, α)

1: if T contains a single path p then

2: for each β = nodes combination in p do

3: pattern = β ∪ α

4: support = min(support of the nodes in β)

5: end for

6: else

7: for each ai in the header of T do

8: pattern β = ai ∪ α with support = ai.support;

9: construct β′s conditional pattern base of β;

10: construct β′s conditional fp-tree T’

11: if T’ ! = φ then

12: FP-Growth(T’, β)

13: end if

14: end for

15: end if

forth. A term can be a word, word-pair, or phrase within a document, e.g., the word data

or word-pair data mining.

Traditionally in IR, text queries are specified as set of terms. Although documents will

usually be much longer than queries, it is convenient to think of a single representation

language that we can use to represent both documents and queries. By representing both

in unified manner, we can begin to think of directly computing distances between queries

and documents, thus providing a framework within which to directly implement simple text

retrieval algorithms.

In this section, we cover the different kinds of queries normally posed by the user to text

retrieval systems. The type of query the user might formulate is largely dependent on the

underlying information retrieval model. An important issue is that most query languages try

to use the content (i.e., the semantics) and the structure of the text (i.e., the text syntax)

to find relevant documents. In that sense, the system may fail to find the relevant answers.

46

Figure 4.1: An Example of Keyphrase Evolutionary Graph

For this reason, a number of techniques meant to enhance the usefulness of the queries

exist. Examples include the expansion of a word to the set of its synonyms or the use of a

thesaurus and stemming to put together all the derivatives of the same word. Moreover, some

words which are very frequent and do not carry meaning (such as ‘the’), called stopwords,

are removed. We first show the queries that can be formulated with keyword-based query

languages. They are aimed at information retrieval, including simple words and phrases as

well as boolean operators which manipulates sets of documents. In the second part, we cover

pattern matching, which includes more complex queries.

47

Keyword - based Querying

A query is the formulation of a user information need. In its simplest form, a query is

composed of keywords and documents containing such keywords are searched for. Keyword-

based queries are popular because they are intuitive and easy to express. Thus, a query can

be simply a word, although it can in general be a more complex combination of operations

involving several words.

In the rest of this section we will refer to single-word and multiple-word queries as basic

queries. Patterns, which are covered in section 4.2.4, are also considered as basic queries.

Single - Word Queries

The most elementary query that can be formulated by a user is a word. Text documents are

assumed to be essentially long sequences of words. More general view allow us to see the

text in this perspective and to search words. In other perspective, we are also able to see

the internal divison of words into letters. In the latter case, we can search of other type of

patterns. The set of words retrieved by these extended queries are considered as a keywords

matching to the query.

A word is normally defined in a rather simple way. The alphabet is split into ‘letters’ and

‘separators’, and word is a sequence of letters surrounded by separators.

Phrase Queries

Phrase is a sequence of single-word queries. An occurrence of the phrase is a sequence of

words. For instance, it is posisble that user can search for the word ‘knowledge’ and then

for the word ‘discovery’. In phrase queries it is normally understood that the separators

in the text need not to be the same as those in the query (e.g., two spaces versus one space

etc.). There can be different separators like ‘,’, ‘and’, ‘&’, ‘or’, ‘|’ are defined by user in a

query. In our system, we have implemented many possible combinations for example:

1. if a user has given a query like (a,b) then we are considering a and b both are separated

keyphrases.

48

2. if a query is given like (a and b), then we search for both a and b separately; also

considered ‘and’ as a boolean operator and selects all answers which statisfy a and b.

3. if a query is like (a or b), then select all keyphrases which statisfy a or b. Duplicates

are eliminated.

4.2.4 Pattern Matching

In this section, we first discuss more specific query formulations (based on the concept of a

pattern which allow the retrieval of pieces of text that have some property.

A pattern is a set of syntactic features that must occur in a text segment. Those segments

satisfying the pattern specifications are said to ‘match’ the pattern. We are interested in set

of keyphrases which match a given search pattern. Specification of some types of patterns

can be range from very simple (for example, words) to rather complex (such as regular

expressions) patterns. Generally, most used types of patterns are:

• Words A string (sequence of characters) which must be a word in the text (see Sec-

tion 4.2.3). This is the most basic pattern.

• Ranges A pair of strings which matches any word lying between them in lexicograph-

ical order. Alphabets are normally sorted, and this induces an order into the strings

which is called lexicographical order (this is indeed the order in which words in a dictio-

nary are listed). For instance, the range between words ‘data’ and ‘drug’ will retrieve

strings such as ‘document’ and ‘dictionary’.

• Prefixes A string which must form the beginning of a text word. For example, given

the prefix ‘comput’ all the words such as ‘computer’, ‘computation’, ‘computing’, etc.

are retrieved.

• Suffixes A string which can appear within a text word. For instance, given the sub-

string ‘tal’ all the documents containing words such as ‘talk’, ‘mestallic’, etc. are

retrieved. This query can be restricted to find the substrings inside words, or it can

go further and search the substring anywhere in the text (in this case the query is not

49

restricted to be a sequence of letters but can contain word separators). For instance,

a search for ‘any flow’ will match in the phrase ‘...many flowers...’

• Allowing errors A word together with an error threshold, This search pattern re-

trieves all text words which are ‘similar’ to the given word. The concept of similarity

can be defined in many ways. The general concept is that the pattern or the text may

have errors (coming from typing or spelling), and the query should try to retrieve the

given word and what are likely to be its erroneous variants. Although there are many

measure for similarity among words, the most generally accepted is the Levenshtein

distance or simply edit distance. Therefore, the query specifies the maximum number

of allowed errors for a word to match the pattern (i.e., the maximum allowed edit

distance). For example, if a typing error splits ‘flower’ into ‘flo wer’ it could still be

found with one error, while in the restricted case of words it could not (since neither

‘flo’ nor ‘wer’ are at edit distance 1 from ‘flower’).

• Regular Expressions A regular expression is a rather general pattern built up by

simple strings (which are meant to be matched as substrings) and the following oper-

ators:

- union: if e1 and e2 are regular expressions, then (e1 | e2) matches what e1 or e2

matches.

- concatenation: if e1 and e2 are regular expressions, the occurrences of (e1 e2) are

formed by the occurrences of e1 immediately followed by those of e2 (therefor simple

strings can be thought of as a concatenation of their individual letters).

- repition: if e is a regular expression, then (e∗) matches a sequences of zero or more

contiguous occurrences of e.

For example, consider a query like ‘pro (blem | tein) (s | Φ) (0 | 1 | 2) ∗’ (where Φ denotes

the empty string). It will match words such as ‘problem02’ and ‘proteins’.

4.2.5 Evaluating Keyphrase Ranking (i.e. Keyrank)

In the previous steps, after finding all keyphrases matched with user’s query and corre-

sponding keyphrase evolutionary graph, our task is to identify the top-k keyphrases which

50

are relevant to the user’s query. As discussed in Section 4.2.1, for recommending top k

- related keyphrases, our approach is based on keyphrase ranking and nearest - neighbor

keyphrases.

In this section, we first discuss the keyphrase ranking scheme with the help of an example.

This way, the user can choose to see only the category of results of his interest. The core

idea is that if a keyphrase that are relevant to many other keyphrases, then it is likely that

the keyphrase has ‘high impact’ with respect to the other keyphrases. Further, a keyphrase

may be considered even more influential, if it is linked from a large number of ‘high impact’

keyphrases. The results of a search query can then be ranked based on their ‘impact’.

The above idea led to rank the keyphrases, by using traditional PageRank algorithm [BP98].

PageRank is a iterative link analysis algorithm that is proved to converge [Hav99, ANTT02],

developed by Larry Page and Sergey Brin and used by the Google Internet search engine,

that assigns an equal importance (or Keyrank) to set of all keyphrases, with the purpose of

“measuring” its relative importance within the set. Then, in each iteration, it refines this

value using the following formula:

Keyrank(x) =∑

y

Keyrank(y)

|links(y)|(4.1)

where y is the set of keyphrases that link to x and |links(y)| is the number of links in

keyphrase y. The pseudo-code for keyrank has been shown in Algorithm 4.

Figure 4.1, shows keyphrase connectivitiy graph. By using the graph, we have calculated

keyphrase rank. Table 4.4, shows the list of keyphrases and their respective keyranks.

4.2.6 Identifying k - Nearest Neighbors of each Keyphrase

After we identified keyrank of the keyphrases, another important factor is to find nearest

- neighbor keyphrases i.e. set of keyphrases that are closest to a query point. Let us now

make the notion of nearest-neigbor keyphrases more precise and clear.

Definition 10 (k - Nearest Neighbor) Suppose we have a collection K of keyphrases.

For a nearest neighbor keyphrase, we are given a query point q, and goal is to determine the

nearest neighbor set NN(q) defined as

51

NN(q) = {r ∈ S | ∀ p ∈ K : d(q, p) ≤ d(q, r) }

To find the k - nearest neigbor for a given query point q, we have used Dijkstra’s shortest

path algorithm. In below section, we first explain the shortest path definition mathematically

and later Dijkstra’s algorithm has been explained in detail.

Definition 11 (Shortest-Path) Given a graph G = (V,E,w), where V is a set of vertices,

E is a set of edges and w is weight function that maps edges to real-valued weights, a path p

from a vertex u to a vertex v in this graph is a sequence of vertices (v0, v1, ..., vk) such that

u = v0, v = vk and (vi−1, vi) ∈ E. The weight w(p) of this path is a sum of weights over all

edges = w(v0, v1) + w(v1, v2) + ...+ w(vk−1, vk). It also reminds that a shortest path from u

to v is the path with minimum weight of all paths from u to v.

Definition 12 (Dijkstra’s Algorithm) For a given source vertex (node) in the graph, the

algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and

every other vertex. For example, if the vertices of the graph represent keyphrases and edge

weight represent distances between pairs of keyphrases connected by a direct link, Dijkstra’s

algorithm can be used to find the shortest route between one keyphrase (i.e. query point or

source vertex) and all other keyphrases. The pseudo-code for Dijkstra’s algorithm has been

shown in Algorithm 5.

In keyphrase evolutionary graph Figure 4.1, where each node is a keyphrase and edge

between the nodes represent connectivity between them and weight associated between edges

are showing how frequently those keyphrases occur together in the database. For finding

shortest distance between the root node and all other nodes, instead of using that weight

(i.e. support count) directly, we reframe weight as inverse of count(K), where count(K) is

number of times the keyphrase occurs in the transactions. Considering these weights, we

have calculated shortest distance between the nodes by using Algorithm 5.

Example: Considering source vertex = “datamining”, we have calculated the distance

between source vertex and all other nodes. Distance of source node from itself will be 0.

Table 4.5, represent keyphrases and their corresponding distances from the source vertex

(i.e. query point).

52

4.2.7 Evaluating Keyscore of Keyphrases

As discussed in the Section 4.2.1, for identifying the keyphrases which are related to the

user’s query, we have introduced the notion of “Degree of Importance” of a keyphrase. For

a given user’s query, there is a possibility that many of the keyphrases are similar to the

user’s query and results are overwhelmed to the user. To avoid this, we have assign a unique

“Keyscore” value for each keyphrase, which is defined as:

Keyscore =Keyrank

distance(4.2)

The intution is that:

• Keyrank of keyphrase should be high, and

• distance (i.e. shortest distance) between the keyphrase and query point should be low.

Table 4.6, show the keyphrases and their respectives keyscores.

4.2.8 Recommending Keyphrases

Our main intention is to recommend k - keyphrases to the user, where k is a constant chosen

by the user. The goal is to recommend the keyphrases which are having high keyscore

value and ensure that the recommended keyphrases have high keyphrase rank and minimum

shortest distance. We have transformed current problem to the knapsack problem in the

following manner.

Definition 13 (Knapsack Problem) Given a set of n items from which we are to select

some number of items to be carried in a knapsack. Each item has both the weight and profit.

The objective is to choose the set of items that fits in the knapsack and maximizes the profit.

We reduce our problem to a knapsack problem using the following equivalences:

Given a set of keyphrases {k = k1, k2, ..., kn} and keyphrase score {ks = ks1, ks2, ..., ksn}.

Let wi be the weight of the ith item, ksi be the profit accrued when the ith item is carried

in the knapsack, and C be the capacity of the knapsack. Let xi be a variable the value of

53

which is either zero or one. The variable xi has the value one when the ith item is carried in

the knapsack. Our objective is to maximize

n∑i=1

ksixi (4.3)

where

ksi =kri

ksti

in which kri denotes the rank and ksti is the shortest distance of ith keyphrase

subject to the constraint

n∑i=1

wixi ≤ C (4.4)

• Items to carry in knapsack is correspond to the keyphrases matching to the user’s

query.

• Keyphrase weight (in knapsack problem) is set to 1 for all keyphrases in our problem.

• Weight limit (in knapsack problem) is set to k - the number of keyphrases to recommend

to the users based on their query.

After identifying the keyscore for each keyphrase, we sort the keyphrase in descending

order of their score and output top k - keyphrases to the user which are related to the user’s

query.

4.2.9 Identify Originating Document

Finally, we identify the originating documents of the obtained related keyphrases. For finding

these originating documents, methodology describe in the Section 3.3 has been used.

4.3 Experiments

In this section, we measure how well our proposed method for finding related keyphrases

performs on the real data. As it is hard to objectively define the notion of “relevance” of

54

keyphrases and due to the absence of competiting approaches, we demonstrate the effective-

ness of our system through case-studies. To evaluate our approach we consider three case

studies to show the performance of our system. We also compare our results with Google

Scholar by using Google similairty distance measure and show that they are “statistically

significant” by using Spearman’s rank correlation and t - test at the 5% and 1% significance

level.

4.3.1 Case Study 1

For a given user’s query specified by keyphrases, our aim is to first identify those keyphrases

which match the user’s query and then identify the top k - most related keyphrases based on

their importance (i.e. keyscore) and corresponding originating documents (i.e. landmark pa-

pers) of those keyphrases. To obtain this, we have performed the sequence of steps discussed

in the Section 4.2 .

Keyphrase Graph

Consider that the user has given the query term as “decision trees”. First we identify the

set of keyterms which occur frequently together with the query term from the exhaustive

list of keyphrases. Using these, we construct the keyphrase evolutionary graph shown in

Figure 4.2. In Figure 4.2, only the relevant portion of the keyphrase graph has been shown

with support counts as weights of edges of the graph.

Keyphrase Ranking

Once we obtain a set of relevant keyphrases, we show the impact of each keyphrase in

keyphrase graph Figure 4.2. Impact of keyphrase is calculated by using KeyRank formula

described in Section 4.2.5. After finding the keyphrase rank, values are normalized by using

z - score defined as:

z-score =x−min+ δ

max−min(4.5)

where x is calculated keyrank value of each keyphrase, max and min is maximum and

55

Figure 4.2: Keyphrase Evolutionary Graph relevant to the query term “decision trees”.

minimum value from overall keyranks value range. δ is min/10, is introduce for descreasing

the influence of max and min values. By using z - score formula, keyrank values are ranged

between [0 - 1]. Table 4.7, shows the list of keypharses and their corresponding keyranks in

increasing order.

Nearest Neighbor Keyphrases

In order to identify the set of keyphrases which are closest to query point, we have calcu-

lated distance (i.e. shortest distance) between query point and all other keyphrases. For

calculating distance between keyphrases, keyphrase graph shown in Figure 4.2 and shortest

path algorithm described in Section 4.2.6 is used. Table 4.8, shows keyphrases and distances

in increasing order between source vertex and each keyphrase node.

Keyscore and Recommending k - keyphrases

Finally, for recommending k - keyphrases to the users related to their query term, we have

assign unique value to each keyphrase to decide which keyphrase is more relevant in terms of

their keyscore values. Keyscore of each keyphrase is calculated by using the formula described

in Section 4.2.7. List of keyphrases and their respective keyscore values in increasing order

are shown in Table 4.9.

56

Analysis of Relevance Results

Table 4.9, shows set of top k - recommended keyphrases for the query “decision trees” based

on their keyscore values in increasing order. In Section 4.2.7, we mention that keyscore value

of a keyphrase is calculated by using Keyrank and the distance i.e. shortest - distance from

the query point. Table 4.7, shows list of keyphrases sorted by keyranks. Table 4.8, shows

the keyphrases and their distances from the query point i.e. list of keyphrases that are close

to the query point.

In Table 4.9, we notice that first recommended relevant keyphrase is “informationgain”

with keyscore value as 0.469178, which is having keyrank value 0.078212 from Table 4.7 and

its distance from query point is 0.1667 from Table 4.8. This implies that keyphrase with high

keyrank value and low distance among all keyphrases is first recommended related keyphrase

of the query point.

Further in Table 4.9, keyphrases “datamining” and “classlabel” with keyscore value as

0.384080 and 0.286780 are recommended as the related keyterms for a query. From Table 4.7,

we can notice that keyphrase “datamining” is having high keyrank value i.e. 0.076816

in comparison to the keyphrase “classlabel” which is having keyrank value as 0.071695.

From Table 4.8, we can see that the keyphrase “datamining” is having low distance i.e.

0.2 in compare to the keyphrase “classlabel” i.e. 0.25. This also implies that keyrank of

keyphrase is dominating over the distance values and which ever is having high keyrank

value is recommended first, So keyphrase “datamining” is recommended before than the

keyphrase “classlabel”.

Next, in Table 4.7, keyphrase “streamingdata” and “streamprocessing” both are having

same keyrank value i.e. 0.054004 but keyphrse “streamingdata” is recommended later than

the keyphrase “streamprocessing”, because it is far away from the query point in compari-

son to “streamprocessing” at distance 0.2. This fact implies that when there is a tie with

keyranks then distance measure efficiently breaks that and which ever is nearset to the

query point is recommended first. Also, consider the keyphrases “regressionproblems” and

“neuralnetwork”, keyphrase “regressionproblems” is recommended before than the keyphrase

“neuralnetwork” because it is having high keyscore value. Since keyscore value affects by

57

keyranks and their distances, notice from Table 4.8, both of these keyphrases are at same

distance i.e. 0.3333 from the query point, whereas keyrank value of “regressionproblems” is

higher than the keyrank value i.e. 0.052142 of “neuralnetwork”. This shows that when there

is a tie with distances the keyrank measure efficiently breaks that and keyphrase with high

keyrank value is recommended.

Further, in Figure 4.2, we notice that keyphrases such as “associationrules”, “frequentitem-

set” and “clusteringalgorithms” are connected to query point i.e. “decisiontrees” through

node “datamining”. Keyphrases “streamingdata” and “streamprocessing” are connected

through “datamining” and “datastreams” to the “decisiontrees”. In Table 4.9, these keyphrases

are recommended later because they are connected indirectly with query point. We can no-

tice that in Table 4.7, keyphrases “datastreams” “associationrules”, “frequentitemset” and

“clusteringalgorithms” all are having same keyrank value as 0.049348, but keyphrase “datas-

treams” is at low distance i.e 0.3929 in comparision to remaining keyphrases from query

point. As discussed above that when there is a tie with keyranks, then distance measure

efficiently breakes that tie and dominate in the keyscore value and which ever is having low

distance is closest to the query point and recommended prior to the other. So, keypharses

“datastreams”, “associationrules”, “frequentitemset” and “clusteringalgorithms” are having

same keyrank values (0.049348) but due to their distances from query point we can notice

that they are not close to the query point, so they are recommended later in the Table 4.9.

In the next few sections, we have consider different case studies for few other types of

queries that can be consider by a user and discuss how our approach will handle different

situations significantly each one separately. The methodology follow for these case studies

is similar to the previous case study described in the Section 4.2.

4.3.2 Case Study 2

Let us suppose user entered a query combined with 2 keyphrases (“cluserting uncertain

data”, “data streams”), our aim is to first identify those keyphrases which are matching with

the user’s query and then identify the top k - most related keyphrases based on their impor-

tance and proximity and finally identify corresponding originating documents (i.e. landmark

papers) of those keyphrases. To obtain this, same steps discussed in the Section 4.3.1 are

58

performed on dataset. In this case, user has given a query specified by two keyterms, to find

relevant keyphrases based on keyphrases specified by a user, first we match the similar terms

related to this query. In our keyphrase database, we found these two terms occur frequently

in the documents. So, single keyphrase evolutionary graph will be obtained for the query.

Considering those frequent itemsets and also their other related keypharses which occur fre-

quently with any of the keyphrase, we obtain the keyphrase graph for the query term. As

mentioned earlier, each keyphrase can connect with other keyphrase either directly or indi-

rectly but keyphrases which belongs into the nearest neighbor area and having high impact

are more influential for the user’s query. Figure 4.3, shows keyphrase graph of keyphrases

which occur together in the documents for the query (“clustering uncertain data”, “data

streams”).

Figure 4.3: Keyphrase Evolutionary Graph for query terms (“cluserting uncertain data”,

“data streams”).

After identifying keyphrase garph, we have calculated keyranks and their nearest neigh-

bor keyphrases by using shortest distance algorithm. Figure 4.3, shows keyphrase graph

for keyranks identification. In Table 4.10, keyphrases and their corresponding keyranks in

59

increasing order is shown.

Table 4.11, shows keyphrases and their distances from query point.

Finally, in Table 4.12, final set of related keyphrases and their respective keyscores in

increasing order with their rank is shown.


For a user’s query (“cluserting uncertain data”, “data streams”), consider Table 4.12 and can

refer that most relevant keyphrase to the user query by considering impact and proximity

value is “uncertaindatastreams” with keyscore value as 0.297765, and also having highest

keyrank value i.e. 0.99245 [refer Table 4.10] and lowest distance factor i.e. 0.3333 [refer

Table 4.11] among all other keyphrases. Next, keyphrase “streamclustering” is having 2nd

high keyscore vale i.e. 0.223361 and recommended at 2nd position. From Table 4.10, we

can notice that “streamclustering” is having 2nd highest keyrank value among remaining

keyphrases. But from Table 4.11, we can see that this keyphrase is not having lowest

distance form other remaining keyphrases. Despite of this fact, it is recommended before

the keyphrase “uncertaindata” and “clusteringalgorithms”, signifies that keyrank measure

is dominating in the keyscore value and highly impact keyphrase will be recommended first

in-compare to nearest neighbor.

Furthermore, for the query when two keyphrases having similar keyrank values then

distance measure will efficiently overcome the situation with the dominating effect and

keyphrase with less distance value will be considered first and other will be later. For

example, consider keyphrase “streamingdata” and “streamprocessing” from Table 4.10 both

are having same keyrank value i.e. 0.055589, but due to less distance value i.e. 0.8012,

“streamingdata” is recommended prior to the keyphrase “streamprocessing”. On the other

hand, for the keyphrases which are at equi-distance from the query point, will be consider

first on the basis of high keyrank value. For example, in Table 4.11, keyphrase “uncertain-

data” and “clusteringalgorithms” both are at 0.3333 distance from the query point but due

to high keyrank value of “uncertaindata” in - comparison of “clusteringalgorithms”, it is

recommended before with respect to the other keyphrase.

60

4.3.3 Case Study 3

Lastly, we have consider user has given a query (“skyline query”, “target schema”). In this

case, user has given a query specified by two keyterms, to find relevant keyphrases based on

keyphrases specified by a user first we identify frequent keyterms which occur together and

match the similar terms related to this query. In our database, we did not find any direct

relationship between these two keyphrases. So these terms consider as separate keyphrases

and for both the terms set of related keyphrases based on their impact and nearest distance,

list of related keyphrases are identify. For each sub-query term a separate keyphrase graph

has been drawn. In Figure 4.4, “skyline query” and their relevant keyphrases graph has

been shown and Figure 4.5, shows the graph of all keypharses which are found frequent in

database with sub-query term “target schema”.

Figure 4.4: Keyphrase Evolutionary Graph relevant to the query term “skyline query”.

For recommending top k - keyphrases related to user’s query, keyscore of each keyphrase

has been obtained by using keyranks and their nearest neighbor distances as discussed in the

preivous section. Table 4.13 and Table 4.14, shows set of keyphrases and their corresponding

keyranks which are relevant to the respective sub-query terms “skyline query” and “target

schema.

Table 4.15 and Table 4.16, shows keyphrases and their corresponding distances from the

keyphrase which are relevant to the sub-query term “skyline query” and “target schema” is

61

Figure 4.5: Keyphrase Evolutionary Graph relevant to the query term “target schema”.

shown.

Finally, Table 4.17 and Table 4.18, show keyphrases and their keyscores with respect to

sub-query terms “skyline query” and “target schema. After calculating keyscore values, we

recommend the set of top k - keyphrases which are having high keyscore values.


For the query (“skyline query”, “target schema”), we first discuss analysis for sub-query

term (“skyline query” and later for sub-query term “target schema”. From Table 4.17, we

can see that for sub - query term (“skyline query”, first recommended related keyphrase

is “skylinecomputation” with high keyscore value as 0.766994 because it is having highest

62

keyrank value i.e. 0.085213 [refer Table 4.13] and it is first closest point to the query term

[refer Table 4.15]. Also, we can see that keyphrases such as “skylinequeryprocessing” is

recommended before even if it is farthest from the query point beacuse of high keyrank value

in comparision to “skylinepoints” and “skylineobjects” which are nearest to the query point

are having low keyrank values.

Further, we can verify that even if there is a tie with in the keyranks of keyphrases then

keyphrase with low distance value will be recommended first, such as keyphrases “datapoints”

and “movingobjects”. On the other hand, also for equi-distant keyphrases from the query

point, keyphrases with high keyrank value is recommended first is verified from keyphrases

like “skylinequeryprocessing” and “skylinealgorithms”.

For sub - query term “target schema”, from Table 4.18, top recommeded keyphrase is

“sourceinstance” having keyscore value as 0.415500 with high keyrank value as 0.059375 [from

Table 4.14] and at the distance 0.1429 [from Table 4.16] implies keyphrase with high keyrank

value and at the least distance is always top prioritize keyphrase. Similary, keyphrases such

as “souceschema”, “schemamapping” and “targetinstance” are having high keyrank values

and at low distance from query point in the respective order, are recommeded continuously

one after another. Also, for tied keyrank values, low distance is dominating in keyscore

value and closest keyphrase is recommended first, for e.g. “schemamatching”, “schemaevo-

lution” and “dataintegration”. Furthermore, keyphrases such as “mappingcomposition” and

“sourceandtarget” both are at equal distance i.e. 0.5333 [from Table 4.16] from query point

but due to high keyrank value i.e. 0.041667 [from Table 4.14] keyphrase “mappingcomposi-

tion” is recommended first, which signifies that when the keyphrases are at same distance

keyranks will dominate keyscore values and high keyranked keyphrases will be recommended

first always.

4.3.4 Identifying Originating Documents

Once we obtain set of related keyphrases which are relevant to the user’s query we identify

the originating documents of those keyphrases in the database. Table 4.19, shows small set

of those keyphrases and their originating documents identified by method discussed in the

Section 3.3. In Table 4.19, first column shows the keyphrase, second column published year

63

and conference, third column paper title, i.e. originating document and last column shows

first author name of the document.

For different case studies as discussed earlier in the Section 4.3, overall 46 set of related

keyphrases were found which are relevant to the user’s query. Out of these, for 15 keyphrases

we found originating documents.

64

Algorithm 4: Keyrank(graph, dampingfactor=0.85, maxiterations=100, mindelta=

0.00001)

Input: Graph of Keyphrases;

Ensure: Dictionary cointaing all the nodes PageRank.

1: nodes = graph.nodes()

2: graphsize = len(nodes)

3: if graphsize == 0: then

4: return

5: end if

6: minvalue = (1.0-dampingfactor)/graphsize //value for nodes without inbound links

{Intialize the key rank dict with 1/N for all nodes}

7: keyrank = dict.fromkeys(nodes, 1.0/graphsize)

8: for i in range(maxiterations): do

9: diff = 0

total difference compared to last iteraction

{Computes each node KeyRank based on inbound links}

10: for node in nodes: do

11: rank = minvalue

12: for referringkeyphrase in graph.incidents(node): do

13: rank += dampingfactor * keyrank[referringkeyphrase] /

len(graph.neighbors(referringkeyphrase))

14: end for

15: end for

16: end for

17: diff += abs(keyrank[node] - rank)

18: keyrank[node] = rank

{Stop if Keyrank has converged}

19: if diff < mindelta: then

20: break

21: end if

22: return keyrank65

Table 4.4: Keyphrases and Keyranks

Keyphrase Keyrank

regressionproblems 0.041667

streamprocessing 0.041667

streamingdata 0.052083

datastreams 0.055074

spectralclustering 0.055074

informationgain 0.055435

hierarchicalclustering 0.055435

clusteringalgorithms 0.067708

frequentitemset 0.069307

associationrules 0.107261

decisiontrees 0.110562

66

Algorithm 5: Dijkstra’s Shortest Path Algorithm

1: def shortest-path(Graph, source):

{Initialization}

2: dist = {source : 0} // Distance from sounce to source

3: previous = {source : None}

4: q = graph.nodes() //q: = set of all nodes in graph

5: while q: do

6: u = q[0]

7: for node in q[1:]: do

8: if ((u not in dist) then

9: or (node in dist and dist[node] < dist[u])):

10: u = node

11: end if

12: q.remove(u)

13: end for

14: end while

{Process reachable, remaining nodes from u}

15: if (u in dist): then

16: for v in graph[u]: do

17: if v in q: then

18: alt = dist[u] + graph.edge-weight((u, v))

19: if (v not in dist) or (alt < dist[v]): //Relax (u,v,a)

20: dist[v] = alt

21: previous[v] = u

22: end if

23: end for

24: end if

25: return previous, dist

26: end shortest-path

67

Table 4.5: Keyphrases and Distance

Keyphrase Distance












Table 4.6: Keyphrases and Keyscores

Keyphrase Keyscore












68

Table 4.7: List of keyphrases and their keyranks matching with query term “decision trees”

Keyphrase Keyrank





neuralnetwork 0.052142




classlabel 0.071695

datamining 0.076816


Table 4.8: List of keyphrases and their distances corresponding to the query term “decision

trees”.

Keyphrase Distance


datamining 0.200000

classlabel 0.250000


neuralnetwork 0.333300







69

Table 4.9: List of keyphrases and their keyscores corresponding to the query term “decision

trees”.

Keyphrase Keyscore Rank

streamingdata 0.096327 11

clusteringalgorithms 0.097904 10

streamprocessing 0.104083 9

associationrules 0.108782 8

frequentitemset 0.117475 7

datastreams 0.124591 6

neuralnetwork 0.145142 5

regressionproblems 0.195656 4

classlabel 0.286780 3

datamining 0.384080 2

informationgain 0.469178 1

Table 4.10: List of keyphrases and their keyranks corresponding to the query term (“clusert-

ing uncertain data”, “data streams”)

Keyphrase Keyrank




datamining 0.059028



uncertaindata 0.073423

streamclustering 0.080401

uncertaindatastreams 0.099245

70

Table 4.11: List of keyphrases and their distances matching with query term (“cluserting

uncertain data”, “data streams”).

Keyphrase Distance

uncertaindatastreams 0.333300

uncertaindata 0.333300


streamclustering 0.360000

datamining 0.533300





Table 4.12: List of keyphrases and their keyscores corresponding to the query term (“clusert-

ing uncertain data”, “data streams”)




spectralclustering 0.086069 7


hierarchicalclustering 0.110684 5


uncertaindata 0.220290 3

streamclustering 0.223361 2

uncertaindatastreams 0.297765 1

71

Table 4.13: List of keyphrases and their keyranks corresponding to the query term “skyline

query”

Keyphrase Keyrank

sensornetworks 0.038462

datamanagement 0.042308

dataintegration 0.042308

datapoints 0.042308

movingobjects 0.042308

subspaceskyline 0.051122

skylineobjects 0.053212

skylinepoints 0.057692

skylinealgorithms 0.063462

queryprocessing 0.080409

skylinequeryprocessing 0.083894

skylinecomputation 0.085213

72

Table 4.14: List of keyphrases and their keyranks corresponding to the query term “target

schema”.

Keyphrase Keyrank

mappingsystem 0.028214


schemamatching 0.033333

schemaevolution 0.033333

dataexchange 0.037210

schematree 0.039583

mappinggeneration 0.040365

sourceandtarget 0.040365

schemaelements 0.041667

mappingcomposition 0.041667

targetinstance 0.050000

schemamapping 0.056250

sourceschema 0.057031

sourceinstance 0.059375

73

Table 4.15: List of keyphrases and their distances matching with query term “skyline query”

Keyphrase Distance

skylinecomputation 0.111100

skylinepoints 0.200000

skylineobjects 0.250000

skylinequeryprocessing 0.250000

skylinealgorithms 0.250000

subspaceskyline 0.333300

datapoints 0.392900

movingobjects 0.500000

sensornetworks 0.533300

datamanagement 0.583300


queryprocessing 0.583300

74

Table 4.16: List of keyphrases and their distances matching with query term “target schema”.

Keyphrase Distance

sourceinstance 0.142900

sourceschema 0.166700

schemamapping 0.200000

targetinstance 0.250000

mappinggeneration 0.325000

schemaelements 0.333300


mappingcomposition 0.533300

sourceandtarget 0.533300

schematree 0.540000

schemamatching 0.592900

schemaevolution 0.650000

dataexchange 0.717900

mappingsystem 0.816700

75

Table 4.17: List of keyphrases and their keyscores matching with query term “skyline query”


sensornetworks 0.072120 12

datamanagement 0.072532 11

dataintegration 0.072532 10

movingobjects 0.084615 9

datapoints 0.107681 8

queryprocessing 0.137851 7

subspaceskyline 0.153381 6

skylineobjects 0.212848 5

skylinealgorithms 0.253848 4

skylinepoints 0.288462 3

skylinequeryprocessing 0.335577 2

skylinecomputation 0.766994 1

76

Table 4.18: List of keyphrases and their keyscores matching with query term “target schema”.


mappingsystem 0.034546 14

schemaevolution 0.051282 13

dataexchange 0.051832 12

schemamatching 0.056221 11


schematree 0.073302 9

sourceandtarget 0.075688 8

mappingcomposition 0.078131 7

mappinggeneration 0.124199 6

schemaelements 0.125014 5

targetinstance 0.200000 4

schemamapping 0.281250 3

sourceschema 0.342117 2

sourceinstance 0.415500 1

77

Table 4.19: List of keyphrases and corresponding originating documents .

Keypharse Conference Paper Title First Author

Year Name

clustering un- ICDE A Framework for Clustering Charu C.

-certain data 2008 Uncertain Data Streams Aggarwal

skyline query ICDE Efficient Skyline Query Processing Shiyuan Wang

processing 2007 on Peer-to-Peer Networks

spectral ICDM Integrating Hidden Markov Models Jie Yin

clustering 2005 and Spectral Analysis for Sensory

Time Series Clustering

uncertain ICDE A Framework for Clustering Charu C.

data streams 2008 Uncertain Data Streams Aggarwal

subspace ICDE SUBSKY: Efficient Computation Yufei Tao

skyline 2006 of Skylines in Subspaces

78

4.3.5 Analysis using Google Similarity Distance Measure

To interpret significance of our results, we use Google Scholar, to show that list of related

keyphrases recommend by our approach are acceptable and relevant to the user’s query. To

see the relevance of keyphrases with user’s query term, we search for the pair of keyphrases

such as (“query term” “keyphrase”) together as well as single keyphrases such as (“query

term”), (“keyphrase”) and collect the statistics that show how many relevant documents

are produced by Google Scholar. In Google Scholar, under advance scholar search option we

have considered following search options:

• find articles with exact phrase.

• return articles published between year 2005 - 2009, as our dataset includes only articles

published in these years.

• search articles only in the (Engineering, Computer Science, and Mathematics) subject

areas.

When the Google Scholar engine is used to search for word x, Google displays the number

of hits that word x has. The ratio of this number of hits to the total number of webpages

indexed by Google represents the probability that word x appears on a webpage. Cilibrasi

and Vitanyi [CV07] use this probability to extract the meaning of words from the world-

wide-web. If word y has a higher conditional probability to appear on a webpage, given

that word x also appears on the webpage, than it does by itself, then it can be concluded

that words x and y are related. Moreover, higher conditional probabilities imply a closer

relationship between the two words. Thus, word x provides some meaning to word y and

vice versa.

Cilibrasi and Vitanyi’s Normalized Google Distance (NGD) function measures how close

word x is to word y on a zero to infinity scale. A distance of zero indicates that the two

words are practically the same. Two independent words have a distance of one. A distance

of infinity occurs for two words that never appear together.

On the average, two random words should be independent of one another. Hence, two

random words should have an NGD of one. To evaluate the NGD among the different word

79

pairs, Cilibrasi and Vitanyi’s formula is defind as:

NGD(x, y) =max(logf(x), logf(y))− logf(x, y)

logM −min(logf(x), logf(y))(4.6)

where f(x) and f(y) are the number of hits of words x and y, respectively, and M is the

total number of web pages that Google indexes.

To measure the relativeness of keyphrases obtained by our approach, we use the Normalized

Google Distance (NGD) measure. The method works by first calculating a distance matrix

whose entries are the pairwise NGD ’s of the terms in the input list.

Consider first case study where quey term is “decision trees”. Final recommended list of

keyphrases from Table 4.9 is considered and NGD for each pair of (query term, keyphrase) is

calculated. At the time of doing the experiment, a Google Scholar search of “decision trees”,

returned 16,400 hits. The number of hits for the search term “information gain” was 7,530.

Searching for the pages where both (“decision trees”, “information gain”) occur gave 2,600

hits. As we did not know total number of pages indexed by Google, based on our intution

we consider total number of pages indexed are 1,000,000 for this case. Using these numbers

in the NGD formula we derive below, with M = 1,000,000, this yields a Normalized Google

Distance between the terms “decision trees” and “information gain” as follows:

NGD(decision trees, information gain) ≈ 0.3767

Similarly for other keyphrase pairs NGD is calculated. Table 4.20, shows the list of

keyphrases, their NGD distances with query point “decision trees”, and ranking of results

respectively.

From Table 4.9 and Table 4.20, notice that first recommended keyphrase is “information-

gain” using both the approaches. The keyphrase rank [in Table 4.20] of other keyphrases is

also well connected with that in Table 4.9. But they are not exactly the same. This may

be due to the fact that the dataset used by Google Scholar is huge in comparsion with our

dataset where we consider limited number of conferences only for data mining/databases

domain. For some exceptional cases, where ranks differ significantly it may be that these

keyphrases occur very frequently in the documents of other fields and occur rarely in our

dataset.

80

Table 4.20: List of keyphrases and their NGD corresponding to the query term “decision

trees”.

Keyphrase NGD Rank

information gain 0.3767 1

data mining 0.6065 4

neural network 0.7165 8

regression problems 0.6233 5

class label 0.3924 2

association rules 0.5815 3

frequent itemset 0.6847 7

data streams 0.7508 9

stream processing 0.9328 11

streaming data 0.8072 10

clustering algorithms 0.6480 6

Similarly, for other case studies NGD measure is identified based on their query term

and related keyphrases. For query terms (“cluserting uncertain data”, “data streams”) and

(“skyline query”, “target schema”), first number of hits are collected through Google Scholar

and then by using NGD measure, distance between the query term and each recommended

keyphrases is calculated. For case study 2’s query term, final list of recommended keyphrases

are shown in Table 4.12 and consider for evaluating NGD measure. Table 4.21, shows list

of keyphrases, their evaluated NGD and ranking with respect to query term (“cluserting

uncertain data”, “data streams”). Whereas, in case study 3, query term is evaluated by

using each sub-query term separately, so for each sub-query term different list of related

keyphrases and their keyscores table is identified. So Table 4.17 and Table 4.18 are used

for identifying NGD measure. Table 4.22 and Table 4.23, shows list of keyphrases, their

evaluated NGD and ranking for sub-query terms (“skyline query”) and (“target schema”)

respectively.

81

Table 4.21: List of keyphrases and their NGD corresponding to the query term (“cluserting

uncertain data”, “data streams”).

Keyphrase NGD Rank

uncertaindatastreams 0.045032 1

uncertaindata 0.444231 3



spectralclustering 0.765654 8

hierarchicalclustering 0.744809 7



streamclustering 0.360304 2

Typically the results of our keyscore measure correlates very well with NGD. Note that

this is the case “even though” the dataset (i.e. the research paper collection) and the ranking

techniques are quite different between the two approaches. Specifically while we have only

research papers from 6 conferences, the Google Scholar collection is more exhaustive. Also,

Google may use undocumented techniques for ranking such as utilizing the implicit feedback

of users while browsing through search results. Inspite of these differences, there is still a

good correlation between Google Scholar results and our approach as can be seen in the

results Table 4.9 and Table 4.20. Further, in the next section to measure the degree of

correlation between the two approaches, we used Spearman’s Rank Correlation coefficient.

Results obtained previously, are interpreted statistically by using t - test and show the

“statistical significance” of results at a significance level of 5% and 1%.

4.3.6 Spearman’s Rank Correlation

Spearman’s Rank Correlation is a non-parametric measure of statistical dependence between

two variables. It assesses how well the relationship between two variables can be described

using a monotonic function. If there are no repeated data values, a perfect Spearman cor-

82

Table 4.22: List of keyphrases and their NGD corresponding to the query term “skyline

query”.

Keyphrase NGD Rank

subspaceskyline 0.179653 6

skylinequeryprocessing 0.097559 3

skylinepoints 0.0952461 2

datapoints 0.684654 8

skylineobjects 0.177634 5

skylinealgorithms 0.158339 4

queryprocessing 0.522789 7

skylinecomputation 0.063742 1


datamanagement 0.715312 9

movingobjects 0.717215 10

sensornetworks 0.918243 12

relation of +1 or -1 occurs when each of the variables is a perfect monotone function of the

other. It is a technique used to test the direction and strength of the relationship between

two variables. In other words, its a device to show whether any one set of numbers has

an effect on another set of numbers. It is often denoted by ρ or rs. The Spearman rank

correlation coefficient is defined by:

rs = 1− 6∑n

i=1 d2i

n(n2 − 1)(4.7)

where, di = xi-yi is the difference in statistical rank of corresponding variables and n is

total number of observations in dataset. Its value is lies between -1 to +1. The closer rs

is to +1 or -1, the stronger the likely correlation. A perfect positive correlation is +1 and

a perfect negative correlation is -1. If rs = 0, implies there is no correlation beteen the

variables.

A further technique is now required to test the significance of the relationship. The

83

Table 4.23: List of keyphrases and their NGD corresponding to the sub-query term “target

schema”.

Keyphrases NGD Rank

sourceandtarget 0.474858 11

sourceinstance 0.083498 1

schemamapping 0.204707 3

dataexchange 0.552462 13

targetinstance 0.239534 5

schematree 0.338834 9

mappingsystem 0.625251 14

schemaelements 0.234208 4

schemamatching 0.278603 7

sourceschema 0.088522 2

mappinggeneration 0.271431 6


mappingcomposition 0.321632 8

schemaevolution 0.339999 10

calculated rs value must be looked up on the Spearman’s Rank significance table at ν = (n

- 2) degrees of freedom (df) for either two-tailed test or on-tailed test at 0.05 and 0.01 level

of significance.

Determine Significance

One approach to testing whether an observed value of rs is significantly different from zero (

rs will always maintain 1 ≥ rs ≥ -1) is to calculate the probability that it would be greater

than or equal to the observed rs, given the null hypothesis, by using a permutation test. An

advantage of this approach is that it automatically takes into account the number of tied

data values there are in the sample, and the way they are treated in computing the rank

correlation.

84

4.3.7 t - test for testing the significance of an observed sample

correlation coefficient

If rs is the observed correlation coefficient in a sample of n pairs of observations from a

bivariate normal population, then Prof. Fisher proved that under the null hypothesis, H0 :

rs = 0 i.e. correlation coefficient is 0, the statistic:

t = rs

√(n− 2)

(1− r2s

(4.8)

follows student’s t-distribution with ν = (n - 2) degrees of freedom (df).

If the value of t comes out to be significant, we reject H0 at the level of significance adopted

and concluded that rs 6= 0, i.e. rs is significant of correlation in population.

It t comes out to be non-significant, then H0 may be accepted and we conclude that

variable may be regarded as uncorrelated in the population.

4.3.8 Analysis of Results using Rank Correlation and t - test

Table 4.24, shows the rank correlation results. These results are calculated by using the

formula defined in Section 4.3.6, where xi and yi are rank values assigned based on their

keyscores values in keyscore tables and NGD’s value in NGD tables for each query term

respectively. These results demonstrate the critical and acceptance values for each query

term.

Table 4.24: Spearman’s rank correlation results

Query n df cal rs 0.05 (tab rs) 0.01 (tab rs)

decision trees 11 9 0.6818 0.700 0.833

(clustering uncertain 9 7 0.69 0.786 0.929

data, data streams)

skyline query 12 10 0.980769 0.648 0.794

target schema 14 12 0.907692 0.538 0.675

Figure 4.6, shows the significance graph for Spearman’s rank correlation with respect to

85

the degrees of freedom. The calulated rs value looked up on the Spearman Rank significance

table below as follows:

• If it is below the line marked as 5%, then it is possible our result was the product of

chance and we must accept the null hypothesis.

• If it is above 1%, we can say we are 99% confident to accept the alternative hypothesis.

• If it is above 5%, but below 1%, we can say we are 95% confident (i.e. statistically

there is a 5% likelihood the result occurred by chance).

Figure 4.6: The Significane of Spearman’s Rank Correlation and degrees of freedom.

From Table 4.24, for df = 7 value 0.69 gives a significance level of slightly less than 5%

as shown in Graph 4.6. That means that the probability of the relationship we have found

being a chance event is about 5 in 100. We are 95% certain that our hypothesis is correct.

For df = 9 value is 0.6818, it lies in between the region of 5% and 1%, implies we are 95%

confident to reject our null hypothesis. This signifies that out of 100 only for 95 cases both

approaches are performing equally well. For df = 10 and 12, we are 99% sure for acceptance

of alternative hypothesis and it implies that both of our apporaches are performing very

closely. For 0.5 < rs < 1, we can say that there is strong positive correlation between the

86

variables. From Table 4.24, we can notice all values of rs are lying between (0.5 < rs < 1)

signifies strong positive correlation between both the approaches.

Further, to test degree of acceptance or rejection of both approaches for each case sepa-

rately t - test is calculated. We consider our assumption for null hypothesis that there is “no

- correlation” i.e. rs = 0, between the two approaches, whereas for alternative hypothesis rs

6= 0. So our hypothesis, null (H0) vs alternative (H1) is defiend as follows:

H0 : rs = 0,

H1 : rs 6= 0

• If calculated t0.05,df < tabulated t0.05,df , then accept H0 and reject the H1.

• If calculated t0.05,df > tabulated t0.05,df , then reject H0 and accept the H1.

In addition, Table 4.25 shows the t - test results calculated by using rs values and formula

described in Section 4.3.7.

Table 4.25: t - test results

Query n df cal t 0.05 (tab t) Testing result

decision trees 11 9 2.8 2.2620 Accept H1

(clustering uncertain 9 7 2.1653 2.365 Accept H0

data, data streams)

skyline query 12 10 15.8773 2.228 Accept H1

target schema 14 12 7.4930 2.179 Accept H1

From Table 4.25, we can notice that for 3 out of 4 cases, our assumption that there is “no

- correlation” between the two approaches is not significant. There is a correlation between

the results even they produced by different approaches. They are not significant differ. Only

for df = 7 case, our null hypothesis is accepted which implies that for this case there is no

significant relationship between the approaches.

87

4.4 Summary

We presented related keyphrases identification method to recommend set of keyphrases rel-

evant to the given user’s query. We have built our approach by using the notion of impact

(keyrank) and proximity (shortest - distance) of the keyphrases. One final remark regarding

keyphrases ranking is that we obtain keyscore value for each keyphrase by using keyrank and

shortest distance measures. Keyphrases from high to low keyscore values are recommended

from first to last. The experimental results show that proposed method performed well and

its comparison with Google similarity measure is also shown. Furthermore, statistical test

is used to measure the degree of correlation between two approaches at different level of

significance.

88

Chapter 5

ProMax: A Profit Maximizing

Recommendation System for Market

Baskets

In the preivous chapter, we discuss an approach to identify related keyphrases based on the

proximity and importance of keyphrases and use the knapsack approach to recommend a set

of top k - keyphrases corresponding to the user’s query and their originating documents. In

this chapter, we use the knapsack based solution in a different domain i.e. to recommend

set of items to the customers in retail stores such that the profit of the store is maximized.

Most data mining research has focused on developing algorithms to discover statistically

significant patterns in large datasets. However, utilizing the resulting patterns in decision

support is more of an art, and the utility of such patterns is often questioned. We formalize

a technique that utilizes data mining concepts to recommend an optimal set of items to

customers in a retail store based on the contents of their market baskets. The recommended

set of items maximizes the expected profit of the store and is decided based on patterns

learnt from past transactions. In addition to concepts of clustering and frequent itemsets, the

proposed method also combines ideas of recommendations systems and knapsack problems

to decide on the items to recommend. We empirically compare our approach with existing

methods on both real and synthetic datasets and show that our method yields better profits

while being faster and simpler.

89

5.1 Introduction

The data mining research literature is replete with several algorithms - most of which are

elegant, efficient and effective. However, there are only a few formal studies concerning how

data mining can actually be beneficial in more specific targets. A major obstacle in data

mining application is the gap between statistically significant pattern extraction and value-

based decision making [WZH02]. It is often questioned as to how exactly one should make

use of the patterns extracted through data mining algorithms for making effective business

decisions, with the ultimate goal of yielding better profits for the business.

Similarly, studies about the retail market [Tom02] have received wide attention, although

only a few of them have seriously dealt with data mining. Ke Wang et al. [WZH02] first

presented a profit mining approach to reduce this gap in 2002 and recent investigations have

shown an increasing interest on how to make decisions by utilizing association rules. We

focus on the problem of recommending products to customers in retail stores such that the

profit of the store is maximized.

Market basket databases contain historical data on prior customer choices where each

customer has selected a subset of items, a market basket, out of a large but finite set.

This data can be used to generate a dynamic recommendation of new items to a customer

who is in the process of making the item choices. Some retail outfits provide carts with

displays that provide product information and recommendations as the customer shops.

Remote shopping systems allow customers to compose shopping lists through personal digital

assistants (PDAs), with interactive recommendations of likely items. Internet commercial

sites often provide dynamic product choices as the customer adds items into the virtual

shopping cart, or market basket. Internet sites also display dynamically changing set of

links to related sites depending on the browsing pattern during a surfing session. Faced

with an enoromous variety of options, customers surfing the web gravitate toward sites that

offer information tailored to their personal preferences. All these activities are characterized

by the progressive item choices being made by a customer, and the provider’s desire to

recommend items that are the most likely next choice of the customer [HNB01].

Recommender systems are rapidly becoming a core tool to accelerate cross-selling and

90

strengthen customer loyalty due to the prosperity of electronic commerce. Enterprises have

been developing new business portals and providing large amount of product information

to create more business opportunities and expand their markets. However, it results in the

information overload problem which has become the burden of customers when making a

purchase decision among a huge variety of products [CCH+07]. The past history of the

items in each market basket transaction is often available, although most of this data is

proprietary. The process of generating recommendations using this data has been called

collaborative filtering, and treated in the same way as the modeling of personal preferences

of movies or news articles. However, we feel that the market basket recommendation is

inherently a different problem due to the large number of categories of items and their

associated profits.

From the market angle, two important criteria [BSVW99] should be taken into account

during the process of mining profit: the items in retail shops should first meet the basic

sale request, and second, should bring higher profits. Therefore, how to meet these two

principles is the core problem of profit mining. The cross-selling effect of items [Tom02]

has been noticed by current retailers: the profit of an item is not only involved in the item

itself, but is also influenced by its related items. Some items fail to produce high profit by

themselves, but they might stimulate customers to buy other profitable items. Consequently,

the cross-selling factor which can be studied by the analysis of historical transactions should

be involved in the problem of item selection.

Searching for such a relation of items to support cross-selling has become an important

issue. Current approaches to study these relationships are based on association rules. How-

ever, association rules by themselves do not suggest how to maximize profit.

We present an algorithm that combines simple ideas from association rule mining, clus-

tering, recommendation systems, and the knapsack problem to recommend those items to

customers that maximize the expected profit. We tested our algorithm on two popular

datasets: One was generated by using the data generator from IBM Almaden Quest re-

search group [BSVW04] and the other was a retail market basket dataset available on the

FIMI1 repository. Our experiments show that our algorithm performs better than compet-

1http://fimi.cs.helsinki.fi/data

91

ing algorithms in terms of obtaining better profits, while at the same time being faster and

simpler.

5.2 Related Work

Many novel and important methods were proposed to support profit mining. Brijs et al.

first proposed the PROFSET Model [Tom02], which adopted the size-constrained strategy

of 0-1 programming, took advantage of the cross-selling effect of items to solve the problem

of item selection.

In 2002, Ke Wang et al. first presented the profit mining approach and several related

problems [WZH02, Zho09]. Ke Wang et al. proposed the HAP algorithm [WS02] which

is extended from the webpage-layered algorithm HITS [BP98] to solve the problem of item

ranking with the consideration of the influence of confidence and profit, but it still has several

drawbacks [wWcF03]. Raymond Wong et al. proposed the maximal profit problem of item

selection (MPIS) [wWcF03] which has the goal of mining out a subset of items with the

maximal profit and then ameliorates those above drawbacks. However, MPIS is too difficult

to implement and solves a NP-hard problem even in the simplest situation. In other words,

although MPIS algorithm could find the best solution, the cost of time is too expressive to

be tolerated.

Recently, Raymond Wong et al proposed a Hill Climbing method [WF04] to solve the

problem of Item Selection for Marketing (ISM) by considering the cross-selling effect. Ray-

mond Wong et al. [WFW05] also adopted genetic algorithms to generate local optimal profit

to fit the optimal profit of the item selection problem. In general if the support of items is

decreased then the number of association rules will keep increasing.

The DualRank [XJWZ05] algorithm which uses a graph based on association rules and

is compressed because the number of out-degrees for the items decrease if the minimum

support is increased. Matrix calculations being done are affected which then affects the item

selection based on the profit. Another problem with DualRank [XJWZ05] is that it is not

very efficient for sparse data sets and it includes cumbersome calculations of eigenvalues and

eigenvectors which become intractable for large transactional data sets. To overcome all

92

these situations we have developed a new algorithm for market basket datasets.

5.3 Problem Definition

In this section, we focus on the problem of recommending products to customers in retail

stores such that the profit of the store is maximized. The following definition formally

captures the problem statement:

Definition 14 (Profit Mining) Given a transactional dataset D = {t1, t2, . . . , tm}, where

each transaction ti contains a subset of items (itemset) from the set I = {i1, i2, . . . , in},

each item ij having an associated profit pj, the problem is to select k additional items to

recommend for each transaction, with the goal of making more sales and maximizing the

total profit of the resulting transactions.

There are several challenges to this problem:

1. Model product purchase probabilities: We need to recommended products that

are more likely to be purchased. It is therefore important to have a clear understanding

of how to model the probability of purchase of each product.

2. Model product relationships: It is not sufficient to know the individual purchase

probabilties of different products. We need to also identify relationships between items

for cross-selling. The standard technique in recent approaches for this purpose has

been to use association rules.

3. Model customers: Even high-confidence association rules may not apply to a par-

ticular customer, whereas some low-confidence rules may apply. Thus, it is imperative

to model customers regarding which category they belong to and then study the rules

within that category. This is more likely to result in effective recommendations. The

standard technique in the recommendation system community for this purpose is to

cluster customers such that customers within each cluster share the same purchase

patterns.

93

4. Balance purchase probability and profit: A pure association rule based approach

will favour rules with high confidence so as to maximize the probability that the cus-

tomer will purchase the recommended item. For example, the rule {Perfume} →

{Lipstick} will be favoured because of higher confidence compared to a rule {Perfume}

→ {Diamond}. In contrast, a pure profit-based approach will favour the latter rule

hoping for higher profit. Neither necessarily maximizes the true profit. Indeed, items

of high profit often also have low supports because fewer people buy expensive stuffs.

5. Decide the number of products to recommend: An implicit constraint here

is that if we recommend too many items then the customer will be overwhelmed by

the choices and is likely to avoid choosing any item at all. On the other hand, if we

recommend too few items, then we may miss a successful sale of a product that has

not been brought to the attention of the customer. The correct number of products to

recommend depends on the attention span of customers and would vary depending on

the domain.

Clearly, the problem of recommending the right products for each market basket in a store is

a challenging problem and thereby the task of designing an elegant, simple and yet effective

algorithm seems very difficult upfront.

5.4 The ProMax Algorithm

As discussed in the previous section, at the face of it, the problem of recommending the right

products is replete with several challenges. It therefore seems daunting to design a simple

yet effective algorithm for this task, so much so that to design an optimal algorithm that

guarantees to maximize the expected profit, seems out of reach. Yet this is exactly what we

achieve. In this section we present ProMax, a Profit Max imizing recommendation system

for market baskets.

Our algorithm performs a clustering of the customer transactions as a preprocessing. For

this purpose, it uses the clustering algorithm explained in [WXL99]. Next, at the time of

recommendation, the algorithm is based on the following steps:

94

1. Identify the cluster C to which the current record belongs.

2. Calculate the expected profit of each item in the cluster C.

3. Sort the items based on expected profit and recommend the top k items, where k is a

parameter to the algorithm.

First, in Section 5.4.1 we describe the clustering algorithm used in step 1. Next, the manner

in which expected profit of items is computed is later described in Section 5.4.2.

5.4.1 Clustering Customer Transactions and Identification of the

Current Cluster

Since the probability of earning more profit is directly proportional to the purchase proba-

bility of items, it is imperative to be able to accurately estimate the puchase probability of

specific items by specific customers. A naive solution is to use the global support of items

as an estimate of their probability. But, as customers differ in their purchase patterns, the

global support of an item is an unreliable estimator of its likelihood of sale.

Therefore, a natural approach is to first cluster the transactions based on their purchase

patterns and then use the support of items within a cluster as more accurate estimates of

their purchase probabilities. The clustering criterion we use is based on the notion of large

items and was proposed in [WXL99].

For a given collection of transactions and a minimum support, our aim is to find a clustering

C such that cost(C) is minimum. The main aim here is to cluster the similar transactions

where intra cluster similarity is high and inter cluster similarity is less. cost(C) is calculated

by using the intra cluster similarity and inter cluster similarity which are calculated by using

the notion of Small Items and Large Items respectively.

Large items are used to calculate the similarity measure used for clustering transactions.

The support of an item in a cluster Ci is the number of transactions in Ci that contain the

item. Let |S| denote the number of elements in set S. For a user-specified minimum support

θ (0 < θ ≤ 1), an item is large in cluster Ci if its support in Ci is atleast θ× |Ci|; otherwise,

an item is small in Ci.

95

Intuitively, large items are popular items in a cluster, thus, contribute to similarity of

items within a cluster. In contrast, small items contribute to dissimilarity in a cluster. Let

Largei denote the set of large items in Ci, and Smalli denote the set of small items in Ci.

Consider a clustering C = {C1, C2, . . . , Ck}. The cluster to which the current record r

belongs to, depends on the cost which can be calculated using the equation. It can be defined

mathematically as.

Cluster(r) = argmini[Cost(Ci)] (5.1)

The cost of C to be minimized depends on two components: the intra-cluster cost and

the inter-cluster cost:

Intra-cluster cost: This component is charged for the intra-cluster dissimilarity, mea-

sured by the total number of small items:

Intra(C) = | ∪ki=1 Smalli| (5.2)

This component will restrain creating loosely bound clusters that have too many small

items.

Inter-cluster cost: This component is charged for the inter-cluster similarity. Since large

items contribute to similarity in a cluster, each cluster should have as little overlapping of

large items as possible. This overlapping is measured as:

Inter(C) =k∑

i=1

|Largei| − | ∪ki=1 Largei| (5.3)

In words, Inter(C) measures the duplication of large items in different clusters. This

component will restrain creating similar clusters.

To put two together, one can specify weights for their relative importance. The cost

function of the clustering C then is defined as:

Cost(C) = w × Intra(C) + Inter(C) (5.4)

A weight w > 1 gives more emphasis to the intra-cluster similarity, and a weight w < 1

gives more emphasis to the inter-cluster dissimilarity. By default w = 1.

96

The pseudocode of the clustering algorithm as described above is shown in Algorithm 6.

Algorithm 6: Clustering Algorithm

/*Allocation Phase*/

while not end of file do

read the next transaction < t,− >;

allocate t to an existing or a new cluster Ci to minimize Cost(C);

write < t,Ci >;

end while

/*Refinement Phase*/

repeat

not moved = true;

while not end of the file do

read the next transaction < t,Ci >;

move t to an existing non-singleton cluster Cj to minimize Cost(C);

if Ci 6= Cj then

write < t,Cj >;

not moved = false;

eliminate any empty cluster;

end if

end while

until not moved ;

5.4.2 Calculating Expected Profit

Once clusters are computed, the probability of an item’s purchase can be found by first

identifying which cluster the current transaction falls into and then looking up the support

of the corresponding item in that cluster. That is, the item’s probability is estimated as its

support within a cluster, rather than its global support. As mentioned in Section 5.4.1, this

manner of computing the probability of items is far more accurate in terms of their likelihood

of purchase.

97

In this context, we can compute the expected profit of a given item i with the help of its

probability (intra-cluster frequency) f and profit p. The expected profit for each item can be

computed as:

Ei = fi × pi (5.5)

5.4.3 Recommending Items: Knapsack Approach

The current scenario is that we can recommend k items to the customer where k is a constant

chosen based on the domain. The goal is to maximize profit - so we must ensure that the

recommended items have high purchase probability and high profit.

There are mainly two options:

1. Recommending items with high purchase probability: Consider the example of lipstick

and diamond. Suppose the cost of lipstick is $1 and cost of diamond is $500 and

purchase probability of lipstick is 0.03 and diamond is 0.0001. When we consider the

purchase probability as the only case, we will be recommending lipstick which would

not yield high profit.

2. Recommending items with high profit: Consider the same example as given above,

suppose the cost of lipstick is $2 and cost of diamond is $500, diamond has higher

profitability than lipstick. When we consider only the high profit, then we will recom-

mend diamond which is not a good recommendation.

If the goal is to maximize either purchase probability or profit separately as mentioned

above, we could directly use the knapsack approach [HS78]. We reduce the current problem

to the knapsack problem in the following manner.

Definition 15 (Knapsack Problem) Given a set of items, each with a weight and a value,

determine the number of each item to include in a collection so that the total weight is less

than a given limit and the total value is as large as possible.

We reduce profit mining to a knapsack problem using the following equivalences:

98

• Items to carry in a knapsack correspond to the items for sale in profit mining.

• Item weight (in knapsack problem) is set to 1 for all items in the store.

• Weight limit (knapsack problem) is set to k – the number of items to recommend (in

profit mining).

• Item value (in knapsack problem) is set to expected profit of that item (which is

purchase probability ≤ item profit).

The result is a greedy algorithm to recommend items - Once the expected profit of items

is computed, we sort all items in decreasing order of these expected profit values and rec-

ommend the k items which have the highest expected profits. The Knapsack approach

guarantees that this greedy approach maximizes the overall expected profit.

The pseudo-code of the resulting algorithm as described above is shown in Algorithm 7.

5.5 Experimental Evaluation

To evaluate the performance of the ProMax algorithm we ran a set of experiments on two

data sets: a real retail data set [BSVW99] available from the Frequent Itemset Mining

Implementations (FIMI) Repository2 and an IBM Sythetic Dataset [BSVW04]. The choice

of these particular data sets is not restrictive since they are widely used standard benchmark

data sets, and the structure is typical of the most common application scenarios. In fact

we can apply this algorithm in all scenarios provided we have the past transactions and a

current customer transaction.

We compare the efficiency of ProMax algorithm against the DualRank algorithm [XJWZ05],

because the DualRank algorithm has already been shown to perform well when compared

with the HAP algorithm and the naive approach. The DualRank algorithm [XJWZ05] is the

state of art in the consideration of item selection methods with customer-behaviour model

in the data mining literature but ProMax algorithm is based on the customer-item-behavior

model.


99

Algorithm 7: Mining Profitable Items

Identify the cluster to which the user’s transaction belongs

I = Items bought by the user

for not end of the transactions in the cluster do

i = cluster id

w = weight function

Ci = ith cluster

Cost(Ci) = w ∗ Intra(Ci) + Inter(Ci)

end for

min = minimum cost value

minid = id of the cluster which has the minimum cost

while not end of the transactions in the cluster having min cost do

f = frequency of each item i

p = profit of item i

E = f × p (expected profit)

end while

sort the items in descending order of expected profits

select k items with maximum value from the sorted list

All experiments are conducted on a 1.60GHZ Intel PC with 8GB main memory, using a

linux machine. Both algorithms were coded in C++. Profits of items are generated randomly

since the way profit is generated does not affect the nature of the results generated. To

perform the experiments, we took a set of four to five items which is a current customer

transaction as an input and recommend top five items to the user as an output based on the

past transaction history.

5.5.1 Performance of DualRank

Initially we have considered the performance of DualRank on the synthetic dataset. For a

larger number of transactions, DualRank was not able to execute fast enough due to the

huge number of computations it has to perform. Performance of DualRank is shown in the

100

graph of Figure 5.1.

Figure 5.1: DualRank Performance

For a minimum support greater than 0.1, DualRank could not be executed. The reason

behind this is that, since there are very few frequent singleton items, there are no edges in

the item-graph created in DualRank.

We notice that the performance of DualRank deteriorates as the min-support value is

reduced. The reason is that for low minimum supports, the number of association rules

generated is very large resulting in a very large matrix. This makes it difficult to perform

the intensive matrix operations that DualRank requires such as the calculation of eigen

values and eigen vectors. DualRanks recommendations are independent of a particular input

transaction, but are globally determined by the overall profitability of individual items.

ProMax algorithm was also evaluated under the same conditions as DualRank. Perfor-

mance of ProMax is shown in the graph of Figure 5.3. We notice that it performs better over

the entire range of min-support values that DualRank could run on. We notice that its per-

formance increases slightly when the min-support is very low. This is only coincidental there

is as such no particular direct relationship between the min-support value and the profits

generated by ProMax. This is because here the min-support can only affect the quality of

clustering by modifying the intra-cluster and inter-cluster costs. Both too low and too high

values of min-support could deteriorate the clustering quality. However, there is a damping

affect when min-support is reduced, there are more small items leading to a high intra-cluster

101

cost and a low inter-cluster cost. The increase in intra-cluster cost is often compensated by

the decrease in inter-cluster cost thereby resulting in no net change in cluster quality.

We have also noticed that by keeping the min-support value as constant and varying the

number of items being selected, ProMax outperforms the DualRank algorithm as shown in

the Figure 5.2. DualRank always generates the static recommendations and is independent

of the customer’s current transaction. Hence, until the database of previous transaction

history changes, DualRank always recommend the same items.

Figure 5.2: Comparisons of profits earned by the algorithms based on the number of items

selected.

5.5.2 Performance of ProMax

In this experiment, we evaluated the performance of ProMax on both the real and synthetic

datasets. The results are shown in the graph of Figure 5.3.

In this graph, the x-axis denotes the min-support, whereas the y-axis denotes the profit

generated by ProMax. As per our observation, the algorithm is performing in the same

manner across different datasets. For datasets which are not very sparse, profit is compar-

atively more. This is because of the clustering approach where the bulkiness of the clusters

increases.

Also, the clustering quality effect described in the previous experiment is clearly visible

in the real dataset curve. Notice that this curve has a peak at around min-support = 0.05

102

Figure 5.3: ProMax Performance on different datasets

when clustering quality is high and deteriorates on both sides as the min-support is either

increased or decreased.

5.6 Summary

In this paper, we presented an algorithm that combines simple ideas from association rule

mining, clustering, recommendation systems, and the knapsack problem to recommend those

items to customers that maximize the expected profit. We tested our algorithm on two pop-

ular datasets and compared our algorithm with the previously existing DualRank algorithm.

Our experiments showed that our algorithm performs better than DualRank in terms of

obtaining better profits, while at the same time being faster and simpler.

103

Chapter 6

Conclusion and Future Work

This thesis has presented methods for helping people to understand the development of key

research topics in terms of a generated list of keyphrases and their originating documents

in collection of time-stamped text documents. In addition to this we built an approach for

identifying related set of keyphrases for a given user’s query and corresponding originating

document. The modeling and evaluation led to the following conclusions and ideas for future

work.

6.1 Conclusions

1. We have proposed and evaluated method for helping user to understand the interac-

tions between keyphrases and their documents. These offline methods were based on

unsupervised models for such documents archives where documents accumulate over

time. We developed an approach to identify a plane list of keyphrases to recommend

the user and first document refer to the originating document of these keyphrases from

different datamining/databases conferences. Our motive is to help new users or re-

searchers who can not navigate through each and every topic and research publication

published in various conferences w.r.t. time-stamp (i.e. conference year). No one is

willing to spend to much time on unrelevant information. Users who read these limited

subset of articles could hopefully get the gist of the important key ideas. Based on

users interrest they can choose their own topics and read only those papers in depth.

104

Experimental results describe in Section 3.4, can refer for detailed understanding.

2. In first approach, we did not consider any input from user. We are generated a list of

keyphrases and output whole list of keyphrases to the users and later they can decide

their field based on their interests.

In second approach discussed in Chapter 4, for a given user’s query, we recommend a

set of top-k keyphrases corresponding to the user’s query based on the importance and

proximity of keyphrases and their corresponding originating document. We developed

our approach based on knapsack problem to recommend top - k keyphrases to the

user. We used the notion of keyphrases evolutionary graph and keyscores to identify

top keyphrases which are relevant to the given query. Keyscore of each keyphrase is

calculated by using the impact of keyphrases i.e. keyranks and identify - k nearest

neighbors of a given user query and finally output top - k keyphrases which are having

high keyscore values and also their corresponding originating document. In Section 4.3,

experimental results are shown and compared with Google similarity distance measure.

3. In addition to the above approaches, we also explored our problem in a different domain.

We used the knapsack based solution discussed for task 2 (above) in a different domain

i.e. to recommend set of items to customers in retail stores such that the profit of the

store is maximized. We developed an algorithm i.e. ProMax that combines simple

ideas from association rule mining, clustering, recommendation systems and knapsack

problem.

To validate the performance of our algorithm, we used real retail data set [BSVW99]

available from Frequent Itemset Mining Implementations (FIMI) Repository1 and an

IBM Sythetic Dataset [BSVW04]. Experiments discussed in Section 5.5, show that our

algorithm is performing better than the competiting algorithm i.e. DualRank in terms

of obtaining more profits, while at the same time being faster and simpler.


105

6.2 Future Work

Work discussed in this thesis can be extended further and hence, has substantial room for

improvement. The improvement can be made in the following distinct arenas.

1. We can improve the performance of existing keyphrase extraction technique by intro-

ducing the notion of other parameters to identify the novelty of keyphrases.

2. We have only considered a flat structure of keyphrases, it would be interesting to

explore a hierarchical structure of keyphrases which can give us a picture of keyphrase

evolutions at different resolutions.

3. To identify the originating document we are considering only reference section i.e.

title, author name, conference year etc. We can further extend this work to check the

presence or absence of keyphrases inside the text of these references and can increase

the accurancy of the prediction.

4. It will be much more helpful for users if we able to identify the particular content of a

keyphrase inside the document instead of navigating through whole document. They

can refere only the content of a keyphrase for their quick reference and understanding

of a particular keyphrase.

5. Our approach for identifying related keyphrases and recommend top - k keyphrases

corresponding to a user’s query is based on keyscore ranking technique. Although, it

performs well for a pioneering effort, it is far from perfect. We like to maneuver this

approach by introducing other distance metrics and ranking techniques.

6. For experimental results we can test our approach on large datasets and can develop

a system for large number of confereneces and maintain the user freindly manner of

visualization and can give ample user interaction options in the entire system make it

a very useful tool in practical research topics and their document evolutions. Such a

system can be very useful for managing all kinds of text stream data.

7. Also it would be interesting to identify the set of papers that are responsible for in-

troducing a new keyphrase. Similarly, it would be interesting to design an algorithm

106

to cluster those documents that can directly capture the splitting and merging of

documents over time and identify the main keyphrase which is associated with those

documents and vice versa.

8. Next, for ProMax algorithm, we believe that there are few aspects for improvement.

First, in the initial phase of our algorithm, clustering technique can be done more ef-

ficiently by appropriately identifying the parameters while calculating the cost. More-

over, we consider only the type of items but did not consider quantity of items, which

can be extended as a future work.

107

Publications

• “Mining Landmark Papers”, SIGAI Workshop on Emerging Research Trends in AI

(ERTAI 2010), Mumbai, India, April, 2010.

• “Extended Approach for Mining Landmark Papers from Text Corpus”, Grace Hopper

Celebration Conference India, Bangalore, India, December 7-9, 2010.

• “ProMax: A Profit Maximizing Recommendation System for Market Baskets”, SIGAI

Workshop on Emerging Research Trends in AI (ERTAI 2010), Mumbai, India, April,

2010.

108

Bibliography

[ACD+98] James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and

Yiming Yang. Topic Detection and Tracking Pilot Study Final Report. In In

Proceedings of the DARPA Broadcast News Transcription and Understanding

Workshop, pages 194–218, 1998.

[ALJ00] James Allan, Victor Lavrenko, and Hubert Jin. First story detection in tdt

is hard. In CIKM ’00: Proceedings of the ninth international conference on

Information and knowledge management, pages 374–381, New York, NY, USA,

2000. ACM.

[All02] James Allan, editor. Topic detection and tracking: event-based information

organization. Kluwer Academic Publishers, Norwell, MA, USA, 2002.

[ANTT02] Arvind Arasu, Jasmine Novak, Andrew Tomkins, and John Tomlin. PageRank

computation and the structure of the Web: Experiments and algorithms. In

Proceedings of the Eleventh International World Wide, 2002.

[APL98] James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection

and tracking. In SIGIR ’98: Proceedings of the 21st annual international ACM

SIGIR conference on Research and development in information retrieval, pages

37–45, New York, NY, USA, 1998. ACM.

[BE97] Regina Barzilay and Michael Elhadad. Using lexical chains for text summa-

rization. In In Proceedings of the ACL Workshop on Intelligent Scalable Text

Summarization, pages 10–17, 1997.

109

[Ber02] Pavel Berkhin. Survey of clustering data mining techniques. Technical report,

2002.

[BP98] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual

web search engine. In Proceedings of the seventh international conference on

World Wide Web 7, WWW7, pages 107–117, Amsterdam, The Netherlands,

The Netherlands, 1998. Elsevier Science Publishers B. V.

[BSVW99] Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. Using association

rules for product assortment decisions: a case study. In Proceedings of the

fifth ACM SIGKDD international conference on Knowledge discovery and data

mining, KDD ’99, pages 254–260, New York, NY, USA, 1999. ACM.

[BSVW04] Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. IBM synthetic

data generator. 2004.

[CC00] David Cohn and Huan Chang. Learning to probabilistically identify authori-

tative documents. In ICML ’00: Proceedings of the Seventeenth International

Conference on Machine Learning, pages 167–174, San Francisco, CA, USA, 2000.

Morgan Kaufmann Publishers Inc.

[CCH+07] Mu-Chen Chen, Long-Sheng Chen, Fei-Hao Hsu, Yuanjia Hsu, and Hsiao-Ying

Chou. Hprs: A profitability based recommender system. In Industrial En-

gineering and Engineering Management, 2007 IEEE International Conference

on, pages 219 –223, dec. 2007.

[Con00] The Linguistic Data Consortium, editor. The Year 2000 Topic Detection and

Tracking TDT2000 Task Definition and Evaluation Plan. 2000.

[CV07] Rudi L. Cilibrasi and Paul M. B. Vitanyi. The google similarity distance. IEEE

Trans. on Knowl. and Data Eng., 19:370–383, March 2007.

[De05] Indro De. Experiments in first story detection, 2005.

110

[DG06] Jesse Davis and Mark Goadrich. The relationship between precision-recall and

roc curves. In Proceedings of the 23rd international conference on Machine

learning, ICML ’06, pages 233–240, New York, NY, USA, 2006. ACM.

[DP97] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian

classifier under zero-one loss. Mach. Learn., 29(2-3):103–130, 1997.

[Gar] E. Garefield. The impact factor.

[Gar55] Eugene Garfield. Citation indexes for science. a new dimension in documentation

through association of ideas. Science, 122:1123–1127, 1955.

[Gar72] Eugene Garfield. Citation analysis as a tool in journal evaluation can be ranked

by frequency and impact of citations for science policy studies. SCIENCE,

178(4060):471–479, 1972.

[Gar03] E. Garefield. The Meaning of the Impact Factor. International Journal of

clinical and Health Psychology, 3(2):363–369, 2003.

[GPW+99] Carl Gutwin, Gordon Paynter, Ian Witten, Craig Nevill-Manning, and Eibe

Frank. Improving browsing in digital libraries with keyphrase indexes. Decis.

Support Syst., 27(1-2):81–104, 1999.

[Hav99] T. Haveliwala. Efficient computation of pagerank. Technical Report 1999-31,

Stanford InfoLab, 1999.

[HH76] M. A. K. Halliday and R. Hasan. Cohesion in English (English Language).

Longman Pub Group, 1976.

[HNB01] Se June Hong, Ramesh Natarajan, and Ilana Belitskaya. A new approach for

item choice recommendations. In Proceedings of the Third International Con-

ference on Data Warehousing and Knowledge Discovery, DaWaK ’01, pages

131–140, London, UK, 2001. Springer-Verlag.

111

[HNP05] Andreas Hotho, Andreas Nrnberger, and Gerhard Paa. A brief survey of text

mining. LDV Forum - GLDV Journal for Computational Linguistics and Lan-

guage Technology, 2005.

[HPYM04] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns

without candidate generation: A frequent-pattern tree approach. Data Min.

Knowl. Discov., 8:53–87, January 2004.

[HS78] Ellis Horowitz and Sartaj Sahni. Fundamentals of Computer Algorithms. Com-

puter Science Press, 1978.

[HSGA09] Mohammad Al Hasan, W. Scott Spangler, Thomas Griffin, and Alfredo Alba.

Coa: finding novel patents through text analysis. In KDD ’09: Proceedings of

the 15th ACM SIGKDD international conference on Knowledge discovery and

data mining, pages 1175–1184, New York, NY, USA, 2009. ACM.

[Jon98] Steve Jones. Link as You Type. Working Paper. Department of Computer

Science, University of Waikato, New Zealand, 1998.

[Kan03] M. Kantardzic, editor. Data Mining. Wiley Inter-Science,, Hoboken, 2003.

[Kle99] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM,

46(5):604–632, 1999.

[Kle02] Jon Kleinberg. Bursty and hierarchical structure in streams. In KDD ’02:

Proceedings of the eighth ACM SIGKDD international conference on Knowledge

discovery and data mining, pages 91–101, New York, NY, USA, 2002. ACM.

[LWY06] Minsuk Lee, Weiqing Wang, and Hong Yu. Exploring supervised and unsuper-

vised methods to detect topics in biomedical text. BMC Bioinformatics, 7:140,

2006.

[MFH+03] Amy Mcgovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast,

Jennifer Neville, and David Jensen. Exploiting relational structure to un-

derstand publication patterns in high-energy physics. SIGKDD Explorations,

5:2003, 2003.

112

[MZ05] Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns

from text: an exploration of temporal text mining. In KDD ’05: Proceedings of

the eleventh ACM SIGKDD international conference on Knowledge discovery in

data mining, pages 198–207, New York, NY, USA, 2005. ACM.

[NIS] National Institue of Standards and Technology.

[PBMW98] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pager-

ank citation ranking: Bringing order to the web. Technical report, Stanford

Digital Library Technologies Project, 1998.

[PBMW99] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pager-

ank citation ranking: Bringing order to the web, 1999.

[SJ00] R. Swan and D. Jensen. Timemines: Constructing timelines with statistical

models of word usage, 2000.

[SM86] Gerard Salton and Michael J. McGill. Introduction to Modern Information Re-

trieval. McGraw-Hill, Inc., New York, NY, USA, 1986.

[SW99] Mark Shewhart and Mark Wasson. Monitoring a newsfeed for hot topics. In

KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on

Knowledge discovery and data mining, pages 402–404, New York, NY, USA,

1999. ACM.

[Tom02] Brijs Tom. Retail market basket analysis: A quantitative modelling, 2002.

[Tur99] Peter Turney. Learning to Extract Keyphrases from Text. National Research

Council Canada, Institute for Information Technology, 1999.

[Voo99] Ellen M. Voorhees. Natural language processing and information retrieval. In In-

formation Extraction: Towards Scalable, Adaptable Systems, pages 32–48, Lon-

don, UK, 1999. Springer-Verlag.

[WF04] Raymond Chi-Wing Wong and Ada Wai-Chee Fu. Ism: Item selection for mar-

keting with cross-selling considerations. In PAKDD, pages 431–440, 2004.

113

[WFW05] Raymond Chi-Wing Wong, Ada Wai-Chee Fu, and Ke Wang. Data mining for

inventory item selection with cross-selling considerations. Data Min. Knowl.

Discov., 11:81–112, July 2005.

[WHGL09] Hei-Chia Wang, Tian-Hsiang Huang, Jiunn-Liang Guo, and Shu-Chuan Li. Jour-

nal article topic detection based on semantic features. In IEA/AIE ’09: Proceed-

ings of the 22nd International Conference on Industrial, Engineering and Other

Applications of Applied Intelligent Systems, pages 644–652, Berlin, Heidelberg,

2009. Springer-Verlag.

[Wil81] Rudolf Wille. Restructuring lattice theory: An approach based on hierarchies

of concepts. Ordered Sets, Ivan Rival Ed., NATO Advanced Study Institute,

83:445–470, September 1981.

[Wit03] Ian H. Witten. Browsing around a digital library. In SODA ’03: Proceedings

of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages

99–99, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Math-

ematics.

[WPF+99] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G.

Nevill-Manning. Kea: practical automatic keyphrase extraction. In DL ’99:

Proceedings of the fourth ACM conference on Digital libraries, pages 254–255,

New York, NY, USA, 1999. ACM.

[WS02] Ke Wang and Ming-Yen Thomas Su. Item selection by ”hub-authority” profit

ranking. In Proceedings of the eighth ACM SIGKDD international conference

on Knowledge discovery and data mining, KDD ’02, pages 652–657, New York,

NY, USA, 2002. ACM.

[wWcF03] Raymond Chi wing Wong and Ada Wai chee Fu. MPIS: Maximal-profit item

selection with cross-selling considerations. In in: IEEE International Conference

on Data Mining (ICDM), 2003.

114

[WXL99] Ke Wang, Chu Xu, and Bing Liu. Clustering transactions using large items. In

Proceedings of the eighth international conference on Information and knowledge

management, CIKM ’99, pages 483–490, New York, NY, USA, 1999. ACM.

[WZH02] Ke Wang, Senqiang Zhou, and Jiawei Han. Profit mining: From patterns to ac-

tions. In Proceedings of the 8th International Conference on Extending Database

Technology: Advances in Database Technology, EDBT ’02, pages 70–87, London,

UK, 2002. Springer-Verlag.

[XJWZ05] Xiujuan Xu, Lifeng Jia, Zhe Wang, and Chunguang Zhou. DualRank: A dual-

phase algorithm for optimal profit mining in retailing market. In ASIAN, pages

182–192, 2005.

[Zho09] Senqiang Zhou. Profit mining. In Encyclopedia of Data Warehousing and Min-

ing, pages 1598–1602. 2009.

115