Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Clustering Search Results with Carrot2

Stanisław Osiński1 Dawid Weiss2

Poznan Supercomputing and Networking Center,ul. Noskowskiego 10, 61-704, Poznan, Poland

[email protected]

Institute of Computing Science, Poznan University of Technology,ul. Piotrowo 3A, 60-965 Poznań, Poland

[email protected]

January 2007

About the authors

Dawid Weiss

entertainment experienceMSc in Software Engineeringindustrial experiencecurrently in academiaPhD in Information Retrieval

Stanisław Osiński

MSc in Design of ITSystemsindustrial/ researchexperienceemployed in a researchinstitutePhD in progress. . .

SRC Carrot2 Lingo Summary

1 Introduction to Search Results Clustering

2 Carrot2 Framework

3 Lingo clustering algorithm

4 Summary


Ranked lists are not perfect


2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering


Search results clustering is one of many methods that can be used toimprove user experience while searching collections of textdocuments, web pages for example. To illustrate the problems withconventional ranked list presentation, let’s imagine a user wants to findweb documents about “apache”. Obviously, this is a very general query,which can lead to. . .



?HTTP Server


?HTTP Server

2007

-01-




. . . large numbers of references being returned, the majority of which willbe about the Apache Web Server.



?HTTP Server

Indians

Helicopter


?HTTP Server

Indians

Helicopter

2007

-01-




A more patient user, a user who is determined enough to look at resultsat rank 100, should be able to reach some scattered results about theApache Helicopter or Apache Indians. As you can see, one problem withranked lists is that sometimes users must go through manyirrelevant documents in order to get to the ones they want.


Search Results Clustering can help

!HTTP ServerIndiansHelicopter (other topics)


!HTTP ServerIndiansHelicopter (other topics)

2007

-01-




So how about an interface that groups the search results intoseparate semantic topics, such as the Apache Web Server, ApacheIndians, Apache Helicopter and so on? With such groups, the user willimmediately get an overview of what is in the results and should be ableto navigate to the interesting documents with less effort.

This kind of interface to search results can be implemented by applying adocument clustering algorithm to the results returned by the searchengine. This is something that is commonly called Search ResultsClustering.


Search Results Clustering is an interesting problem


2007

-01-




Search Results Clustering has a few interesting characteristics and one ofthem is the fact that it is based only on the fragments of documentsreturned by the search engine (document snippets). This is the onlyinput an algorithm has, full text of documents is not available.




2007

-01-




Document snippets returned by search engines are usually very short andnoisy. So we can get broken sentences or useless symbols, numbers ordates in the input.



Semantic clustersMeaningful cluster labelsSmall input



2007

-01-




In order to be helpful for the users, search results clustering must putresults that deal with the same topic into one group. This is the primaryrequirement for all document clustering algorithms.

But in search results clustering very important are also the labels ofclusters. We must accurately and concisely describe the contents ofthe cluster, so that the user can quickly decide if the cluster isinteresting or not. This aspect of document clustering is sometimesneglected.

Finally, because the total size of input in search results clustering is small(e.g. 200 snippets), we can afford some more complex processing,which can possibly let us achieve better results.



. . . and that’s why we created



2 Carrot2 Framework


4 Summary


Carrot2 is about search results clustering

Carrot2 is a...

framework forexperimenting withprocessing and presentationof search resultsframework for buildingreal-worldproduction-qualityapplicationsBSD-licensed open sourceproject


Carrot2 targets researchers, developers and end-users


Carrot2 is based on processing pipelines


Carrot2 offers ready-to-use components: input

Fetching search results from:

Google (API) Yahoo (API) MSN (API)

Open Search Lucene ODP Project


Carrot2 offers ready-to-use components: clustering

5 search results clustering algorithms:

Lingo(Stanisław Osiński)STC(Oren Zamir, Oren Etzioni)Rough-KMeans(Ngo Chi Lang)HAOG(Karol Gołembniak, Irmina Masłowska)FuzzyAnts(Steven Schockaert)


Carrot2 offers ready-to-use components: other

Other utilities:language tokenizers, stemmers and stop word listsvery fast matrix computations librarydesktop browser application for tuning and rapid experiments

2007

-01-


Carrot2 Framework

The desktop application allows detailed tuning of each algorithm. In thequery panel we have options for:

• input/ algorithm selection,

• number of search results to fetch,

• default algorithm configuration settings.

After a query is performed, a result tab appears on screen allowing:

• benchmarking,

• visualization,

• on-line modification of algorithm parameters, reflected in theclusters panel.

http://www.carrot2.org


2007

-01-


Carrot2 Framework

The on-line demo is a playground for users, but also a demonstration ofthe technology really used by quite a number of people.

http://www.google.com/analytics





Carrot2 has a number of real-world applications


Commercial spin-off: Carrot Search s.c.

A different, improvedclustering algorithm –Lingo3GConsulting and support forthe open source projectText-mining consultancy


Lingo3G introduces many improvements

hierarchical results,a number of customizationoptions,much faster and robust,better cluster labels.



2 Carrot2 Framework


4 Summary


Lingo is designed specifically for search results clustering




2007

-01-


Lingo clustering algorithm


The primary assumption we made when working Lingo was that it shouldbe an algorithm specifically designed to handle search results clustering.Therefore our main focus was the quality of cluster label. We also wereaware that, due to the small size of input, we could afford more complexprocessing.


Cluster description has a priority

Snippets

Clusters

Results

cluster

describe

Classic clustering


Snippets

Clusters

Results

cluster

describe

Classic clustering20

07-0

1-22

Clustering Search Results. . .Lingo clustering algorithm


Having in mind the requirement for high quality of cluster labels, weexperimented with reversing the normal clustering order and givingthe cluster description a priority.

In the classic clustering scheme, in which the algorithm starts withfinding document groups and then tries to label these groups, we canhave situations where the algorithm knows that certain documents shouldbe clustered together, but at the same time the algorithm is unable toexplain to the user what these documents have in common.



Snippets

Clusters

Results

cluster

describe

Classic clustering

Snippets

Labels

Results

find labels

add snippets

Description comesfirst clustering


Snippets

Clusters

Results

cluster

describe

Classic clustering

Snippets

Labels

Results

find labels

add snippets

Description comesfirst clustering

2007

-01-




We can try to avoid these problems by starting with finding a set ofmeaningful and diverse cluster labels and then assigning documents tothese labels to form proper clusters. This kind of general clusteringprocedure we called “description comes first clustering” andimplemented in a search results clustering algorithm called LINGO.


Phrases are good label candidates

Apache HTTP Server

Apache HTTP

Server HTTP

Apache Server

Apache Web Server

Web Server

Apache Software Foundation

Software Foundation Apache County

Apache JunctionApache Indians

Native Americans

Apache Tomcat

Apache AntApache Cocoon

XML

Apache Incubator

Apache Geronimo

... and 300 more...


2007

-01-




So how do we go about finding good cluster labels? One of the firstapproaches to search results clustering called Suffix Tree Clustering wouldgroup documents according to the common phrase they shared.Frequent phrases are very often collocations (such as Web Server orApache County), which increases their descriptive power. But how do weselect the best and most diverse set of cluster labels? We’ve got quite alot of label candidates. . .


Approximate matrix factorizations can find labels

term 1term 2term 3term 4term 5term 6

doc

1

∗

doc

2

∗do

c3

∗do

c4

∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

= A

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned


term 1term 2term 3term 4term 5term 6

doc

1

∗

doc

2

∗

doc

3

∗

doc

4

∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

= A

- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned

2007

-01-




We can do that using Vector Space Model and matrix factorizations.To build the Vector Space Model we need to create a so calledterm-document matrix: a matrix containing frequencies of all termsacross all input documents. If we had just two terms – term X and Y –we could visualise the Vector Space Model as a plane with two axescorresponding to the terms and points on that plane corresponding to theactual documents.



A =

∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

'

∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗

×[∗ ∗ ∗ ∗∗ ∗ ∗ ∗

]

- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned

base vectors

coefficients


A =

∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

'

∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗

×[∗ ∗ ∗ ∗∗ ∗ ∗ ∗

]

- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned

2007

-01-




The task of an approximate matrix factorization is to break a matrixinto a product of usually two matrices in such a way that the product isas close to the original matrix as possible and has much lower rank. Theleft-hand matrix of the product can be thought of as a set of base vectorsof the new low-dimensional space, while the other matrix contains thecorresponding coefficients that enable us to reconstruct the originalmatrix.

In the context of our simplified graphical example, base vectors show thegeneral directions or trends in the input collection.



- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned

- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned


- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned

- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned

2007

-01-




Please notice that both frequent phrases and base vectors are expressedin the same space as the input documents (think of the phrases as tinydocuments). With this assumption we can use e.g. cosine distance tofind the best matching phrase for each base vector. In this way, eachbase vector will lead to selecting one cluster label.


Cosine distance can find documents

- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned


- label candidate

Term X

Term

Y

- a document


Term X

Term

Y

- base vectors


Term X

Term

Y


Term X

Term

Y


Term X

Term

Y


- assigned

- unassigned

2007

-01-




To form proper clusters, we can again use cosine similarity and assign toeach label those documents whose similarity to that label is larger thansome threshold.


Giving priority to labels pays off

Lingo STC Rough K-Means


Lingo STC Rough K-Means20

07-0

1-22

Clustering Search Results. . .Lingo clustering algorithm


Here we show how 100 search results obtained from Yahoo! for the query“tiger” were clustered by Lingo (with SVD decomposition), STC andRough K-Means algorithms. As you can see Lingo did not manage toavoid generating useless labels (such as “sign” or “use”), but it tohighlight some tiger-related topics that the remaining algorithms did notdiscover (helicopter, java, security tool).



2 Carrot2 Framework


4 Summary


Summary

Exploit the potential of existing ontologies?Investigate support for more languages.Investigate more data sources.


References

Osinski, S. and Weiss, D. (2005). A Concept-DrivenAlgorithm for Clustering Search Results. IEEE IntelligentSystems, 20(3):48–54Osiński, S., Stefanowski, J., and Weiss, D. (2004). Lingo:Search Results Clustering Algorithm Based on Singular ValueDecomposition. In Proceedings of the International IntelligentInformation Processing and Web Mining Conference,Zakopane, Poland, Advances in Soft Computing, pages359–368. SpringerWeiss, D. (2006). Descriptive Clustering as a Method forExploring Text Collections. PhD thesis, Poznan University ofTechnology, Poznań, Poland


Carrot2 links

On-line demo:http://www.carrot2.org

Open source project:http://project.carrot2.org

SourceForge (repository etc.):http://sourceforge.net/projects/carrot2

Carrot Search:http://www.carrot-search.com


http://project.carrot2.org

http://sourceforge.net/projects/carrot2

http://www.carrot-search.com