59
Clustering Search Results with Carrot 2 Stanislaw Osiński 1 Dawid Weiss 2 Poznan Supercomputing and Networking Center, ul. Noskowskiego 10, 61-704, Poznan, Poland [email protected] Institute of Computing Science, Poznan University of Technology, ul. Piotrowo 3A, 60-965 Poznań, Poland [email protected] January 2007

Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Clustering Search Results with Carrot2

Stanisław Osiński1 Dawid Weiss2

Poznan Supercomputing and Networking Center,ul. Noskowskiego 10, 61-704, Poznan, Poland

[email protected]

Institute of Computing Science, Poznan University of Technology,ul. Piotrowo 3A, 60-965 Poznań, Poland

[email protected]

January 2007

Page 2: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

About the authors

Dawid Weiss

entertainment experienceMSc in Software Engineeringindustrial experiencecurrently in academiaPhD in Information Retrieval

Stanisław Osiński

MSc in Design of ITSystemsindustrial/ researchexperienceemployed in a researchinstitutePhD in progress. . .

Page 3: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

1 Introduction to Search Results Clustering

2 Carrot2 Framework

3 Lingo clustering algorithm

4 Summary

Page 4: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Ranked lists are not perfect

Page 5: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Ranked lists are not perfect

2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering

Ranked lists are not perfect

Search results clustering is one of many methods that can be used toimprove user experience while searching collections of textdocuments, web pages for example. To illustrate the problems withconventional ranked list presentation, let’s imagine a user wants to findweb documents about “apache”. Obviously, this is a very general query,which can lead to. . .

Page 6: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Ranked lists are not perfect

?HTTP Server

Page 7: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Ranked lists are not perfect

?HTTP Server

2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering

Ranked lists are not perfect

. . . large numbers of references being returned, the majority of which willbe about the Apache Web Server.

Page 8: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Ranked lists are not perfect

?HTTP Server

Indians

Helicopter

Page 9: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Ranked lists are not perfect

?HTTP Server

Indians

Helicopter

2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering

Ranked lists are not perfect

A more patient user, a user who is determined enough to look at resultsat rank 100, should be able to reach some scattered results about theApache Helicopter or Apache Indians. As you can see, one problem withranked lists is that sometimes users must go through manyirrelevant documents in order to get to the ones they want.

Page 10: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Search Results Clustering can help

!HTTP ServerIndiansHelicopter (other topics)

Page 11: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Search Results Clustering can help

!HTTP ServerIndiansHelicopter (other topics)

2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering

Search Results Clustering can help

So how about an interface that groups the search results intoseparate semantic topics, such as the Apache Web Server, ApacheIndians, Apache Helicopter and so on? With such groups, the user willimmediately get an overview of what is in the results and should be ableto navigate to the interesting documents with less effort.

This kind of interface to search results can be implemented by applying adocument clustering algorithm to the results returned by the searchengine. This is something that is commonly called Search ResultsClustering.

Page 12: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Search Results Clustering is an interesting problem

Page 13: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Search Results Clustering is an interesting problem

2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering

Search Results Clustering is an interesting problem

Search Results Clustering has a few interesting characteristics and one ofthem is the fact that it is based only on the fragments of documentsreturned by the search engine (document snippets). This is the onlyinput an algorithm has, full text of documents is not available.

Page 14: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Search Results Clustering is an interesting problem

Page 15: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Search Results Clustering is an interesting problem

2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering

Search Results Clustering is an interesting problem

Document snippets returned by search engines are usually very short andnoisy. So we can get broken sentences or useless symbols, numbers ordates in the input.

Page 16: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Search Results Clustering is an interesting problem

Semantic clustersMeaningful cluster labelsSmall input

Page 17: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Search Results Clustering is an interesting problem

Semantic clustersMeaningful cluster labelsSmall input

2007

-01-

22Clustering Search Results. . .

Introduction to Search Results Clustering

Search Results Clustering is an interesting problem

In order to be helpful for the users, search results clustering must putresults that deal with the same topic into one group. This is the primaryrequirement for all document clustering algorithms.

But in search results clustering very important are also the labels ofclusters. We must accurately and concisely describe the contents ofthe cluster, so that the user can quickly decide if the cluster isinteresting or not. This aspect of document clustering is sometimesneglected.

Finally, because the total size of input in search results clustering is small(e.g. 200 snippets), we can afford some more complex processing,which can possibly let us achieve better results.

Page 18: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Search Results Clustering is an interesting problem

. . . and that’s why we created

Page 19: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

1 Introduction to Search Results Clustering

2 Carrot2 Framework

3 Lingo clustering algorithm

4 Summary

Page 20: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 is about search results clustering

Carrot2 is a...

framework forexperimenting withprocessing and presentationof search resultsframework for buildingreal-worldproduction-qualityapplicationsBSD-licensed open sourceproject

Page 21: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 targets researchers, developers and end-users

Page 22: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 is based on processing pipelines

Page 23: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 offers ready-to-use components: input

Fetching search results from:

Google (API) Yahoo (API) MSN (API)

Open Search Lucene ODP Project

Page 24: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 offers ready-to-use components: clustering

5 search results clustering algorithms:

Lingo(Stanisław Osiński)STC(Oren Zamir, Oren Etzioni)Rough-KMeans(Ngo Chi Lang)HAOG(Karol Gołembniak, Irmina Masłowska)FuzzyAnts(Steven Schockaert)

Page 25: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 offers ready-to-use components: other

Other utilities:language tokenizers, stemmers and stop word listsvery fast matrix computations librarydesktop browser application for tuning and rapid experiments

Page 26: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,
Page 27: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,
Page 28: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

2007

-01-

22Clustering Search Results. . .

Carrot2 Framework

The desktop application allows detailed tuning of each algorithm. In thequery panel we have options for:

• input/ algorithm selection,

• number of search results to fetch,

• default algorithm configuration settings.

After a query is performed, a result tab appears on screen allowing:

• benchmarking,

• visualization,

• on-line modification of algorithm parameters, reflected in theclusters panel.

Page 29: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

http://www.carrot2.org

Page 30: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

2007

-01-

22Clustering Search Results. . .

Carrot2 Framework

The on-line demo is a playground for users, but also a demonstration ofthe technology really used by quite a number of people.

Page 31: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

http://www.google.com/analytics

Page 32: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

http://www.google.com/analytics

Page 33: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 has a number of real-world applications

Page 34: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Commercial spin-off: Carrot Search s.c.

A different, improvedclustering algorithm –Lingo3GConsulting and support forthe open source projectText-mining consultancy

Page 35: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Lingo3G introduces many improvements

hierarchical results,a number of customizationoptions,much faster and robust,better cluster labels.

Page 36: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

1 Introduction to Search Results Clustering

2 Carrot2 Framework

3 Lingo clustering algorithm

4 Summary

Page 37: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Lingo is designed specifically for search results clustering

Semantic clustersMeaningful cluster labelsSmall input

Page 38: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Lingo is designed specifically for search results clustering

Semantic clustersMeaningful cluster labelsSmall input

2007

-01-

22Clustering Search Results. . .

Lingo clustering algorithm

Lingo is designed specifically for search results clustering

The primary assumption we made when working Lingo was that it shouldbe an algorithm specifically designed to handle search results clustering.Therefore our main focus was the quality of cluster label. We also wereaware that, due to the small size of input, we could afford more complexprocessing.

Page 39: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Cluster description has a priority

Snippets

Clusters

Results

cluster

describe

Classic clustering

Page 40: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Cluster description has a priority

Snippets

Clusters

Results

cluster

describe

Classic clustering20

07-0

1-22

Clustering Search Results. . .Lingo clustering algorithm

Cluster description has a priority

Having in mind the requirement for high quality of cluster labels, weexperimented with reversing the normal clustering order and givingthe cluster description a priority.

In the classic clustering scheme, in which the algorithm starts withfinding document groups and then tries to label these groups, we canhave situations where the algorithm knows that certain documents shouldbe clustered together, but at the same time the algorithm is unable toexplain to the user what these documents have in common.

Page 41: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Cluster description has a priority

Snippets

Clusters

Results

cluster

describe

Classic clustering

Snippets

Labels

Results

find labels

add snippets

Description comesfirst clustering

Page 42: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Cluster description has a priority

Snippets

Clusters

Results

cluster

describe

Classic clustering

Snippets

Labels

Results

find labels

add snippets

Description comesfirst clustering

2007

-01-

22Clustering Search Results. . .

Lingo clustering algorithm

Cluster description has a priority

We can try to avoid these problems by starting with finding a set ofmeaningful and diverse cluster labels and then assigning documents tothese labels to form proper clusters. This kind of general clusteringprocedure we called “description comes first clustering” andimplemented in a search results clustering algorithm called LINGO.

Page 43: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Phrases are good label candidates

Apache HTTP Server

Apache HTTP

Server HTTP

Apache Server

Apache Web Server

Web Server

Apache Software Foundation

Software Foundation Apache County

Apache JunctionApache Indians

Native Americans

Apache Tomcat

Apache AntApache Cocoon

XML

Apache Incubator

Apache Geronimo

... and 300 more...

Page 44: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Phrases are good label candidates

2007

-01-

22Clustering Search Results. . .

Lingo clustering algorithm

Phrases are good label candidates

So how do we go about finding good cluster labels? One of the firstapproaches to search results clustering called Suffix Tree Clustering wouldgroup documents according to the common phrase they shared.Frequent phrases are very often collocations (such as Web Server orApache County), which increases their descriptive power. But how do weselect the best and most diverse set of cluster labels? We’ve got quite alot of label candidates. . .

Page 45: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Approximate matrix factorizations can find labels

term 1term 2term 3term 4term 5term 6

doc

1

doc

2

∗do

c3

∗do

c4

∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

= A

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

Page 46: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Approximate matrix factorizations can find labels

term 1term 2term 3term 4term 5term 6

doc

1

doc

2

doc

3

doc

4

∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

= A

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

2007

-01-

22Clustering Search Results. . .

Lingo clustering algorithm

Approximate matrix factorizations can find labels

We can do that using Vector Space Model and matrix factorizations.To build the Vector Space Model we need to create a so calledterm-document matrix: a matrix containing frequencies of all termsacross all input documents. If we had just two terms – term X and Y –we could visualise the Vector Space Model as a plane with two axescorresponding to the terms and points on that plane corresponding to theactual documents.

Page 47: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Approximate matrix factorizations can find labels

A =

∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

'

∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗

×[∗ ∗ ∗ ∗∗ ∗ ∗ ∗

]

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

base vectors

coefficients

Page 48: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Approximate matrix factorizations can find labels

A =

∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

'

∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗

×[∗ ∗ ∗ ∗∗ ∗ ∗ ∗

]

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

2007

-01-

22Clustering Search Results. . .

Lingo clustering algorithm

Approximate matrix factorizations can find labels

The task of an approximate matrix factorization is to break a matrixinto a product of usually two matrices in such a way that the product isas close to the original matrix as possible and has much lower rank. Theleft-hand matrix of the product can be thought of as a set of base vectorsof the new low-dimensional space, while the other matrix contains thecorresponding coefficients that enable us to reconstruct the originalmatrix.

In the context of our simplified graphical example, base vectors show thegeneral directions or trends in the input collection.

Page 49: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Approximate matrix factorizations can find labels

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

Page 50: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Approximate matrix factorizations can find labels

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

2007

-01-

22Clustering Search Results. . .

Lingo clustering algorithm

Approximate matrix factorizations can find labels

Please notice that both frequent phrases and base vectors are expressedin the same space as the input documents (think of the phrases as tinydocuments). With this assumption we can use e.g. cosine distance tofind the best matching phrase for each base vector. In this way, eachbase vector will lead to selecting one cluster label.

Page 51: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Cosine distance can find documents

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

Page 52: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Cosine distance can find documents

- label candidate

Term X

Term

Y

- a document

Documents in terms' space

Term X

Term

Y

- base vectors

Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix

Term X

Term

Y

Cluster label candidates expressed inthe vector space of the documents

Term X

Term

Y

Choosing a cluster's label

Term X

Term

Y

Cluster content discovery

- assigned

- unassigned

2007

-01-

22Clustering Search Results. . .

Lingo clustering algorithm

Cosine distance can find documents

To form proper clusters, we can again use cosine similarity and assign toeach label those documents whose similarity to that label is larger thansome threshold.

Page 53: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Giving priority to labels pays off

Lingo STC Rough K-Means

Page 54: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

Giving priority to labels pays off

Lingo STC Rough K-Means20

07-0

1-22

Clustering Search Results. . .Lingo clustering algorithm

Giving priority to labels pays off

Here we show how 100 search results obtained from Yahoo! for the query“tiger” were clustered by Lingo (with SVD decomposition), STC andRough K-Means algorithms. As you can see Lingo did not manage toavoid generating useless labels (such as “sign” or “use”), but it tohighlight some tiger-related topics that the remaining algorithms did notdiscover (helicopter, java, security tool).

Page 55: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

1 Introduction to Search Results Clustering

2 Carrot2 Framework

3 Lingo clustering algorithm

4 Summary

Page 56: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Summary

Exploit the potential of existing ontologies?Investigate support for more languages.Investigate more data sources.

Page 57: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

References

Osinski, S. and Weiss, D. (2005). A Concept-DrivenAlgorithm for Clustering Search Results. IEEE IntelligentSystems, 20(3):48–54Osiński, S., Stefanowski, J., and Weiss, D. (2004). Lingo:Search Results Clustering Algorithm Based on Singular ValueDecomposition. In Proceedings of the International IntelligentInformation Processing and Web Mining Conference,Zakopane, Poland, Advances in Soft Computing, pages359–368. SpringerWeiss, D. (2006). Descriptive Clustering as a Method forExploring Text Collections. PhD thesis, Poznan University ofTechnology, Poznań, Poland

Page 58: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,

SRC Carrot2 Lingo Summary

Carrot2 links

On-line demo:http://www.carrot2.org

Open source project:http://project.carrot2.org

SourceForge (repository etc.):http://sourceforge.net/projects/carrot2

Carrot Search:http://www.carrot-search.com

Page 59: Clustering Search Results with Carrot2project.carrot2.org/publications/carrot2-dresden-2007.pdf · 2019-09-23 · separate semantic topics, such as the Apache Web Server, Apache Indians,