Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Clustering Search Results with Carrot2
Stanisław Osiński1 Dawid Weiss2
Poznan Supercomputing and Networking Center,ul. Noskowskiego 10, 61-704, Poznan, Poland
Institute of Computing Science, Poznan University of Technology,ul. Piotrowo 3A, 60-965 Poznań, Poland
January 2007
About the authors
Dawid Weiss
entertainment experienceMSc in Software Engineeringindustrial experiencecurrently in academiaPhD in Information Retrieval
Stanisław Osiński
MSc in Design of ITSystemsindustrial/ researchexperienceemployed in a researchinstitutePhD in progress. . .
SRC Carrot2 Lingo Summary
1 Introduction to Search Results Clustering
2 Carrot2 Framework
3 Lingo clustering algorithm
4 Summary
SRC Carrot2 Lingo Summary
Ranked lists are not perfect
Ranked lists are not perfect
2007
-01-
22Clustering Search Results. . .
Introduction to Search Results Clustering
Ranked lists are not perfect
Search results clustering is one of many methods that can be used toimprove user experience while searching collections of textdocuments, web pages for example. To illustrate the problems withconventional ranked list presentation, let’s imagine a user wants to findweb documents about “apache”. Obviously, this is a very general query,which can lead to. . .
SRC Carrot2 Lingo Summary
Ranked lists are not perfect
?HTTP Server
Ranked lists are not perfect
?HTTP Server
2007
-01-
22Clustering Search Results. . .
Introduction to Search Results Clustering
Ranked lists are not perfect
. . . large numbers of references being returned, the majority of which willbe about the Apache Web Server.
SRC Carrot2 Lingo Summary
Ranked lists are not perfect
?HTTP Server
Indians
Helicopter
Ranked lists are not perfect
?HTTP Server
Indians
Helicopter
2007
-01-
22Clustering Search Results. . .
Introduction to Search Results Clustering
Ranked lists are not perfect
A more patient user, a user who is determined enough to look at resultsat rank 100, should be able to reach some scattered results about theApache Helicopter or Apache Indians. As you can see, one problem withranked lists is that sometimes users must go through manyirrelevant documents in order to get to the ones they want.
SRC Carrot2 Lingo Summary
Search Results Clustering can help
!HTTP ServerIndiansHelicopter (other topics)
Search Results Clustering can help
!HTTP ServerIndiansHelicopter (other topics)
2007
-01-
22Clustering Search Results. . .
Introduction to Search Results Clustering
Search Results Clustering can help
So how about an interface that groups the search results intoseparate semantic topics, such as the Apache Web Server, ApacheIndians, Apache Helicopter and so on? With such groups, the user willimmediately get an overview of what is in the results and should be ableto navigate to the interesting documents with less effort.
This kind of interface to search results can be implemented by applying adocument clustering algorithm to the results returned by the searchengine. This is something that is commonly called Search ResultsClustering.
SRC Carrot2 Lingo Summary
Search Results Clustering is an interesting problem
Search Results Clustering is an interesting problem
2007
-01-
22Clustering Search Results. . .
Introduction to Search Results Clustering
Search Results Clustering is an interesting problem
Search Results Clustering has a few interesting characteristics and one ofthem is the fact that it is based only on the fragments of documentsreturned by the search engine (document snippets). This is the onlyinput an algorithm has, full text of documents is not available.
SRC Carrot2 Lingo Summary
Search Results Clustering is an interesting problem
Search Results Clustering is an interesting problem
2007
-01-
22Clustering Search Results. . .
Introduction to Search Results Clustering
Search Results Clustering is an interesting problem
Document snippets returned by search engines are usually very short andnoisy. So we can get broken sentences or useless symbols, numbers ordates in the input.
SRC Carrot2 Lingo Summary
Search Results Clustering is an interesting problem
Semantic clustersMeaningful cluster labelsSmall input
Search Results Clustering is an interesting problem
Semantic clustersMeaningful cluster labelsSmall input
2007
-01-
22Clustering Search Results. . .
Introduction to Search Results Clustering
Search Results Clustering is an interesting problem
In order to be helpful for the users, search results clustering must putresults that deal with the same topic into one group. This is the primaryrequirement for all document clustering algorithms.
But in search results clustering very important are also the labels ofclusters. We must accurately and concisely describe the contents ofthe cluster, so that the user can quickly decide if the cluster isinteresting or not. This aspect of document clustering is sometimesneglected.
Finally, because the total size of input in search results clustering is small(e.g. 200 snippets), we can afford some more complex processing,which can possibly let us achieve better results.
SRC Carrot2 Lingo Summary
Search Results Clustering is an interesting problem
. . . and that’s why we created
SRC Carrot2 Lingo Summary
1 Introduction to Search Results Clustering
2 Carrot2 Framework
3 Lingo clustering algorithm
4 Summary
SRC Carrot2 Lingo Summary
Carrot2 is about search results clustering
Carrot2 is a...
framework forexperimenting withprocessing and presentationof search resultsframework for buildingreal-worldproduction-qualityapplicationsBSD-licensed open sourceproject
SRC Carrot2 Lingo Summary
Carrot2 targets researchers, developers and end-users
SRC Carrot2 Lingo Summary
Carrot2 is based on processing pipelines
SRC Carrot2 Lingo Summary
Carrot2 offers ready-to-use components: input
Fetching search results from:
Google (API) Yahoo (API) MSN (API)
Open Search Lucene ODP Project
SRC Carrot2 Lingo Summary
Carrot2 offers ready-to-use components: clustering
5 search results clustering algorithms:
Lingo(Stanisław Osiński)STC(Oren Zamir, Oren Etzioni)Rough-KMeans(Ngo Chi Lang)HAOG(Karol Gołembniak, Irmina Masłowska)FuzzyAnts(Steven Schockaert)
SRC Carrot2 Lingo Summary
Carrot2 offers ready-to-use components: other
Other utilities:language tokenizers, stemmers and stop word listsvery fast matrix computations librarydesktop browser application for tuning and rapid experiments
2007
-01-
22Clustering Search Results. . .
Carrot2 Framework
The desktop application allows detailed tuning of each algorithm. In thequery panel we have options for:
• input/ algorithm selection,
• number of search results to fetch,
• default algorithm configuration settings.
After a query is performed, a result tab appears on screen allowing:
• benchmarking,
• visualization,
• on-line modification of algorithm parameters, reflected in theclusters panel.
2007
-01-
22Clustering Search Results. . .
Carrot2 Framework
The on-line demo is a playground for users, but also a demonstration ofthe technology really used by quite a number of people.
SRC Carrot2 Lingo Summary
Carrot2 has a number of real-world applications
SRC Carrot2 Lingo Summary
Commercial spin-off: Carrot Search s.c.
A different, improvedclustering algorithm –Lingo3GConsulting and support forthe open source projectText-mining consultancy
SRC Carrot2 Lingo Summary
Lingo3G introduces many improvements
hierarchical results,a number of customizationoptions,much faster and robust,better cluster labels.
SRC Carrot2 Lingo Summary
1 Introduction to Search Results Clustering
2 Carrot2 Framework
3 Lingo clustering algorithm
4 Summary
SRC Carrot2 Lingo Summary
Lingo is designed specifically for search results clustering
Semantic clustersMeaningful cluster labelsSmall input
Lingo is designed specifically for search results clustering
Semantic clustersMeaningful cluster labelsSmall input
2007
-01-
22Clustering Search Results. . .
Lingo clustering algorithm
Lingo is designed specifically for search results clustering
The primary assumption we made when working Lingo was that it shouldbe an algorithm specifically designed to handle search results clustering.Therefore our main focus was the quality of cluster label. We also wereaware that, due to the small size of input, we could afford more complexprocessing.
SRC Carrot2 Lingo Summary
Cluster description has a priority
Snippets
Clusters
Results
cluster
describe
Classic clustering
Cluster description has a priority
Snippets
Clusters
Results
cluster
describe
Classic clustering20
07-0
1-22
Clustering Search Results. . .Lingo clustering algorithm
Cluster description has a priority
Having in mind the requirement for high quality of cluster labels, weexperimented with reversing the normal clustering order and givingthe cluster description a priority.
In the classic clustering scheme, in which the algorithm starts withfinding document groups and then tries to label these groups, we canhave situations where the algorithm knows that certain documents shouldbe clustered together, but at the same time the algorithm is unable toexplain to the user what these documents have in common.
SRC Carrot2 Lingo Summary
Cluster description has a priority
Snippets
Clusters
Results
cluster
describe
Classic clustering
Snippets
Labels
Results
find labels
add snippets
Description comesfirst clustering
Cluster description has a priority
Snippets
Clusters
Results
cluster
describe
Classic clustering
Snippets
Labels
Results
find labels
add snippets
Description comesfirst clustering
2007
-01-
22Clustering Search Results. . .
Lingo clustering algorithm
Cluster description has a priority
We can try to avoid these problems by starting with finding a set ofmeaningful and diverse cluster labels and then assigning documents tothese labels to form proper clusters. This kind of general clusteringprocedure we called “description comes first clustering” andimplemented in a search results clustering algorithm called LINGO.
SRC Carrot2 Lingo Summary
Phrases are good label candidates
Apache HTTP Server
Apache HTTP
Server HTTP
Apache Server
Apache Web Server
Web Server
Apache Software Foundation
Software Foundation Apache County
Apache JunctionApache Indians
Native Americans
Apache Tomcat
Apache AntApache Cocoon
XML
Apache Incubator
Apache Geronimo
... and 300 more...
Phrases are good label candidates
2007
-01-
22Clustering Search Results. . .
Lingo clustering algorithm
Phrases are good label candidates
So how do we go about finding good cluster labels? One of the firstapproaches to search results clustering called Suffix Tree Clustering wouldgroup documents according to the common phrase they shared.Frequent phrases are very often collocations (such as Web Server orApache County), which increases their descriptive power. But how do weselect the best and most diverse set of cluster labels? We’ve got quite alot of label candidates. . .
SRC Carrot2 Lingo Summary
Approximate matrix factorizations can find labels
term 1term 2term 3term 4term 5term 6
doc
1
∗
doc
2
∗do
c3
∗do
c4
∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗
= A
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
Approximate matrix factorizations can find labels
term 1term 2term 3term 4term 5term 6
doc
1
∗
doc
2
∗
doc
3
∗
doc
4
∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗
= A
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
2007
-01-
22Clustering Search Results. . .
Lingo clustering algorithm
Approximate matrix factorizations can find labels
We can do that using Vector Space Model and matrix factorizations.To build the Vector Space Model we need to create a so calledterm-document matrix: a matrix containing frequencies of all termsacross all input documents. If we had just two terms – term X and Y –we could visualise the Vector Space Model as a plane with two axescorresponding to the terms and points on that plane corresponding to theactual documents.
SRC Carrot2 Lingo Summary
Approximate matrix factorizations can find labels
A =
∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗
'
∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗
×[∗ ∗ ∗ ∗∗ ∗ ∗ ∗
]
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
base vectors
coefficients
Approximate matrix factorizations can find labels
A =
∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗
'
∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗
×[∗ ∗ ∗ ∗∗ ∗ ∗ ∗
]
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
2007
-01-
22Clustering Search Results. . .
Lingo clustering algorithm
Approximate matrix factorizations can find labels
The task of an approximate matrix factorization is to break a matrixinto a product of usually two matrices in such a way that the product isas close to the original matrix as possible and has much lower rank. Theleft-hand matrix of the product can be thought of as a set of base vectorsof the new low-dimensional space, while the other matrix contains thecorresponding coefficients that enable us to reconstruct the originalmatrix.
In the context of our simplified graphical example, base vectors show thegeneral directions or trends in the input collection.
SRC Carrot2 Lingo Summary
Approximate matrix factorizations can find labels
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
Approximate matrix factorizations can find labels
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
2007
-01-
22Clustering Search Results. . .
Lingo clustering algorithm
Approximate matrix factorizations can find labels
Please notice that both frequent phrases and base vectors are expressedin the same space as the input documents (think of the phrases as tinydocuments). With this assumption we can use e.g. cosine distance tofind the best matching phrase for each base vector. In this way, eachbase vector will lead to selecting one cluster label.
SRC Carrot2 Lingo Summary
Cosine distance can find documents
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
Cosine distance can find documents
- label candidate
Term X
Term
Y
- a document
Documents in terms' space
Term X
Term
Y
- base vectors
Visual representation of base vectors(„concepts”) acquired by SVD decomposition of the term-document matrix
Term X
Term
Y
Cluster label candidates expressed inthe vector space of the documents
Term X
Term
Y
Choosing a cluster's label
Term X
Term
Y
Cluster content discovery
- assigned
- unassigned
2007
-01-
22Clustering Search Results. . .
Lingo clustering algorithm
Cosine distance can find documents
To form proper clusters, we can again use cosine similarity and assign toeach label those documents whose similarity to that label is larger thansome threshold.
SRC Carrot2 Lingo Summary
Giving priority to labels pays off
Lingo STC Rough K-Means
Giving priority to labels pays off
Lingo STC Rough K-Means20
07-0
1-22
Clustering Search Results. . .Lingo clustering algorithm
Giving priority to labels pays off
Here we show how 100 search results obtained from Yahoo! for the query“tiger” were clustered by Lingo (with SVD decomposition), STC andRough K-Means algorithms. As you can see Lingo did not manage toavoid generating useless labels (such as “sign” or “use”), but it tohighlight some tiger-related topics that the remaining algorithms did notdiscover (helicopter, java, security tool).
SRC Carrot2 Lingo Summary
1 Introduction to Search Results Clustering
2 Carrot2 Framework
3 Lingo clustering algorithm
4 Summary
SRC Carrot2 Lingo Summary
Summary
Exploit the potential of existing ontologies?Investigate support for more languages.Investigate more data sources.
SRC Carrot2 Lingo Summary
References
Osinski, S. and Weiss, D. (2005). A Concept-DrivenAlgorithm for Clustering Search Results. IEEE IntelligentSystems, 20(3):48–54Osiński, S., Stefanowski, J., and Weiss, D. (2004). Lingo:Search Results Clustering Algorithm Based on Singular ValueDecomposition. In Proceedings of the International IntelligentInformation Processing and Web Mining Conference,Zakopane, Poland, Advances in Soft Computing, pages359–368. SpringerWeiss, D. (2006). Descriptive Clustering as a Method forExploring Text Collections. PhD thesis, Poznan University ofTechnology, Poznań, Poland
SRC Carrot2 Lingo Summary
Carrot2 links
On-line demo:http://www.carrot2.org
Open source project:http://project.carrot2.org
SourceForge (repository etc.):http://sourceforge.net/projects/carrot2
Carrot Search:http://www.carrot-search.com