Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han
Feb 10, 2011
Slide 2
Why perform cluster based fact finding? Books: Goldstone Books
is a highly trustworthy provider, but it is not the best for
history books Google/Yahoo/Bing are good search engines. But I
would prefer Monster for jobs or 101apartments for apartments CNN
or CBS or Google news are best for news. But I prefer Slashdot or
Techcrunch for technical news and ESPN or cricinfo for sports news.
Aljazeera for Middle East news! Providers excel in their fields of
focus. 2
Slide 3
Our Contributions Formally define problem of cluster based fact
finding Algorithm that performs trust analysis and clustering of
objects iteratively Comparison of our algorithm using different
fact finders on multiple datasets showing better accuracy and
interesting clusters Analysis of clustering based fact finders
using synthetic dataset 3
Slide 4
Related work Yin et al [TKDE 2008]: Truth finder Dong et al
[PVLDB 2009]: Time varying truth, copycat detection Pasternack et
al [COLING 2010]: Multiple fact finders and effect of priors Sun et
al [EDBT 2009]: Alternate ranking-clustering framework (RankClus)
Gupta et al [WWW 2011]: Trust Analysis with Clustering Work in
Agent-based systems (trust of agents on each other based on past
mutual interactions etc) 4
Slide 5
The iterative fact finder model Three components of model
Trustworthiness of providers (sources) Confidence (belief) of facts
(claims) Implications between facts 5
Slide 6
Basic Fact Finder Algorithm 6
Slide 7
Intuitive example 7
Slide 8
Drawbacks of basic fact finders No object specific trust
ranking is generated. Only global trustworthiness ranking of
providers is computed. Confidence ranking of facts for an object is
influenced by trustworthiness of providers who are not so good for
this object or objects related to this object. 8
Slide 9
Our hypothesis Objects can be clustered based on provider
trustworthiness profiles, t o (p), personalized to the particular
object. Restrictive flow of trust information across objects, using
clusters, can improve ranking accuracy of facts and providers.
Iterative alternate clustering and trust analysis can provide high
quality trust-based clusters and can improve accuracy of trust
ranking of providers and confidence ranking of facts. 9
Slide 10
Clustering before Trust Analysis Drawbacks Does not use the
information about the providers related to objects in other
clusters. This method needs some input clustering. Clusters are
fixed and depend on a particular dimension. In many cases, such a
clustering is not available or the desired trustworthiness based
clustering may not follow any natural clustering of the objects
along just a single dimension. 10
Slide 11
Clustering in provider trustworthiness space 11
Slide 12
Basic Cluster Based Fact Finder Drawbacks: There is no
trustworthiness information sharing between objects in BCFF2. Every
iteration in Algorithm 3 simply re-computes trustworthiness of
providers based on implications between various facts about the
same object. 12
Slide 13
Clustering with Trust Analysis 13
Slide 14
Smoothing Three kinds of providers correct information about
each object wrong information for each of the objects correct for
some, wrong for some Our cluster based algorithms would intuitively
work better for the third case. If the vectors are quite close to
each other, clustering is not really effective, hence smooth using
the global scores s C is cluster based score and s G is the global
score. is set to average inter-cluster similarity. 14
Slide 15
Datasets Books (Yin et al.) 24819 author listings for 1265
books provided by 894 online book stores. Ground truth: manually
from scanned book covers. Accuracy and implication values computed
as match between best author list and golden list 15
Slide 16
Datasets Wikipedia Biography Infobox dataset (Pasternack et al)
Accuracy for date measured as Accuracy of strings: using Edit
distance (if >75% else 0) Population dataset (Pasternack et al)
34422 Population claims by 1361 contributors about 30K cities.
Golden truth using US Census data. Accuracy measured as 16
Slide 17
Analysis of clustering profiles 17
Slide 18
Accuracy results 18
Slide 19
Synthetic dataset 60 objects, 21 providers, 3 clusters Each
object has 4-5 different facts Providers and objects are assigned
to clusters A provider can provide a fact for an object within the
cluster with a probability of 0.8 For a set of dicy objects (for
which most frequent fact is the true fact), prolific providers from
other clusters provide false fact with total freq=1+max freq
19
Slide 20
Improvement in accuracy Parameters: max support for true fact
of dicy objects number of dicy objects original strength of the
providers Gains are more when number of dicy objects are more and
best fact for them is not supported by many providers within their
cluster 20
Slide 21
Comparison of various cluster based fact finders Sums performs
better. Sums has no kind of normalization and hence has best
chances of improving 21
Slide 22
Conclusion We identified the problem of cluster based fact
finding We proposed algorithms for trust analysis using cluster
based methods. We showed using four datasets that our algorithms
perform better than traditional fact finders and generate
interesting clusters. In the future, we plan to use the network
information within objects and use it to influence clustering of
objects 22
Slide 23
Acknowledgements Xiaoxin Yin for basic code base and books and
movies datasets Jeff Pasternack for wikipedia datasets Dr. Dan Roth
for interesting discussions Vinod Vydiswaran for reviewing a
preliminary version of the work NSF (IIS-09-05215) and ARL- NSCTA
(W911NF- 09-2-0053) for funding. 23
Slide 24
References 24
Slide 25
Thanks! 25
Slide 26
Variants of clustering with trust analysis This version of ACFF
would give more importance to trust analysis and tries to organize
the clusters around the results of trust analysis. Drawback:
cluster conditional trust computations are used to re- compute
object conditional trust vectors and also as centroids for
clustering of object conditional trust vectors. This may bias the
algorithm heavily towards changes in trust analysis. 26
Slide 27
Variants of clustering with trust analysis Use the object
conditional trustworthiness vectors computed initially using BCFF2
and avoid re-computing them after the cluster conditional trust
analysis iterations. Iterative trust analysis is done with the sole
purpose of improving the cluster centroids. The representation of
each of the objects is kept fixed. Intuition: cluster centroids
would organize themselves as far away from each other as possible
in the trust space and hence lead to distinct clusters. 27
Slide 28
Variants of clustering with trust analysis Perform clustering
in a secondary richer space. The ith element of vector V is
computed as the cosine similarity between the object conditional
trust vector to and the ith cluster conditional trust vector tci.
28