Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011

Why perform cluster based fact finding? Books: Goldstone Books is a highly trustworthy provider, but it is not the best for history books Google/Yahoo/Bing are good search engines. But I would prefer Monster for jobs or 101apartments for apartments CNN or CBS or Google news are best for news. But I prefer Slashdot or Techcrunch for technical news and ESPN or cricinfo for sports news. Aljazeera for Middle East news! Providers excel in their fields of focus. 2

Our Contributions Formally define problem of cluster based fact finding Algorithm that performs trust analysis and clustering of objects iteratively Comparison of our algorithm using different fact finders on multiple datasets showing better accuracy and interesting clusters Analysis of clustering based fact finders using synthetic dataset 3

Related work Yin et al [TKDE 2008]: Truth finder Dong et al [PVLDB 2009]: Time varying truth, copycat detection Pasternack et al [COLING 2010]: Multiple fact finders and effect of priors Sun et al [EDBT 2009]: Alternate ranking-clustering framework (RankClus) Gupta et al [WWW 2011]: Trust Analysis with Clustering Work in Agent-based systems (trust of agents on each other based on past mutual interactions etc) 4

The iterative fact finder model Three components of model Trustworthiness of providers (sources) Confidence (belief) of facts (claims) Implications between facts 5

Basic Fact Finder Algorithm 6

Intuitive example 7

Drawbacks of basic fact finders No object specific trust ranking is generated. Only global trustworthiness ranking of providers is computed. Confidence ranking of facts for an object is influenced by trustworthiness of providers who are not so good for this object or objects related to this object. 8

Our hypothesis Objects can be clustered based on provider trustworthiness profiles, t o (p), personalized to the particular object. Restrictive flow of trust information across objects, using clusters, can improve ranking accuracy of facts and providers. Iterative alternate clustering and trust analysis can provide high quality trust-based clusters and can improve accuracy of trust ranking of providers and confidence ranking of facts. 9

Clustering before Trust Analysis Drawbacks Does not use the information about the providers related to objects in other clusters. This method needs some input clustering. Clusters are fixed and depend on a particular dimension. In many cases, such a clustering is not available or the desired trustworthiness based clustering may not follow any natural clustering of the objects along just a single dimension. 10

Clustering in provider trustworthiness space 11

Basic Cluster Based Fact Finder Drawbacks: There is no trustworthiness information sharing between objects in BCFF2. Every iteration in Algorithm 3 simply re-computes trustworthiness of providers based on implications between various facts about the same object. 12

Clustering with Trust Analysis 13

Smoothing Three kinds of providers correct information about each object wrong information for each of the objects correct for some, wrong for some Our cluster based algorithms would intuitively work better for the third case. If the vectors are quite close to each other, clustering is not really effective, hence smooth using the global scores s C is cluster based score and s G is the global score. is set to average inter-cluster similarity. 14

Datasets Books (Yin et al.) 24819 author listings for 1265 books provided by 894 online book stores. Ground truth: manually from scanned book covers. Accuracy and implication values computed as match between best author list and golden list 15

Datasets Wikipedia Biography Infobox dataset (Pasternack et al) Accuracy for date measured as Accuracy of strings: using Edit distance (if >75% else 0) Population dataset (Pasternack et al) 34422 Population claims by 1361 contributors about 30K cities. Golden truth using US Census data. Accuracy measured as 16

Analysis of clustering profiles 17

Accuracy results 18

Synthetic dataset 60 objects, 21 providers, 3 clusters Each object has 4-5 different facts Providers and objects are assigned to clusters A provider can provide a fact for an object within the cluster with a probability of 0.8 For a set of dicy objects (for which most frequent fact is the true fact), prolific providers from other clusters provide false fact with total freq=1+max freq 19

Improvement in accuracy Parameters: max support for true fact of dicy objects number of dicy objects original strength of the providers Gains are more when number of dicy objects are more and best fact for them is not supported by many providers within their cluster 20

Comparison of various cluster based fact finders Sums performs better. Sums has no kind of normalization and hence has best chances of improving 21

Conclusion We identified the problem of cluster based fact finding We proposed algorithms for trust analysis using cluster based methods. We showed using four datasets that our algorithms perform better than traditional fact finders and generate interesting clusters. In the future, we plan to use the network information within objects and use it to influence clustering of objects 22

Acknowledgements Xiaoxin Yin for basic code base and books and movies datasets Jeff Pasternack for wikipedia datasets Dr. Dan Roth for interesting discussions Vinod Vydiswaran for reviewing a preliminary version of the work NSF (IIS-09-05215) and ARL- NSCTA (W911NF- 09-2-0053) for funding. 23

References 24

Thanks! 25

Variants of clustering with trust analysis This version of ACFF would give more importance to trust analysis and tries to organize the clusters around the results of trust analysis. Drawback: cluster conditional trust computations are used to re- compute object conditional trust vectors and also as centroids for clustering of object conditional trust vectors. This may bias the algorithm heavily towards changes in trust analysis. 26

Variants of clustering with trust analysis Use the object conditional trustworthiness vectors computed initially using BCFF2 and avoid re-computing them after the cluster conditional trust analysis iterations. Iterative trust analysis is done with the sole purpose of improving the cluster centroids. The representation of each of the objects is kept fixed. Intuition: cluster centroids would organize themselves as far away from each other as possible in the trust space and hence lead to distinct clusters. 27

Variants of clustering with trust analysis Perform clustering in a secondary richer space. The ith element of vector V is computed as the cosine similarity between the object conditional trust vector to and the ith cluster conditional trust vector tci. 28

Documents

Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011