26
Can Change this on the Master Sli Monday, August 20, 2007 Can change this on the Master Slide 1 A Distributed Ranking Algorithm for the iTrust Information Search and Retrieval System Presented by Boyang Peng Research conducted in collaboration with Y. T. Chuang, I. Michel Lombera, L. E. Moser, and P. M. Melliar-Smith Supported in part by NSF Grant CNS 10-16103

Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

Embed Size (px)

Citation preview

Page 1: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

Can Change this on the Master Slide

Can change this on the Master Slide 1Monday, August 20, 2007

A Distributed Ranking Algorithm for the iTrust Information Search

and Retrieval System

Presented by Boyang Peng

Research conducted in collaboration with Y. T. Chuang, I. Michel Lombera, L. E. Moser, and P. M. Melliar-Smith

Supported in part by NSF Grant CNS 10-16103

Page 2: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

OverviewOverview

1)Introduction

2)Overview of iTrust

3)Distributed Ranking System

4)Trustworthiness

5)Evaluation

6)Conclusion and Future Work

WEBIST 2013 iTrust Boyang Peng 2

Page 3: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

IntroductionIntroduction

WEBIST 2013 iTrust Boyang Peng 3

• What is iTrust?

Page 4: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

IntroductionIntroduction

WEBIST 2013 Boyang Peng 4

vs

Page 5: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

PurposePurpose

WEBIST 2013 Boyang Peng 5

Page 6: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrustWEBIST 2013 Boyang Peng

Source of Information

Source of Information

1. Distribution of metadata

1. Distribution of metadata

Page 7: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrustWEBIST 2013 Boyang Peng

Source of Information

Source of Information

Requester of Information

2. Distribution of Requests

3. Request encounters metadata

3. Request encounters metadata

Page 8: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrustWEBIST 2013 Boyang Peng

Source of Information

Source of Information

Requester of Information

4. Request matched4. Request

matched

Page 9: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Distributed Ranking SystemDistributed Ranking System

• Why ranking is needed in iTrust? Centralized search engines have this functionality Filters out trivial and not-relevant files Increases both the fidelity and the quality of the results

• How is the ranking is done? What metrics the ranking algorithm will use and what the

ranking formula is. What information the ranking algorithm needs and how to

retrieve that information

WEBIST 2013 Boyang Peng 9

Page 10: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Distributed Ranking SystemDistributed Ranking System

• Indexing performed at the source nodes Generate a term-frequency table for an uploaded document

• Ranking performed at the requesting node Ensure fidelity of results

WEBIST 2013 Boyang Peng

Page 11: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrustWEBIST 2013 Boyang Peng

Source of Information

Source of Information

Requester of Information

Request FreqTable(d)

Request FreqTable(d)

5. Retrieve term-frequency table

Page 12: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Ranking AlgorithmRanking Algorithm

where

• norm(d) is the normalization factor for document d, computed as:  

• number_of_common_terms for a document d is |s∩c|, where s is the set of all terms in the freqTable(d) and c is the set of common terms

• number_of_uncommon_terms for a document d is |freqTable(d)| - number_of_common_terms

WEBIST 2013 Boyang Peng 12

Page 13: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Ranking AlgorithmRanking Algorithm

where

• tf(t,d) is the term-frequency factor for term t in document d, computed as:

• freq(t,d) is the frequency of occurrence of term t in freqTable(d)

• avg(freq(d))) is the average frequency of terms contained in the freqTable(d)

WEBIST 2013 Boyang Peng 13

Page 14: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Ranking AlgorithmRanking Algorithm

where

• idf(t) is the inverse document frequency factor for term t, computed as:

• numDocs is the total number of documents being ranked

• docFreq(t) is the number of documents that contain the term t

WEBIST 2013 Boyang Peng 14

Page 15: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Trustworthiness Trustworthiness

• Potential scammers Falsifying Information

• Distribute a term-frequency table containing every single word in the language

• Set a limit on the size of term-frequency table of a document

Exaggerating Information• A malicious node can exaggerate the information about a

document to achieve a higher ranking

WEBIST 2013 Boyang Peng 15

Page 16: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrustWEBIST 2013 Boyang Peng 16

The percent time that a document is ranked last as a function of the number of keywords in the query. The size of the term-frequency table for all documents is 200

Page 17: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrustWEBIST 2013 Boyang Peng 17

The mean score of 1000 rankings of a document as a function of the number of keywords in the query. The lines, Document x 5 and Document x 10, correspond to the frequencies in the term-frequency table of a document multiplied by a factor of 5 and 10, respectively

Page 18: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

EvaluationEvaluation

• Because iTrust is a distributed and probabilistic system, for reproducibility of results, we evaluate the effectiveness of the ranking system by simulation, separate from the iTrust system implementation

• As the number of keywords in the query increases, the accuracy of the results increases

WEBIST 2013 Boyang Peng 18

Page 19: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Document SimilarityDocument Similarity

WEBIST 2013 Boyang Peng 19

The mean percent time of 1000 rankings that a set of documents (Document Set 1 at the left and Document Set 2 at the right) are ranked the top four, as a function of the number of keywords in the query

Page 20: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Ranking StabilityRanking Stability

WEBIST 2013 Boyang Peng 20

Page 21: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Ranking StabilityRanking Stability

WEBIST 2013 Boyang Peng 21

Page 22: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

How big should the term-frequency tables be?How big should the term-frequency tables be?

WEBIST 2013 Boyang Peng 22

Page 23: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

ConclusionConclusion

• We have presented a Distributed Ranking System for iTrust Effective in ranking documents relative to their relevance to

queries that the user has input Exhibits stability in ranking documents Counters scamming by malicious nodes

WEBIST 2013 Boyang Peng 23

Page 24: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Related WorksRelated Works

• Danzig, P. B., Ahn, J., Noll, J., & Obraczka, K. (1991, September). Distributed indexing: a scalable mechanism for distributed information retrieval. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 220-229). ACM.

• Kalogeraki, V., Gunopulos, D., and Zeinalipour-Yazti, D. (2002). A local search mechanism for peer-to-peer networks. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, pages 300–307.

• Callan, J. P., Lu, Z., & Croft, W. B. (1995, July). Searching distributed collections with inference networks. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 21-28). ACM.

• Xu, J., & Croft, W. B. (1999, August). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 254-261). ACM.

WEBIST 2013 Boyang Peng 24

Page 25: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

iTrust

Future WorkFuture Work

• Ranking that also takes into account the reputation of the source node or the document or both

• Evaluating the distributed ranking algorithm on a larger and more varied set of documents

• Additional schemes to prevent malicious nodes from gaining an unfair advantage

WEBIST 2013 Boyang Peng 25

Page 26: Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information

Question? Comments?Question? Comments?

• Our iTrust Website: http://itrust.ece.ucsb.edu

• Contact information: Boyang Peng: [email protected] Personal Website: http://www.cs.ucsb.edu/~jpeng/

• Our project is supported by NSF: CNS 10-16193

WEBIST 2013 iTrust Boyang Peng 26