17
Link Distribution on Wikipedia [0422]KwangHee Park

Link Distribution on W ikipedia

  • Upload
    marlow

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Link Distribution on W ikipedia. [0422] KwangHee Park. Table of contents. Introduction Similarity between document Error case Modify word bag Conclusion. Introduction. Why focused on Link - PowerPoint PPT Presentation

Citation preview

Page 1: Link Distribution on  W ikipedia

Link Distribution on Wikipedia

[0422]KwangHee Park

Page 2: Link Distribution on  W ikipedia

Table of contents Introduction Similarity between document

Error case Modify word bag

Conclusion

Page 3: Link Distribution on  W ikipedia

Introduction Why focused on Link

When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others

Assumption Link terms in the Wikipedia articles is the key terms which

can represent specific characteristic of articles

Page 4: Link Distribution on  W ikipedia

Introduction Problem what we want to solve is

To analyses latent distribution of set of Target document by topic modeling

Page 5: Link Distribution on  W ikipedia

Topic modeling – our approach Target

Document = Wikipedia article Terms = linked term in document

Modeling method LDA

Modeling tool Lingpipe api

Page 6: Link Distribution on  W ikipedia

Advantage of linked term Don’t need to extra preprocessing

Boundary detection Remove stopword Word stemming

Include more semantics Co-relation between term and document Ex) cancer as a term cancer as a document

cancer

A Cancer

Page 7: Link Distribution on  W ikipedia

Preliminary Problem How well link terms in the document are represent

specific characteristic of that document

Link evaluation Calculate similarity between document

Page 8: Link Distribution on  W ikipedia

Link evaluation Similarity based evaluation

Calculate similarity between documents Sim_d{doc1,doc2}

Calculate similarity between terms Sim_t{term1,term2}

Compare two similarity

Page 9: Link Distribution on  W ikipedia

Similarity between documents Sim_d

Similarity between documents Significantly affected input term set

Data set 1536 number of document

Disease domain : 208 Settlement domain : 1328

p,q = topic distribution of each document Kullback Leibler divergence

Page 10: Link Distribution on  W ikipedia

Example –reasonable

Page 11: Link Distribution on  W ikipedia

Example – not good

Page 12: Link Distribution on  W ikipedia

Error analysis Length problem – overestimate portion of topic

If the document contain only few link term then portion of topic of that document tend to be overestimated Ex)1950 년 ,1960 년 , 파푸아 뉴기니 , 식인풍습

Page 13: Link Distribution on  W ikipedia

Error analysis Some document’s Link terms do not describe docu-

ment itself Ex) Date, Country,…etc

Page 14: Link Distribution on  W ikipedia

Demo website For disease domain :

http://semanticweb.kaist.ac.kr/research/tmodel/ For settlement domain :

http://semanticweb.kaist.ac.kr/research/tmodel/sindex.php

For disease + settlement domain : http://semanticweb.kaist.ac.kr/research/tmodel/dsi

ndex.php

Page 15: Link Distribution on  W ikipedia

Modify word bag Including non-link term

Excluding noise term

Weighted score for duplication term

Including incoming link

Page 16: Link Distribution on  W ikipedia

Conclusion Topic modeling with link distribution in Wikipedia Need to measure how well link distribution can rep-

resent each article’s characteristic After that analysis topic distribution in variety way Expect topic distribution can be apply many applica-

tion

Page 17: Link Distribution on  W ikipedia

Thank