13
SET EXPANSION ON NAMED ENTITIES GROUP# 38 IRE PROJECT# 3 Mentor: AISHWARYA RAJARAM ANKIT CHOUDHARY(201206570) LOVLEAN ARORA(201305590) SAKSHAM MAHESHWARI(201130184) AMAN JAIN(201101132)

Set Expansion on Named Entities

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Set Expansion on Named Entities

SET EXPANSION ON NAMED ENTITIES

GROUP# 38IRE PROJECT# 3

Mentor: AISHWARYA RAJARAM

ANKIT CHOUDHARY(201206570)

LOVLEAN ARORA(201305590)

SAKSHAM MAHESHWARI(201130184)

AMAN JAIN(201101132)

Page 2: Set Expansion on Named Entities

What is Set Expansion?

● In simple terms we can define set expansion is basically determining the set to which given named entities belongs to

● For ex: Sachin Tendulkar and Rahul Dravid belongs to set of Indian Cricket Team

● Define:

Set expansion refers to expanding a given partial set of objects into a more complete set. In set expansion, the user issues a query consisting of small number of seeds x1,x2,...xk (assumption we will be given atleast three valid seeds) where each xi is a member of some target set Si. The answer to query is a listing of other probable elements of Si.

Page 3: Set Expansion on Named Entities

Why Set Expansion ?

● With such a huge expansion of data/service providers, the need of users has been shifted from detailed query to simple query

● Now user wants desired results in quick time with some words as query

● If some user want to get list of Indian cricket players, he can just pass sachin tendulkar, rahul dravid as input and will get list of cricketers from set expansion technique

● Ex:

– Input : {Sachin Tendulkar, Rahul Dravid}

– Ouput : {Sourav Ganguly, VVS Laxman, Sachin Tendulkar, Rahul Dravid}

Page 4: Set Expansion on Named Entities

Related Work

● Our system works on Wikipedia data and currently Wikipedia has no such features

● Wikipedia just provide title based search

● Google Sets:

– Google Sets has been used for a number of purposes in research community, including deriving features for named-entity recognition and evaluation of question answering systems.

– Shortcomings: Google Sets is a proprietary that may be changed any time

● SEAL (Set Expander for Any Language): Exploits semi-structured nature of web pages to find seed and wrapper around them. Wrappers are further used to search other related entities

● Others like Boo!Wa! System based on Web wrapper technologies to extract and rank entities iteratively, is also there in this race

Page 5: Set Expansion on Named Entities

Approach

● Our entire is work is distributed over two parts:

1. Indexing

2. Searching

3. Some external tools like POS(Part of Speech) Tagger we are applying on final retrieved document names to refine our results and constrained under named entities

Page 6: Set Expansion on Named Entities

Indexing

● For Index preparation, we have gone through some specifics like, tokenization, stop word removal, stemming, diacritics normalization

● We focus on following fields provided by Wiki data to get our results

– Titles, Categories, Infobox, Body Text, External References (order in decreasing order of their weights)

– and build some primary and secondary indexes on them

Page 7: Set Expansion on Named Entities

Searching

● Document Fetcher

– Retrieving relevant top 10 documents for each seed

● Attribute Classifier

– Crawling each document based on Category, Infobox/Taxobox and Introductory Text

● Ranker

– Rank the Set of documents corresponding to the attributes with highest weightage given to Category followed by Infobox and then Text.

● POS Tagger

– Retrieving only the named entities that belong to the set thus obtained

Page 8: Set Expansion on Named Entities

Complete Architecture

Page 9: Set Expansion on Named Entities

Tests

● Input: Lagaan talaash

● Results:3 idiotssarfarosh champion (2003 film)p.k. (film)afsana pyaar kadelhi belly (film)dhoom 3jo jeeta wohi sikandarnation awakes ghajini (2008 film)

Input: arvind kejriwal narendra modi

Results:aung san suu kyi barry simon (politician) edmund stoiber george orwell cindy hill (politician) friedrich hayek gordon brown mohandas karamchand gandhi mikhail gorbachev jimmy carter

Page 10: Set Expansion on Named Entities

Applications

● Set Expansion on Wiki data itself

● General Knowledge

– For ex: if you want to know list of diseases and you know only few diseases like malaria and cholera, you just give them as inputs and you will get variety of diseases in results

● Comparisons between named entities

● Search Result suggestion on Wiki like Google

● and many more to come...

Page 11: Set Expansion on Named Entities

Conclusion

● In this project, we were supposed to expand the set of named entities and we think we are quite successful in it

● Yah, its possible that our results may not up to the mark in some cases, but it covers the most general results as expected

● There is lot of scope in future for this project and we are planning to gets our hand dirty

● We can try to handle whole Wiki data more efficiently and God knows may be our tool will be used by billions :)

Page 12: Set Expansion on Named Entities

References

● http://en.wikipedia.org/wiki/Collaborative_filtering

● https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/postscript/icdm2007.pdf

● http://aclweb.org/anthology//P/P09/P091050.pdfhttp://su2010-projekt.googlecode.com/svn-history/r115/trunk/literatura/melville2002content.pdf

● http://www.cs.uic.edu/~lzhang3/paper/set_expansion.pdf

Page 13: Set Expansion on Named Entities

Thank You ☺