19
“BEYOND PAGES: SUPPORTING EFFICIENT, SCALABLE ENTITY SEARCH WITH DUAL INVERSION INDEX” Tao Chang, Kevin-Chen-Chuan Chang University of Illinois at Urbana-Champaign Presented By: Mahesh Gupta CSE 6339 Web Search Mining & Integration – Paper Presenta

Presentation dual inversion-index

  • Upload
    mahiuta

  • View
    147

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Presentation dual inversion-index

“BEYOND PAGES: SUPPORTING EFFICIENT, SCALABLE ENTITY SEARCH WITH DUAL INVERSION INDEX”

Tao Chang, Kevin-Chen-Chuan ChangUniversity of Illinois at Urbana-Champaign

Presented By:

Mahesh Gupta

CSE 6339 Web Search Mining & Integration – Paper Presentation

Page 2: Presentation dual inversion-index

WHAT THIS PAPER IS ALL ABOUT?

Entity based search is next big step forward and significant departure from traditional keyword based search. From computational point of view also Entity search on the scale of world wide web is going to through unique challenges. This paper indentify these computational challenges and introduce solution using “Dual-Inversion Index” technique.

2

Page 3: Presentation dual inversion-index

ENTITY SEARCH

Suppose we are interested in finding location of Cowboy Stadium. Our System expect query as combination of keywords and Entities.

Task to do here: Context Matching

3

Cowboy Stadium #Location

Page 4: Presentation dual inversion-index

CONTEXT MATCHING ALONE ENOUGH?

….Cowboy Stadium located in Arlington Texas …..

…..Cowboy Stadium located in United States cost $1.3 billion in complete construction...

.... Cowboy Stadium located in North Texas is the fourth largest national football stadium in the country by seating capacity…..

… Cowboy Stadium is 20 Miles drive from Hilton hotel located in Stemmons Freeway, Dallas Texas …….

So Clearly we need some scoring mechanism also.4

Page 5: Presentation dual inversion-index

HENCE….

Suppose we are interested in finding location of Cowboy Stadium.

Task to do here: Context Matching Global Aggregation

5

Cowboy Stadium #Location

Page 6: Presentation dual inversion-index

COMPUTATIONAL CHALLENGES

Approach that, first doing keyword search using keywords in the query and then doing entity search on resulting document of keyword search is much like sequential scan and very slow to scale on world wide web.

If some one suggest for top-k pages approach here then what would be the effective value of k for web and it will also affect global aggregation.

So we need effective mechanism for this.

& That is : Index6

Page 7: Presentation dual inversion-index

WHAT IS INDEX HERE

Indexing would we a pre-processing task for application here and indexes will be use to answer users query fast.

For example, index list can look something like below:

7

Keyword Document, position

Cowboy (D10,12) ;(D12,34)(D46,257)……

Stadium (D10,13) ;(D34,134)(D146,357)……

------------- ----------------

Page 8: Presentation dual inversion-index

INTRODUCING DUAL-INVERSION INDEX

2 types of Index mechanism proposed here by the authors for entity search:

1. Document-Inverted Index2. Entity-Inverted Index

Lets Discuss each one by one.

8

Page 9: Presentation dual inversion-index

DOCUMENT INVERTED INDEX

Process document, identify keywords,Entities in it and then create a list for each keyword having document ID and position in the document. Basically this index is keyword,Entity-to-document mapping.

Mathematically for keyword ‘k’ and document ‘doc’ D(k): k -> {(doc,pos) | doc.content(pos)=k; }

So list will look something like:D(cowboy): D(stadium):

9

D2,12 D6,17 D9,34 D9,357 D97,45

D6,18 D9,35 D56,55 D64,5 D97,46

Page 10: Presentation dual inversion-index

DOCUMENT INVERTED INDEX CONTINUE..

Mapping for Entity to document will be slightly different then keyword to entity in index. It is because Entity can have different instance value in the document.

So Mathematically:D(E): E-> {(doc,pos,e)| d.content(pos)=e; eεE}

In List view:

D(Location): 10

D6,23,’Arlington TX’

D9,45,’United State’

D97,50,’North Texas’

……. …….

Page 11: Presentation dual inversion-index

DI-INDEX-> IS IT EFFICIENT NOW ? If we treat list of each keywords and entity in

index as a relation then we can write equivalent SQL query as:

Select D(l).entity , sum(lscore) as score

From Cowboy c, Stadium s, Location l

where c.dId=s.dId and s.dId=l.dId

-------

Group By D(l).entity

Having score > threshold ;

Issue here: Cost of Complex join: In fact as number of

keywords and Entity in query increases join will become more complex.

So we still need improvement. 11

Page 12: Presentation dual inversion-index

DI-INDEX -> DATA PARTITIONING

Partition the document space in equal size. For example

100 Doc -> 10 partition-> 10 doc in each partition

Each partition have list of keywords and entity it support, find partition support yours and then you hose have to perform join only in between those documents.

So Entry for entity in index now should have partition number instead of instance value(Why?)

D(Location): 12

D6,23,P8 D9,45,P86 D97,50,P8 …….

Page 13: Presentation dual inversion-index

ENTITY-INVERTED INDEX

As opposite to DI-Index here we map each keyword to the entity while building index. Here we not only store each keyword’s position in the doc but also nearby entity’s position also under which context this keyword occurred.

Mathematically E(k) : k ->{(o(doc,epos,entity),pos) |

o.context[pos]=k; entityεE} Which translated to k appear with position

‘pos’ in the context of entity occurrence o(doc,epos,entity). 13

Page 14: Presentation dual inversion-index

EI-INDEX CONTINUE…

Hence layout of index list in this case will look something like:

E(cowboy):

E(Stadium):

(Notice here that there is no entry for location because it is entity)

Here P is partition number explained in next slide whose concept is analogous to DI-Index partitioning.

14

((D6,23,P8),17)

((D9,45,P86),34)

((D97,50,P8),45)

……….

((D23,23,P8),18)

((D9,45,P86),35)

((D97,50,P8),46)

……….

Page 15: Presentation dual inversion-index

EI-INDEX PARTITIONING

Here we will do the partitioning on the basis of Entites. Divide Entity space into equal size. So if

10 Entity-> 10 partition node -> 1 entity each partition.

Each entity node will have list of keywords found in the context of this entity.

Its faster than DI-Index because task of context matching we have performed during index formation itself.

15

Page 16: Presentation dual inversion-index

COMPARISON

D-Inverted E-Inverted

Join Fast (why?) Faster (why?)

Aggregation Central (why?) Distributed (why?)

Space Minimal Overhead (why?)

Large (why?)

16

Page 17: Presentation dual inversion-index

BOTH INDEX CO-EXIST? DUAL-INVERSION INDEX)

Answer is Yes. Its advisable that E-Inverted should be

created for Entities that are queried more often and take less space because its faster whereas D-Inverted should be created for Entities that are queried less often but take large space.

This balance space and time performance of the application.

17

Page 18: Presentation dual inversion-index

SUMMARY

D-Inverted maps keywords, Entity to document. It gives good performance using partitioning and takes minimal space.

E-inverted maps each keywords to document and and context of Entity under which it found. It is faster than D-Inverted but require large space to store.

Both can co-exist in a system to balance performance.

18

Page 19: Presentation dual inversion-index

Thank You

19