Presentation dual inversion-index

“BEYOND PAGES: SUPPORTING EFFICIENT, SCALABLE ENTITY SEARCH WITH DUAL INVERSION INDEX”

Tao Chang, Kevin-Chen-Chuan ChangUniversity of Illinois at Urbana-Champaign

Presented By:

Mahesh Gupta

CSE 6339 Web Search Mining & Integration – Paper Presentation

WHAT THIS PAPER IS ALL ABOUT?

Entity based search is next big step forward and significant departure from traditional keyword based search. From computational point of view also Entity search on the scale of world wide web is going to through unique challenges. This paper indentify these computational challenges and introduce solution using “Dual-Inversion Index” technique.

2

ENTITY SEARCH

Suppose we are interested in finding location of Cowboy Stadium. Our System expect query as combination of keywords and Entities.

Task to do here: Context Matching

3

Cowboy Stadium #Location

CONTEXT MATCHING ALONE ENOUGH?

….Cowboy Stadium located in Arlington Texas …..

…..Cowboy Stadium located in United States cost $1.3 billion in complete construction...

.... Cowboy Stadium located in North Texas is the fourth largest national football stadium in the country by seating capacity…..

… Cowboy Stadium is 20 Miles drive from Hilton hotel located in Stemmons Freeway, Dallas Texas …….

So Clearly we need some scoring mechanism also.4

HENCE….

Suppose we are interested in finding location of Cowboy Stadium.

Task to do here: Context Matching Global Aggregation

5

Cowboy Stadium #Location

COMPUTATIONAL CHALLENGES

Approach that, first doing keyword search using keywords in the query and then doing entity search on resulting document of keyword search is much like sequential scan and very slow to scale on world wide web.

If some one suggest for top-k pages approach here then what would be the effective value of k for web and it will also affect global aggregation.

So we need effective mechanism for this.

& That is : Index6

WHAT IS INDEX HERE

Indexing would we a pre-processing task for application here and indexes will be use to answer users query fast.

For example, index list can look something like below:

7

Keyword Document, position

Cowboy (D10,12) ;(D12,34)(D46,257)……

Stadium (D10,13) ;(D34,134)(D146,357)……

------------- ----------------

INTRODUCING DUAL-INVERSION INDEX

2 types of Index mechanism proposed here by the authors for entity search:

1. Document-Inverted Index2. Entity-Inverted Index

Lets Discuss each one by one.

8

DOCUMENT INVERTED INDEX

Process document, identify keywords,Entities in it and then create a list for each keyword having document ID and position in the document. Basically this index is keyword,Entity-to-document mapping.

Mathematically for keyword ‘k’ and document ‘doc’ D(k): k -> {(doc,pos) | doc.content(pos)=k; }

So list will look something like:D(cowboy): D(stadium):

9

D2,12 D6,17 D9,34 D9,357 D97,45

D6,18 D9,35 D56,55 D64,5 D97,46

DOCUMENT INVERTED INDEX CONTINUE..

Mapping for Entity to document will be slightly different then keyword to entity in index. It is because Entity can have different instance value in the document.

So Mathematically:D(E): E-> {(doc,pos,e)| d.content(pos)=e; eεE}

In List view:

D(Location): 10

D6,23,’Arlington TX’

D9,45,’United State’

D97,50,’North Texas’

……. …….

DI-INDEX-> IS IT EFFICIENT NOW ? If we treat list of each keywords and entity in

index as a relation then we can write equivalent SQL query as:

Select D(l).entity , sum(lscore) as score

From Cowboy c, Stadium s, Location l

where c.dId=s.dId and s.dId=l.dId

-------

Group By D(l).entity

Having score > threshold ;

Issue here: Cost of Complex join: In fact as number of

keywords and Entity in query increases join will become more complex.

So we still need improvement. 11

DI-INDEX -> DATA PARTITIONING

Partition the document space in equal size. For example

100 Doc -> 10 partition-> 10 doc in each partition

Each partition have list of keywords and entity it support, find partition support yours and then you hose have to perform join only in between those documents.

So Entry for entity in index now should have partition number instead of instance value(Why?)

D(Location): 12

D6,23,P8 D9,45,P86 D97,50,P8 …….

ENTITY-INVERTED INDEX

As opposite to DI-Index here we map each keyword to the entity while building index. Here we not only store each keyword’s position in the doc but also nearby entity’s position also under which context this keyword occurred.

Mathematically E(k) : k ->{(o(doc,epos,entity),pos) |

o.context[pos]=k; entityεE} Which translated to k appear with position

‘pos’ in the context of entity occurrence o(doc,epos,entity). 13

EI-INDEX CONTINUE…

Hence layout of index list in this case will look something like:

E(cowboy):

E(Stadium):

(Notice here that there is no entry for location because it is entity)

Here P is partition number explained in next slide whose concept is analogous to DI-Index partitioning.

14

((D6,23,P8),17)

((D9,45,P86),34)

((D97,50,P8),45)

……….

((D23,23,P8),18)

((D9,45,P86),35)

((D97,50,P8),46)

……….

EI-INDEX PARTITIONING

Here we will do the partitioning on the basis of Entites. Divide Entity space into equal size. So if

10 Entity-> 10 partition node -> 1 entity each partition.

Each entity node will have list of keywords found in the context of this entity.

Its faster than DI-Index because task of context matching we have performed during index formation itself.

15

COMPARISON

D-Inverted E-Inverted

Join Fast (why?) Faster (why?)

Aggregation Central (why?) Distributed (why?)

Space Minimal Overhead (why?)

Large (why?)

16

BOTH INDEX CO-EXIST? DUAL-INVERSION INDEX)

Answer is Yes. Its advisable that E-Inverted should be

created for Entities that are queried more often and take less space because its faster whereas D-Inverted should be created for Entities that are queried less often but take large space.

This balance space and time performance of the application.

17

SUMMARY

D-Inverted maps keywords, Entity to document. It gives good performance using partitioning and takes minimal space.

E-inverted maps each keywords to document and and context of Entity under which it found. It is faster than D-Inverted but require large space to store.

Both can co-exist in a system to balance performance.

18

Thank You

19

Education

Presentation dual inversion-index