39
En##es, Graphs, and Crowdsourcing for be7er Web Search Gianluca Demar#ni eXascale Infolab University of Fribourg, Switzerland

Entities, Graphs, and Crowdsourcing for better Web Search

Embed Size (px)

DESCRIPTION

Invited talk at the Università della Svizzera Italiana, Lugano, Switzerland in June 2013

Citation preview

Page 1: Entities, Graphs, and Crowdsourcing for better Web Search

En##es,  Graphs,  and  Crowdsourcing  for  be7er  Web  Search  

Gianluca  Demar#ni  eXascale  Infolab  

University  of  Fribourg,  Switzerland  

Page 2: Entities, Graphs, and Crowdsourcing for better Web Search

Gianluca  Demar#ni  

•  M.Sc.  at  University  of  Udine,  Italy  •  Ph.D.  at  University  of  Hannover,  Germany  

–  En#ty  Retrieval  •  Worked  for  UC  Berkeley  (on  Crowdsourcing),  Yahoo!  Research  

(Spain),  L3S  Research  Center  (Germany)  

•  Post-­‐doc  at  the  eXascale  Infolab,  Uni  Fribourg,  Switzerland.  •  Lecturer  for  Social  Compu,ng  in  Fribourg  

•  Tutorial  on  En#ty  Search  at  ECIR  2012,  on  Crowdsourcing  at  ESWC  2013  and  ISWC  2013  

•  Research  Interests  –  Informa#on  Retrieval,  Seman#c  Web,  Crowdsourcing  

2  

[email protected]

Gianluca  Demar#ni  

Page 3: Entities, Graphs, and Crowdsourcing for better Web Search

Gianluca  Demar#ni   3  

Page 4: Entities, Graphs, and Crowdsourcing for better Web Search

Gianluca  Demar#ni   4  

Page 5: Entities, Graphs, and Crowdsourcing for better Web Search

Web  of  Data  

•  Freebase  – Acquired  by  Google  in  July  2010.  –  Knowledge  Graph  launched  in  May  2012.  

•  Schema.org  – Driven  by  major  search  engine  companies  – Machine-­‐readable  annota#ons  of  Web  pages  

•  Linked  Open  Data  –  31  billion  triples,  Sept.  2011  

Gianluca  Demar#ni   5  

Page 6: Entities, Graphs, and Crowdsourcing for better Web Search

Linked  Open  Data  

Z.  Kaoudi  and  I.  Manolescu,  ICDE  seminar  2013     6  

Page 7: Entities, Graphs, and Crowdsourcing for better Web Search

I  will  talk  about  

•  En#ty  Linking/Disambigua#on  – On  the  Web  using  crowdsourcing  – For  scien#fic  literature  using  graphs  

•  Ad-­‐hoc  Object  Retrieval  (En#ty  Ranking)  – Using  IR  and  graphs  

•  Crowdsourced  Query  Understanding  

Gianluca  Demar#ni   7  

Page 8: Entities, Graphs, and Crowdsourcing for better Web Search

Disclaimer  

•  No  efficiency  evalua#on  – Approaches  not  distributed  – But  designed  to  scale  out  

•  No  user  studies  – Goal:  Obtain  high  quality  data  – Only  TREC-­‐like  evalua#on  on  effec#veness  

Gianluca  Demar#ni   8  

Page 9: Entities, Graphs, and Crowdsourcing for better Web Search

En#ty  Linking/Disambigua#on  

Page 10: Entities, Graphs, and Crowdsourcing for better Web Search

Gianluca  Demar#ni   10  

h7p://dbpedia.org/resource/Facebook  

h7p://dbpedia.org/resource/Instagram  

jase:Instagram  owl:sameAs  

Google  

Android  

<p>Facebook  is  not  wai#ng  for  its  ini#al  public  offering  to  make  its  first  big  purchase.</p><p>In  its  largest  acquisi#on  to  date,  the  social  network  has  purchased  Instagram,  the  popular  photo-­‐sharing  applica#on,  for  about  $1  billion  in  cash  and  stock,  the  company  said  Monday.</p>  

<p><span  about="h7p://dbpedia.org/resource/Facebook"><cite  property=”rdfs:label">Facebook</cite>  is  not  wai#ng  for  its  ini#al  public  offering  to  make  its  first  big  purchase.</span></p><p><span  about="h7p://dbpedia.org/resource/Instagram">In  its  largest  acquisi#on  to  date,  the  social  network  has  purchased  <cite  property=”rdfs:label">Instagram</cite>  ,  the  popular  photo-­‐sharing  applica#on,  for  about  $1  billion  in  cash  and  stock,  the  company  said  Monday.</span></p>  

RDFa  enrichment  

HTML:  

Page 11: Entities, Graphs, and Crowdsourcing for better Web Search

Crowdsourcing  

•  Exploit  human  intelligence  to  solve  – Tasks  simple  for  humans,  complex  for  machines  – With  a  large  number  of  humans  (the  Crowd)  

– Small  problems:  micro-­‐tasks  (Amazon  MTurk)  

•  Examples  – Wikipedia,  Image  tagging  

•  Incen#ves  – Financial,  fun,  visibility  

Gianluca  Demar#ni   11  

Page 12: Entities, Graphs, and Crowdsourcing for better Web Search

ZenCrowd  

•  Combine  both  algorithmic  and  manual  linking  •  Automate  manual  linking  via  crowdsourcing  

•  Dynamically  assess  human  workers  with  a  probabilis#c  reasoning  framework  

12  

Crowd  

Algorithms  Machines  

Gianluca  Demar#ni  

Page 13: Entities, Graphs, and Crowdsourcing for better Web Search

ZenCrowd  Architecture  

Micro Matching

Tasks

HTMLPages

HTML+ RDFaPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index Get Entity

Input Output

Probabilistic Network

Decision Engine

Micr

o-Ta

sk M

anag

er

Workers Decisions

AlgorithmicMatchers

Gianluca  Demar#ni   13  

Gianluca  Demar#ni,  Djellel  Eddine  Difallah,  and  Philippe  Cudré-­‐Mauroux.  ZenCrowd:  Leveraging  Probabilis#c  Reasoning  and  Crowdsourcing  Techniques  for  Large-­‐Scale  En#ty  Linking.  In:  21st  Interna#onal  Conference  on  World  Wide  Web  (WWW  2012).  

Page 14: Entities, Graphs, and Crowdsourcing for better Web Search

Algorithmic  Matching  

•  Inverted  index  over  LOD  en##es  – DBPedia,  Freebase,  Geonames,  NYT  

•  TF-­‐IDF  (IR  ranking  func#on)  •  Top  ranked  URIs  linked  to  en##es  in  docs  

•  Threshold  on  the  ranking  func#on  or  top  N  

Gianluca  Demar#ni   14  

Page 15: Entities, Graphs, and Crowdsourcing for better Web Search

En#ty  Factor  Graphs  

•  Graph  components  – Workers,  links,  clicks  – Prior  probabili#es  – Link  Factors  – Constraints  

•  Probabilis#c  Inference  – Select  all  links  with  posterior  prob  >τ  

w1 w2

l1 l2

pw1( ) pw2( )

lf1( ) lf2( )

pl1( ) pl2( )

l3

lf3( )

pl3( )

c11 c22c12c21 c13 c23

u2-3( )sa1-2( )

2  workers,  6  clicks,  3  candidate  links  

Link  priors  

Worker  priors  

Observed  variables  

Link  factors  

SameAs  constraints  

Dataset  Unicity  constraints  

Gianluca  Demar#ni   15  

Page 16: Entities, Graphs, and Crowdsourcing for better Web Search

En#ty  Factor  Graphs  

•  Training  phase  –  Ini#alize  worker  priors  – with  k  matches  on  known  answers  

•  Upda#ng  worker  Priors  – Use  link  decision  as  new  observa#ons  – Compute  new  worker  probabili#es  

•  Iden#fy  (and  discard)  unreliable  workers  

Gianluca  Demar#ni   16  

Page 17: Entities, Graphs, and Crowdsourcing for better Web Search

Experimental  Evalua#on  

•  Datasets  –  25  news  ar#cles  from  

•  CNN.com  (Global  news)  

•  NYTimes.com  (Global  news)  

•  Washington-­‐post.com  (US  local  news)  

•  Timesofindia.india#mes.com  (India  news)  

•  Swissinfo.com  (Switzerland  local  news)  

–  40M  en##es  (Freebase,  DBPedia,  Geonames,  NYT)  

Gianluca  Demar#ni   17  

Page 18: Entities, Graphs, and Crowdsourcing for better Web Search

Worker  Selec#on  

Gianluca  Demar#ni   18  

Top$US$Worker$

0$

0.5$

1$

0$ 250$ 500$

Worker&P

recision

&

Number&of&Tasks&

US$Workers$

IN$Workers$

0.6$0.62$0.64$0.66$0.68$0.7$0.72$0.74$0.76$0.78$0.8$

1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$Precision)

Top)K)workers)

Page 19: Entities, Graphs, and Crowdsourcing for better Web Search

Lessons  Learnt  

•  Crowdsourcing  +  Prob  reasoning  works!  •  But  

– Different  worker  communi#es  perform  differently  – Many  low  quality  workers  

– Comple#on  #me  may  vary  (based  on  reward)  

•  Need  to  find  the  right  workers  for  your  task  (see  WWW13  paper)  

Gianluca  Demar#ni   19  

Page 20: Entities, Graphs, and Crowdsourcing for better Web Search

ZenCrowd  Summary  

•  ZenCrowd:  Probabilis#c  reasoning  over  automa#c  and  crowdsourcing  methods  for  en#ty  linking  

•  Standard  crowdsourcing  improves  6%  over  automa#c  •  4%  -­‐  35%  improvement  over  standard  crowdsourcing  •  14%  average  improvement  over  automa#c  approaches  

•  On-­‐going  work:  –  Also  used  for  instance  matching  across  datasets  –  3-­‐way  blocking  with  the  crowd  

h7p://exascale.info/zencrowd/  

Gianluca  Demar#ni   20  

Page 21: Entities, Graphs, and Crowdsourcing for better Web Search

En#ty  Disambigua#on  in  Scien#fic  Literature  

•  Using  a  background  concept  graph  

Roman  Prokofyev,  Gianluca  Demar#ni,  Philippe  Cudré-­‐Mauroux,  Alexey  Boyarsky,  and  Oleg  Ruchayskiy.  Ontology-­‐Based  Word  Sense  Disambigua#on  in  the  Scien#fic  Domain.  In:  35th  European  Conference  on  Informa#on  Retrieval  (ECIR  2013).  

Gianluca  Demar#ni   21  

h7p://scienceWISE.info/  

Page 22: Entities, Graphs, and Crowdsourcing for better Web Search

En#ty  Ranking  

Page 23: Entities, Graphs, and Crowdsourcing for better Web Search

Ad-­‐hoc  Object  Retrieval  

•  Once  en##es  have  been  iden#fied…  •  We  want  to  rank  them  as  answer  to  a  query  

•  AOR  – Given  the  descrip#on  of  an  en#ty  – give  me  back  its  iden#fier  

–  Input:  query  q,  data  graph  G  – Output:  ranked  list  of  URIs  from  G  

Gianluca  Demar#ni   23  

Page 24: Entities, Graphs, and Crowdsourcing for better Web Search

An  Hybrid  Approach  to  AOR  

Alberto  Tonon,  Gianluca  Demar#ni,  and  Philippe  Cudré-­‐Mauroux.  Combining  Inverted  Indices  and  Structured  Search  for  Ad-­‐hoc  Object  Retrieval.  In:  35th  Annual  ACM  SIGIR  Conference  (SIGIR  2012).  

LOD Cloud

index()User

Query Annotation and Expansion

Inverted Index

RDF Store

Ranking FunctionsRanking

FunctionsRanking Functions

query()

Entity SearchKeyword Query

intermediatetop-k resultsGraph-Enriched

Results

Graph Traversals(queries on object

properties)

Neighborhoods(queries on datatype

properties)

Structured Inverted Index

WordNet

3rd partysearch engines

Final Ranking Function

Pseudo-Relevance Feedback

Gianluca  Demar#ni   24  

Page 25: Entities, Graphs, and Crowdsourcing for better Web Search

AOR  Evalua#on  

•  1.3  billions  RDF  triples  from  LOD  cloud  •  92  and  50  queries  •  Crowdsourced  relevance  judgments  

•  semsearch.yahoo.com  

Gianluca  Demar#ni   25  

Page 26: Entities, Graphs, and Crowdsourcing for better Web Search

Evalua#on  Results  

Gianluca  Demar#ni   26  

Page 27: Entities, Graphs, and Crowdsourcing for better Web Search

Summary  

•  AOR  =  “Given  the  descrip,on  of  an  en,ty,  give  me  back  its  iden,fier”    

•  Combining  classic  IR  techniques  +  structured  database  storing  graph  data    

•  Significantly  be7er  results  (up  to  +25%  MAP  over  BM25  baseline).    

•  Overhead  caused  from  the  graph  traversal  part  is  limited    

Gianluca  Demar#ni   27  

h7p://exascale.info/AOR/  

Page 28: Entities, Graphs, and Crowdsourcing for better Web Search

CrowdQ:  Crowdsourced  Query  Understanding  

Page 29: Entities, Graphs, and Crowdsourcing for better Web Search

birthdate  of  mayor  of  capital  city  of  france  

Gianluca  Demar#ni   29  

Page 30: Entities, Graphs, and Crowdsourcing for better Web Search

capital  city  of  france  

Gianluca  Demar#ni   30  

Page 31: Entities, Graphs, and Crowdsourcing for better Web Search

mayor  of  paris  

Gianluca  Demar#ni   31  

Page 32: Entities, Graphs, and Crowdsourcing for better Web Search

birthdate  of  Bertrand  Delanoë  

Gianluca  Demar#ni   32  

Page 33: Entities, Graphs, and Crowdsourcing for better Web Search

Mo#va#on  

•  Web  Search  Engines  can  answer  simple  factual  queries  directly  on  the  result  page  

•  Users  with  complex  informa#on  needs  are  oyen  unsa#sfied  

•  Purely  automa#c  techniques  are  not  enough  

•  We  want  to  solve  it  with  Crowdsourcing!  

Gianluca  Demar#ni   33  

Page 34: Entities, Graphs, and Crowdsourcing for better Web Search

CrowdQ  

•  CrowdQ  is  the  first  system  that  uses  crowdsourcing  to  – Understand  the  intended  meaning  

– Build  a  structured  query  template  – Answer  the  query  over  Linked  Open  Data  

Gianluca  Demar#ni   34  

Gianluca  Demar#ni,  Beth  Trushkowsky,  Tim  Kraska,  and  Michael  Franklin.  CrowdQ:  Crowdsourced  Query  Understanding.  In:  6th  Biennial  Conference  on  Innova#ve  Data  Systems  Research  (CIDR  2013).  

Page 35: Entities, Graphs, and Crowdsourcing for better Web Search

35  

Page 36: Entities, Graphs, and Crowdsourcing for better Web Search

User

Keyword QueryOn#line'Complex'Query

ProcessingComplex

query classifier

CrowdsourcingPlatform

Vetrical selection,

Unstructured Search, ...

POS + NER tagging

Query Template Index

Crowd Manager

N

Y

Queries Templ +Answer Types

StructuredLOD Search

Result Joiner

Template Generation

SERP

t1t2t3

Off#line'Complex'QueryDecomposition

Structured Query

Query Logquery

N

Answ

erCo

mpo

sitio

n

LOD Open Data Cloud

Match with existingquery templates

CrowdQ  Architecture  

36  

Off-­‐line:  query  template  genera#on  with  the  help  of  the  crowd  On-­‐line:  query  template  matching  using  NLP  and  search  over  open  data  

Page 37: Entities, Graphs, and Crowdsourcing for better Web Search

Hybrid  Human-­‐Machine  Pipeline  

Gianluca  Demar#ni   37  

Q=  birthdate  of  actors  of  forrest  gump  

Query  annota#on   Noun   Noun   Named  en#ty  

Verifica#on  

En#ty  Rela#ons  

Is  forrest  gump  this  en#ty  in  the  query?  

Which  is  the  rela#on  between:  actors  and  forrest  gump   starring  

Schema  element   Starring                          <dbpedia-­‐owl:starring>    

Verifica#on   Is  the  rela#on  between:  Indiana  Jones  –  Harrison  Ford  Back  to  the  Future  –  Michael  J.  Fox  of  the  same  type  as  Forrest  Gump  -­‐  actors        

Page 38: Entities, Graphs, and Crowdsourcing for better Web Search

Structured  query  genera#on  

SELECT  ?y  ?x  WHERE  {  ?y  <dbpedia-­‐owl:birthdate>  ?x  .  

     ?z  <dbpedia-­‐owl:starring>  ?y  .  

     ?z  <rdfs:label>  ‘Forrest  Gump’  }  

Gianluca  Demar#ni   38  

Results  from  BTC09:  

Q=  birthdate  of  actors  of  forrest  gump  MOVIE  

MOVIE  

Page 39: Entities, Graphs, and Crowdsourcing for better Web Search

Conclusions  

•  Structured  Data  make  Web  Search  be7er  •  Exploit  the  best  out  of  structured  and  unstructured  data  (Hybrid  AOR)  

•  Crowd  can  help  in  understanding  seman#cs  

•  Hybrid  human-­‐machine  systems  (ZenCrowd)  

•  Exploit  Human  Intelligence  at  Scale  (CrowdQ)  

gianlucademartini.net [email protected] Gianluca  Demar#ni   39