29
Crowdsourcing search relevance evalua2on at eBay Brian Johnson September 28, 2011

2011 Crowdsourcing Search Evaluation

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 2011 Crowdsourcing Search Evaluation

Crowdsourcing  search  relevance  evalua2on  at  eBay    

Brian  Johnson  September  28,  2011  

 

Page 2: 2011 Crowdsourcing Search Evaluation

Agenda  

•  Why  •  What  •  How  •  Cost  •  Quality  •  Measurement  

Page 3: 2011 Crowdsourcing Search Evaluation

Why  Ask  Real  Humans  •  They’re  our  customers  –  Some2mes  asking  is  the  best  way  to  find  out  what  you  want  to  know  

–  Provide  ground  truth  for  automated  metrics  •  Provide  data  for  –  Experimental  Evalua2on  

•  complements  A/B  tes2ng,  surveys  –  Query  Diagnosis  –  Judged  Test  Corpus  

•  Machine  Learning  •  Offline  evalua2on    

–  Produc2on  Quality  Control  

Page 4: 2011 Crowdsourcing Search Evaluation

Why  Crowdsourcing  

•  Fast  – 1-­‐3  days  

•  Low  Cost  – pennies  per  judgment  

•  High  Quality  – Mul2ple  workers  – Worker  evalua2on  (test  ques2ons  &  inter-­‐worker  agreement)  

•  Flexible  – Ask  anything  

Page 5: 2011 Crowdsourcing Search Evaluation

Judgment  Volume  by  Day  

Page 6: 2011 Crowdsourcing Search Evaluation

Cost  

Judgments   Cost  1   $0.01    10   $0.10    100   $1.00    

1,000   $10.00    10,000   $100.00    100,000   $1,000.00    

1,000,000   $10,000.00    

Page 7: 2011 Crowdsourcing Search Evaluation

Who  are  these  workers  

•  Crowdflower  – Mechanical  Turk  – Gambit/Facebook  – TrialPay  – SamaSource  

•  LiveOps  •  CloudCrowd  – Facebook  

Page 8: 2011 Crowdsourcing Search Evaluation

What  Can  We  Evaluate  •  Search  Ranking  

–  Query  >  Item  •  Item/Image  Similarity  

–  Item  >  Item  •  Merchandising  

–  Query  >  Item  –  Category  >  Item  –  Item  >  Item  

•  Product  Tagging  –  Item  >  Product  

•  Category  Recommenda2ons  –  Item  (Title)  >  Category  

Page 9: 2011 Crowdsourcing Search Evaluation

Crowdsourced  Search  Relevance  Evalua2on  

•  What  are  we  measuring  – Relevance  

•  What  are  we  not  measuring  – Value  – Purchase  metrics  – Revenue  

Page 10: 2011 Crowdsourcing Search Evaluation

Industry  Standard  Sample  

•  As  in  the  original  DCG  formula2on,  we’ll  be  using  a  four-­‐point  scale  for  relevance  assessment:    

•  Irrelevant  document  (0)    •  Marginally  relevant  document  (1)    •  Fairly  relevant  document  (2)    •  Highly  relevant  document  (3)    

hcp://www.sigir.org/forum/2008D/papers/2008d_sigirforum_alonso.pdf  

Page 11: 2011 Crowdsourcing Search Evaluation

eBay  Search  Relevance  Crowdsourcing  

Page 12: 2011 Crowdsourcing Search Evaluation

Great  Match  

Page 13: 2011 Crowdsourcing Search Evaluation

Good  Match  

Page 14: 2011 Crowdsourcing Search Evaluation

Not  Matching  

Page 15: 2011 Crowdsourcing Search Evaluation

Quality  

•  Tes2ng  – Train/test  workers  before  they  start  – Mix  test  ques2ons  into  the  work  mix  – Discard  data  from  unreliable  workers  

•  Redundancy  – Cost  is  low  >  Ask  mul2ple  workers  – Monitor  inter-­‐worker  agreement  – Have  trusted  workers  monitor  new  workers  – Track  worker  “feedback”  over  2me  

Page 16: 2011 Crowdsourcing Search Evaluation

eBay  @  SIGIR  ’10  Ensuring  quality  in  crowdsourced  search  relevance  evalua8on:    

The  effects  of  training  ques8on  distribu8on    

John  Le,  Andy  Edmonds,  Vaughn  Hester,  Lukas  Biewald    

The  use  of  crowdsourcing  plaiorms  like  Amazon  Mechanical  Turk  for  evalua2ng  the  relevance  of  search  results  has  become  an  effec2ve   strategy   that   yields   results  quickly   and   inexpensively.  One  approach   to  ensure   quality   of   worker   judgments   is   to   include   an   ini2al   training   period   and   subsequent   sporadic  inser2on  of  predefined  gold  standard  data  (training  data).  Workers  are  no2fied  or  rejected  when  they  err  on  the  training  data,  and  trust  and  quality  ra2ngs  are  adjusted  accordingly.  In  this  paper,  we  assess  how  this  type  of  dynamic  learning  environment  can  affect  the  workers'  results  in  a  search  relevance  evalua2on  task   completed  on  Amazon  Mechanical   Turk.   Specifically,  we   show  how   the  distribu2on  of   training   set  answers   impacts   training   of   workers   and   aggregate   quality   of   worker   results.   We   conclude   that   in   a  relevance  categoriza2on  task,  a  uniform  distribu2on  of  labels  across  training  data  labels  produces  op2mal  peaks  in  1)  individual  worker  precision  and  2)  majority  vo2ng  aggregate  result  accuracy.  

 SIGIR  ’10,  July  19-­‐23,  2010,  Geneva,  Switzerland  

Page 17: 2011 Crowdsourcing Search Evaluation

Metrics  

•  There  are  standard  industry  metrics  •  Designed  to  measure  value  to  the  end  user  •  Older  metrics  –  Precision  &  recall  (binary  relevance,  no  no2on  of  posi2on)  

•  Current  metrics  –  Cumula2ve  Gain  (overall  value  of  results  on  a  non-­‐binary  relevance  scale)  

– Discounted  (adjusted  for  posi2on  value)  – Normalized  (common  0-­‐1  scale)  

Page 18: 2011 Crowdsourcing Search Evaluation

Judgment  Scale  Granularity  

Binary   Web  Search   SigIR   3  Point   4  Point                Offensive           -­‐1    Spam   -­‐2    Spam                           -­‐1    Off  Topic   -­‐2    Off  Topic  0    Irrelevant        Off  Topic        Irrelevant   0    Not  Matching   -­‐1    Not  Matching                Relevant        Marginally  Relevant                  1    Relevant        Useful        Fairly  Relevant   1    Matching   1    Good  Match                Vital        Highly  Relevant           2    Great  Match  

Page 19: 2011 Crowdsourcing Search Evaluation

Rank  Discount  

0.00  

0.10  

0.20  

0.30  

0.40  

0.50  

0.60  

0.70  

0.80  

0.90  

1.00  

1   2   3   4   5   6   7   8   9   10  

Rank  Discount  d  1/r^constant  

Page 20: 2011 Crowdsourcing Search Evaluation

Cumula2ve  Gain  Metrics  

Rank  Human  

Judgment  Cumula8ve  

Gain  Rank  

Discount  

Discounted  Cumula8ve  

Gain  

Ideal  Rank  Order  

Observed  Ideal  DCG  Observed  

Normalized  Discounted  Cumula8ve  

Gain  Observed  

Ideal  Rank  Order  

Theore8cal  Ideal  DCG  Theore8cal  

Normalized  Discounted  Cumula8ve  

Gain  Theore8cal  

r   j   cg   d   dcg   io   idcgo   ndcgo   it   idcgt   ndcgt  

    0-­‐1    +=  j   1  /  r^c  dcg(n-­‐1)  +  j  *  

d   sort(j)  dcg(n-­‐1)  +  io  

*  d  dcg(n)  /  idcgo(n)   1  

dcg(n-­‐1)  +  it  *  d  

dcg(n)  /  idcgt(n)  

1   1.0   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00  2   1.0   2.00   0.53   1.53   1.00   1.53   1.00   1.00   1.53   1.00  3   0.8   2.80   0.37   1.83   1.00   1.90   0.96   1.00   1.90   0.96  4   0.0   2.80   0.28   1.83   1.00   2.18   0.84   1.00   2.18   0.84  5   1.0   3.80   0.23   2.06   0.80   2.37   0.87   1.00   2.41   0.85  6   0.2   4.00   0.20   2.10   0.50   2.47   0.85   1.00   2.61   0.80  7   0.2   4.20   0.17   2.13   0.20   2.50   0.85   1.00   2.78   0.77  8   0.5   4.70   0.15   2.21   0.20   2.53   0.87   1.00   2.93   0.75  9   1.0   5.70   0.14   2.34   0.00   2.53   0.93   1.00   3.07   0.76  10   0.0   5.70   0.12   2.34   0.00   2.53   0.93   1.00   3.19   0.73  

Page 21: 2011 Crowdsourcing Search Evaluation

Con2nuous  Produc2on  Evalua2on  

•  Daily  query  sampling/scraping  to  facilitate  ongoing  monitoring,  QA,  triage,  and  post-­‐hoc  business  analysis  

NDCG  

Time  By  Site,  Category,  Query  …  

Page 22: 2011 Crowdsourcing Search Evaluation

Human  Judgment  >  Query  List  

Page 23: 2011 Crowdsourcing Search Evaluation

Best  Match  Variant  Comparison  

Page 24: 2011 Crowdsourcing Search Evaluation

Best  Match  Variant  Comparison  

Page 25: 2011 Crowdsourcing Search Evaluation

Measuring  a  Ranked  List  

Huan  Liu,  Lei  Tang  and  Ni2n  Agarwal.  Tutorial  on  Community  Detec1on  and  Behavior  Study  for  Social  Compu1ng.  Presented  in  The  1st  IEEE  Interna2onal  Conference  on  Social  Compu2ng  (SocialCom’09),  2009.  hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-­‐tutorial.pdf  

Page 26: 2011 Crowdsourcing Search Evaluation

Ranking  Evalua2on  

hcp://research.microsox.com/en-­‐us/um/people/kevynct/files/ECIR-­‐2010-­‐ML-­‐Tutorial-­‐FinalToPrint.pdf  

Page 27: 2011 Crowdsourcing Search Evaluation

NDCG  -­‐  Example  

Huan  Liu,  Lei  Tang  and  Ni2n  Agarwal.  Tutorial  on  Community  Detec1on  and  Behavior  Study  for  Social  Compu1ng.  Presented  in  The  1st  IEEE  Interna2onal  Conference  on  Social  Compu2ng  (SocialCom’09),  2009.  hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-­‐tutorial.pdf  

Page 28: 2011 Crowdsourcing Search Evaluation

Open  Ques2ons  

•  Discrete  vs.  Con2nuous  relevance  scale  •  #  of  workers  •  Distribu2on  of  test  ques2ons  •  Genera2on  of  test  ques2ons  •  Qualifica2on  (demographics,  interests,  region)  •  Dynamic  worker  assignment  based  on  qualifica2on  

•  Mobile  workers  (untapped  pool)  

Page 29: 2011 Crowdsourcing Search Evaluation

References  

•  Discounted  Cumula2ve  Gain  – hcp://en.wikipedia.org/wiki/Discounted_cumula2ve_gain  

•  hcp://crowdflower.com/  •  hcp://www.cloudcrowd.com/  •  hcp://www.trialpay.com