Upload
brian-johnson
View
635
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Crowdsourcing search relevance evalua2on at eBay
Brian Johnson September 28, 2011
Agenda
• Why • What • How • Cost • Quality • Measurement
Why Ask Real Humans • They’re our customers – Some2mes asking is the best way to find out what you want to know
– Provide ground truth for automated metrics • Provide data for – Experimental Evalua2on
• complements A/B tes2ng, surveys – Query Diagnosis – Judged Test Corpus
• Machine Learning • Offline evalua2on
– Produc2on Quality Control
Why Crowdsourcing
• Fast – 1-‐3 days
• Low Cost – pennies per judgment
• High Quality – Mul2ple workers – Worker evalua2on (test ques2ons & inter-‐worker agreement)
• Flexible – Ask anything
Judgment Volume by Day
Cost
Judgments Cost 1 $0.01 10 $0.10 100 $1.00
1,000 $10.00 10,000 $100.00 100,000 $1,000.00
1,000,000 $10,000.00
Who are these workers
• Crowdflower – Mechanical Turk – Gambit/Facebook – TrialPay – SamaSource
• LiveOps • CloudCrowd – Facebook
What Can We Evaluate • Search Ranking
– Query > Item • Item/Image Similarity
– Item > Item • Merchandising
– Query > Item – Category > Item – Item > Item
• Product Tagging – Item > Product
• Category Recommenda2ons – Item (Title) > Category
Crowdsourced Search Relevance Evalua2on
• What are we measuring – Relevance
• What are we not measuring – Value – Purchase metrics – Revenue
Industry Standard Sample
• As in the original DCG formula2on, we’ll be using a four-‐point scale for relevance assessment:
• Irrelevant document (0) • Marginally relevant document (1) • Fairly relevant document (2) • Highly relevant document (3)
hcp://www.sigir.org/forum/2008D/papers/2008d_sigirforum_alonso.pdf
eBay Search Relevance Crowdsourcing
Great Match
Good Match
Not Matching
Quality
• Tes2ng – Train/test workers before they start – Mix test ques2ons into the work mix – Discard data from unreliable workers
• Redundancy – Cost is low > Ask mul2ple workers – Monitor inter-‐worker agreement – Have trusted workers monitor new workers – Track worker “feedback” over 2me
eBay @ SIGIR ’10 Ensuring quality in crowdsourced search relevance evalua8on:
The effects of training ques8on distribu8on
John Le, Andy Edmonds, Vaughn Hester, Lukas Biewald
The use of crowdsourcing plaiorms like Amazon Mechanical Turk for evalua2ng the relevance of search results has become an effec2ve strategy that yields results quickly and inexpensively. One approach to ensure quality of worker judgments is to include an ini2al training period and subsequent sporadic inser2on of predefined gold standard data (training data). Workers are no2fied or rejected when they err on the training data, and trust and quality ra2ngs are adjusted accordingly. In this paper, we assess how this type of dynamic learning environment can affect the workers' results in a search relevance evalua2on task completed on Amazon Mechanical Turk. Specifically, we show how the distribu2on of training set answers impacts training of workers and aggregate quality of worker results. We conclude that in a relevance categoriza2on task, a uniform distribu2on of labels across training data labels produces op2mal peaks in 1) individual worker precision and 2) majority vo2ng aggregate result accuracy.
SIGIR ’10, July 19-‐23, 2010, Geneva, Switzerland
Metrics
• There are standard industry metrics • Designed to measure value to the end user • Older metrics – Precision & recall (binary relevance, no no2on of posi2on)
• Current metrics – Cumula2ve Gain (overall value of results on a non-‐binary relevance scale)
– Discounted (adjusted for posi2on value) – Normalized (common 0-‐1 scale)
Judgment Scale Granularity
Binary Web Search SigIR 3 Point 4 Point Offensive -‐1 Spam -‐2 Spam -‐1 Off Topic -‐2 Off Topic 0 Irrelevant Off Topic Irrelevant 0 Not Matching -‐1 Not Matching Relevant Marginally Relevant 1 Relevant Useful Fairly Relevant 1 Matching 1 Good Match Vital Highly Relevant 2 Great Match
Rank Discount
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1 2 3 4 5 6 7 8 9 10
Rank Discount d 1/r^constant
Cumula2ve Gain Metrics
Rank Human
Judgment Cumula8ve
Gain Rank
Discount
Discounted Cumula8ve
Gain
Ideal Rank Order
Observed Ideal DCG Observed
Normalized Discounted Cumula8ve
Gain Observed
Ideal Rank Order
Theore8cal Ideal DCG Theore8cal
Normalized Discounted Cumula8ve
Gain Theore8cal
r j cg d dcg io idcgo ndcgo it idcgt ndcgt
0-‐1 += j 1 / r^c dcg(n-‐1) + j *
d sort(j) dcg(n-‐1) + io
* d dcg(n) / idcgo(n) 1
dcg(n-‐1) + it * d
dcg(n) / idcgt(n)
1 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2 1.0 2.00 0.53 1.53 1.00 1.53 1.00 1.00 1.53 1.00 3 0.8 2.80 0.37 1.83 1.00 1.90 0.96 1.00 1.90 0.96 4 0.0 2.80 0.28 1.83 1.00 2.18 0.84 1.00 2.18 0.84 5 1.0 3.80 0.23 2.06 0.80 2.37 0.87 1.00 2.41 0.85 6 0.2 4.00 0.20 2.10 0.50 2.47 0.85 1.00 2.61 0.80 7 0.2 4.20 0.17 2.13 0.20 2.50 0.85 1.00 2.78 0.77 8 0.5 4.70 0.15 2.21 0.20 2.53 0.87 1.00 2.93 0.75 9 1.0 5.70 0.14 2.34 0.00 2.53 0.93 1.00 3.07 0.76 10 0.0 5.70 0.12 2.34 0.00 2.53 0.93 1.00 3.19 0.73
Con2nuous Produc2on Evalua2on
• Daily query sampling/scraping to facilitate ongoing monitoring, QA, triage, and post-‐hoc business analysis
NDCG
Time By Site, Category, Query …
Human Judgment > Query List
Best Match Variant Comparison
Best Match Variant Comparison
Measuring a Ranked List
Huan Liu, Lei Tang and Ni2n Agarwal. Tutorial on Community Detec1on and Behavior Study for Social Compu1ng. Presented in The 1st IEEE Interna2onal Conference on Social Compu2ng (SocialCom’09), 2009. hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf
Ranking Evalua2on
hcp://research.microsox.com/en-‐us/um/people/kevynct/files/ECIR-‐2010-‐ML-‐Tutorial-‐FinalToPrint.pdf
NDCG -‐ Example
Huan Liu, Lei Tang and Ni2n Agarwal. Tutorial on Community Detec1on and Behavior Study for Social Compu1ng. Presented in The 1st IEEE Interna2onal Conference on Social Compu2ng (SocialCom’09), 2009. hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf
Open Ques2ons
• Discrete vs. Con2nuous relevance scale • # of workers • Distribu2on of test ques2ons • Genera2on of test ques2ons • Qualifica2on (demographics, interests, region) • Dynamic worker assignment based on qualifica2on
• Mobile workers (untapped pool)
References
• Discounted Cumula2ve Gain – hcp://en.wikipedia.org/wiki/Discounted_cumula2ve_gain
• hcp://crowdflower.com/ • hcp://www.cloudcrowd.com/ • hcp://www.trialpay.com