13
© 2009 Amazon.com, Inc. or its Affiliates. Amazon Mechanical Turk Requester Meetup (Panos Ipeirotis – New York University)

New York Mechanical Turk Meetup

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Amazon Mechanical TurkRequester Meetup(Panos Ipeirotis – New York University)

Page 2: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com/

Email: [email protected]

Panos Ipeirotis - Introduction

New York University, Stern School of Business

Page 3: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Example: Build an Adult Web Site Classifier

Need a large number of hand-labeled sites Get people to look at sites and classify them as:

G (general), PG (parental guidance), R (restricted), X (porn)

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr MTurk: 2500 websites/hr, cost: $12/hr

Page 4: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Bad news: Spammers!

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general audience)

Page 5: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Improve Data Quality through Repeated Labeling Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote

Probability of correctness increases with number of workers Probability of correctness increases with quality of workers

1 worker

70% correct

11 workers

93% correct

Page 6: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr

Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr

But Majority Voting is Expensive

Page 7: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Using redundant votes, we can infer worker quality

Look at our spammer friend ATAMRO447HWJQtogether with other 9 workers

Our “friend” ATAMRO447HWJQmainly marked sites as G.Obviously a spammer…

We can compute error rates for each worker

Error rates for ATAMRO447HWJQ P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%

Page 8: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Rejecting spammers and Benefits

Random answers error rate = 50%Average error rate for ATAMRO447HWJQ: 45.2% P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%

Action: REJECT and BLOCK

Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher

Page 9: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

After rejecting spammers, quality goes up Spam keeps quality down Without spam, workers are of higher quality Need less redundancy for same quality Same quality of results for lower cost

With spam

1 worker

70% correct

With spam

11 workers

93% correct

Without spam

1 worker

80% correct

Without spam

5 workers

94% correct

Page 10: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Correcting biases

Classifying sites as G, PG, R, X Sometimes workers are careful but biased

Classifies G → P and P → R Average error rate for ATLJIK76YH1TF: 45.0%

Error Rates for Worker: ATLJIK76YH1TFP[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Is ATLJIK76YH1TF a spammer?

Page 11: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Correcting biases

For ATLJIK76YH1TF, we simply need to compute the “non-recoverable” error-rate (technical details omitted)

Non-recoverable error-rate for ATLJIK76YH1TF: 9%

Error Rates for Worker: ATLJIK76YH1TFP[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Page 12: New York Mechanical Turk Meetup

© 2009 Amazon.com, Inc. or its Affiliates.

Too much theory?

Open source implementation available at:http://code.google.com/p/get-another-label/

Input: – Labels from Mechanical Turk– Cost of incorrect labelings (e.g., XG costlier than GX)

Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality

Alpha version, more improvements to come! Suggestions and collaborations welcomed!