44
Mechanical Turk Under the Hood Marc Schwarz, Ph.D. UXPA Conference 2016 Seattle, WA © Copyright 2016 Marc Schwarz All Rights Reserved Graphics Copyright Their Respective Owners Utilized for Educational Purposes Under Fair Usage Laws

Mechanical Turk Under the Hood

Embed Size (px)

Citation preview

Mechanical Turk Under the Hood

Marc Schwarz, Ph.D.

UXPA Conference 2016

Seattle, WA

© Copyright 2016 Marc SchwarzAll Rights ReservedGraphics Copyright Their Respective OwnersUtilized for Educational Purposes Under Fair Usage Laws

Disclaimer

• I have previously worked as a contractor at Amazon on multiple occasions• This deck is in accordance with my various NDAs

• The intent of this talk is to share my experiences & nuances with:1. Working with the Mechanical Turk front-end2. Managing and working with Turkers3. Managing live studies & reviewing their status

a. Some Thoughts on activities well-suited for the Mechanical Turk environment4. Using Mechanical Turk as a supplemental tool for managing aspects of traditional

usability studies

• This deck is certainly not the “Gospel According to Marc” • The talk is not intended to be either a recommendation or endorsement for

Mechanical Turk or third-parties who provide Mechanical Turk vendor services• Also, not intended to be an “Intro to Using” or “How To Plan & Run Your Study” talk

• Lots of these already exist

2

My Background

• Have an MS in Computer Science, with a concentration in AI Natural Language Computing

• Transitioned from “Classical AI” to complete a Ph.D. with a concentration in Cognitive Psychology & Instructional Systems Design• Implicitly, this is “Applied AI”

• Worked as a Tactical User Researcher for about 8 years• Completed about 70 UX lab studies, and 35+ Mechanical Turk studies

• Quick Poll: How many people consider euthanasia a particularly contentious topic? Show of hands, please.

3

The Difficulty With Disambiguation – AI is Hard

Originally used in a Benny Hill sketch and “borrowed” by SNL4

Mechanical Turk – Brief Introduction to AAI

5Amazon Mechanical Turk Login “Splash” Page

Mechanical Turk – The Basics: What it is

• Amazon’s “High Concept” premise is that Mechanical Turk is an internet-based source for “Artificial Artificial-Intelligence”• Provide a marketplace where Requesters can post tasks that Workers can perform

• Specifically, tasks a computer can’t handle • Example: When I show you any legal English string, please tell me the number of syllables it has

• Requesters – Post “Human Intelligence Tasks” (AKA HITs) to be completed for a denoted payment consideration upon completion

• Workers – Select HITs to perform and are paid upon submission and Requester’s review & approval of completed work• Workers are informally known as “Turkers”

6

Process & Mechanisms

• Front End – Mechanical Turk “Proper”• Requesters: Create & Manage Tasks; Manage & Pay Workers; Manage Payment Account

• Workers: Select Tasks; Submit Verification Numbers; Manage Payment Receivables

• Back End – Link to external website where the “work” is completed• For Requesters:

• Link to a web site created from scratch, or in conjunction with some third party service provider• Important to Note: Mechanical Turk is NOT an end-to-end turn-key solution

• You the User Researcher must provide the activity back-end

• For Workers: The back end is the place to• Either perform work or upload completed work• Retrieve Task Verification Numbers (so that they are paid)

7

1. Working With the Mechanical Turk Front-End

8Root-Level Page for Creating a New Project – Clicking Create Project Button will display screen on Slide 10

Within the Mechanical Turk “Front End” UI

• Requesters Can• Create a New Study Project Description• Manage Worker Logs• Manage Live Studies and Review Their Status

• Important to Note: The Front End Handles the Bookkeeping• Actual study materials/protocol is completely external to the front end• Your Project Description contains a link to the external Project Proper

• Upon completing the project, the worker is provided with a verification number• The worker inputs that number into a form on the HIT Screen within the Worker UI front end• The verification number is the underlying link between the front end Description and the

back end Project Proper

9

Creating a New Study Project Description

• Requesters Create a New Study Project by:• Entering Study “Properties”

• Setting the “Design Layout”

• Previewing the “Back End” Link

10The Enter Properties Screen – Above the Fold

Creating New Study Description – Properties

• Entering Study Properties Entails• Providing a Title and Description of the Human Intelligence Task

• Important: 1. Lead with a Verb and clearly delineate the type of activity and the related topic/subject

2. Pretend your Editor is Jethro Bodine or Britney Spears, and write for a 6th Grade Education

• Example: Answer a brief survey about listening to music on a mobile device

• Determining the N; Time on Task; Gratuity Amount• Strong Suggestion: Determine your Target N and add 15-20% to yield an Actual N

• Retain the first Target-N number of submissions for your analysis

• From personal experience, I typically had to throw away 15%+ of responses b/c they were unusable

• Examples: Rating everything a “5” (even the inverted questions); Selecting a corner or blank space

• Setting Worker Requirements – Some Recommendations• Set Geo-Location to the country where your primary target segment(s) reside (E.G. USA)• Do NOT require workers to be “Masters” (Requiring Masters is the default, so be sure to uncheck box)• Set Visibility to either Private or Hidden (so that only qualified workers can preview HIT details)

11

• Setting the “Design Layout” entails• Authoring Task Instructions

• Again, channel your inner 6th grader• Be sure to clearly delineate any Restrictions or Qualifications – but use gentle language

• Examples: • First Time Participants Only – If you’ve already taken one of our surveys, we might not accept and

pay you for your subsequent submission

• Please take our Qualification Survey if you would like to work on this task

• Providing the link to the external Project Proper• Important: DO NOT DISPLAY THE URL as the link• Click on the Source Button in the UI, and edit so the link displays a brief descriptor, but not the URL• Displaying the URL is an open invitation to be gamed – cheats will go directly to your link

• Including a form to input the Survey Code• Note: the Survey Code is the only proof that Worker actually competed the task• If a worker leaves the form blank, always send a polite note about the importance of including the

Survey Code, and never approve submission for payment to prevent gaming

12

Creating New Study Description – Design Layout

Edit Project Screen – Design Layout

13The Design Layout Screen

• Previewing your project entails• Checking that the link to the Project Proper is live and typo free

• Strongly Recommend: Doing a stakeholder cog walk prior to going live

• “Masters granted” is a default qualification unless you deselect it in the Enter Properties tab

• Checking that the Instructions section to the project is well written & typo free• Recommend that you have 2-3 colleagues proof & review Instructions section

14

Creating New Study Description – Preview

The Preview Screen

Summary SWOT of Mechanical Turk

• Strengths

• Weaknesses

• Opportunities

• Threats

15

Mechanical Turk – Strengths

• Aside for some fundamental Right to Privacy restrictions and guards to protect workers from known internet scams, your backend link can go practically anywhere on the web, and your task in theory is only limited to your imagination

• It is possible to assign piece-work at very low costs

• Provides a good forum to get “quick & dirty” data on the cheap

16

Mechanical Turk – Weaknesses

• Workers Might Not• Represent your Targeted User Segments• Be able to financially afford your product

• Possibility of a false positive/negative data submission because worker is non-customer

• Have any intrinsic interest in your product or service• Might only be interested in doing what they need to earn your payment

• Workers Might• Make choices and provide opinions based on what they think you want to hear• Simply go through the motions, and provide derivative or random feedback• Skip steps and rush through activities without any conscientious thought • Intentionally provide disinformation, just to do it

17

Mechanical Turk – Opportunities

• Provides a venue for• Testing early stage concepts quickly & inexpensively

• Conducting research about your competition “anonymously”

• Generating “Big Data” on the fly• Example: Social Media companies can insert links within content to garner feedback

• Non-technical Requesters to obtain data via third party vendor provided tools

• Off-loading time intensive tasks such as phone screening (more on this later)

18

Mechanical Turk – Threats

• Using MT successfully requires more than a modicum of vigilance• Garbage In Garbage Out – Requesters have to guard against:

• “Gamed” or otherwise “Bad” data• Many workers will rush and skip steps if they think they can get away with it

• Publicity• From personal experience, publicity attracts a lower-quality worker

• Getting any kind of reputation• It’s almost always better to quietly pay for data that will be thrown out than to risk negative feedback

• False Positives & Satisfiers• A “Satisfier” is a worker who tells you what he/she thinks you want to hear

• Example: “That’s awesome! I’d absolutely pay $300 for a narrated 3D video of my cat’s autopsy”

• Nothing is explicitly provided to prevent Turkers from multiple participation• Creating Screening guards is not intuitive (will be discussed in a later slide)

19

2. Managing and Working With Turkers

20Photo Source: http://www.wired.com/2008/12/anonymity-for-sale-on-mechanical-turk/

Typical Turker Characteristics

• It’s all about the Franklins

• Not uncommon for Turkers to be stay-at-home, students, or care-givers

• For most Turkers, MT provides a source for self-paced supplemental income

• Generally, Turker discussion forums tend to focus on HITs that are• Paying good rates

• Easy to finish

• From requesters that pay promptly

• Note: I’ve never seen a thread call-out HITs that were “fun” or “stimulating”

• Important to remember that these individuals are literally working for nickels and dimes

21

Turker Attitudes & Motivation

22Source: http://www.slideshare.net/lirani/agency-and-exploitation-in-amazon-mechanical-turk

Sampling of Sites Where Turkers Congregate

• MTURKGRIND• http://www.mturkgrind.com/

• TurkerNation• www.turkernation.com/

• Reddit: Mechanical Turk Blog• https://www.reddit.com/r/mturk

• Reddit: HITs Worth Turking For• https://www.reddit.com/r/HITsWorthTurkingFor/• Note: Reddit and their Blogs typically pay contributors for entries that receive

heavy traffic and heavy vote-up ranking promotions• It’s possible to earn more for sharing a task link than actually doing the task

• Turkopticon• https://turkopticon.ucsd.edu/

23

A Word About Turker Discussion Forums

• Rule of Thumb: ALL PUBLICITY IS BAD• “Positive” Publicity Encourages Gamers & Satisfiers

• “Negative” Publicity can be Highly Alienating (can scare away “good” workers)• Never “Block” a Worker

• Never Deny a Payment

• What to do when you get “junk” data• Pay the Bad Worker(s) – Try to practice “just pay and put to sleep”

• Paid workers don’t Bitch & Moan on Public Forums

• Classify worker as a “BW = 1” (or whatever masked coding you prefer)

• If needed, run a “filler study” and take the first X viable/quality responses that get you to your intended Target-N• Example, If your Target N = 20 and you’ve tossed 5 submissions, you could then run a filler study

where you would retain the first 5 viable responses, which would then put your N back at 20. Alternatively, bump-up your N by 20% to begin with, and just retain the first Target-N responses.

• As mentioned, expect to toss 15% to 20% of your data for every HIT 24

INSTRUCTIONS:Please click on a page feature and explain why you selected it.

Example of a Bad Response

25Source: Synthesized from www.Disney.com/

Unusable Response on Next Slide

Explain why you selected your choice:

“It was interesting.”

Sample Bad Data Response

26

Explanation: This is Bad Data because the worker selected “dead air” outside of the target area, and provided an explanation that was devoid of any contextual value.

User Clicked Here

3. Managing Live Studies & Reviewing Status

27Source: http://journal.code4lib.org/articles/6004

Managing & Reviewing Studies – Potential Issues

• Payment Rate is a VERY contentious topic• Some Turkers have tried organizing and lobbying for “fair market turking”

• Important to have a Screening tool to guard against repeated participation• Workers will try to double-dip if they are allowed to

• Important to be very methodical with keeping Screener Categories up to date

• There is a constant need to check for bogus data

• Time factors related to when you launch a study can affect your data

28

Some Study Launch Considerations

• Launching studies to run over the weekend or holidays is risky• Weekend Warriors are different – data tends to be all over the place compared to studies

that launch on a Tuesday Morning• Launching late Friday afternoon will attract weekend warriors• Studies set to run during legal holidays will also attract weekenders

• Launching on a Monday can be risky for the same reason• Experience has shown that “Monday Turkers” were both more impulsive and more likely to

be satisfiers

• The time a study is launched can also affect the data distribution• Launching early morning on West Coast is different than an early morning launch on the East

Coast

• Generally, the “best time” to launch is from 9AM to 10AM for a targeted Geo-Loc• Best days are Tue, Wed, Thurs, (early Friday is okay, but you want the HIT finished by 4PM)

29

Suggested Pay Rates

• Objective: Pay well enough to encourage participation, but not so well that you create a publicity stampede

• Rule of Thumb: 3 to 5 cents per click, or short-activity (E.G.S. Steps 2 & 3)

• Example: 15 Cent HIT with opportunity for 5 Cent Bonus• Step 1. Click the Diagram You Preferred.

• Step 2. In 3 to 15 sentences please explain why you made your selection.

• Step 3. Please provide your level of experience or familiarity with <Context>.

• Bonus. Paid on case-by-case for high quality, thoughtful submissions

• Some approaches to setting rates• Billing by estimated time-on-task

• Billing by number of interactions

• Utilizing Bonuses to reward quality work30

Some Approaches to Setting Pay Rates

• Billing by Estimated Time• $6.00/Hour = .10/minute (i.e., $1.00/10 Minutes – Baseline of “fair turking”)• $15.00/Hour = .25/minute (Minimum Wage in Seattle; $2.50/10 Minutes)

• Billing by Interaction• $6.00/Hour = 2 mills/click (i.e., .0016/click, rounded-up)• $15.00/Hour = .025/click

• Using Bonuses (Token In-Exchange for Preferred Service)• Reward for Quality Work (especially if it goes above & beyond)• Don’t have to deny an initial, smaller payment to workers who skip steps or “phone it in”• Well suited for a more involved “try it out in a store and tell us your thoughts” task

• Crucial that expectations for bonus are clearly delineated

• Modulo 10 Clustering – Try to group your workers into sets of 10 • Amazon’s 20% fee doubles when N > 10 (current pricing model)• E.G. Bumped-Up N = 25 done as 3 studies where N = 8, 8, 9 (run serially; do Paired-T Tests)

31

Managing Turker Payments• Try to approve payments the same day that your activity is completed

• Always review your data set before doing any payment approvals

• Also note that Study/Activity Data is accessed externally from the MT Front-End

• Every data set from a worker will have a corresponding survey code• Survey Codes are visible in the MT Front-End within a Batch Detail Listing

• Accessed via clicking the Results Button for a given Study Batch

• Not uncommon to have data entries with missing or bogus survey codes• Only retain data from workers who have survey codes that match the MT Batch Listing

• Only pay workers explicitly denoted in the MT Batch Listing

32Batch Detail List

Screening Worker Candidates

• MT lets you create Categories for managing worker access to HITS• To create a Category, click the Qualifications Types link within the Manage tab

• Potential Uses• Screen for prior participation within a given study type

• Blacklisting for gaming behavior

• Profile-Qualifying workers • Example: Want individuals proficient in Halo

• Administer 20-27 question pretest on gaming (Don’t share scoring criteria)

• Intersperse questions about Halo that only an advanced/proficient would know

• People who score 85%+ on the Halo questions would qualify

• Denote MVPs

33

Screening Candidates for Prior Participation• Within the Manage tab

• Create a worker qualification column and label it “Opted-In <Study Type>”

• If a Worker has submitted a HIT for a given Study Type• Then manually set the corresponding qualification column value to “1”

• When you run a HIT of <Study Type> require Opted-In equals 0 to qualify• Workers with prior participation will be unable to access HITS of that Study Type

34

Managing HIT Access via Qualifications

• This approach for screening-out prior participants can also be used for factors such as:• Pre-Qualifying Workers (If Profile-Qualified <Study Type> = 1 Then Allow Access)

• Workers would complete a Screener Activity (like the Halo example on Slide 31)

• If they are successful, then set Profile-Qualified <Study Type> = 1 for those workers

• Denoting MVPs (Qualification MVP = 1)

• Blacklisting (If BL = 1 Then Deny Access)• Since qualifications are public facing (to the individual worker), use a masked term

• Explicitly blocking someone might cause negative publicity about you

35

Summary: Some Approaches for Mitigating Risk

• Carefully crafting task instructions• Use 6th Grade English• Unchecking Require Masters rating• Conduct an internal walkthrough to validate content & typo check

• Controlling Access to Your HITs• Setting-Up an “Opting-In” qualification categories• Setting HIT visibility to either Private or Hidden

• Some Things to Watch for When Reviewing Turker Submissions• “Phoning it in”• “Satisfiers”• Workers skipping steps

36

3a. Some Thoughts on Activities well-suited for the Mechanical Turk Environment

37Source: http://kernelmag.dailydot.com/features/report/4732/my-gruelling-day-as-an-amazon-mechanical-turk/

Activities potentially well-suited for MT

• Surveys

• First Impressions Feedback

• Supplemental Feedback to delivered content (Twitter does this)

• Script-Directed In-Store “Ethnography” (inherently risky)• Go to a B&M; do activities & take notes; upload notes & take this survey

• Card Sorting and Affinity Diagramming (via Html5, or JavaScript)• Each card would have a drop-down menu of numbers (# of cards/groups)• User would set/reset the drop-down for each card, and then click submit

• For stack ranking, include a guard so a given drop-down number could only be selected once

• Reviewing Tutorial/Documentation Effectiveness (Pre/Post Treatment)

38

4. Using Mechanical Turk as a Supplemental Tool

39Source: Synthesized from a free-use image and utilized in a Participant Recruiting Flyer

Advanced Topics

• Using Mechanical Turk for supplemental and off-loaded tasks• Recruiting – Participant-Directed Screeners

• As an alternative to conducting phone screeners to profile-qualify candidates to participate in a UX study, the candidates can be directed to an online survey within MT

• MT Terms of Usage allow you to send specific people to a HIT • Important: Set HIT visibility to hidden so only your recruits see the HIT

• Working with Remote Participants• Participant-Directed protocols (recommend tightly choreographed script)

• Example: Flash-Based forward-chaining script where data was submitted in excel form to a Cold Fusion-Based server (Data then retrieved via ftp)

• Facilitator-Directed protocols with remotely accessed questionnaire packets• Example: Focus Group with a combination of local and remote participants

• You don’t want the remotes to physically possess your protocol script• Implement survey elements of protocol as a SharePoint site, use MT to control/restrict access

• Managing Gratuities• Importance of Modulo 10 clustering for optimal budgeting

• 20% “standard” fee becomes a 40% fee if N > 10

40

Using MT for Managing “Regular” UX Gratuities• Not all companies have a Usability Central facility like Microsoft

• Side Effect: Principal Investigator is responsible for ensuring that Participant W9 and 1099 forms are accurately completed and properly submitted to the IRS

• Case Example• At one of my assignments, my manager used an external vendor to provide lab space; schedule

participants; and distribute gratuities.• It was cheaper to pay vendor than the Researcher to do the scheduling & phone screening• More time & cost efficient for Manager to off-load tax form verification to vendor• If something goes awry, the vendor deals with the Fed

• Approach • Create a Category • Set HIT visibility to Hidden• Each participant is given a Group ID, Unique ID, and Pass-Phrase as identifiers• Participants would be grouped across multiple HITs, modulo 10 (as needed)• The HIT would involve inputting the identifier info• Requester would carefully verify & validate each submission, and approve payment• Amazon would collect a 20% fee for its troubles

Disclaimer: Neither the Presenter nor UXPA will be held responsible if you choose to use this method

41

Questions?

42

THANK YOU !

43Statue of Jimi Hendrix at 1604 Broadway, Capital Hill District, Seattle

Survey Request

• Here is the link to submit a review this presentation:• http://www.uxpa2016.org/sessionsurvey?sessionid=197

44