View
224
Download
3
Category
Tags:
Preview:
Citation preview
Crowd Algorithms
Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park,
Alkis Polyzotis, Petros Venetis, Jennifer Widom
Stanford and UC Santa Cruz
Scoop — The Stanford – Santa Cruz Project for Cooperative Computing with Algorithms, Data, and People
2
The Goal
Design Fundamental Algorithms for Human Computation
Latency
Cost
Uncertainty
• Which questions do I ask?• When do I ask the questions? • When do I stop?• How do I combine the answers?
3
The Problems
Sort / Max
GraphSearch
Categorize
Filter
Crowd-
Crowd-
Crowd-
Crowd-
Latency
Cost
Uncertainty
: Difficult!
: Difficult!
: Difficult!
: Difficult!
Progress!
[VLDB 2011]
The focus of this talk.
Summaries of the rest
Filters
4
Dataset of Items
Predicate 1
Predicate 2……
Predicate k
Is this image that of Bytes Café ?
Is the image blurry?
Does it show people’s faces?
Filtered Dataset
Given: —Error Probability (FP/FN) & Selectivity for each
predicate
—Desired Overall Error Probability
To: Compose a filtering strategy—Minimize Overall Cost (# of questions)
• Which questions do I ask?• When do I ask the questions? • When do I stop?• How do I combine the answers?
Single Filter
Surprisingly difficult! Need to meet an overall error threshold
—Say, up to 10% of my images may be wrongly filtered
Minimize overall expected number of questions
Boils down to the following: —Take one item—Ask some questions• Results in a certain number of (Y, N) for a given
item—Do I stop (if so, what do I return), or do I continue
asking?
5
Dataset of Items Predicate 1
Filtered Dataset
Hasn’t this been done before?
Solutions from statistics guarantee the same error per item—Important on contexts like:• Automobile testing• Diagnosis
We’re worried about aggregate error over all items: a uniquely data-oriented problem—I don’t care if every image is perfect as long as the
overall error is met.—As we will see, results in $$$ savings
6
Strategies
7
YES = 5, NO = 6Return “Passed”
YES Answers
NOAnswers
YES = 3, NO = 7Return “Failed”
YES = 3, NO = 5Continue
Reformulated Task:
For each point in grid : Return Pass/Fail/Cont.
Equivalently,
Find the best shape and color it!
Start here, with no questions
Common Strategies
Always ask X questions, return most likely answer—The triangle shape
If you get X YES, return “Pass” or Y NO, return “Fail”, else keep asking.—Rectangular shape
Ask until |#YES - #NO| > X, or at most Y questions—Chopped off rectangle—Anhai’s work on MOBS
8
Summary of Results
A characterization of which “shapes” are optimal
A optimal PTIME “probabilistic” approach—LP leveraging the inherent DP structure—Optimal: Strategy with minimum overall cost • for given parameters and requirements
—Probabilistic: Probability of “Pass” “Fail” “Continue”
9
Empirical Results
Evaluation on 10000 synthetic scenarios Tested:
—Optimal, Brute Force, Statistical, 5 Heuristic Algorithms
Optimal Probabilistic issues fewer questions overall—15% savings on average compared to brute force • 32% savings when optimal wins
—22% savings on average compared to the statistics approach• 49% savings when optimal wins
10
Translates to $$$ for many items !!
Generate Parameters
Other AlgorithmsBrute Force
Deterministic Optimal
Probabilistic
COST1 >>
COST2
COST3
>>
Crowd-Max/Sort
The problem(s):—Find the strategy of sorting n items • Given: Probability of error for a comparison• Given: Desired threshold on
error,#questions,#rounds
Sorting automatically given evidence —NP-Hard even for a simple probability of error
model—Related work in the area of voting theory,
economics Which r questions do we ask next?
11
Ask all pairs a total of 2k/n times
Tournament, with k repetitions at each level
One question in each roundDecreasing Parallelism
More Accuracy
Crowd-GraphSearch
Image Categorization Example
12
vehicle
car
nissan honda toyota
maxima
sentra
To attach: image of a honda car
Is image one of vehicle? YES!
Is image one of toyota? NO!
Is image one of honda?
YES!
target node = intended category Is the image one of X? = Is the target node reachable from
X?
Find the target node by asking minimum number of search questions.
Crowd-Categorize
k buckets, n items Categorize every item, overall error <
threshold For k = 1, same as filters problem Two versions:
—Discrete • Independent (like in the filters case) • Dependent buckets (e.g., colors,
GraphSearch)—Continuous (e.g., age)
13
…….Dataset of Items
Recommended