Crowd Algorithms Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park, Alkis...

Crowd Algorithms

Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park,

Alkis Polyzotis, Petros Venetis, Jennifer Widom

Stanford and UC Santa Cruz

Scoop — The Stanford – Santa Cruz Project for Cooperative Computing with Algorithms, Data, and People

The Goal

Design Fundamental Algorithms for Human Computation

Latency

Uncertainty

• Which questions do I ask?• When do I ask the questions? • When do I stop?• How do I combine the answers?

The Problems

Sort / Max

GraphSearch

Categorize

Filter

Crowd-

Latency

Uncertainty

: Difficult!

Progress!

[VLDB 2011]

The focus of this talk.

Summaries of the rest

Filters

Dataset of Items

Predicate 1

Predicate 2……

Predicate k

Is this image that of Bytes Café ?

Is the image blurry?

Does it show people’s faces?

Filtered Dataset

Given: —Error Probability (FP/FN) & Selectivity for each

predicate

—Desired Overall Error Probability

To: Compose a filtering strategy—Minimize Overall Cost (# of questions)

• Which questions do I ask?• When do I ask the questions? • When do I stop?• How do I combine the answers?

Single Filter

Surprisingly difficult! Need to meet an overall error threshold

—Say, up to 10% of my images may be wrongly filtered

Minimize overall expected number of questions

Boils down to the following: —Take one item—Ask some questions• Results in a certain number of (Y, N) for a given

item—Do I stop (if so, what do I return), or do I continue

asking?

Dataset of Items Predicate 1

Filtered Dataset

Hasn’t this been done before?

Solutions from statistics guarantee the same error per item—Important on contexts like:• Automobile testing• Diagnosis

We’re worried about aggregate error over all items: a uniquely data-oriented problem—I don’t care if every image is perfect as long as the

overall error is met.—As we will see, results in $$$ savings

Strategies

YES = 5, NO = 6Return “Passed”

YES Answers

NOAnswers

YES = 3, NO = 7Return “Failed”

YES = 3, NO = 5Continue

Reformulated Task:

For each point in grid : Return Pass/Fail/Cont.

Equivalently,

Find the best shape and color it!

Start here, with no questions

Common Strategies

Always ask X questions, return most likely answer—The triangle shape

If you get X YES, return “Pass” or Y NO, return “Fail”, else keep asking.—Rectangular shape

Ask until |#YES - #NO| > X, or at most Y questions—Chopped off rectangle—Anhai’s work on MOBS

Summary of Results

A characterization of which “shapes” are optimal

A optimal PTIME “probabilistic” approach—LP leveraging the inherent DP structure—Optimal: Strategy with minimum overall cost • for given parameters and requirements

—Probabilistic: Probability of “Pass” “Fail” “Continue”

Empirical Results

Evaluation on 10000 synthetic scenarios Tested:

—Optimal, Brute Force, Statistical, 5 Heuristic Algorithms

Optimal Probabilistic issues fewer questions overall—15% savings on average compared to brute force • 32% savings when optimal wins

—22% savings on average compared to the statistics approach• 49% savings when optimal wins

Translates to $$$ for many items !!

Generate Parameters

Other AlgorithmsBrute Force

Deterministic Optimal

Probabilistic

COST1 >>

Crowd-Max/Sort

The problem(s):—Find the strategy of sorting n items • Given: Probability of error for a comparison• Given: Desired threshold on

error,#questions,#rounds

Sorting automatically given evidence —NP-Hard even for a simple probability of error

model—Related work in the area of voting theory,

economics Which r questions do we ask next?

Ask all pairs a total of 2k/n times

Tournament, with k repetitions at each level

One question in each roundDecreasing Parallelism

More Accuracy

Crowd-GraphSearch

Image Categorization Example

vehicle

nissan honda toyota

maxima

sentra

To attach: image of a honda car

Is image one of vehicle? YES!

Is image one of toyota? NO!

Is image one of honda?

target node = intended category Is the image one of X? = Is the target node reachable from

Find the target node by asking minimum number of search questions.

Crowd-Categorize

k buckets, n items Categorize every item, overall error <

threshold For k = 1, same as filters problem Two versions:

—Discrete • Independent (like in the filters case) • Dependent buckets (e.g., colors,

GraphSearch)—Continuous (e.g., age)

…….Dataset of Items

Questions?

Crowd Algorithms Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park, Alkis...

Documents

Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu

CS 347: Distributed Databases and Transaction Processinginfolab.stanford.edu/~venetis/cs347/notes/347Notes07.pdf · Transaction Processing Notes07: Reliable Distributed Database Management

신재생에너지발전원및에너지저장장치 조합에관한최적화모형 · 2014-04-15 Yonghyun Nam, Chan-Kyoo Park, Hyunjung Shin 1 신재생에너지발전원및에너지저장장치

S KYLINE Q UERY P ROCESSING OVER J OINS. Akrivi Vlachou1, Christos Doulkeridis1, Neoklis Polyzotis SIGMOD 2011

Adaptive Query Processing - Stanford Universityinfolab.stanford.edu/~hyunjung/cs346/aqp-survey.pdf · An alternate approach, so called data-driven or dataﬂow schedul-ing [121],

1 XML processing in DHT networks Serge Abiteboul, Ioana Manolescu, Neoklis Polyzotis, Nicoleta Preda, Chong Sun INRIA-Saclay & UC Santa-Cruz Date

Swing Bench by Daphne Verbeek,HyunJung Lee, Kristina Vysotskaya

Hyunjung Kim M.D, ph.D 피부과 Department of …Department of Dermatology , /피부과 김 현 정 caspase@hanmail.et Hyunjung Kim M.D, ph.D Atopy and Asthma center, Seoul Medical

Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz

Professional Services & Business Partner Directory · ACCOUNTANTS ABEDRABBO , John Polyzotis & Company LLP Chartered Accountants Tel: 416-360-4310 CPA, CA, CPA (Illinois) 34 King

Motivation: Building a Text Indexi.stanford.edu/~venetis/cs347/notes/347NotesMR-Handouts.pdf– Done through log messages and counters CS 347 MapReduce 20 MapReduce Advantages •

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis

Macromolecule mediated transport and retention of · 2010. 4. 13. · Macromolecule mediated transport and retention of Escherichia coli O157:H7 in saturated porous media Hyunjung

Venetis & Ghauri EJM

VENETIS 2010a

Social Sites Research Through CourseRanki.stanford.edu/~venetis/publications/2009/social_sites_research.pdfa student community). Thus, our work “kills two birds with one stone”,

Data Management Challenges in Production … Management Challenges in Production Machine Learning Neoklis Polyzotis, Sudip Roy, Steven Whang, Martin Zinkevich

Data Management Challenges in Production Machine Learning › media › research... · 2020-03-03 · Data Management Challenges in Production Machine Learning Neoklis Polyzotis,

CS 347: Distributed Databases and Transaction Processingi.stanford.edu/~venetis/cs347/notes/347Notes01.pdf · CS 347 Lecture 1 23 Logistics • TEXTBOOK: No required textbook. Some

Response modeling with support vector machinesis.tuebingen.mpg.de/fileadmin/user_upload/files... · 2011. 1. 20. · Response modeling with support vector machines HyunJung Shina,