52
Making Sense at Scale with Algorithms, Machines & People Tim Kraska EECS, Computer Science UC Berkeley

Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Making Sense at Scale with!Algorithms, Machines & People"

Tim Kraska"EECS, Computer Science"

UC Berkeley"

Page 2: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Agenda"

•# Issues/Opportunities of Big Data"•# AMPLab Background"•# AMP Technologies"

Algorithms for Machine Learning"Machines for Cloud Computing"People for Crowd Sourcing"

•# Project Status and Wrap Up""2

Page 3: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Big Data is Massive…"•# Facebook: "

–# 130TB/day: user logs"–# 200-400TB/day: 83 million pictures"

"•# Google: > 25 PB/day processed data"

•# Data generated by LHC: 1PB/sec"

•# Gene sequencing: "–# Sequence 1 human cell costs Illumina $1k "–# Whole genome: $100,000 ! 1.5 GB – 30 TB (raw)"–# Sequence 1 cell for every infant by 2015?"–# 10 trillion cells / human body!"

•# Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year"–# ~60% increase every year"3

Page 4: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

… Growing …"•# More and more devices"

"•# Connected people"

•# Cheaper and cheaper storage: +50%/year"

4

Page 5: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

… and Dirty"

•# Diverse"•# Variety of sources!•# Uncurated "•# No schema "•# Inconsistent semantics"•# Inconsistent syntax"

" 5

Page 6: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Queries have issues too"

•# Diverse"•# Time-sensitive"•# Opportunistic"•# Exploratory"•# Multi-hypotheses Pitfall"

6

Page 7: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

“Big Data”: Working Definition"

When the normal application of current technology doesn’t enable users to obtain and answers of sufficient to data-driven questions"

"

7

Page 8: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

A Necessary Synergy"1.# Improve scale, efficiency, and quality of

machine learning and analytics to increase value from Big Data(Algorithms)"

"2.# Use cloud computing to get value from Big

Data and enhance datacenter infrastructure to cut costs of Big Data management (Machines)"

3.# Leverage human activity and intelligence to obtain data and extract value from Big Data cases that are hard for algorithms !(People)"

8

Page 9: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Algorithms, Machines, People"•# Today’s solutions:"

Algorithms

Machines

People

search

Watson/IBM

9

Page 10: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

The AMPLab"

search

Watson/IBM

Machines

People

Algorithms

!"#$%&'()%()'"*'(+",)'-.'$%*)&/"0%&'1,&2/$*34(5'!"+3$%)(5'"%6'7)28,)'

10

Page 11: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

AMPLab: What is it?"

A Five-Year research collaboration to develop a new generation of data analysis

methods, tools, and infrastructure for making sense of data at scale"

(Started Feb 2011)"

11

Page 12: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

AMP Faculty and Sponsors"

12

Berkeley Faculty"•# Alex Bayen (mobile sensing platforms)"•# Armando Fox (systems)"•# Michael Franklin (databases) Director"•# Michael Jordan (machine learning) Co-Director"•# Anthony Joseph (security & privacy)"•# Randy Katz (systems)"•# David Patterson (systems)"•# Ion Stoica (systems) Co-Director"•# Scott Shenker (networking)"

Sponsors:"

Page 13: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

AMPLab Application Partners"•# Mobile Millennium Project"

–# Alex Bayen, Civil and Environment Engineering, UC Berkeley"

•# Microsimulation of urban development"–# Paul Waddell, College of

Environment Design, UC Berkeley"•# Crowd based opinion formation"

–# Ken Goldberg, Industrial Engineering and Operations Research, UC Berkeley"

•# Personalized Sequencing"–# Taylor Sittler, UCSF!"

13

Page 14: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

AMP Research Themes"•# Algorithms"

–#Scale up machine learning"–#Error bars on everything"

•# Machines"–#Datacenter as a computer"

•# People"–#People in all phases of data analysis"

•# AMP"–#Put it all together"

14

Page 15: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Deb

uggi

ng

Crowd Mgt Computational Resource Mgt

Data Collector

Scalable Storage Engine

Higher Query Languages

Processing Frameworks

!"+3$%)'9)"/%$%&':'1%",.0+('9$-/"/$)('

Qua

lity

Con

trol

Data Source Selector

Result Control Center Visualization

Priv

acy

Con

trol

Berkeley Data Analytics System"

15

Roles: Infra. Builder Algo/Tools Data Collector Data Analyst Privacy Mgr

Page 16: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Algorithms: Error Management"•# Immediate results/continuous improvement"•# Calibrate answer: provide error bars"•# Breakthrough – “Big Data” Bootstrap"

;//2/'-"/('2%')<)/.'"%(=)/>'

Estim

ate"

# of data points

true answer

16

Page 17: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Algorithms: Error Management"•# Given any problem, data and a time budget"

–#Automatically pick the best algorithm"–#Actively learn and adapt strategy"

Estim

ate"

time

pick algo 1 pick algo 2 error too high

true answer

17

Page 18: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Algos: Black Swans vs. Red Herrings"What about new data sources and bias?"

Number of Hypotheses

More Data Per Existing Features (rows)

More Data as New Features

(columns) 18

Page 19: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Machines"Goal: “The datacenter as a computer”, but…"

–#Special purpose clusters, e.g., Hadoop cluster"–#Highly variable performance"–#Hard to program"–#Hard to debug"

19

=!?

Page 20: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Machines: Problem"

•#Rapid innovation in cluster computing frameworks"

•# Hadoop, Pregel, Dryad, Rails, Spark, Pig, MPI2, …"

•#No single framework optimal for all applications"•#Want to run multiple frameworks in a single cluster"»#…to maximize utilization!»#…to share data between frameworks"

20

Page 21: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Machines: A Solution"•#Mesos: a resource sharing layer supporting diverse frameworks"

–# Fine-grained sharing: Improves utilization, latency, and data locality"–# Resource offers: Simple, scalable application-controlled scheduling mechanism"

!"#$#%

&$'"% &$'"% &$'"% &$'"%

()'$$*% +,"-".%!"

&$'"% &$'"%

()'$$*%

&$'"% &$'"%

+,"-".%!"

B. Hindman, et al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011. 21

Page 22: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Mesos – Cluster Operating System"

Efficiently shares resources among diverse parallel

applications"

22

!)(2('(,"<)'

!)(2('4"(*)/'

?/."6'(+3)6@,)/'

!)(2('(,"<)'

A"6228')B)+@*2/'

*"(#'

!)(2('(,"<)'

?/."6')B)+@*2/'

*"(#'

!7C'(+3)6@,)/'

!7C')B)+@*2/'

*"(#'

A"6228'(+3)6@,)/'

?/."6')B)+@*2/'

*"(#'

!7C')B)+@*2/'

*"(#'

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

1 31 61 91 121 151 181 211 241 271 301 331

Shar

e of

Clu

ster

Time (s)

MPI

Hadoop

Spark

Page 23: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Mesos: Implementation Status"

•# 20,000 lines of C++"•# Master failover using ZooKeeper"•# Frameworks ported: Hadoop, MPI, Torque"•# New specialized frameworks: Spark"•# Open source in Apache Incubator"•# In use at Berkeley, UCSF, Twitter and elsewhere"

23

Page 24: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Datacenter Programming Framework: Spark"

•# In-memory cluster computing framework for applications that reuse working sets of data"–# Iterative algorithms: machine learning, graph

processing, optimization"–# Interactive data mining: order of magnitude faster than

disk-based tools"

•# Key idea: “resilient distributed datasets” that can automatically be rebuilt on failure"–# mechanism based on “lineage”"

"

24

M. Zaharia, et al, Spark: Cluster Computing with Working Sets, 2nd USENIX Workshop on Hot Topics in Cloud Computing (Hot Cloud 2010), June2010.

Page 25: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Spark Data Flow vs. Map Reduce"

Slave"Slave"Slave"Slave"

Master"

R1! R2! R3! R4!

aggregate!

param!

Master"

aggregate!

param!

Map 4"Map 3"Map 2"Map 1"

Reduce"

aggregate!

Map 8"Map 7"Map 6"Map 5"

Reduce"

param!

!! !! !!!Spark! MapReduce / Dryad!

25

Page 26: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Scala Language Integration"

val data = readData(...) var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = Vector.zeros(D) for (p <- data) { val s = (1/(1+exp(-p.y* (w dot p.x)))-1) * p.y gradient += s * p.x } w -= gradient } println("Final w: " + w)

"

val data = spark.hdfsTextFile(...) .map(readPoint _).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator( Vector.zeros(D)) for (p <- data) { val s = (1/(1+exp(-p.y* (w dot p.x)))-1) * p.y gradient += s * p.x } w -= gradient.value } println("Final w: " + w)

"

Serial Version! Spark Version!

26

Page 27: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Logistic Regression Performance"

127 s / iteration"

first iteration 174 s"further iterations 6 s"0"

500"1000"1500"2000"2500"3000"3500"4000"4500"

1" 5" 10" 20" 30"

Run

ning

Tim

e (s

)!

Number of Iterations!

Hadoop"Spark"

27

Page 28: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

People"•# Make people an integrated part of

the system!"–# Leverage human activity"–# Leverage human intelligence

(crowdsourcing):"•# Curate and clean dirty data"•# Answer imprecise questions"•# Test and improve algorithms"

"•# Challenge"

–# Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) "

"28

!"+3$%)('D'1,&2/$*34('

data

, ac

tivity

Que

stio

ns A

nswers

Page 29: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

CrowdDB – A First Step"

29

search

Watson/IBM

People

Algorithms

Crowd DB

Machines

Page 30: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Problem: DB-hard Queries"

30

SELECT Market_Cap From Companies Where Company_Name = “IBM”

Number of Rows: 0

Problem: EEnnttiittyy RReessoolluuttiioonn

!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+

E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'

C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'

!$+/2(2R' S)642%65'T1' IPLUM%'

Page 31: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

DB-hard Queries"

31

SELECT Market_Cap From Companies Where Company_Name = “Apple”

Number of Rows: 0

Problem: CClloosseedd WWoorrlldd AAssssuummppttiioonn

!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+

E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'

C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'

!$+/2(2R' S)642%65'T1' IPLUM%'

Page 32: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

DB-hard Queries"

32

SELECT Top_1 Image From Pictures Where Topic = “Business Success” Order By Relevance

Number of Rows: 0

Problem: SSuubbjjeeccttiivvee CCoommppaarriissoonn

Page 33: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Easy Queries"

33

SELECT Market_Cap From Companies Where Company_Name = “IBM”

$203Bn Number of Rows: 1

!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+

E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'

C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'

!$+/2(2R' S)642%65'T1' IPLUM%'

Page 34: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Pretty Easy Queries"

34

SELECT Market_Cap From Companies Where Company_Name =

“The Cool Software Company”

$xxxBn Number of Rows: 1

!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+

E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'

C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'

!$+/2(2R' S)642%65'T1' IPLUM%'

Page 35: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Microtasking Marketplaces"

"# Requestors approve jobs and payment"

"# API-based: “createHit()”, “getAssignments()”, “approveAssignments()”"

"# Other parameters: #of replicas, expiration, User Interface,…!

35

•# Current leader: Amazon Mechanical Turk"•# Requestors place Human Intelligence Tasks

(HITs) "

•# Workers (a.k.a. “turkers”) choose jobs, do them, get paid"

"

Page 36: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Idea: CrowdDB "

36

Use the crowd to answer DB-hard queries!"Where to use the crowd:"•# Find missing data!•# Make subjective

comparisons!•# Recognize patterns"

But not:"•# Anything the computer

already does well " Disk 2

Disk 1

Parser

Optimizer

Stat

istics

CrowdSQL Results

Executor

Files Access Methods

UI Template Manager

Form Editor

UI Creation

HIT Manager

Met

aDat

a

Turker Relationship Manager

M. Franklin, D. Kossmann, T. Kraska, S. Madden, S. Ramesh, R. Xin. CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011

Page 37: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

CrowdSQL"

SELECT *

FROM companies

WHERE Name ~~== “Big Blue”"

37

CREATE CCRROOWWDD TABLE department ( university STRING, department STRING, phone_no STRING) PRIMARY KEY (university, department);

CREATE TABLE company ( name STRING PRIMARY KEY, hq_address CCRROOWWDD STRING);

304+562*&/7"&/8++

SELECT p FROM picture

WHERE subject = "Golden Gate Bridge"

ORDER BY CCRROOWWDDOORRDDEERR(p, "Which pic shows better %subject");

334+562*&/7"&/V

!"#$%#"%&"'()*+,-(+.'/01++*2-34'5%6.78'!+(9:&;1,38'''

!+(9:.(1+0*:'0(31<2.''' !+(9:.(1+0*:'-,=3*.'

Page 38: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

User Interface Generation"•# A clear UI is key to response time and

answer quality."•# Can leverage the SQL Schema to auto-

generate UI (e.g., Oracle Forms, etc.)"

38

Page 39: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Query Optimization and Execution"

39

CREATE CCRROOWWDD TABLE department ( name STRING PRIMARY KEY phone_no STRING); CREATE CCRROOWWDD TABLE professor ( name STRING PRIMARY KEY e-mail STRING dep STRING REF department(name) );

SELECT * "FROM PROFESSOR p, DEPARTMENT d"WHERE d.name = p.dep"AND p.name =“Michael J. Carey”"

Professor Department

!! name="Carey"

p.dep=d.name

Professor

Department

!

!name= "Carey"

p.dep=d.name

Please fill out the missing professor data

Submit

Carey

E-Mail

Name

Please fill out the missing department data

Submit

CS

Phone

Department

MTJoin(Dep)

p.dep = d.name

MTProbe(Professor)

name=Carey

(b) Logical plan before optimization

(c) Logical plan after optimization

(d) Physical plan

Department

Page 40: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Dealing with the Open-World"

40

Professor

Department

!!name=

"Carey"

p.dep=d.name

Professor Department

! p.dep=d.name

Professor Department

! p.dep=d.name

Stop After (10)

W;9;HX'Y''ZS[!'7S[Z;WW[S'85'''''?;71SX!;NX'6'TA;S;'8F6)8'\'6F%"4)' Professor Department

! p.dep=d.name

W;9;HX'Y''ZS[!'7S[Z;WW[S'85'''''?;71SX!;NX'6'TA;S;'8F6)8'\'6F%"4)'9C!CX'L5'JL''

10*1

!

!

!

10

10 10

! !

W;9;HX'Y''ZS[!'7S[Z;WW[S'85'''''?;71SX!;NX'6'TA;S;'8F6)8'\'6F%"4)'1N?'8F%"4)'\]H"/).^'' Professor Department

!! name="Carey"

p.dep=d.name! !

!

!

1

1

1*1

1

[%,.'",,2=)6'=$*3'$*)/"0<)'_@)/.'$48/2<)4)%*'

!

Page 41: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Subjective Comparisons"

41

09:;&<="&++•# $48,)4)%*('*3)'HS[T?;`a19'"%6'HS[T?[S?;S'+248"/$(2%''

•# X"#)('(24)'6)(+/$802%'"%6'"'*.8)'b)_@",5'2/6)/c'8"/"4)*)/'

•# `@",$*.'+2%*/2,'"&"$%'-"()6'2%'4"d2/$*.'<2*)'•# [/6)/$%&'+"%'-)'e@/*3)/'2804$f)6'b)F&F5'X3/))g=".'+248"/$($2%('<(F'X=2g=".'+248"/$(2%(c'

Are the following entities the same?

IBM == Big Blue

Yes No

Which picture visualizes better "Golden Gate Bridge"

Submit

Page 42: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Does it Work?: Picture ordering"

42

QQuueerryy:: SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CCRROOWWDDOORRDDEERR(p, "Which pic shows better %subject");

Data-Size: 30 subject areas, with 8 pictures each Batching: 4 orderings per HIT Replication: 3 Assignments per HIT Price: 1 cent per HIT

(turker-votes, turker-ranking, expert-ranking)

Which picture visualizes better "Golden Gate Bridge"

Submit

Page 43: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Can we build a “Crowd Optimizer”?"

43

W),)+*'Y'Z/24'S)(*"@/"%*'T3)/)'+$*.'\'h'

Page 44: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Price vs. Response Time"

44

Li'

JLi'

PLi'

QLi'

jLi'

kLi'

ULi'

KLi'

lLi'

mLi'

JLLi'

L' JL' PL' QL' jL' kL' UL'

>*.<*&

2%?*

+"@+A

B9/+2

C%2+C

%D*+%2

+E*%/

2+"&

*+%/

/7?&

#*&

2+<"#

$E*2

*-+

97#*+F#7&/G+

ILFLJ''

ILFLP''

ILFLQ''

ILFLj''

5 Assignments, 100 HITs

Page 45: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Turker Affinity and Errors"

45

L'

kL'

JLL'

JkL'

PLL'

PkL'

J' PJ' jJ' UJ' lJ'

);#

H*.+"

@+%//

7?&#

*&2/+/;

H#7I

*-+

9;.1*.+

X2*",'ACX('(@-4$n)6'oC%+2//)+*'"(($&%4)%*('+248,)*)6o'

5 Assignments

Page 46: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Can we build a “Crowd Optimizer”?"

46

W),)+*'Y'Z/24'S)(*"@/"%*'T3)/)'+$*.'\'h'

C'=2@,6'62'=2/#'e2/'*3$('/)_@)(*)/'"&"$%F'

X3$('&@.'(32@,6'-)'(3@%%)6F'

C'"6<$()'%2*'+,$+#$%&'2%'3$(']$%e2/4"02%'"-2@*'/)(*"@/"%*(^'3$*(F'

A44FFF'C'(4),,',"-'/"*'4"*)/$",F''

-)'<)/.'="/.'2e'62$%&'"%.'=2/#'e2/'*3$('/)_@)(*)/h'

Page 47: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Processor Relations?"AMPLab"HIT Group » Simple straight-forward HITs, find the address and phone number for a given business in a given city. All HITs completed were approved. Pay was decent for amount of time required, when compared to other available HITs. But not when looked at from an hourly wage perspective. I would do work for this requester again. $$$$posted by… "fair:5$/$5$$ fast:5$/$5$$ pay:4$/$5$$$comm:0$/$5"$$ $ $ $ $ $ $ $ $ $ $ $ $ "Tim Klas Kraska"

HIT Group » "I recently did 299 HITs for this requester.… Of the 299 HITs I completed, 11 of them were rejected without any reason being given. Prior to this I only had 14 rejections, a .2% rejection rate. I currently have 8522 submitted HITs, with a .3% rejection rate after the rejections from this requester (25 total rejections). I have attempted to contact the requester and will update if I receive a response. Until then be very wary of doing any work for this requester, as it appears that they are rejecting about 1 in every 27 HITs being submitted. $ $ posted by …"

" $$ $ "

47

Page 48: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Crowd as specialized CPUs"

!E";-+ !."J-+!"/2+ bILFLP'g'IPFJLc:'32@/' IjFl':'32@/'

>%'+0"-*E+ 8".g"(g.2@g&2' 8".g"(g.2@g&2'

B&D*/2#*&2+ %2%)' '*/"$%$%&'

K*/$"&/*+97#*+ G"/$)6V'4()+' G"/$)6V'4$%5'32@/(5'6".('

!%$%H7E7=*/+L+:*%2;.*/+

&226'b%@4-)/'+/@%+3$%&c''p'-"6'b1Cc'

&226'b1Cc'p''-"6b%@4-)/'+/@%+3$%&c'

>."?.%##7&?+ e2/4",' %"*@/",',"%&@"&)'EaC'

,D%7E%H7E72'+ <$/*@",,.'@%,$4$*)6' qqq'

,M&72'+ <$/*@",$f)6' 8)/(2%",'

K*E7%H7E72'+ Z"@,*.'-@*'%2*'4",$+$2@(' Z"@,*.'"%6'82(($-,.'4",$+$2@('4*?%E+ 7/$<"+.' 7/$<"+.DD5'*"B)(5'-)%)r*(5FF'

48

Page 49: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Future: Crowdsourcing ! DB++?"

49

•# H2(*'!26),'e2/'*3)'H/2=6''–# 9"*)%+.'b4$%(c'<(F'H2(*'bIc'<(F'`@",$*.'bi)//2/c'

•# 16"80<)'`@)/.'[804$f"02%'–# A2='*2'42%$*2/'+/2=6'6@/$%&'_@)/.')B)+@02%q''–# A2='*2'"6"8*'*3)'_@)/.'8,"%q'

•# H"+3$%&':'!"*)/$",$f$%&'H/2=6'S)(@,*(''–# ;F&F5'4"$%*"$%$%&'*3)'+"+3)6'<",@)('

•# H248,)B$*.'X3)2/.''–# X3/))g=".'+248"/$($2%('<(F'X=2g=".'+248"/$(2%('

•# 7/$<"+.V'7@-,$+'<(F'7/$<"*)'+/2=6(''–# Z,$8g($6)'2e'"s%$*.'2e'*@/#)/('

•# !)*"g+/2=6(V'H/2=6('3),8'+/2=6('b)F&F'aCc'

Page 50: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

H/2=6gA"/6'7/2-,)4(V'•# 7/2&/"44$%&'9"%&@"&)V'EaC'•# !"%.5'4"%.'#%2-('*2'*@/%'•# H3"%&$%&'8,"t2/4'-)3"<$2/'•# `@",$*.'H2%*/2,'•# 9)"/%$%&')u)+*(''•# H244@%$*.+!"%"&)4)%*'•# h'

Future: DB ! Crowdsourcing++?"

X3)'?Mg188/2"+3'+"%'3),8q'•# ?"*"'$%6)8)%6)%+)5'+2(*g-"()6'2804$f"02%5'(+3)4"'4"%"&)4)%*h'

50

Page 51: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Sequencing

Microsimulation Mobile Millennium

The AMPLab"

Machines

People

Algorithms

!"#)'()%()'"*'(+",)'-.'32,$(0+",,.'$%*)&/"0%&'1,&2/$*34(5'!"+3$%)(5'"%6'7)28,)'

51

Page 52: Making Sense at Scale with Algorithms, Machines & Peoplecs.brown.edu › people › tkraska › talks › 2011-07-17_AMP.pdf · 2011-07-29 · •#Master failover using ZooKeeper"

Summary - AMPLab"•# Goal: Tame Big Data Problem"

Balance quality, cost and time to solve a given problem"•# The computing challenge of the decade "•# Widespread applicability across applications"•# To address, we must Holistically integrate "

!Algorithms, Machines, and People" "

We thank our sponsors:"

""

"52