Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Making Sense at Scale with!Algorithms, Machines & People"
Tim Kraska"EECS, Computer Science"
UC Berkeley"
Agenda"
•# Issues/Opportunities of Big Data"•# AMPLab Background"•# AMP Technologies"
Algorithms for Machine Learning"Machines for Cloud Computing"People for Crowd Sourcing"
•# Project Status and Wrap Up""2
Big Data is Massive…"•# Facebook: "
–# 130TB/day: user logs"–# 200-400TB/day: 83 million pictures"
"•# Google: > 25 PB/day processed data"
•# Data generated by LHC: 1PB/sec"
•# Gene sequencing: "–# Sequence 1 human cell costs Illumina $1k "–# Whole genome: $100,000 ! 1.5 GB – 30 TB (raw)"–# Sequence 1 cell for every infant by 2015?"–# 10 trillion cells / human body!"
•# Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year"–# ~60% increase every year"3
… Growing …"•# More and more devices"
"•# Connected people"
•# Cheaper and cheaper storage: +50%/year"
4
… and Dirty"
•# Diverse"•# Variety of sources!•# Uncurated "•# No schema "•# Inconsistent semantics"•# Inconsistent syntax"
" 5
Queries have issues too"
•# Diverse"•# Time-sensitive"•# Opportunistic"•# Exploratory"•# Multi-hypotheses Pitfall"
6
“Big Data”: Working Definition"
When the normal application of current technology doesn’t enable users to obtain and answers of sufficient to data-driven questions"
"
7
A Necessary Synergy"1.# Improve scale, efficiency, and quality of
machine learning and analytics to increase value from Big Data(Algorithms)"
"2.# Use cloud computing to get value from Big
Data and enhance datacenter infrastructure to cut costs of Big Data management (Machines)"
3.# Leverage human activity and intelligence to obtain data and extract value from Big Data cases that are hard for algorithms !(People)"
8
Algorithms, Machines, People"•# Today’s solutions:"
Algorithms
Machines
People
search
Watson/IBM
9
The AMPLab"
search
Watson/IBM
Machines
People
Algorithms
!"#$%&'()%()'"*'(+",)'-.'$%*)&/"0%&'1,&2/$*34(5'!"+3$%)(5'"%6'7)28,)'
10
AMPLab: What is it?"
A Five-Year research collaboration to develop a new generation of data analysis
methods, tools, and infrastructure for making sense of data at scale"
(Started Feb 2011)"
11
AMP Faculty and Sponsors"
12
Berkeley Faculty"•# Alex Bayen (mobile sensing platforms)"•# Armando Fox (systems)"•# Michael Franklin (databases) Director"•# Michael Jordan (machine learning) Co-Director"•# Anthony Joseph (security & privacy)"•# Randy Katz (systems)"•# David Patterson (systems)"•# Ion Stoica (systems) Co-Director"•# Scott Shenker (networking)"
Sponsors:"
AMPLab Application Partners"•# Mobile Millennium Project"
–# Alex Bayen, Civil and Environment Engineering, UC Berkeley"
•# Microsimulation of urban development"–# Paul Waddell, College of
Environment Design, UC Berkeley"•# Crowd based opinion formation"
–# Ken Goldberg, Industrial Engineering and Operations Research, UC Berkeley"
•# Personalized Sequencing"–# Taylor Sittler, UCSF!"
13
AMP Research Themes"•# Algorithms"
–#Scale up machine learning"–#Error bars on everything"
•# Machines"–#Datacenter as a computer"
•# People"–#People in all phases of data analysis"
•# AMP"–#Put it all together"
14
Deb
uggi
ng
Crowd Mgt Computational Resource Mgt
Data Collector
Scalable Storage Engine
Higher Query Languages
Processing Frameworks
!"+3$%)'9)"/%$%&':'1%",.0+('9$-/"/$)('
Qua
lity
Con
trol
Data Source Selector
Result Control Center Visualization
Priv
acy
Con
trol
Berkeley Data Analytics System"
15
Roles: Infra. Builder Algo/Tools Data Collector Data Analyst Privacy Mgr
Algorithms: Error Management"•# Immediate results/continuous improvement"•# Calibrate answer: provide error bars"•# Breakthrough – “Big Data” Bootstrap"
;//2/'-"/('2%')<)/.'"%(=)/>'
Estim
ate"
# of data points
true answer
16
Algorithms: Error Management"•# Given any problem, data and a time budget"
–#Automatically pick the best algorithm"–#Actively learn and adapt strategy"
Estim
ate"
time
pick algo 1 pick algo 2 error too high
true answer
17
Algos: Black Swans vs. Red Herrings"What about new data sources and bias?"
Number of Hypotheses
More Data Per Existing Features (rows)
More Data as New Features
(columns) 18
Machines"Goal: “The datacenter as a computer”, but…"
–#Special purpose clusters, e.g., Hadoop cluster"–#Highly variable performance"–#Hard to program"–#Hard to debug"
19
=!?
Machines: Problem"
•#Rapid innovation in cluster computing frameworks"
•# Hadoop, Pregel, Dryad, Rails, Spark, Pig, MPI2, …"
•#No single framework optimal for all applications"•#Want to run multiple frameworks in a single cluster"»#…to maximize utilization!»#…to share data between frameworks"
20
Machines: A Solution"•#Mesos: a resource sharing layer supporting diverse frameworks"
–# Fine-grained sharing: Improves utilization, latency, and data locality"–# Resource offers: Simple, scalable application-controlled scheduling mechanism"
!"#$#%
&$'"% &$'"% &$'"% &$'"%
()'$$*% +,"-".%!"
&$'"% &$'"%
()'$$*%
&$'"% &$'"%
+,"-".%!"
B. Hindman, et al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011. 21
Mesos – Cluster Operating System"
Efficiently shares resources among diverse parallel
applications"
22
!)(2('(,"<)'
!)(2('4"(*)/'
?/."6'(+3)6@,)/'
!)(2('(,"<)'
A"6228')B)+@*2/'
*"(#'
!)(2('(,"<)'
?/."6')B)+@*2/'
*"(#'
!7C'(+3)6@,)/'
!7C')B)+@*2/'
*"(#'
A"6228'(+3)6@,)/'
?/."6')B)+@*2/'
*"(#'
!7C')B)+@*2/'
*"(#'
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
1 31 61 91 121 151 181 211 241 271 301 331
Shar
e of
Clu
ster
Time (s)
MPI
Hadoop
Spark
Mesos: Implementation Status"
•# 20,000 lines of C++"•# Master failover using ZooKeeper"•# Frameworks ported: Hadoop, MPI, Torque"•# New specialized frameworks: Spark"•# Open source in Apache Incubator"•# In use at Berkeley, UCSF, Twitter and elsewhere"
23
Datacenter Programming Framework: Spark"
•# In-memory cluster computing framework for applications that reuse working sets of data"–# Iterative algorithms: machine learning, graph
processing, optimization"–# Interactive data mining: order of magnitude faster than
disk-based tools"
•# Key idea: “resilient distributed datasets” that can automatically be rebuilt on failure"–# mechanism based on “lineage”"
"
24
M. Zaharia, et al, Spark: Cluster Computing with Working Sets, 2nd USENIX Workshop on Hot Topics in Cloud Computing (Hot Cloud 2010), June2010.
Spark Data Flow vs. Map Reduce"
Slave"Slave"Slave"Slave"
Master"
R1! R2! R3! R4!
aggregate!
param!
Master"
aggregate!
param!
Map 4"Map 3"Map 2"Map 1"
Reduce"
aggregate!
Map 8"Map 7"Map 6"Map 5"
Reduce"
param!
!! !! !!!Spark! MapReduce / Dryad!
25
Scala Language Integration"
val data = readData(...) var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = Vector.zeros(D) for (p <- data) { val s = (1/(1+exp(-p.y* (w dot p.x)))-1) * p.y gradient += s * p.x } w -= gradient } println("Final w: " + w)
"
val data = spark.hdfsTextFile(...) .map(readPoint _).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator( Vector.zeros(D)) for (p <- data) { val s = (1/(1+exp(-p.y* (w dot p.x)))-1) * p.y gradient += s * p.x } w -= gradient.value } println("Final w: " + w)
"
Serial Version! Spark Version!
26
Logistic Regression Performance"
127 s / iteration"
first iteration 174 s"further iterations 6 s"0"
500"1000"1500"2000"2500"3000"3500"4000"4500"
1" 5" 10" 20" 30"
Run
ning
Tim
e (s
)!
Number of Iterations!
Hadoop"Spark"
27
People"•# Make people an integrated part of
the system!"–# Leverage human activity"–# Leverage human intelligence
(crowdsourcing):"•# Curate and clean dirty data"•# Answer imprecise questions"•# Test and improve algorithms"
"•# Challenge"
–# Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) "
"28
!"+3$%)('D'1,&2/$*34('
data
, ac
tivity
Que
stio
ns A
nswers
CrowdDB – A First Step"
29
search
Watson/IBM
People
Algorithms
Crowd DB
Machines
Problem: DB-hard Queries"
30
SELECT Market_Cap From Companies Where Company_Name = “IBM”
Number of Rows: 0
Problem: EEnnttiittyy RReessoolluuttiioonn
!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+
E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'
C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'
!$+/2(2R' S)642%65'T1' IPLUM%'
DB-hard Queries"
31
SELECT Market_Cap From Companies Where Company_Name = “Apple”
Number of Rows: 0
Problem: CClloosseedd WWoorrlldd AAssssuummppttiioonn
!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+
E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'
C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'
!$+/2(2R' S)642%65'T1' IPLUM%'
DB-hard Queries"
32
SELECT Top_1 Image From Pictures Where Topic = “Business Success” Order By Relevance
Number of Rows: 0
Problem: SSuubbjjeeccttiivvee CCoommppaarriissoonn
Easy Queries"
33
SELECT Market_Cap From Companies Where Company_Name = “IBM”
$203Bn Number of Rows: 1
!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+
E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'
C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'
!$+/2(2R' S)642%65'T1' IPLUM%'
Pretty Easy Queries"
34
SELECT Market_Cap From Companies Where Company_Name =
“The Cool Software Company”
$xxxBn Number of Rows: 1
!"#$%&'()%#*+ ,--.*//+ 0%.1*2+!%$+
E22&,)' E22&,)8,)B5'!*%F'G$)='H1' IJKLM%'
C%*,F'M@($%)(('!"+3$%)(' 1/42%#5'NO' IPLQM%'
!$+/2(2R' S)642%65'T1' IPLUM%'
Microtasking Marketplaces"
"# Requestors approve jobs and payment"
"# API-based: “createHit()”, “getAssignments()”, “approveAssignments()”"
"# Other parameters: #of replicas, expiration, User Interface,…!
35
•# Current leader: Amazon Mechanical Turk"•# Requestors place Human Intelligence Tasks
(HITs) "
•# Workers (a.k.a. “turkers”) choose jobs, do them, get paid"
"
Idea: CrowdDB "
36
Use the crowd to answer DB-hard queries!"Where to use the crowd:"•# Find missing data!•# Make subjective
comparisons!•# Recognize patterns"
But not:"•# Anything the computer
already does well " Disk 2
Disk 1
Parser
Optimizer
Stat
istics
CrowdSQL Results
Executor
Files Access Methods
UI Template Manager
Form Editor
UI Creation
HIT Manager
Met
aDat
a
Turker Relationship Manager
M. Franklin, D. Kossmann, T. Kraska, S. Madden, S. Ramesh, R. Xin. CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011
CrowdSQL"
SELECT *
FROM companies
WHERE Name ~~== “Big Blue”"
37
CREATE CCRROOWWDD TABLE department ( university STRING, department STRING, phone_no STRING) PRIMARY KEY (university, department);
CREATE TABLE company ( name STRING PRIMARY KEY, hq_address CCRROOWWDD STRING);
304+562*&/7"&/8++
SELECT p FROM picture
WHERE subject = "Golden Gate Bridge"
ORDER BY CCRROOWWDDOORRDDEERR(p, "Which pic shows better %subject");
334+562*&/7"&/V
!"#$%#"%&"'()*+,-(+.'/01++*2-34'5%6.78'!+(9:&;1,38'''
!+(9:.(1+0*:'0(31<2.''' !+(9:.(1+0*:'-,=3*.'
User Interface Generation"•# A clear UI is key to response time and
answer quality."•# Can leverage the SQL Schema to auto-
generate UI (e.g., Oracle Forms, etc.)"
38
Query Optimization and Execution"
39
CREATE CCRROOWWDD TABLE department ( name STRING PRIMARY KEY phone_no STRING); CREATE CCRROOWWDD TABLE professor ( name STRING PRIMARY KEY e-mail STRING dep STRING REF department(name) );
SELECT * "FROM PROFESSOR p, DEPARTMENT d"WHERE d.name = p.dep"AND p.name =“Michael J. Carey”"
Professor Department
!! name="Carey"
p.dep=d.name
Professor
Department
!
!name= "Carey"
p.dep=d.name
Please fill out the missing professor data
Submit
Carey
Name
Please fill out the missing department data
Submit
CS
Phone
Department
MTJoin(Dep)
p.dep = d.name
MTProbe(Professor)
name=Carey
(b) Logical plan before optimization
(c) Logical plan after optimization
(d) Physical plan
Department
Dealing with the Open-World"
40
Professor
Department
!!name=
"Carey"
p.dep=d.name
Professor Department
! p.dep=d.name
Professor Department
! p.dep=d.name
Stop After (10)
W;9;HX'Y''ZS[!'7S[Z;WW[S'85'''''?;71SX!;NX'6'TA;S;'8F6)8'\'6F%"4)' Professor Department
! p.dep=d.name
W;9;HX'Y''ZS[!'7S[Z;WW[S'85'''''?;71SX!;NX'6'TA;S;'8F6)8'\'6F%"4)'9C!CX'L5'JL''
10*1
!
!
!
10
10 10
! !
W;9;HX'Y''ZS[!'7S[Z;WW[S'85'''''?;71SX!;NX'6'TA;S;'8F6)8'\'6F%"4)'1N?'8F%"4)'\]H"/).^'' Professor Department
!! name="Carey"
p.dep=d.name! !
!
!
1
1
1*1
1
[%,.'",,2=)6'=$*3'$*)/"0<)'_@)/.'$48/2<)4)%*'
!
Subjective Comparisons"
41
09:;&<="&++•# $48,)4)%*('*3)'HS[T?;`a19'"%6'HS[T?[S?;S'+248"/$(2%''
•# X"#)('(24)'6)(+/$802%'"%6'"'*.8)'b)_@",5'2/6)/c'8"/"4)*)/'
•# `@",$*.'+2%*/2,'"&"$%'-"()6'2%'4"d2/$*.'<2*)'•# [/6)/$%&'+"%'-)'e@/*3)/'2804$f)6'b)F&F5'X3/))g=".'+248"/$($2%('<(F'X=2g=".'+248"/$(2%(c'
Are the following entities the same?
IBM == Big Blue
Yes No
Which picture visualizes better "Golden Gate Bridge"
Submit
Does it Work?: Picture ordering"
42
QQuueerryy:: SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CCRROOWWDDOORRDDEERR(p, "Which pic shows better %subject");
Data-Size: 30 subject areas, with 8 pictures each Batching: 4 orderings per HIT Replication: 3 Assignments per HIT Price: 1 cent per HIT
(turker-votes, turker-ranking, expert-ranking)
Which picture visualizes better "Golden Gate Bridge"
Submit
Can we build a “Crowd Optimizer”?"
43
W),)+*'Y'Z/24'S)(*"@/"%*'T3)/)'+$*.'\'h'
Price vs. Response Time"
44
Li'
JLi'
PLi'
QLi'
jLi'
kLi'
ULi'
KLi'
lLi'
mLi'
JLLi'
L' JL' PL' QL' jL' kL' UL'
>*.<*&
2%?*
+"@+A
B9/+2
C%2+C
%D*+%2
+E*%/
2+"&
*+%/
/7?&
#*&
2+<"#
$E*2
*-+
97#*+F#7&/G+
ILFLJ''
ILFLP''
ILFLQ''
ILFLj''
5 Assignments, 100 HITs
Turker Affinity and Errors"
45
L'
kL'
JLL'
JkL'
PLL'
PkL'
J' PJ' jJ' UJ' lJ'
);#
H*.+"
@+%//
7?&#
*&2/+/;
H#7I
*-+
9;.1*.+
X2*",'ACX('(@-4$n)6'oC%+2//)+*'"(($&%4)%*('+248,)*)6o'
5 Assignments
Can we build a “Crowd Optimizer”?"
46
W),)+*'Y'Z/24'S)(*"@/"%*'T3)/)'+$*.'\'h'
C'=2@,6'62'=2/#'e2/'*3$('/)_@)(*)/'"&"$%F'
X3$('&@.'(32@,6'-)'(3@%%)6F'
C'"6<$()'%2*'+,$+#$%&'2%'3$(']$%e2/4"02%'"-2@*'/)(*"@/"%*(^'3$*(F'
A44FFF'C'(4),,',"-'/"*'4"*)/$",F''
-)'<)/.'="/.'2e'62$%&'"%.'=2/#'e2/'*3$('/)_@)(*)/h'
Processor Relations?"AMPLab"HIT Group » Simple straight-forward HITs, find the address and phone number for a given business in a given city. All HITs completed were approved. Pay was decent for amount of time required, when compared to other available HITs. But not when looked at from an hourly wage perspective. I would do work for this requester again. $$$$posted by… "fair:5$/$5$$ fast:5$/$5$$ pay:4$/$5$$$comm:0$/$5"$$ $ $ $ $ $ $ $ $ $ $ $ $ "Tim Klas Kraska"
HIT Group » "I recently did 299 HITs for this requester.… Of the 299 HITs I completed, 11 of them were rejected without any reason being given. Prior to this I only had 14 rejections, a .2% rejection rate. I currently have 8522 submitted HITs, with a .3% rejection rate after the rejections from this requester (25 total rejections). I have attempted to contact the requester and will update if I receive a response. Until then be very wary of doing any work for this requester, as it appears that they are rejecting about 1 in every 27 HITs being submitted. $ $ posted by …"
" $$ $ "
47
Crowd as specialized CPUs"
!E";-+ !."J-+!"/2+ bILFLP'g'IPFJLc:'32@/' IjFl':'32@/'
>%'+0"-*E+ 8".g"(g.2@g&2' 8".g"(g.2@g&2'
B&D*/2#*&2+ %2%)' '*/"$%$%&'
K*/$"&/*+97#*+ G"/$)6V'4()+' G"/$)6V'4$%5'32@/(5'6".('
!%$%H7E7=*/+L+:*%2;.*/+
&226'b%@4-)/'+/@%+3$%&c''p'-"6'b1Cc'
&226'b1Cc'p''-"6b%@4-)/'+/@%+3$%&c'
>."?.%##7&?+ e2/4",' %"*@/",',"%&@"&)'EaC'
,D%7E%H7E72'+ <$/*@",,.'@%,$4$*)6' qqq'
,M&72'+ <$/*@",$f)6' 8)/(2%",'
K*E7%H7E72'+ Z"@,*.'-@*'%2*'4",$+$2@(' Z"@,*.'"%6'82(($-,.'4",$+$2@('4*?%E+ 7/$<"+.' 7/$<"+.DD5'*"B)(5'-)%)r*(5FF'
48
Future: Crowdsourcing ! DB++?"
49
•# H2(*'!26),'e2/'*3)'H/2=6''–# 9"*)%+.'b4$%(c'<(F'H2(*'bIc'<(F'`@",$*.'bi)//2/c'
•# 16"80<)'`@)/.'[804$f"02%'–# A2='*2'42%$*2/'+/2=6'6@/$%&'_@)/.')B)+@02%q''–# A2='*2'"6"8*'*3)'_@)/.'8,"%q'
•# H"+3$%&':'!"*)/$",$f$%&'H/2=6'S)(@,*(''–# ;F&F5'4"$%*"$%$%&'*3)'+"+3)6'<",@)('
•# H248,)B$*.'X3)2/.''–# X3/))g=".'+248"/$($2%('<(F'X=2g=".'+248"/$(2%('
•# 7/$<"+.V'7@-,$+'<(F'7/$<"*)'+/2=6(''–# Z,$8g($6)'2e'"s%$*.'2e'*@/#)/('
•# !)*"g+/2=6(V'H/2=6('3),8'+/2=6('b)F&F'aCc'
H/2=6gA"/6'7/2-,)4(V'•# 7/2&/"44$%&'9"%&@"&)V'EaC'•# !"%.5'4"%.'#%2-('*2'*@/%'•# H3"%&$%&'8,"t2/4'-)3"<$2/'•# `@",$*.'H2%*/2,'•# 9)"/%$%&')u)+*(''•# H244@%$*.+!"%"&)4)%*'•# h'
Future: DB ! Crowdsourcing++?"
X3)'?Mg188/2"+3'+"%'3),8q'•# ?"*"'$%6)8)%6)%+)5'+2(*g-"()6'2804$f"02%5'(+3)4"'4"%"&)4)%*h'
50
Sequencing
Microsimulation Mobile Millennium
The AMPLab"
Machines
People
Algorithms
!"#)'()%()'"*'(+",)'-.'32,$(0+",,.'$%*)&/"0%&'1,&2/$*34(5'!"+3$%)(5'"%6'7)28,)'
51
Summary - AMPLab"•# Goal: Tame Big Data Problem"
Balance quality, cost and time to solve a given problem"•# The computing challenge of the decade "•# Widespread applicability across applications"•# To address, we must Holistically integrate "
!Algorithms, Machines, and People" "
We thank our sponsors:"
""
"52