AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

AMPLab: Algorithms, Machines and People

Michael Franklin!UC Berkeley!

!!

Kick-off Meeting!December 8, 2010!

Asilomar, CA!

Agenda

•  About the retreat •  AMPLab Background

– Context – People

•  Project Mo>va>on and Goals •  Research Thrusts •  Data Management Foci

2

Retreat Agenda -‐ Highlights •  Today: AMP Overviews and Applica>ons* •  Discussion topic groups during Dinner* •  Open Mic session – Techie and other* •  Thurs: Machine Learning, Systems, and DB* •  Lunch with Discussion* •  <sun comes out here> Ac>vity Break* •  Industry talks – Founding Sponsors* •  Report outs; Dinner/Poster Session* •  Fri: Industry talks – Cloud and Crowd* •  Group Photo* •  Industry Feedback*** 3 * = Your Participation Needed

Who’s Here?

4

•  Amazon •  Cisco •  Cloudera •  Crowdflower •  eBay •  Ericsson •  Facebook

•  Google •  HP •  Huawei •  IBM •  Intel •  MicrosoX •  NEC

•  NetApp •  O'Reilly •  Oracle •  SAP •  Twi[er •  VMWare •  Yahoo!

Our Retreat Goals

•  Outline our direc>ons and project goals •  Introduce our ideas as they currently stand •  Introduce you to the team and vice versa •  Get your ideas, guidance, feedback, direc>ons •  Kick-‐off a great 5-‐year collabora>on •  Posi>on AMPLab to be the leading academic center for “Big Data” research

5

6

Compu>ng as a Commodity

Con>nuous Improvement of Client Devices

Ubiquitous Connec>vity

8

Algorithms, Machines & People

Adap>ve/Ac>ve Machine Learning and Analy>cs

Cloud Compu>ng CrowdSourcing

9

Massive and

Diverse Data

AMPLab: What is it?

A Five-‐Year research collabora>on to develop a new genera>on of data analysis methods, tools and infrastructure for

making sense at scale.

10

“Big Data”: Working Defini>on

When the normal applica>on of current technology doesn’t enable users to obtain

answers of to their data-‐driven ques>ons.

11

The Scalability Dilemma

12

•  State-‐of-‐the Art Machine Learning techniques do not scale to large data sets.

•  Data Analy>cs frameworks can’t handle lots of incomplete, heterogeneous, dirty data.

•  Processing architectures struggle with increasing diversity of programming models and job types.

•  Adding people to a late project makes it later.

Exactly Opposite of what we Expect and Need

The AMP Team •  Principal Inves>gators (*co-‐directors)

–  Alex Bayen (sensing plalorms) –  Armando Fox (systems) –  Michael Franklin* (databases) –  Michael Jordan* (machine learning) –  Anthony Joseph (security & privacy) –  Randy Katz (systems) –  David Pa[erson (systems) –  Ion Stoica* (systems) –  Sco[ Shenker (networking)

•  PostDocs and Visi>ng Researchers –  Ali Ghodsi (KTH), Tim Kraska (ETH Zurich), Jus>n Ma (UCSD), Purna Sarkar (CMU), Elaine Shi (CMU/PARC)

•  Lots o’ Great Students –  Present and Future

13

Background: RAD Lab Enable 1 person to develop, deploy, operate a next-‐

genera3on Internet applica3on at scale

Five-‐year collabora>ve effort – Thru Feb 2011 •  See upcoming “end of project” event and demo

Ini>al Technical Bet: •  Machine Learning for large-‐scale self-‐managing systems

Wave Caught: Cloud Compu>ng Mul>-‐area faculty, postdocs, & students

•  Systems, Networks, DB, Security, Sta>s>cal Machine Learning all in a single, open, collabora>ve space

Industrial Sponsorship and intensive interac>on •  Including bi-‐annual retreats

14!

RAD, AMP, WTF? •  Similar cast of characters and exper>se •  Leveraging RADLab structure, organiza>onal approach and collabora>ve space

•  Will start with much of the RADLab-‐developed soXware stack

•  Focus on big data analy>cs and analy>cs-‐intensive applica>ons

•  “Systems for ML” rather than “ML for Systems” •  Close collabora>on with applica>on partners •  Focus on Human Element throughout data lifecycle

–  Elvis is everywhere 15

AMP Themes •  Big Data

•  Scale -‐> Diversity •  Scale -‐> You never see “all the data” •  Scale -‐> Randomness looks like non-‐randomness

•  Reliable answers from unreliable data •  Cloud, Warehouse-‐Scale & Ubiquitous Compu>ng •  People involved throughout the whole data lifecycle •  Both a part of the problem and a part of the solu>on

•  Con>nuous answer improvement/Pay-‐as-‐you-‐go •  Smart, flexible, fair resource alloca>on and use

16

Ini>al Use Cases/Applica>ons

17

Crowdsourced: Sensing, Analysis, Policy, Journalism

Urban Micro-‐Simula>on (next session)

AMPLab: Making Sense at Scale

•  A holis>c view of the stack. •  Strong Industrial

involvement •  A five-‐year plan •  Research underway now •  Public launch of lab in Feb

2011

•  Funding & commitments received to date:

18!

Data Viz Collaboration, HCI

Text analytics Machine Learning and Stats

Database, OLAP, MapReduce Security and Privacy

MPP,Data Centers, Networks

Multi-Core Parallelism

AMP: Technical Thrusts •  Machine Learning and Analy>cs (tomorrow AM)

–  Error Bars on all Answers – Ac>ve learning and con>nuous/adap>ve answer improvement

•  Infrastructure (next talk) – Mesos cloud OS and analy>cs frameworks

•  Data Management –  Pay-‐as-‐you-‐go integra>on and structure – ML/Analy>cs workload/workflow support

•  Hybrid Crowd/Cloud Systems –  “Human Tolerant Compu>ng” –  Incen>ve structures, HCI aspects

19

Architectural View (strawman)

20

Crowd UI

Crowd Resource Mgt

OpenFlow/ NOX

Mesos

Info

rm.

Inte

gr.

Scalable Storage Engine

Spark Confidence Insightful Query Language (CIQL)

Priv

acy

Con

trol

Deb

uggi

ng

Machine Learning

Result Control Center Collaborative Visualization

Scaling Machine Learning

•  Want a systema>c methodology that automa>cally selects an opera>ng point along the spectrum

•  Given a fixed computa3onal budget, performance should improve monotonically as data accrue –  immediate results with con>nual improvement

•  Smart sampling/dropping of data •  Error bars on all answers

Simple Algorithms Massive Data

n3 n log n n log n

Algorithmic Complexity

Complex Alg Less Data

21

Reliable Answers

•  Probabilis>c Databases •  Schema Matching •  Judicious use of User Input •  Approximate Query Answering •  Uncertainty Management •  Data Model Learning •  Provenance and Annota>on •  Structured + Unstructured Search

22

Data Management for Interac>ve Web Applica>ons

App DB

App DB

App DB

App DB

App DB

SCADS Storage

Scads Storage Handler Mesos

VM monitor

Director

•  Stateless •  Easy to scale •  DB Library

•  PIQL – Performance Insightful Query Language

•  Consistency Rationing – Consistency à la carte

•  AVRO Schema - Physical/Logical data independence

•  Independent of Key/Value store

•  Stateful but still easy to scale (no sharding required)

•  Simple Query Interface •  Reduced consistency •  Predictable performance •  Easy to price •  High availability (even across data

centers)

App DB

AMP Lab Big Data Analy>cs

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

App

DB

Moving Code to Data

Integrating the Crowd

Machine Learning Access Patterns

Functionality?

Optimize the storage for machine learning • Intermediate results as first class citizens • Adaptive replication levels and layout (e.g. Fractured Mirrors) • … Spark as a first step

val custs= cluster.get(“test”) val result = custs.map(

a => a.title == “PostDoc” && a.incrSalary(10) > 60 )

Par>cipatory Culture -‐ Direct

25

Par>cipatory Culture – “Indirect”

26

John Murrell: GM SV 9/17/09 …every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

Hybrid Compu>ng – A First Step

Disk 2

Disk 1

Parser

Optimizer St

atist

ics

Query Results

Executor

Files Access Methods

Form Collection

Form Editor

UI Creation

HIT Management

People DB

Met

aDat

a

27

CrowdDB

CrowdSQL •  DDL Extensions: Crowdsourced columns, tables and

referen>al integrity. CREATE TABLE company ( name STRING, hq_address CROWD STRING); CREATE CROWD TABLE department ( name STRING PRIMARY KEY phone_no STRING);

•  DML Extensions: CROWDEQUAL and CROWDORDER

operators (currently UDFs).

28

SELECT * FROM professor p, department dWHERE p.dep = d.name AND p.name = "Carey"

Professor Department

⋈σ name="Carey"

p.dep=d.name

Professor

Department

⋈

σname= "Carey"

p.dep=d.name

Please fill out the missing professor data

Submit

Carey

E-Mail

Name

Please fill out the missing department data

Submit

CS

Phone

DepartmentGoodEnough

MTJoin(Dep)

p.dep = d.name

MTProbe(Professor)

name=Carey

(a) PeopleSQL query (b) Logical plan before optimization

(c) Logical plan after optimization

(d) Physical plan

CSDepartment

Leveraging DB Technology

•  Use Schema to drive Turker UI genera>on •  Create a cost model of crowd operators and plug them into the op>mizer.

29

Picture query

30

Select the best picture of the Golden Gate Bridge

Integra>ng Clouds and Crowds

Interac)ve Cloud Analy)c Cloud People Cloud

Data Acquisi)on

Transac>onal systems Data entry

… + Sensors (physical & soXware)

… + Web 2.0

Computa)on Get and Put Map Reduce Parallel DBMS

Stream Processing

… + Collabora>ve Structures (e.g., Mechanical Turk,

Intelligence Markets)

Data Model Records … + Numbers, Media … + Text, Media, Natural Language

Response Time

Seconds …+ Min/Hours/Days + Con>nuous

all

31

Summary

•  AMPLab will inves>gate the confluence of Algorithms, Machines and People to solve “Big Data” analysis problems.

•  Huge research issues across many domains. •  The goal of this mee>ng is to get the process started.

•  Our approach depends on close interac>on with our industrial partners.

•  We look forward to your input, advice and collabora>on.

32

Proposed Discussion Topics 1) Using the Cloud for Big Data analy>cs and Machine Learning

2) Pay-‐as-‐you-‐go Processing -‐ incremental answer improvement and data integra>on

3) Data Center Opera>ng System

4) Debugging Big Data systems

5) Crowdsourcing Opportuni>es and Challenges

6) User Interface, Interac>on and Data Visualiza>on

7) Mobility and Devices -‐ how do they impact our agenda?

8) Privacy Issues -‐ What are they and how to address them?

9) Metrics -‐ How does AMPLab measure success?

10) Approximate answers and answer confidence

11) Killer Apps -‐ What other apps should AMP focus on?

12) What ques>ons can we answer? not answer? can we classify ques>ons in terms of difficulty? 13) How to define “Big Data” and should we use some other term?

33

Documents

AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa