Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
AMPLab: Algorithms, Machines and People
Michael Franklin!UC Berkeley!
!!
Kick-off Meeting!December 8, 2010!
Asilomar, CA!
Agenda
• About the retreat • AMPLab Background
– Context – People
• Project Mo>va>on and Goals • Research Thrusts • Data Management Foci
2
Retreat Agenda -‐ Highlights • Today: AMP Overviews and Applica>ons* • Discussion topic groups during Dinner* • Open Mic session – Techie and other* • Thurs: Machine Learning, Systems, and DB* • Lunch with Discussion* • <sun comes out here> Ac>vity Break* • Industry talks – Founding Sponsors* • Report outs; Dinner/Poster Session* • Fri: Industry talks – Cloud and Crowd* • Group Photo* • Industry Feedback*** 3 * = Your Participation Needed
Who’s Here?
4
• Amazon • Cisco • Cloudera • Crowdflower • eBay • Ericsson • Facebook
• Google • HP • Huawei • IBM • Intel • MicrosoX • NEC
• NetApp • O'Reilly • Oracle • SAP • Twi[er • VMWare • Yahoo!
Our Retreat Goals
• Outline our direc>ons and project goals • Introduce our ideas as they currently stand • Introduce you to the team and vice versa • Get your ideas, guidance, feedback, direc>ons • Kick-‐off a great 5-‐year collabora>on • Posi>on AMPLab to be the leading academic center for “Big Data” research
5
6
Compu>ng as a Commodity
Con>nuous Improvement of Client Devices
Ubiquitous Connec>vity
8
Algorithms, Machines & People
Adap>ve/Ac>ve Machine Learning and Analy>cs
Cloud Compu>ng CrowdSourcing
9
Massive and
Diverse Data
AMPLab: What is it?
A Five-‐Year research collabora>on to develop a new genera>on of data analysis methods, tools and infrastructure for
making sense at scale.
10
“Big Data”: Working Defini>on
When the normal applica>on of current technology doesn’t enable users to obtain
answers of to their data-‐driven ques>ons.
11
The Scalability Dilemma
12
• State-‐of-‐the Art Machine Learning techniques do not scale to large data sets.
• Data Analy>cs frameworks can’t handle lots of incomplete, heterogeneous, dirty data.
• Processing architectures struggle with increasing diversity of programming models and job types.
• Adding people to a late project makes it later.
Exactly Opposite of what we Expect and Need
The AMP Team • Principal Inves>gators (*co-‐directors)
– Alex Bayen (sensing plalorms) – Armando Fox (systems) – Michael Franklin* (databases) – Michael Jordan* (machine learning) – Anthony Joseph (security & privacy) – Randy Katz (systems) – David Pa[erson (systems) – Ion Stoica* (systems) – Sco[ Shenker (networking)
• PostDocs and Visi>ng Researchers – Ali Ghodsi (KTH), Tim Kraska (ETH Zurich), Jus>n Ma (UCSD), Purna Sarkar (CMU), Elaine Shi (CMU/PARC)
• Lots o’ Great Students – Present and Future
13
Background: RAD Lab Enable 1 person to develop, deploy, operate a next-‐
genera3on Internet applica3on at scale
Five-‐year collabora>ve effort – Thru Feb 2011 • See upcoming “end of project” event and demo
Ini>al Technical Bet: • Machine Learning for large-‐scale self-‐managing systems
Wave Caught: Cloud Compu>ng Mul>-‐area faculty, postdocs, & students
• Systems, Networks, DB, Security, Sta>s>cal Machine Learning all in a single, open, collabora>ve space
Industrial Sponsorship and intensive interac>on • Including bi-‐annual retreats
14!
RAD, AMP, WTF? • Similar cast of characters and exper>se • Leveraging RADLab structure, organiza>onal approach and collabora>ve space
• Will start with much of the RADLab-‐developed soXware stack
• Focus on big data analy>cs and analy>cs-‐intensive applica>ons
• “Systems for ML” rather than “ML for Systems” • Close collabora>on with applica>on partners • Focus on Human Element throughout data lifecycle
– Elvis is everywhere 15
AMP Themes • Big Data
• Scale -‐> Diversity • Scale -‐> You never see “all the data” • Scale -‐> Randomness looks like non-‐randomness
• Reliable answers from unreliable data • Cloud, Warehouse-‐Scale & Ubiquitous Compu>ng • People involved throughout the whole data lifecycle • Both a part of the problem and a part of the solu>on
• Con>nuous answer improvement/Pay-‐as-‐you-‐go • Smart, flexible, fair resource alloca>on and use
16
Ini>al Use Cases/Applica>ons
17
Crowdsourced: Sensing, Analysis, Policy, Journalism
Urban Micro-‐Simula>on (next session)
AMPLab: Making Sense at Scale
• A holis>c view of the stack. • Strong Industrial
involvement • A five-‐year plan • Research underway now • Public launch of lab in Feb
2011
• Funding & commitments received to date:
18!
Data Viz Collaboration, HCI
Text analytics Machine Learning and Stats
Database, OLAP, MapReduce Security and Privacy
MPP,Data Centers, Networks
Multi-Core Parallelism
AMP: Technical Thrusts • Machine Learning and Analy>cs (tomorrow AM)
– Error Bars on all Answers – Ac>ve learning and con>nuous/adap>ve answer improvement
• Infrastructure (next talk) – Mesos cloud OS and analy>cs frameworks
• Data Management – Pay-‐as-‐you-‐go integra>on and structure – ML/Analy>cs workload/workflow support
• Hybrid Crowd/Cloud Systems – “Human Tolerant Compu>ng” – Incen>ve structures, HCI aspects
19
Architectural View (strawman)
20
Crowd UI
Crowd Resource Mgt
OpenFlow/ NOX
Mesos
Info
rm.
Inte
gr.
Scalable Storage Engine
Spark Confidence Insightful Query Language (CIQL)
Priv
acy
Con
trol
Deb
uggi
ng
Machine Learning
Result Control Center Collaborative Visualization
Scaling Machine Learning
• Want a systema>c methodology that automa>cally selects an opera>ng point along the spectrum
• Given a fixed computa3onal budget, performance should improve monotonically as data accrue – immediate results with con>nual improvement
• Smart sampling/dropping of data • Error bars on all answers
Simple Algorithms Massive Data
n3 n log n n log n
Algorithmic Complexity
Complex Alg Less Data
21
Reliable Answers
• Probabilis>c Databases • Schema Matching • Judicious use of User Input • Approximate Query Answering • Uncertainty Management • Data Model Learning • Provenance and Annota>on • Structured + Unstructured Search
22
Data Management for Interac>ve Web Applica>ons
App DB
App DB
App DB
App DB
App DB
SCADS Storage
Scads Storage Handler Mesos
VM monitor
Director
• Stateless • Easy to scale • DB Library
• PIQL – Performance Insightful Query Language
• Consistency Rationing – Consistency à la carte
• AVRO Schema - Physical/Logical data independence
• Independent of Key/Value store
• Stateful but still easy to scale (no sharding required)
• Simple Query Interface • Reduced consistency • Predictable performance • Easy to price • High availability (even across data
centers)
App DB
AMP Lab Big Data Analy>cs
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
App
DB
Moving Code to Data
Integrating the Crowd
Machine Learning Access Patterns
Functionality?
Optimize the storage for machine learning • Intermediate results as first class citizens • Adaptive replication levels and layout (e.g. Fractured Mirrors) • … Spark as a first step
val custs= cluster.get(“test”) val result = custs.map(
a => a.title == “PostDoc” && a.incrSalary(10) > 60 )
Par>cipatory Culture -‐ Direct
25
Par>cipatory Culture – “Indirect”
26
John Murrell: GM SV 9/17/09 …every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.
Hybrid Compu>ng – A First Step
Disk 2
Disk 1
Parser
Optimizer St
atist
ics
Query Results
Executor
Files Access Methods
Form Collection
Form Editor
UI Creation
HIT Management
People DB
Met
aDat
a
27
CrowdDB
CrowdSQL • DDL Extensions: Crowdsourced columns, tables and
referen>al integrity. CREATE TABLE company ( name STRING, hq_address CROWD STRING); CREATE CROWD TABLE department ( name STRING PRIMARY KEY phone_no STRING);
• DML Extensions: CROWDEQUAL and CROWDORDER
operators (currently UDFs).
28
SELECT * FROM professor p, department dWHERE p.dep = d.name AND p.name = "Carey"
Professor Department
⋈σ name="Carey"
p.dep=d.name
Professor
Department
⋈
σname= "Carey"
p.dep=d.name
Please fill out the missing professor data
Submit
Carey
Name
Please fill out the missing department data
Submit
CS
Phone
DepartmentGoodEnough
MTJoin(Dep)
p.dep = d.name
MTProbe(Professor)
name=Carey
(a) PeopleSQL query (b) Logical plan before optimization
(c) Logical plan after optimization
(d) Physical plan
CSDepartment
Leveraging DB Technology
• Use Schema to drive Turker UI genera>on • Create a cost model of crowd operators and plug them into the op>mizer.
29
Picture query
30
Select the best picture of the Golden Gate Bridge
Integra>ng Clouds and Crowds
Interac)ve Cloud Analy)c Cloud People Cloud
Data Acquisi)on
Transac>onal systems Data entry
… + Sensors (physical & soXware)
… + Web 2.0
Computa)on Get and Put Map Reduce Parallel DBMS
Stream Processing
… + Collabora>ve Structures (e.g., Mechanical Turk,
Intelligence Markets)
Data Model Records … + Numbers, Media … + Text, Media, Natural Language
Response Time
Seconds …+ Min/Hours/Days + Con>nuous
all
31
Summary
• AMPLab will inves>gate the confluence of Algorithms, Machines and People to solve “Big Data” analysis problems.
• Huge research issues across many domains. • The goal of this mee>ng is to get the process started.
• Our approach depends on close interac>on with our industrial partners.
• We look forward to your input, advice and collabora>on.
32
Proposed Discussion Topics 1) Using the Cloud for Big Data analy>cs and Machine Learning
2) Pay-‐as-‐you-‐go Processing -‐ incremental answer improvement and data integra>on
3) Data Center Opera>ng System
4) Debugging Big Data systems
5) Crowdsourcing Opportuni>es and Challenges
6) User Interface, Interac>on and Data Visualiza>on
7) Mobility and Devices -‐ how do they impact our agenda?
8) Privacy Issues -‐ What are they and how to address them?
9) Metrics -‐ How does AMPLab measure success?
10) Approximate answers and answer confidence
11) Killer Apps -‐ What other apps should AMP focus on?
12) What ques>ons can we answer? not answer? can we classify ques>ons in terms of difficulty? 13) How to define “Big Data” and should we use some other term?
33