1 Berkeley RAD Lab: Robust, Adaptive, Distributed Systems Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica November 2005

1

Berkeley RAD Lab:Robust, Adaptive, Distributed Systems

Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion StoicaNovember 2005

2

RAD LabThe 5-year Vision:Single person can go from vision to a next-generation IT service

(“the Fortune 1 million”) E.g., over long holiday weekend in 1995, Pierre Omidyar created

Ebay v1.0

The Vehicle:Interdisciplinary Center creates core technical competency to demo 10X

to 100X Researchers are leaders in machine learning, networking, and

systems Industrial Participants: leading companies in HW, systems SW, and

online services Called “RAD Lab” for Reliable, Adaptable, Distributed systems

3

RAD LabThe Science:Both shorter-term and longer-term solutions Develop using primitives functions (MapReduce), services (Craigslist) Assess/debug using deterministic replay and finding new metrics Deploy using “Internet-in-a-Box” via FPGAs under failure/slowdown

workloads Operate using Statistical Learning Theory-friendly, Control Theory-

friendly software architectures and visualization tools

Cap:Cap:DadoDado: :

(The section of a (The section of a pedestal between cap pedestal between cap

and base)and base)Base:Base:

Added Value to Industrial Participants: Working with leading people and companies from different

industries on long-range, pre-competitive technology Training of dozens of future leaders of IT in multiple disciplines,

and their recruitment by industrial participants Working with researchers with successful track record of rapid

transfer of new technology

4

Steps vs. Process

Process: SupportDADO Evolution, 1 group

Steps: Traditional, Static Handoff Model, N groups

Develop

Assess Deploy

Operate

Develop

Assess

Deploy

Operate

5

Create abstractions, primitives, & toolkit for large scale systems that make it easy to invent/deploy functions (e.g, MapReduce) For example, Distributed Hash

Tables (OpenDHT) Already setting the trend for IETF

standards

DADO - Develop

Application

Higher Functions (MapReduce)

Middleware (J2EE)

Libraries

Compilers/Debuggers

Operating System

Virtual Machine

Hardware

6

“We improve what we can measure” Inspect box visibility into networks, usually data poor Servers data rich; data often discarded

Statistical and Machine Learning (SML) to the rescue. It works well when You have lots of raw data You have reason to believe the raw data is related to

some high-level effect you’re interested in

You don’t have a model of what that relationship is

Note: SML advances fast analysis

DADO - Assess

7

DADO - Deploy

Re-engineer RAMP to act like 1000+ node distributed system under realistic failure and slowdown workloads RAMP emulates data center & wide area systems as well as MPP Collect and apply failure data from real world RAMP vs. Clusters: Larger scale, easier to develop/debug, flexible

HW/SW configuration, inexpensive so no need to share

Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic

8

DADO - Operate

• Idea: when site misbehaves, users notice, and change their behavior; use as “failure detector”

• Approach: combine visualization with Statistical and Machine Learning analysis so operator see anomalies too

• Experiment: does distribution of hits to various pages match the “historical” distribution? Each minute, compare hit counts of top N pages to hit counts over

last 6 hours using Bayesian networks and 2 test, real Ebates data To learn more, see “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.

9

11:33am – 11:56amsite crash

Novel Visualization

Account page problem

anomalyscore

0

11:07amstart of anomalies

“I see and understand”Winning operator trust

10

Founding the RADLab; Start 12/1 Looking for 3 to 5 founding companies to

fund 5 years @ cost of $0.5M / year 25 grad students + 15 undergrads+ 6 faculty + 2 staff Founding companies: Google, Microsoft, Sun Microsystems

RADS Consortium model Preference to founding partner technology in prototypes Designate employees to act as consultants Head start for participants on research results Putting IP in Public Domain so partners use & not sued

Press release of founding RAD Lab partners December 1? Mid project review after 3 years by founding partners

11

RAD Lab Opportunity: New Research Model

Chance to Partner with the Top University in Computer Systems on the “Next Great Thing” National Academy of Engineering mentions Berkeley in 7 of 19

$1B+ industries that came from IT research NAE mentions Berkeley 7 times, Stanford 5 Times, MIT 5, CMU 3

Timesharing (SDS 940), Client-Server Computing (BSD Unix), Graphics, Entertainment, Internet, LANs, Workstations, GUI, VLSI Design (Spice) [ECAD $5B?/yr] , RISC [$10B?/yr] , Relational DB (Ingres/Postgres) [RDB $15B?/yr], Parallel DB, Data Mining, Parallel Computing, RAID [$15B?/yr] , Portable Communication (BWRC), WWW, Speech Recognition, Broadband

Berkeley one of the top suppliers of systems students to industry and academia

US News & World Report ranking of CS Systems universities: 1 Berkeley, 2 CMU, 2 MIT, 4 Stanford

12

• Working with different industries on long-range, pre-competitive technology• Training of dozens of future leaders of IT, plus their recruitment• Working with researchers with track records of successful technology transfer

RAD Lab: Interdisciplinary Center for Reliable, Adaptive, Distributed Systems

Develop using primitives to enable functions and servicesAssess using deterministic replay and statistical and machine learning (SML)Deploy via “Internet-in-a-Box” FPGAsOperate SML-friendly, Control Theory-friendly architectures and operator-centric visualization and analysis tools

CapabilityCapability (Desired): (Desired): 1 person can invent & run the next-gen IT service

BaseBase Technology: Technology:Server Hardware, System Server Hardware, System Software,Software,Middleware, NetworkingMiddleware, Networking

13

Backup Slides

14

References

• “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.

• “Microreboot -- A Technique for Cheap Recovery,” George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Proc. 6th Symp. on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004.

• “Path-Based Failure and Evolution Management,” Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer In Proc. 1st USENIX/ACM Symp. on Networked Systems Design and Implementation (NSDI '04), San Francisco, CA, March 2004.

• ""Scalable Statistical Bug Isolation," Ben Liblit, M. Naik, Alice. X. Zheng, Alex ," Ben Liblit, M. Naik, Alice. X. Zheng, Alex

Aiken, and Micheal I. Jordan, Aiken, and Micheal I. Jordan, PLDIPLDI, 2005., 2005.

To learn more, see

15

Sustaining Innovation/Training Engine in 21st Century Replicate research centers based

primarily on industrial funding to expand IT market and to train next generation of IT leaders Berkeley Wireless Research Center (BWRC):

50 grad students, 30 undergrads @ $5M per year Stanford Network Research Center (SNRC):

50 Grad students @ $5M per year MIT Tparty $4M per year (100% $ from Quanta) Industry largely funds

N companies, where N is 5? Exciting, long term technical vision

Demonstrated by prototype(s)

16

State of Research Funding Today Most industry research shorter term DARPA exiting long-term (exp.) IT research

’03-’05 BAAs IPTO: 9 AI, 2 classified, 1 SW radio, 1 sensor net, 1 reliability, all have 12 to 18 month “go/no go” milestones

Academic led funding reduced 50% (so far) 2001 to 2004 Faculty ≈ consultants in consortia led by defense contractor,

get grants ≈ support 1-2 students (~ NSF funding level)

NSF swamped with proposals, conservative 2000 to 6500 proposals in 5 years

IT has lowest acceptance rate at NSF (between 8% to 16%) “Ambitious proposal” is a negative review Even if get NSF funding, proposal reduced to stretch NSF $

e.g., got 3 x 1/3 faculty, 6 grad students, 0 staff, 3 years

(To learn more, see www.cra.org/research)

17

RAD Lab Timeline

2005 Launch RAD Lab 12/1 2006 Collect workloads, Internet in a Box 2007 SLT/CT distributed architectures, Iboxes,

annotative layer, class testing 2008 Development toolkit 1.0, tuple space,

class testing; Mid Project Review 2009 RAD Lab software suite 1.0, class testing 2010 End of Project Party

Documents

1 Berkeley RAD Lab: Robust, Adaptive, Distributed Systems Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica November 2005