View
218
Download
0
Tags:
Embed Size (px)
Citation preview
1
Berkeley RAD Lab:Robust, Adaptive, Distributed Systems
Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion StoicaNovember 2005
2
RAD LabThe 5-year Vision:Single person can go from vision to a next-generation IT service
(“the Fortune 1 million”) E.g., over long holiday weekend in 1995, Pierre Omidyar created
Ebay v1.0
The Vehicle:Interdisciplinary Center creates core technical competency to demo 10X
to 100X Researchers are leaders in machine learning, networking, and
systems Industrial Participants: leading companies in HW, systems SW, and
online services Called “RAD Lab” for Reliable, Adaptable, Distributed systems
3
RAD LabThe Science:Both shorter-term and longer-term solutions Develop using primitives functions (MapReduce), services (Craigslist) Assess/debug using deterministic replay and finding new metrics Deploy using “Internet-in-a-Box” via FPGAs under failure/slowdown
workloads Operate using Statistical Learning Theory-friendly, Control Theory-
friendly software architectures and visualization tools
Cap:Cap:DadoDado: :
(The section of a (The section of a pedestal between cap pedestal between cap
and base)and base)Base:Base:
Added Value to Industrial Participants: Working with leading people and companies from different
industries on long-range, pre-competitive technology Training of dozens of future leaders of IT in multiple disciplines,
and their recruitment by industrial participants Working with researchers with successful track record of rapid
transfer of new technology
4
Steps vs. Process
Process: SupportDADO Evolution, 1 group
Steps: Traditional, Static Handoff Model, N groups
Develop
Assess Deploy
Operate
Develop
Assess
Deploy
Operate
5
Create abstractions, primitives, & toolkit for large scale systems that make it easy to invent/deploy functions (e.g, MapReduce) For example, Distributed Hash
Tables (OpenDHT) Already setting the trend for IETF
standards
DADO - Develop
Application
Higher Functions (MapReduce)
Middleware (J2EE)
Libraries
Compilers/Debuggers
Operating System
Virtual Machine
Hardware
6
“We improve what we can measure” Inspect box visibility into networks, usually data poor Servers data rich; data often discarded
Statistical and Machine Learning (SML) to the rescue. It works well when You have lots of raw data You have reason to believe the raw data is related to
some high-level effect you’re interested in
You don’t have a model of what that relationship is
Note: SML advances fast analysis
DADO - Assess
7
DADO - Deploy
Re-engineer RAMP to act like 1000+ node distributed system under realistic failure and slowdown workloads RAMP emulates data center & wide area systems as well as MPP Collect and apply failure data from real world RAMP vs. Clusters: Larger scale, easier to develop/debug, flexible
HW/SW configuration, inexpensive so no need to share
Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic
8
DADO - Operate
• Idea: when site misbehaves, users notice, and change their behavior; use as “failure detector”
• Approach: combine visualization with Statistical and Machine Learning analysis so operator see anomalies too
• Experiment: does distribution of hits to various pages match the “historical” distribution? Each minute, compare hit counts of top N pages to hit counts over
last 6 hours using Bayesian networks and 2 test, real Ebates data To learn more, see “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.
9
11:33am – 11:56amsite crash
Novel Visualization
Account page problem
anomalyscore
0
11:07amstart of anomalies
“I see and understand”Winning operator trust
10
Founding the RADLab; Start 12/1 Looking for 3 to 5 founding companies to
fund 5 years @ cost of $0.5M / year 25 grad students + 15 undergrads+ 6 faculty + 2 staff Founding companies: Google, Microsoft, Sun Microsystems
RADS Consortium model Preference to founding partner technology in prototypes Designate employees to act as consultants Head start for participants on research results Putting IP in Public Domain so partners use & not sued
Press release of founding RAD Lab partners December 1? Mid project review after 3 years by founding partners
11
RAD Lab Opportunity: New Research Model
Chance to Partner with the Top University in Computer Systems on the “Next Great Thing” National Academy of Engineering mentions Berkeley in 7 of 19
$1B+ industries that came from IT research NAE mentions Berkeley 7 times, Stanford 5 Times, MIT 5, CMU 3
Timesharing (SDS 940), Client-Server Computing (BSD Unix), Graphics, Entertainment, Internet, LANs, Workstations, GUI, VLSI Design (Spice) [ECAD $5B?/yr] , RISC [$10B?/yr] , Relational DB (Ingres/Postgres) [RDB $15B?/yr], Parallel DB, Data Mining, Parallel Computing, RAID [$15B?/yr] , Portable Communication (BWRC), WWW, Speech Recognition, Broadband
Berkeley one of the top suppliers of systems students to industry and academia
US News & World Report ranking of CS Systems universities: 1 Berkeley, 2 CMU, 2 MIT, 4 Stanford
12
• Working with different industries on long-range, pre-competitive technology• Training of dozens of future leaders of IT, plus their recruitment• Working with researchers with track records of successful technology transfer
RAD Lab: Interdisciplinary Center for Reliable, Adaptive, Distributed Systems
Develop using primitives to enable functions and servicesAssess using deterministic replay and statistical and machine learning (SML)Deploy via “Internet-in-a-Box” FPGAsOperate SML-friendly, Control Theory-friendly architectures and operator-centric visualization and analysis tools
CapabilityCapability (Desired): (Desired): 1 person can invent & run the next-gen IT service
BaseBase Technology: Technology:Server Hardware, System Server Hardware, System Software,Software,Middleware, NetworkingMiddleware, Networking
13
Backup Slides
14
References
• “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.
• “Microreboot -- A Technique for Cheap Recovery,” George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Proc. 6th Symp. on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004.
• “Path-Based Failure and Evolution Management,” Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer In Proc. 1st USENIX/ACM Symp. on Networked Systems Design and Implementation (NSDI '04), San Francisco, CA, March 2004.
• ""Scalable Statistical Bug Isolation," Ben Liblit, M. Naik, Alice. X. Zheng, Alex ," Ben Liblit, M. Naik, Alice. X. Zheng, Alex
Aiken, and Micheal I. Jordan, Aiken, and Micheal I. Jordan, PLDIPLDI, 2005., 2005.
To learn more, see
15
Sustaining Innovation/Training Engine in 21st Century Replicate research centers based
primarily on industrial funding to expand IT market and to train next generation of IT leaders Berkeley Wireless Research Center (BWRC):
50 grad students, 30 undergrads @ $5M per year Stanford Network Research Center (SNRC):
50 Grad students @ $5M per year MIT Tparty $4M per year (100% $ from Quanta) Industry largely funds
N companies, where N is 5? Exciting, long term technical vision
Demonstrated by prototype(s)
16
State of Research Funding Today Most industry research shorter term DARPA exiting long-term (exp.) IT research
’03-’05 BAAs IPTO: 9 AI, 2 classified, 1 SW radio, 1 sensor net, 1 reliability, all have 12 to 18 month “go/no go” milestones
Academic led funding reduced 50% (so far) 2001 to 2004 Faculty ≈ consultants in consortia led by defense contractor,
get grants ≈ support 1-2 students (~ NSF funding level)
NSF swamped with proposals, conservative 2000 to 6500 proposals in 5 years
IT has lowest acceptance rate at NSF (between 8% to 16%) “Ambitious proposal” is a negative review Even if get NSF funding, proposal reduced to stretch NSF $
e.g., got 3 x 1/3 faculty, 6 grad students, 0 staff, 3 years
(To learn more, see www.cra.org/research)
17
RAD Lab Timeline
2005 Launch RAD Lab 12/1 2006 Collect workloads, Internet in a Box 2007 SLT/CT distributed architectures, Iboxes,
annotative layer, class testing 2008 Development toolkit 1.0, tuple space,
class testing; Mid Project Review 2009 RAD Lab software suite 1.0, class testing 2010 End of Project Party