View
220
Download
0
Category
Tags:
Preview:
Citation preview
Open Resilient Cluster Manager:A Distributed Approach to a Resilient Router Manager
Ralph H. Castain, Ph.D.Cisco Systems, Inc.
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
3© 2006 Cisco Systems, Inc. All rights reserved.
System Software Requirements
1) Turn on once with remote access thereafter
2) Non-Stop == max 20 events/day lasting < 200ms each
3) Hitless SW Upgrades and Downgrades
4) Upgrade/downgrade SW components across delta versions
5) Field Patchable
6) Beta Test New Features in situ
7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…
8) Configuration
9) Clear APIs; minimize application awareness
10) Extensive remote capabilities for fault management, software maintenance and software installations
Our Approach
• Distributed redundancy NO master Multiple copies of everything Running in tracking mode
• Parallel, seeing identical input• Multiple ways of selecting leader
• Utilize component architecture Multiple ways to do something => framework! Create an initial working base Encourage experimentation
Methodology
• Exploit open source software Reduce development time Encourage outside participation Cross-fertilize with HPC community
• Write new cluster manager (ORCM) Exploit new capabilities Potential dual-use for HPC clusters Encourage outside contributions
Open Source ≠ Free
Pro
• Widespread exposure ORTE on thousands of
systems around world Surface & address problems
• Community support Others can help solve
problems Expanded access to tools
(e.g., debuggers)
• Energy Other ideas, methods
Con
• Your timeline ≠ my timeline No penalty for late
contributions Academic contributors have
other priorities
• Compromise: a required art Code must be designed to
support multiple approaches Nobody wins all the time Adds time to implementation
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
3-day workshop
Robustness(CSU)
A Convergence of Ideas
PACX-MPI(HLRS)
LAM/MPI(IU)
LA-MPI(LANL)
FT-MPI(U of TN)
Open MPIOpen MPI
FaultDetection
(LANL,Industry)
Grid(many)
AutonomousComputing
(many)
FDDP(Semi. Mfg.
Industry) ResilientResilientComputingComputing
SystemsSystems
OpenRTEOpenRTE
Program Objective
*Cell = one or more computers sharing a common launch environment/point
Participants
Developers
• DOE/NNSA* Los Alamos Nat Lab Sandia Nat Lab Oak Ridge Nat Lab
• Universities Indiana University Univ of Tennessee Univ of Houston HLRS, Stuttgart
Support• Industry
Cisco Oracle IBM Microsoft* Apple* Multiple interconnect
vendors
• Open source teams OFED, autotools,
Mercurial*Providing funding
Reliance on Components
• Formalized interfaces Specifies “black box”
implementation Different
implementations available at run-time
Can compose different systems on the fly
Interface 1 Interface 2 Interface 3
Caller
OpenRTE and Components
• Components are shared libraries Central set of components in installation tree Users can also have components under $HOME
• Can add / remove components after install No need to recompile / re-link apps Download / install new components Develop new components safely
• Update “on-the-fly” Add, update components while running Frameworks “pause” during update
Component Benefits
• Stable, production quality environment for 3rd party researchers Can experiment inside the system without rebuilding
everything else Small learning curve (learn a few components, not the
entire implementation) Allow wide use, experience before exposing work
• Vendors can quickly roll out support for new platforms Write only the components you want/need to change Protect intellectual property
ORTE: Resiliency*
• Fault Events that hinder the correct operation of a process.
• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level
Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.
• Fault prediction Estimate probability of incipient fault within some time period in the future
• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault
• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences
• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults
*standalone presentation
Key Frameworks
Error Manager (Errmgr)
• Receives all process state updates Sensor, waitpid Includes predictions
• Determines response strategy Restart locally, globally,
abort
• Executes recovery Accounts for fault groups to
avoid repeated failover
Sensor
• Monitors software and hardware state-of-health Sentinel file size, mod &
access times Memory footprint Temperature Heartbeat ECC errors
• Predicts incipient faults Trend, fingerprint AI-based algos coming
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
Universal PNP
• Widely adopted standard
• ORCM uses only a part PNP discovery via announcement on std multicast
channel• Includes application id, contact info• All applications respond
Wireup “storm” limits scalability Various algorithms for storm reduction
Each application assigned own “channel”• All output from members of that application• Input sent to that application given to all members
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
ORCM DVM
• One per node Started at node boot or launched by tool Locally spawns and monitors processes, system
health sensors Small footprint (≤1Mb)
• Each daemon tracks existence of others PNP wireup Know where all processes are located
orcmd
Predefined“System”multicastchannel
orcmd orcmd
Parallel DVMs
• Allows Concurrent development, testing in production
environment Sharing of development resources
• Unique identifier (ORTE jobid) Maintains separation between orcmd’s Each application belongs to their respective
DVM No cross-DVM communication allowed
Configuration Mgmt
orcmd orcmd orcmd
cfgi
confd tool fileconfddaemon
subscribe
Lowest vpid
recv config
Openframework
set recv configfile?
connect?
orcm-start
file
Configuration Mgmt
orcmd orcmd orcmd
cfgi
confd tool fileconfddaemon
subscribe
recv config
Openframework
set recv configfile?
connect?
orcm-start
file
Update any missing config infoAssume “leader” duties
Application Launch
orcmd orcmd orcmd
cfgi
confd tool fileconfddaemon
subscribe
recv config
set recv configfile?
connect?
orcm-start
file
Config change
#procslocation
Launch msgPredefined“System”multicastchannel
Resilient Mapper
• Fault groups Nodes with common failure mode Node can belong to multiple fault groups Defined in system file
• Map instances across fault groups Minimize probability of cascading failures One instance/fault group Pick lightest loaded node in group Randomly map extras
• Next-generation algorithms Failure mode probability => fault group selection
Multiple Replicas
• Multiple copies of each executable Run on separate fault groups Async, independent
• Shared pnp channel Input: recvd by all Output: broadcast to all, recvd by those who
registered for input• Leader determined by recvr
Leader Selection
• Two forms of leader selection Internal to ORCM DVM External facing
• Internal - framework App-specific module Configuration specified Lowest rank First contact None
External Connections
orcm-connector Input
• Broadcast on respective PNP channel
Output• Determines “leader” to supply output to rest of world• Utilize any leader method in framework
Testing in Production
orcm-logger
logger
db file syslog console
Software Maintenance
• On-the-fly module activation Configuration manager can select new
modules to load, reload, activate Change priorities of active modules
• Full replacement When more than a module needs updating Start replacement version Configuration manager switches “leader” Stop old version
Detecting Failures
• Application failures - detected by local daemon Monitors for self-induced problems
• Memory and cpu usage• Orders termination if limits exceeded or are trending to
exceed
Detects unexpected failures via waitpid
• Hardware failures Local hardware sensors continuously report status
• Read by local daemon• Projects potential failure modes to pre-order relocation of
processes, shutdown node
Detected by DVM when daemon misses heartbeats
Application Failure
• Local daemon Detects (or predicts) failure Locally restarts up to specified max #local-restarts Utilizes resilient mapper to determine re-location Sends launch message to all daemons
• Replacement app Announces itself on application public address
channel Receives responses - registers own inputs Begins operation
• Connected applications Select new “leader” based on current module
Node Failure
orcmd orcmd orcmd
cfgi
confd tool file
OpenframeworkNext higher orcmd becomes leader
Open/init cfgi frameworkUpdate any missing config infoMark node as “down”Relocate application processes from failed nodeConnected apps failover leader per active leader moduleAttempt to restart
Node Replacement/Addition
• Auto-boot of local daemon on power up Daemon announces to DVM All DVM members add node to available resources
• Reboot/restart Relocate original procs back up to some max number
of times (need smarter algo here) Leadership remains unaffected to avoid “bounce”
• Processes will map to new resource as start/restart demands Future: rebalance existing load upon node availability
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
35© 2006 Cisco Systems, Inc. All rights reserved.
System Software Requirements
1) Turn on once with remote access thereafter
2) Non-Stop == max 20 events/day lasting < 200ms each
3) Hitless SW Upgrades and Downgrades
4) Upgrade/downgrade SW components across delta versions
5) Field Patchable
6) Beta Test New Features in situ
7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…
8) Configuration
9) Clear APIs; minimize application awareness
10) Extensive remote capabilities for fault management, software maintenance and software installations
~5ms recovery
Start new app triplet, kill old one
New app triplet, register for production input
Boot-level startup
Start/stop triplets, leader selection
Still A Ways To Go
• Security Who can order ORCM to launch/stop apps? Who can “log” output from which apps? Network extent of communications?
• Communications Message size, fragmentation support Speed of underlying transport Truly reliable multicast Asynchronous messaging
Still A Ways To Go
• Transfer of state How does a restarted application replica regain the
state of its prior existence? How do we re-sync state across replicas so outputs
track?
• Deterministic outputs Same output from replicas tracking same inputs
• Assumes deterministic algorithms
Can we support non-deterministic algorithms?• Random channel selection to balance loads• Decisions based on instantaneous traffic sampling
Still A Ways To Go
• Enhanced algorithms Mapping Leader selection
• Fault prediction Implementation and algorithms Expanded sensors
• Replication vs rapid restart If we can restart in few millisecs, do we really
need replication?
Concluding Remarks
http://www.open-mpi.orghttp://www.open-mpi.org/projects/orcm
Recommended