Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Self-Management for Large Scale DistributedSystems
Ahmad Al-Shishtawy ([email protected])
Advisors:Vladimir Vlassov ([email protected])
Seif Haridi ([email protected])
KTH Royal Institute of TechnologyStockholm, Sweden
The 5th EuroSys Doctoral Workshop (EuroDW 2011)April 10, 2011
IntroductionNiche Platform
Robust Management ElementsFuture Work
Outline
1 Introduction
2 Niche Platform
3 Robust Management Elements
4 Future Work
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 2/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Outline
1 Introduction
2 Niche Platform
3 Robust Management Elements
4 Future Work
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 3/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
ProblemAll computing systems need to be managed
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
ProblemAll computing systems need to be managed
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
ProblemComputing systems are getting more and more complex
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
ProblemComplexity means higher administration overheads
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
ProblemComplexity poses a barrier on further development
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
SolutionThe Autonomic Computing initiative by IBM
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
SolutionSelf-Management: Systems capable of managing themselves
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
SolutionUse Autonomic Managers
Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Dealing with Complexity
Open Question
How to achieve Self-Management?
Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge Monitor
Analyze Plan
Execute
Autonomic Manager
Knowledge
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
The Autonomic Computing Architecture
Managed Resource
Touchpoint (Sensors &Actuators)Autonomic Manager
MonitorAnalyzePlanExecute
Knowledge Source
Communication
Manager Interface
Monitor
Analyze Plan
Execute
Touch Point
Autonomic Manager
Managed Resource
Knowledge
Managed Resource
Touch Point
Manager
Interface
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 5/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
The Goal
Large-scale distributed systems
Complex and require self-management
May run on unreliable resourcesMajor sources of complexity:
Scale (resources, events, users, . . . )Dynamism (resource churn, load changes, . . . )
GoalA platform (concepts, abstractions, algorithms. . . ) that facilitatesdevelopment of self-managing applications in large-scale and/ordynamic distributed environment.
A methodology that help us to achieve self-management.
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 6/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
The Goal
Large-scale distributed systems
Complex and require self-management
May run on unreliable resourcesMajor sources of complexity:
Scale (resources, events, users, . . . )Dynamism (resource churn, load changes, . . . )
GoalA platform (concepts, abstractions, algorithms. . . ) that facilitatesdevelopment of self-managing applications in large-scale and/ordynamic distributed environment.
A methodology that help us to achieve self-management.
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 6/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
The ProblemAutonomic ComputingThe Goal
Research Plan
Self-Management in large-scale distributed systems. Consists of fourmain parts:
Part 1: Touchpoints and feedback loops in distributed systems
Part 2: Robust Management
Part 3: Improve management logic
Part 4: Integrate previous parts in a self-managing system.
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 7/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Outline
1 Introduction
2 Niche Platform
3 Robust Management Elements
4 Future Work
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 8/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Niche
Niche is a Distributed Component ManagementSystem
Niche implements the Autonomic ComputingArchitecture for large-scale distributed environment
Niche leverages Structured Overlay Networks forcommunication and for provisioning of basic services(DHT, Publish/Subscribe, Groups, etc.)
Monitor
Analyze Plan
Execute
Touch Point
Autonomic Manager
Managed Resource
Knowledge
Managed Resource
Touch Point
Manager
Interface
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 9/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Management Part
Management ElementsWatchersAggregatorsManagersExecutors
Communicate through events
Publish/Subscribe
Autonomic Managers (controlloops) built as network of MEs
Sensors and Actuators forcomponents and groups
Actuation APIFunctional Part
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Management Part
Management ElementsWatchersAggregatorsManagersExecutors
Communicate through events
Publish/Subscribe
Autonomic Managers (controlloops) built as network of MEs
Sensors and Actuators forcomponents and groups
Actuation APIFunctional Part
Management Part
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Management Part
Management ElementsWatchersAggregatorsManagersExecutors
Communicate through events
Publish/Subscribe
Autonomic Managers (controlloops) built as network of MEs
Sensors and Actuators forcomponents and groups
Actuation APIFunctional Part
Management Part
Watcher
Aggreg.
Manager
ExecutorExecutorWatcher Watcher
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Management Part
Management ElementsWatchersAggregatorsManagersExecutors
Communicate through events
Publish/Subscribe
Autonomic Managers (controlloops) built as network of MEs
Sensors and Actuators forcomponents and groups
Actuation APIFunctional Part
Management Part
Watcher
Aggreg.
Manager
ExecutorExecutorWatcher Watcher
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Management Part
Management ElementsWatchersAggregatorsManagersExecutors
Communicate through events
Publish/Subscribe
Autonomic Managers (controlloops) built as network of MEs
Sensors and Actuators forcomponents and groups
Actuation APIFunctional Part
Management Part
Watcher
Aggreg.
Manager
ExecutorExecutorWatcher Watcher
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Management Part
Management ElementsWatchersAggregatorsManagersExecutors
Communicate through events
Publish/Subscribe
Autonomic Managers (controlloops) built as network of MEs
Sensors and Actuators forcomponents and groups
Actuation APIFunctional Part
Management Part
Watcher
Aggreg.
Manager
ExecutorExecutorWatcher Watcher
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Runtime Environment
Containers that hostcomponents and MEs
Use a Structured OverlayNetwork for communication
Provide overlay services
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 11/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Runtime Environment
Containers that hostcomponents and MEs
Use a Structured OverlayNetwork for communication
Provide overlay services
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 11/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Runtime Environment
Containers that hostcomponents and MEs
Use a Structured OverlayNetwork for communication
Provide overlay services
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 11/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Dealing with Resource Churn
How to deal with failures?MEs heal the functional partHow to heal failed MEs?
Programmatically in themanagement logicTransparently by theplatform
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 12/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Dealing with Resource Churn
How to deal with failures?MEs heal the functional partHow to heal failed MEs?
Programmatically in themanagement logicTransparently by theplatform
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 12/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Niche OverviewManagement PartRuntime Environment
Dealing with Resource Churn
How to deal with failures?MEs heal the functional partHow to heal failed MEs?
Programmatically in themanagement logicTransparently by theplatform
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
??!!
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 12/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Outline
1 Introduction
2 Niche Platform
3 Robust Management Elements
4 Future Work
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 13/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Robust Management Elements
A Robust Management Element (RME):
is replicated to ensure fault-tolerance
tolerates continuous churn by automatically restoring failedreplicas on other nodes
maintains its state consistent among replicas
provides its service with minimal disruption in spite of resourcechurn (high availability)
is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Robust Management Elements
A Robust Management Element (RME):
is replicated to ensure fault-tolerance
tolerates continuous churn by automatically restoring failedreplicas on other nodes
maintains its state consistent among replicas
provides its service with minimal disruption in spite of resourcechurn (high availability)
is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Robust Management Elements
A Robust Management Element (RME):
is replicated to ensure fault-tolerance
tolerates continuous churn by automatically restoring failedreplicas on other nodes
maintains its state consistent among replicas
provides its service with minimal disruption in spite of resourcechurn (high availability)
is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Robust Management Elements
A Robust Management Element (RME):
is replicated to ensure fault-tolerance
tolerates continuous churn by automatically restoring failedreplicas on other nodes
maintains its state consistent among replicas
provides its service with minimal disruption in spite of resourcechurn (high availability)
is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Robust Management Elements
A Robust Management Element (RME):
is replicated to ensure fault-tolerance
tolerates continuous churn by automatically restoring failedreplicas on other nodes
maintains its state consistent among replicas
provides its service with minimal disruption in spite of resourcechurn (high availability)
is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Robust Management Elements
A Robust Management Element (RME):
is replicated to ensure fault-tolerance
tolerates continuous churn by automatically restoring failedreplicas on other nodes
maintains its state consistent among replicas
provides its service with minimal disruption in spite of resourcechurn (high availability)
is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Solution Outline
Replicated state machine
An algorithm to reconfigure the replicated state machine. (Weused the SMART algorithm)
Our decentralized algorithm to automate reconfiguration
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 15/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Solution Outline
Replicated state machine
An algorithm to reconfigure the replicated state machine. (Weused the SMART algorithm)
Our decentralized algorithm to automate reconfiguration
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 15/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Solution Outline
Replicated state machine
An algorithm to reconfigure the replicated state machine. (Weused the SMART algorithm)
Our decentralized algorithm to automate reconfiguration
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 15/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E
Admin
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E
Admin
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E
Admin
ConfigurationRepository
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E
Admin
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E
Admin
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E
Admin
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E X Y
Admin
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E X Y
Admin
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E X Y
Admin
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
{A,X,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART
A B C D E X Y
Admin
ConfigurationRepository
{A,X,C,D,E}
Configuration {A,X,C,D,E}
{A,X,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
Any node can create a RSM. Select ID and replication degree
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
Any node can create a RSM. Select ID and replication degree
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
RSM ID = 10, f=4, N=32
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
The node uses symmetric replication scheme to calculate replica IDs
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
The node uses lookups to find responsible nodes . . .
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Responsible Node IDs = 14, 20, 29, 7
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
. . . and gets direct references to them
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Responsible Node IDs = 14, 20, 29, 7
RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Responsible Node IDs = 14, 20, 29, 7
Configuration = D, F,I ,B
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
The set of direct references forms the configuration
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Responsible Node IDs = 14, 20, 29, 7
RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Responsible Node IDs = 14, 20, 29, 7
Configuration = D, F,I ,B
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
The node sends a Create message to the configuration
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
SM10 r1
SM10 r2
SM10 r3
SM10 r4
RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Responsible Node IDs = 14, 20, 29, 7
RSM ID = 10, f=4, N=32
Replica IDs = 10, 18, 26, 2
Responsible Node IDs = 14, 20, 29, 7
Configuration = D, F,I ,B
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Creating a Replicated State Machine (RSM)
Now replicas communicate directly using the configuration
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
{A,X,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
{A,X,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
{A,X,C,D,E} {A,B,C,Y,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{?,?,?,?,?}
Configuration {?,?,?,?,?}
{A,X,C,D,E} {A,B,C,Y,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,Y,E}
Configuration {A,B,C,Y,E}
{A,X,C,D,E} {A,B,C,Y,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}
{ ,X, , , }
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}Proposed Changes { ,X, , , }
{ ,X, , , }
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}Proposed Changes { ,X, , , }
{ ,X, , , } { , , ,Y, }
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}Proposed Changes { ,X, , Y, }
{ ,X, , , } { , , ,Y, }
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,B,C,D,E}
Configuration {A,B,C,D,E}Proposed Changes { ,X, , Y, }
New Configuration{A,X,C,Y,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
SMART with Multiple Admins
A B C D E X Y
Admin Admin2
ConfigurationRepository
{A,X,C,Y,E}
Configuration {A,X,C,Y,E}
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Handling Churn
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Handling Churn
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
F
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Handling Churn
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
SM10 r2
F
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Handling Churn
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
SM10 r2
SM10 r2 = G
F
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Handling Churn
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
SM10 r2
SM10 r2 = G
G1 2 3 4
F
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Handling Churn
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
G1 2 3 4
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Handling Churn
0 12
3
4
5
6
7
8
9
10
11
12
13
14151617
1819
20
21
22
23
24
25
26
27
28
29
3031
A
B
C
DE
G
H
I
G1 2 3 4
Configuration_2 D G I B1 2 3 4
SM10 r1
SM10 r2
SM10 r3
SM10 r4
Configuration_1 D F I B1 2 3 4
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Evaluation
Built a prototype implementation of RME
Simulation-based performance evaluation
Focused on the effect of the churn rate and replication degree onrequest critical path and failure recovery
Used the King latency dataset
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 20/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation
Request latency for a single client
0
1000
2000
3000
4000
5000
6000
7000
8000
0 1000 2000 3000 4000 5000 6000 7000 8000
Re
qu
est
La
ten
cy (
ms)
Request Number
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 21/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Outline
1 Introduction
2 Niche Platform
3 Robust Management Elements
4 Future Work
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 22/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Improve Management Logic
Apply control theory to distributed systems
Distributed optimization
Reinforcement Learning
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 23/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Self-Management in Cloud Applications
Study elastic services in the Cloud
Develop self-management techniques for Cloud applications
Integrate all pieces into an elastic storage system
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 24/25
IntroductionNiche Platform
Robust Management ElementsFuture Work
Thank you for careful listening :-)
Questions?
Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 25/25