86
Self-Management for Large Scale Distributed Systems Ahmad Al-Shishtawy ([email protected]) Advisors: Vladimir Vlassov ([email protected]) Seif Haridi ([email protected]) KTH Royal Institute of Technology Stockholm, Sweden The 5th EuroSys Doctoral Workshop (EuroDW 2011) April 10, 2011

Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

Self-Management for Large Scale DistributedSystems

Ahmad Al-Shishtawy ([email protected])

Advisors:Vladimir Vlassov ([email protected])

Seif Haridi ([email protected])

KTH Royal Institute of TechnologyStockholm, Sweden

The 5th EuroSys Doctoral Workshop (EuroDW 2011)April 10, 2011

Page 2: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Outline

1 Introduction

2 Niche Platform

3 Robust Management Elements

4 Future Work

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 2/25

Page 3: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Outline

1 Introduction

2 Niche Platform

3 Robust Management Elements

4 Future Work

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 3/25

Page 4: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

ProblemAll computing systems need to be managed

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 5: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

ProblemAll computing systems need to be managed

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 6: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

ProblemComputing systems are getting more and more complex

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 7: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

ProblemComplexity means higher administration overheads

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 8: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

ProblemComplexity poses a barrier on further development

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 9: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

SolutionThe Autonomic Computing initiative by IBM

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 10: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

SolutionSelf-Management: Systems capable of managing themselves

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 11: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

SolutionUse Autonomic Managers

Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 12: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Dealing with Complexity

Open Question

How to achieve Self-Management?

Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge Monitor

Analyze Plan

Execute

Autonomic Manager

Knowledge

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 4/25

Page 13: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

The Autonomic Computing Architecture

Managed Resource

Touchpoint (Sensors &Actuators)Autonomic Manager

MonitorAnalyzePlanExecute

Knowledge Source

Communication

Manager Interface

Monitor

Analyze Plan

Execute

Touch Point

Autonomic Manager

Managed Resource

Knowledge

Managed Resource

Touch Point

Manager

Interface

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 5/25

Page 14: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

The Goal

Large-scale distributed systems

Complex and require self-management

May run on unreliable resourcesMajor sources of complexity:

Scale (resources, events, users, . . . )Dynamism (resource churn, load changes, . . . )

GoalA platform (concepts, abstractions, algorithms. . . ) that facilitatesdevelopment of self-managing applications in large-scale and/ordynamic distributed environment.

A methodology that help us to achieve self-management.

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 6/25

Page 15: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

The Goal

Large-scale distributed systems

Complex and require self-management

May run on unreliable resourcesMajor sources of complexity:

Scale (resources, events, users, . . . )Dynamism (resource churn, load changes, . . . )

GoalA platform (concepts, abstractions, algorithms. . . ) that facilitatesdevelopment of self-managing applications in large-scale and/ordynamic distributed environment.

A methodology that help us to achieve self-management.

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 6/25

Page 16: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

The ProblemAutonomic ComputingThe Goal

Research Plan

Self-Management in large-scale distributed systems. Consists of fourmain parts:

Part 1: Touchpoints and feedback loops in distributed systems

Part 2: Robust Management

Part 3: Improve management logic

Part 4: Integrate previous parts in a self-managing system.

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 7/25

Page 17: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Outline

1 Introduction

2 Niche Platform

3 Robust Management Elements

4 Future Work

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 8/25

Page 18: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Niche

Niche is a Distributed Component ManagementSystem

Niche implements the Autonomic ComputingArchitecture for large-scale distributed environment

Niche leverages Structured Overlay Networks forcommunication and for provisioning of basic services(DHT, Publish/Subscribe, Groups, etc.)

Monitor

Analyze Plan

Execute

Touch Point

Autonomic Manager

Managed Resource

Knowledge

Managed Resource

Touch Point

Manager

Interface

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 9/25

Page 19: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Management Part

Management ElementsWatchersAggregatorsManagersExecutors

Communicate through events

Publish/Subscribe

Autonomic Managers (controlloops) built as network of MEs

Sensors and Actuators forcomponents and groups

Actuation APIFunctional Part

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25

Page 20: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Management Part

Management ElementsWatchersAggregatorsManagersExecutors

Communicate through events

Publish/Subscribe

Autonomic Managers (controlloops) built as network of MEs

Sensors and Actuators forcomponents and groups

Actuation APIFunctional Part

Management Part

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25

Page 21: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Management Part

Management ElementsWatchersAggregatorsManagersExecutors

Communicate through events

Publish/Subscribe

Autonomic Managers (controlloops) built as network of MEs

Sensors and Actuators forcomponents and groups

Actuation APIFunctional Part

Management Part

Watcher

Aggreg.

Manager

ExecutorExecutorWatcher Watcher

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25

Page 22: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Management Part

Management ElementsWatchersAggregatorsManagersExecutors

Communicate through events

Publish/Subscribe

Autonomic Managers (controlloops) built as network of MEs

Sensors and Actuators forcomponents and groups

Actuation APIFunctional Part

Management Part

Watcher

Aggreg.

Manager

ExecutorExecutorWatcher Watcher

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25

Page 23: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Management Part

Management ElementsWatchersAggregatorsManagersExecutors

Communicate through events

Publish/Subscribe

Autonomic Managers (controlloops) built as network of MEs

Sensors and Actuators forcomponents and groups

Actuation APIFunctional Part

Management Part

Watcher

Aggreg.

Manager

ExecutorExecutorWatcher Watcher

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25

Page 24: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Management Part

Management ElementsWatchersAggregatorsManagersExecutors

Communicate through events

Publish/Subscribe

Autonomic Managers (controlloops) built as network of MEs

Sensors and Actuators forcomponents and groups

Actuation APIFunctional Part

Management Part

Watcher

Aggreg.

Manager

ExecutorExecutorWatcher Watcher

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 10/25

Page 25: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Runtime Environment

Containers that hostcomponents and MEs

Use a Structured OverlayNetwork for communication

Provide overlay services

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 11/25

Page 26: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Runtime Environment

Containers that hostcomponents and MEs

Use a Structured OverlayNetwork for communication

Provide overlay services

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 11/25

Page 27: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Runtime Environment

Containers that hostcomponents and MEs

Use a Structured OverlayNetwork for communication

Provide overlay services

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 11/25

Page 28: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Dealing with Resource Churn

How to deal with failures?MEs heal the functional partHow to heal failed MEs?

Programmatically in themanagement logicTransparently by theplatform

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 12/25

Page 29: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Dealing with Resource Churn

How to deal with failures?MEs heal the functional partHow to heal failed MEs?

Programmatically in themanagement logicTransparently by theplatform

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 12/25

Page 30: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Niche OverviewManagement PartRuntime Environment

Dealing with Resource Churn

How to deal with failures?MEs heal the functional partHow to heal failed MEs?

Programmatically in themanagement logicTransparently by theplatform

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

??!!

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 12/25

Page 31: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Outline

1 Introduction

2 Niche Platform

3 Robust Management Elements

4 Future Work

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 13/25

Page 32: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Robust Management Elements

A Robust Management Element (RME):

is replicated to ensure fault-tolerance

tolerates continuous churn by automatically restoring failedreplicas on other nodes

maintains its state consistent among replicas

provides its service with minimal disruption in spite of resourcechurn (high availability)

is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25

Page 33: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Robust Management Elements

A Robust Management Element (RME):

is replicated to ensure fault-tolerance

tolerates continuous churn by automatically restoring failedreplicas on other nodes

maintains its state consistent among replicas

provides its service with minimal disruption in spite of resourcechurn (high availability)

is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25

Page 34: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Robust Management Elements

A Robust Management Element (RME):

is replicated to ensure fault-tolerance

tolerates continuous churn by automatically restoring failedreplicas on other nodes

maintains its state consistent among replicas

provides its service with minimal disruption in spite of resourcechurn (high availability)

is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25

Page 35: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Robust Management Elements

A Robust Management Element (RME):

is replicated to ensure fault-tolerance

tolerates continuous churn by automatically restoring failedreplicas on other nodes

maintains its state consistent among replicas

provides its service with minimal disruption in spite of resourcechurn (high availability)

is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25

Page 36: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Robust Management Elements

A Robust Management Element (RME):

is replicated to ensure fault-tolerance

tolerates continuous churn by automatically restoring failedreplicas on other nodes

maintains its state consistent among replicas

provides its service with minimal disruption in spite of resourcechurn (high availability)

is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25

Page 37: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Robust Management Elements

A Robust Management Element (RME):

is replicated to ensure fault-tolerance

tolerates continuous churn by automatically restoring failedreplicas on other nodes

maintains its state consistent among replicas

provides its service with minimal disruption in spite of resourcechurn (high availability)

is location transparent, i.e., RME clients communicate with itregardless of current location of its replicas

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 14/25

Page 38: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Solution Outline

Replicated state machine

An algorithm to reconfigure the replicated state machine. (Weused the SMART algorithm)

Our decentralized algorithm to automate reconfiguration

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 15/25

Page 39: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Solution Outline

Replicated state machine

An algorithm to reconfigure the replicated state machine. (Weused the SMART algorithm)

Our decentralized algorithm to automate reconfiguration

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 15/25

Page 40: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Solution Outline

Replicated state machine

An algorithm to reconfigure the replicated state machine. (Weused the SMART algorithm)

Our decentralized algorithm to automate reconfiguration

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 15/25

Page 41: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E

Admin

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 42: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E

Admin

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 43: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E

Admin

ConfigurationRepository

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 44: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E

Admin

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 45: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E

Admin

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 46: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E

Admin

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 47: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E X Y

Admin

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 48: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E X Y

Admin

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 49: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E X Y

Admin

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

{A,X,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 50: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART

A B C D E X Y

Admin

ConfigurationRepository

{A,X,C,D,E}

Configuration {A,X,C,D,E}

{A,X,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 16/25

Page 51: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

Any node can create a RSM. Select ID and replication degree

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 52: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

Any node can create a RSM. Select ID and replication degree

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

RSM ID = 10, f=4, N=32

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 53: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

The node uses symmetric replication scheme to calculate replica IDs

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 54: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

The node uses lookups to find responsible nodes . . .

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Responsible Node IDs = 14, 20, 29, 7

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 55: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

. . . and gets direct references to them

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Responsible Node IDs = 14, 20, 29, 7

RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Responsible Node IDs = 14, 20, 29, 7

Configuration = D, F,I ,B

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 56: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

The set of direct references forms the configuration

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Responsible Node IDs = 14, 20, 29, 7

RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Responsible Node IDs = 14, 20, 29, 7

Configuration = D, F,I ,B

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 57: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

The node sends a Create message to the configuration

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

SM10 r1

SM10 r2

SM10 r3

SM10 r4

RSM ID = 10, f=4, N=32RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Responsible Node IDs = 14, 20, 29, 7

RSM ID = 10, f=4, N=32

Replica IDs = 10, 18, 26, 2

Responsible Node IDs = 14, 20, 29, 7

Configuration = D, F,I ,B

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 58: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Creating a Replicated State Machine (RSM)

Now replicas communicate directly using the configuration

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 17/25

Page 59: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 60: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 61: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 62: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

{A,X,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 63: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

{A,X,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 64: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

{A,X,C,D,E} {A,B,C,Y,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 65: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{?,?,?,?,?}

Configuration {?,?,?,?,?}

{A,X,C,D,E} {A,B,C,Y,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 66: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,Y,E}

Configuration {A,B,C,Y,E}

{A,X,C,D,E} {A,B,C,Y,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 67: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 68: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}

{ ,X, , , }

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 69: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}Proposed Changes { ,X, , , }

{ ,X, , , }

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 70: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}Proposed Changes { ,X, , , }

{ ,X, , , } { , , ,Y, }

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 71: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}Proposed Changes { ,X, , Y, }

{ ,X, , , } { , , ,Y, }

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 72: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,B,C,D,E}

Configuration {A,B,C,D,E}Proposed Changes { ,X, , Y, }

New Configuration{A,X,C,Y,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 73: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

SMART with Multiple Admins

A B C D E X Y

Admin Admin2

ConfigurationRepository

{A,X,C,Y,E}

Configuration {A,X,C,Y,E}

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 18/25

Page 74: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Handling Churn

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25

Page 75: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Handling Churn

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

F

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25

Page 76: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Handling Churn

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

SM10 r2

F

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25

Page 77: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Handling Churn

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

SM10 r2

SM10 r2 = G

F

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25

Page 78: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Handling Churn

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

SM10 r2

SM10 r2 = G

G1 2 3 4

F

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25

Page 79: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Handling Churn

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

G1 2 3 4

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25

Page 80: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Handling Churn

0 12

3

4

5

6

7

8

9

10

11

12

13

14151617

1819

20

21

22

23

24

25

26

27

28

29

3031

A

B

C

DE

G

H

I

G1 2 3 4

Configuration_2 D G I B1 2 3 4

SM10 r1

SM10 r2

SM10 r3

SM10 r4

Configuration_1 D F I B1 2 3 4

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 19/25

Page 81: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Evaluation

Built a prototype implementation of RME

Simulation-based performance evaluation

Focused on the effect of the churn rate and replication degree onrequest critical path and failure recovery

Used the King latency dataset

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 20/25

Page 82: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Solution OutlineThe SMART Reconfiguration AlgorithmAutomatic ReconfigurationEvaluation

Request latency for a single client

0

1000

2000

3000

4000

5000

6000

7000

8000

0 1000 2000 3000 4000 5000 6000 7000 8000

Re

qu

est

La

ten

cy (

ms)

Request Number

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 21/25

Page 83: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Outline

1 Introduction

2 Niche Platform

3 Robust Management Elements

4 Future Work

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 22/25

Page 84: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Improve Management Logic

Apply control theory to distributed systems

Distributed optimization

Reinforcement Learning

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 23/25

Page 85: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Self-Management in Cloud Applications

Study elastic services in the Cloud

Develop self-management techniques for Cloud applications

Integrate all pieces into an elastic storage system

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 24/25

Page 86: Self-Management for Large Scale Distributed Systems · Introduction Niche Platform Robust Management Elements Future Work The Problem Autonomic Computing The Goal Outline 1 Introduction

IntroductionNiche Platform

Robust Management ElementsFuture Work

Thank you for careful listening :-)

Questions?

Self-Management for Large Scale Distributed Systems (A. Al-Shishtawy) 25/25