Framework for replica selection in fault-tolerant distributed systemscsse.usc.edu/TECHRPTS/2007/usc-csse-2007-702/usc-… · · 2008-03-05Framework for replica selection in fault-tolerant

Framework for repl ica select ion in

fault-tolerant distributed systems

Daniel Popescu

Computer Science DepartmentUniversity of Southern California

Los Angeles, CA 90089-0781{dpopescu}@usc.edu

Abstract. This paper describes my term project, which I developed in the course CS 589 Software Engineering for Embedded Systems. The term project should be a design and an implementation of a novel application or development tool that exploits one or more existing approaches to software engineering in the context of embedded systems, demon-strates a novel idea in this domain, or overcomes a known significant challenge posed by embedded systems. In my project I examined how to select replica components in fault-tolerant systems to increase the overall reliability of the system considering the addi-tional costs of the deployed replica components. As a result, I developed a framework for different replica selection algorithms and evaluated five selec-tions strategies for this framework.

1 Introduction

Software plays more and more of an integrated part in our everyday environment. Al-most every electronic device in a household contains complex software. Since software is also integrated in critical devices, systems are desired, which can still operate after a severe fault occurred. Distributed and mobile systems especially have more points of possible failure than desktop applications. Therefore, these systems have a higher like-lihood of failing. In a desktop system, when the hardware completely fails, the whole software system is now unavailable. In a distributed system, which consists of many hardware notes, a failure of a hardware node does not mean that the whole system fails. It can still operate, if software services are independent or if replica of critical software components exists. If the replication of components is used as a fault-tolerance strategy, the following problem arises. If a hardware note fails and we have replicated certain software components, did we replicate the right component? To be sure that we have replicated the right component, we could replicate every single software component. However, this is not feasible because additional software components consume memo-ry, process power and bandwidth among others. Therefore, a trade-off decision about criticality, reliability and costs of software components is needed to find the right set of replica.

Finding a good trade-off is a difficult problem, because many decision variables ex-ist. An analyst has to decide which components to replicate considering the importance

2.

of use cases, the resource consumption of software components, the reliability of the components, the computer hardware and the network. After selecting the components the analyst has to decide where to deploy these components again considering the above mentioned dimensions.

To address this replica selection problem I transformed it into an optimization prob-lem [4]. In an optimization problem the objective is to find the best possible solution satisfying all given constraints. Optimization or mathematical programming is a well-studied domain in applied mathematics and operations research. Therefore, different methods and algorithms have been developed to solve such problems using strategies such as non-linear programming, greedy strategies or genetic algorithms.

Since different algorithms can be used to solve optimization problems and since no canonical set of design constraints for replica exists, I developed a framework providing extension ports for additional constraints and different algorithms allowing customized instantiation of the replica model. This framework can be seen as the base implementa-tion for the development of different later models.

The class project required not only designing a novel approach for the embedded system domain, but also an implementation. This paper also describes the architecture and the usage of the implemented tool. The tool is an Eclipse plugin extending the at USC developed deployment tool DeSi [4]. Additionally, it is able to export data to the parallel coordinates visualisation tool Parvis [2] for further analyses.

The remainder of this paper is structured as follows. Section 2 describes the devel-oped framework model, an instantiation of the model with some constraints and the de-veloped algorithms to solve the optimization problem. Section 3 describes the devel-oped tool. Section 4 shows an evaluation of the developed approach. Section 5 con-cludes this paper and shows possible future research ideas.

2 Framework

Based on the framework model of Malek et al. [4], the developed replication framework model enables the modelling of distributes system. It allows to define customized con-straints and parameters allowing instantiations of the model to solve own defined trade-off scenarios. Furthermore, since multiple strategies exist to solve optimization prob-lems, the framework provides different algorithms to find solutions.

2.1 Framework modelThe basic entities of the framework model include hosts, components, links and

services. These entities seem appropriate to model component deployments in distrib-uted systems. In detail, a model consists of

• A set of hosts, H, which represents the hardware nodes of a system. Possible at-tributes of hardware nodes are the capacity of memory, process power or energy consumption.

• A set of components, C, where a possible attribute could be the size of the com-ponent. Each component is deployed on one of the above defined hosts and each component has a set of replica R, which also includes the original component.

3.

• A set of services, S, which describe the different use cases that the whole system offers and can perform. A service is composed of the interaction of components in a system. Since multiple services could use an e.g. encryption component, a component can appear in multiple services.

The model contains three different types of links: physical networks links, logical links and service logical links. Therefore, the model contains three different sets of links.

• The set of physical network links, PL, which connect two hosts to each other showing that components on these host could interact with each other. Physical network links have properties like bandwidth among others.

• The set of logical links, LL. Logical links show that two components are able to interact with each other, because they know e.g. how to invoke methods in each other.

• The set of service logical links, SLL. A service logical link connects two compo-nents, which are part in the same service. The service logical link shows that these two components interact in a use case of the system. It can have properties such as exchanged average data size and average data exchange frequency. Two com-ponents can only have a service logical link, if they also have a logical link be-tween each other.

To set up the global optimization function we need some more attributes for the de-fined basic entities.

• Reliability. Different entities can fail in a distributed system. The host can break down, a physical network link can fail or a software component reaches a certain failure state. Therefore, each host, each physical link and each component in the model has a reliability value, which describes the likelihood that this entity will not fail during the operation.

• Service Criticality. A use case of a system has different priorities for a user. For example, in a car a working brake system has a higher criticality in the overall system than an entertainment system. Therefore, my model considers service crit-icality.

After defining all required elements, the designed global optimization function can be derived. It is based on the original deployment, the criticality of services and the re-liability of components.

α Reliability Sj Ci,( ) Criticality Sj( )α 1if Ci Sj∈( )=

α 0if Ci Sj∉( )=

⋅ ⋅

Sj 1=

S

∑Ci 1=

C

∑

4.

Creating replica components for certain services increases the reliability of a com-ponent. Therefore, the reliability function computes the reliability of the component based on the amount of component instances. Each replica of a component is an addi-tional instance of the original component.

The reliability definition above is based on the assumption that the failing events of each replica are independent. This should be the case, since all replica are on different hosts and therefore should have different causing events for host and link failures. There can never be the same components deployed on the same host, because a constraint of the framework prohibits this double deployment.

The probability of failure of a component is dependent on the host on which it is placed and on the links which it uses to serve a certain service. If a host fails ( ) the component also fails, if a host does not fail ( ) then the probability of failure depends on the link reliability and the component reliability. Therefore, through application of probability theory (i.e. conditional probability and the law of total prob-ability) the following formula for the component reliability can be derived.

For this formula I assume that a component and link cannot exist without running host. If the host fails, then each attached network link and each software component, which was deployed on it, fails also.

As an example, if a host fails 30% of the time, one link fails 20% of the time and the software component does not fail at all, then we arrive at the following formula.

In this example the probability that the component fails is 34% considering host and link failure.

2.2 Framework InstantiationFor this framework, an active replication strategy is assumed although it can be adapted to other replication strategies as well. In an active replication all replica are running at the same time. The above defined global optimization function does not include any constraints. This is a wrong assumption, because hosts have memory and processing constraints and the original component forwards each received event to each replica. Therefore, each additional replica consumes resources of the system. The framework itself does not require any constraints. However, different constraints can be defined.

Reliability Sj Ci,( )

1 FailProbability Sj Replica Rk Ci,( ),( )

k

ReplicaSet Ci( )

∏–

=

P H( )P H( ) 1 P H( )–=

FailProbability Ci( ) P H( ) P H( ) P L H( )⋅ P H( ) P L H( ) P Ci L H,( )⋅ ⋅+ +=

FailProbability Ci( ) P H( ) P H( ) P L H( )⋅ P H( ) P L H( ) P Ci L H,( )⋅ ⋅+ +0.3 0.7 0.2 0.7 0.8 0⋅ ⋅+⋅+ 0.34

== =

5.

To make a reasonable trade-off decision, each instantiation of a model should be instan-tiated with some constraints.

Memory consumption is an example of a constraint for the model. Each host has only a limited space available to install software components. Each software component has a size. This constraint ensures that the optimization algorithm will not generate too many replica components. At most it will generate as many components as will fit on the hosts. Using all remaining space for replica components is also not optimal, since no space for maintenance tasks would exist. Realistic boundaries for the memory usage are required. However, how does the algorithm know what are good boundaries for memory usage?

Deciding on a trade-off requires intelligence, understanding and domain knowl-edge. An optimization algorithm does not know what memory usage is. Therefore, we need to add some human input to the model. Since each important parameter of the sys-tem is modeled, an engineer can examine how much of each resource is used and how much of each resource is available. Consequently, he can set these constraints for a cer-tain system before running the optimization algorithm.

Bandwidth consumption is another example of a constraint. Each replica is placed on a different host. Whenever a data event is sent to the component, the data event needs also to be forwarded to all replica components. This can generate a lot of overhead traf-fic. A human engineer knows how much overhead traffic he wants to allow providing the input for the optimization algorithm. Using the engineer’s input and considering the available bandwidth of each physical network link, we have input for a realistic con-straint for the optimization algorithm.

2.3 Framework algorithmsSeveral algorithms with different run-time behavior and quality of the results exist to solve optimization problems. Since optimization problems often are so complex that they require an algorithm with exponential runtime, algorithms can be chosen, which are based on heuristics finding solution that are approximations of the optimum.

2.3.1 Exhaustive SearchThe first algorithm is an exhaustive search. This algorithm always finds the optimal so-lution by computing the global function value for every possible replication. Therefore, if the algorithms find a solution, we can be sure that it has found the best replication strategy.

What are the dimensions of this algorithm? Each component can be replicated mul-tiple times and be put on each host. The only constraint is that one host can only run one instance of a component. Therefore, the exhaustive search tries out configura-tions. This runtime complexity causes the algorithm to be inapplicable for already small problems. Therefore, the algorithm’s primary function is serving as a benchmark for other algorithms.

2.3.2 Greedy SearchGreedy algorithms are iterative algorithms, which find a better solution in each iteration step. They operate by selecting the best possible solution in each step. Therefore, greedy

2 C H⋅

6.

algorithms have a much better runtime compared to an exhaustive search enabling the optimization of large distributed systems. Greedy algorithms only find the optimal so-lution if the search space does not have local maxima, which is rarely the case. The so-lution that a greedy algorithm computes is therefore often only an approximation of the optimal results.

In this framework the greedy algorithm replicates in every step a component on a host, which improves the optimization function the most. Therefore, it computes in very step possibilities. In the worst case each component gets replicated on each host. Since in every step only one component gets replicated and only one replica can be installed on each host, we need at most iterations. Therefore, if we assume no constraints such as memory consumption or bandwidth consumption, the greedy al-gorithm has the runtime . This low runtime complexity enables the use of the greedy strategy also for large distributed systems.

2.4 Practical discussionTo compute an optimal optimization, the framework requires quantifiable values.

Values such as component size or the available memory of the host can be easily meas-ured in the implemented system. However, values such as reliability or average data transmission cannot be exactly specified a priori to the runtime. These values need to be obtained using expert knowledge gained from earlier systems or through dynamic simulation [1].

3 Tool Support

An implementation of the above described conceptual model is another part of the de-liveries. This section describes the implementation in detail.

3.1 Architecture

3.1.1 Foundation of the toolThe developed tool was developed as a plug-in for Eclipse. The Eclipse framework is an open source platform-independent software framework. It is mainly used as an IDE, however it was developed to be a general platform for rich clients. Its basic version con-sists of the Rich Client Platform, which provides the basic functionality for extensions, a framework for views and editors, and plug-ins for the Java programming language. It integrates OSGi [5] as a component framework and provides extension ports for addi-tional components.

Besides requiring the basic Eclipse framework, the replication plug-in also requires and extends the tool DeSi. The DeSi tool, an Eclipse plug-in itself, is a visual deploy-ment exploration environment that supports specification, manipulation, visualization, and (re)estimation of deployment architectures for large-scale, highly distributed sys-tems. DeSi exports an API for modifying its deployment model, which can be used to define new system parameters, since the underlying deployment framework for DeSi is similar to the framework of the paper. The basic data model and graphical user interface of DeSi could be reused for this project.

C H⋅

C H⋅

O C 2 H 2⋅( )

7.

3.1.2 The extension mechanism of the replication plug-inThe replication plug-in was designed to provide two distinct extension ports for future change cases. It provides an API to plug in new optimization algorithms and to plug in new constraints for the algorithms.

To develop a new algorithm, a new subclass of an abstract base class has to be im-plemented. The new developed algorithm is automatically integrated in the program, re-ceiving all required data and being executable through the graphical user interface of the replication plug-in.

Adding a new constraint is similarly easy. After implementing the general con-straint interfaces, the constraints are automatically integrated in the graphical user in-terface, enabling user input about allowed cost increase. Additionally, the constraint in-terface of each algorithm has to be implemented. If the constraint interface of an algo-rithm is implemented, the constraint is considered automatically in the run of this algorithm.

3.1.3 Parvis - The analysis COTS componentTo enhance visualization of the results, the replication plug-in exports its results in

a readable format for the parallel coordinate visualization tool Parvis.

Parvis uses the visualization technique called parallel coordinates. Parallel Coordi-nates are a way to visualize multi-dimensional data. In this visualization technique each dimension is represented as a parallel axes with equal distance to each other. Each value of an n-dimensional data point is marked on the parallel axes and connected through one line. Therefore, one data point can be traced over the different dimensions. To increase the readability of the visualization, the user can highlight since data points.

The replication tool exports its data using five dimensions. Each component is a five-dimensional data point. The dimensions are the original reliability value, the im-proved reliability value, the number of component instances (replica + the original com-ponent), the number of services, where the component appears, and the average critical-ity of these services.

8.

3.2 Usage Description

3.2.1 The main views of the replication toolThis section shows an example execution of the tool. The replication plugin can be in-voked through a menu item in Eclipse.

The screenshot above shows the deployment of an example embedded system. 12 components are deployed on four hosts. Each host is connected to each other host, which the black connecting lines indicate. If two components are connected, they are able to communicate. The properties of each entity (hosts, links and components) can be edited in the view below.

9.

As described above components together form services, which represent the use cases of the system for the user. Services and their component interaction can be mod-eled in the service view below.

3.2.2 The replication wizardAfter the system is modeled as desired, the algorithms can be invoked on the model us-ing a graphical wizard.

The screenshot above shows the first page of the wizard. As the first step the opti-mization strategies can be chosen. In the current version five strategies are implement-ed.

10.

In the second and also the final page of the wizard, each constraint is displayed showing resources used in the whole system and the total resources, which are available in that dimension. The user can enter how much cost increases for each constraint he tolerates. In the screenshot above a cost increase for 75% is entered. Since 40.596 units of memory are already in use, 175% would correspond to 71.043 units of memory. The algorithm ensures that the solution does not consume more resources as allowed. If the specified allowed increase exceeds the available resources, the available resources are the boundary for the algorithm.

After the algorithm computed the best replica, it adds replica component to the model. These component are highlighted by a grey bar on the bottom of the component. All properties can be examined in the same way as for the original components.

In addition, a replica service is created in the service view showing which compo-nents are connected to which replica components. In this view it can be analyzed how much data is transferred from each component to its replica. The replication data trans-fer is inferred from the data exchange from each component of its services.

11.

As an additional step, the effectiveness of the chosen algorithm can be analyzed us-ing the integrated COTS tool Parvis.

In the parallel coordinate result graph, a user can see in a overview how the relia-bility of each component improved. Furthermore, he can read other parameters as serv-ice criticality. Therefore, the results of the algorithms can be visual validated.

4 Evaluation

For the evaluation of the system I developed three comparison heuristics as algorithm extension for the replication tool.

Since the domain is complex and often the variables are unattainable, human engi-neers can use heuristics for deciding on a replica. The first heuristic replicates each component of the system on each host maximizing reliability. This strategy provides the maximum possible reliability in the system while being very costly. However, in some systems costs are not critical. Therefore, it is still a valid strategy. The second strategy is to replicate each component once. Therefore, if a component fails, there should be always one component, which can be used instead. This approach is less expensive than the first heuristic. In unreliable environments, where components fail more often, this strategy might be insufficient. Replicating every component twice, could still be inex-pensive, while providing higher reliability. These more easily comprehendable selec-tion strategies are compared against the exhaustive search and the greedy algorithm. Note that all three heuristics do not consider any constraints while creating replica com-ponents.

To evaluate the algorithms, I ran each described algorithm on three randomly gen-erated distributed systems with parameters in the following data ranges. The units of at-tributes such as component size are not specified, since many different units are possi-ble and this information is not essential for the algorithms.

Attribute RangeComponent Sizes 1..5Service Criticality 1..3

12.

The algorithms are compared in the dimension of overall bandwidth usage increase, overall memory cost increase and the increase in the global optimization function value. Additionally, the test shows the reliability value of the two components with the lowest reliability after the optimization ( and ), because these two components are the weakest points in the whole system. They are the most likely to fail.

All three experiment runs have four hosts, four services and twelve components. Even for this small configuration, the exhaustive search is already not any more feasi-ble.

The result of the three experiments can be seen in the following tables. The first col-umn Base shows the values before a replication strategy was applied. For Greedy (50%)the greedy algorithm was executed, whereas both memory and bandwidth constraint al-lowed only a cost increase of 50%. For Greedy (100%) the allowed cost increase was specified as 100% and for Greedy (150%) it was 150%.

All three experiments were too complex for the exhaustive search. Therefore, exe-cuting the exhaustive search was not feasible. Instead the Replicate All strategy was used as a benchmark. In this strategy each component is replicated on each host.

Event Data Frequency 1..10Event Data Size 1..10Host Reliability 0.9..1Component Reliability 0.7..1

Table 1: Experiment 1

Name Bandwidth Increase

Memory Increase Optimization

Base 40.6 255.13 16.17 0.743 0.761

Greedy (50%) 56.59 366.02 21.59 0.838 0.853

Greedy (100%) 78.38 451.14 29.50 0.951 0.956

Greedy (150%) 97.43 565.91 30.76 0.974 0.988

Replicate All 162.38 871.07 30.97 0.997 0.998

Replicate each once 81.19 460.44 30.26 0.934 0.954

Replicate each twice 121.79 665.76 30.86 0.990 0.992




Base 36.3 429.93 21.42 0.760 0.773

Cw1 Cw2

Cw1 Cw2

Cw1 Cw2

13.

In general, it can be observed that all algorithms showed a similar behavior in all three experiments.

The first result is that a cost increase of 50% increases the overall reliability, but seems to be in general too little compared to the maximum possible reliability of the Replicate All strategy. This can especially be seen in the overall optimization value of each experiment and in the small changes in the reliability of the most unreliable com-ponent.

The Greedy (100%) strategy produces already better results, which are closer to the best possible solution. Replicating each component once has slightly better results than the Greedy (100%) strategy while being only slightly more expensive. A similar trend can be observed with the Greedy (150%) strategy compared to strategy, which repli-cates each component twice. Both strategies compute almost the same results, but the

Greedy (50%) 48.68 660.67 26.94 0.824 0.826

Greedy (100%) 63.82 898.14 31.90 0.824 0.826

Greedy (150%) 76.36 1073.28 35.40 0.946 0.955

Replicate All 145.2 1970.28 36.97 0.997 0.998






Base 32.25 426.42 28.78 0.774 0.775

Greedy (50%) 47.43 622.15 37.21 0.774 0.784

Greedy (100%) 64.39 866.15 43.21 0.946 0.957

Greedy (150%) 79.69 1113.92 44.69 0.990 0.995

Replicate All 129.02 1901.73 44.98 0.998 0.998





Memory Increase Optimization Cw1 Cw2

Cw1 Cw2

14.

Greedy (150%) performs slightly worse while having a slightly better resource con-sumption. Both strategies reach values, which are close to the best possible solution.

How are these results transferable to the replica selection in fault-tolerant systems? If a moderate reliability is sufficient or the hardware resources are constrained, the Greedy (100%) or the strategy, which replicates each component once, is suitable. If a high reliability is required replicating every component twice or using the Greedy (150%) strategy can be used. Both strategies require approximately three times the re-sources as the original deployment.

It is interesting that a simple heuristic, such as replicating every component once, produces almost the same results as the more sophisticated greedy search. Since it might be difficult to obtain all reliability values and other parameters of the model, the simple heuristic can be utilized in many fault-tolerant distributed systems. If, however, more uneven constraints boundaries exist (e.g. 65% allowed cost increase) the greedy algo-rithm should be selected.

5 Future

The developed framework helped to explore the problem space of the replication do-main. Several possible future questions could still be answered.

• Since the greedy algorithm performs only slightly better than the simple heuristic, would a better solution provided by a non-linear programming solver create better results?

• How do the developed algorithms perform when they are bounded by new addi-tional constraints and how could other resource constraints be modeled?

• The set of randomly created experiments helped to understand the different strat-egies better. As a next step the algorithms could be applied in real systems.

6 Contribution

In conclusion this term project made several contributions. • Design and implementation of a novel approach for Component Replica Selec-

tion.• A framework plus tool implementation to facilitate complex trade-off decisions

between reliability and replica overhead costs based on service criticality.• An architecture that provides extension ports for additional constraints and algo-

rithms• Evaluation of different replication selection strategies• Design and COTS component integration of parallel coordinates technique to vis-

ualize the results and the fault-tolerance improvements of the analyzed system.

15.

7 References

[1] Edwards, G. et al. Scenario-Driven Dynamic Analysis of Distributed Architectures, USC-CSE-2006-617

[2] H. Hauser, F. Ledermann, and H. Doleisch. Angular brushing for extended parallel coor-dinates. In Proc. of the IEEE Symposium on Information Visualization, pages 127--130, 2002

[3] Inselberg, A. and Dimsdale, B. 1987. Parallel coordinates for visualizing multi-dimension-al geometry. In CG international '87 on Computer Graphics 1987 (Karuizawa, Japan). T. L. Kunii, Ed. Springer-Verlag New York, New York, NY, 25-44.

[4] Malek, S. et al., A User Centric Approach for Improving A Distributed Software System's Deployment Architecture, USC-CSE-2006-602

[5] OSGi Alliance, OSGi Service Platform, Release 3, Mar 2003.

Documents

Framework for replica selection in fault-tolerant distributed systemscsse.usc.edu/TECHRPTS/2007/usc-csse-2007-702/usc-… · · 2008-03-05Framework for replica selection in fault-tolerant