10
Enhanced N-Version Programming and Recovery Block Techniques for Web Service Systems Kuan-Li Peng 1 , Chin-Yu Huang 1 , Pin-Heng Wang 2 , Chao-Jung Hsu 3 1 Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 2 System Development Department III Alpha Networks Inc. Hsinchu Taiwan 3 Medical Department Altek Corporation Hsinchu, Taiwan ABASTRACT In recent years, web services (WS’s) have been widely used to support interoperable machine-to-machine interaction over a network. In order to ensure a reliable WS system, a number of fault tolerance designs have been proposed. It is known that network connection and hardware devices may fail. In addition, the acceptance test (AT) as well as the decision mechanism (DM), which are common in fault tolerance designs, could also fail unexpectedly. Such uncertainties may affect the reliability of a WS-based system but have not yet been carefully considered in reliability modeling. Therefore, we propose extended NVP (ENVP) and extended RB (ERB) for the reliability analysis. Various operations of ENVP and ERB are discussed, and a simulation procedure is implemented to evaluate the system reliability and the failure probability of fault-tolerant WS-based systems. The experimental results show a high degree of correlation between the numbers of AT’s and the reliability improvements. The proposed fault tolerance designs could improve the system reliability, and the simulation procedure could also help in exploring appropriate configurations of fault tolerance designs for practitioners. Categories and Subject Descriptors C.4 [Performance of Systems]: design studies, fault tolerance, modeling techniques, reliability, availability, and serviceability; H.3.5 [Online Information Systems]: Web-based services. General Terms Management, Design, Reliability, Experimentation. Keywords Acceptance Test (AT), Decision Mechanism (DM), Fault Tolerance Design, Reliability Assessment, Web Service (WS). NOTATIONS n Number of functionally identical WS’s. m Number of AT’s. WS i The i-th functionally identical WS. Specifically, WS 1 is the primary WS (PWS) and WS 2 ~WS n are the alternative WS’s (AWS). AT r The r-th AT. H mid Host machine of the middleware. H i Host machine of WS i . P i Provider of WS i . N i Network connected to WS i . N C Network connected to a client. r comp Reliability of the component. f comp Failure probability of the component. f comp =1-r comp . 1. INTRODUCTION A Web service (WS) is generally defined as a software system designed to support interoperable machine-to-machine interaction over a network [1]. In the past, different web-based techniques such as XML, simple object access protocol (SOAP), universal description discovery and integration (UDDI), and web services description language (WSDL) have been widely used to realize operations of the web service architecture [2]. By using WS-based techniques, users on different platforms could access different kinds of services easily. Software developers could also combine various WS’s to systems in order to achieve agile development. With more and more publicly-available WS-based systems, the way to implement a reliable WS’s platform has become an important issue [3-6]. On the other hand, it is hard to remove all faults during software operational phases. The software practitioners instead consider fault tolerance (FT) designs with redundancy or backups of core components to build highly reliable systems. The N-version programming (NVP) and recovery block (RB) are two popular fault tolerance mechanisms [3, 5]. The NVP utilizes functionally equivalent software components (versions) to enable software fault tolerance [4], and the RB utilizes different representations of input data to provide the tolerance of design faults [6], respectively. Other types of fault tolerance techniques were proposed in previous studies [7-16], such as N-self-checking programming, and distributed recovery block. However, the reliability of web-based systems is hard to evaluate [8-10]. Berman et al. investigated several backup mechanisms and cost assessments to enhance the accuracy of software reliability analysis [6-8]. Communications through undependable networks between Web services make reliability assurance even more challenging. Specifically, host overload, software failures, hardware problems, and network congestions could all make a WS request fail. To better assess the reliability of WS-based fault tolerance systems, the above factors need to be considered [8, 9]. In addition, the factors of network connection and hardware devices have not been carefully considered in the reliability analysis of fault-tolerant WS-based systems. It is also noted that the acceptance test (AT) and the decision mechanism (DM) are Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. InnoSWDev'14, November 16-22, 2014, Hong Kong, China Copyright 2014 ACM 978-1-4503-3226-2/14/11... $15.00 http://dx.doi.org/10.1145/2666581.2666587

Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

Enhanced N-Version Programming and Recovery Block Techniques for Web Service Systems

Kuan-Li Peng1, Chin-Yu Huang1, Pin-Heng Wang2, Chao-Jung Hsu3

1Department of Computer Science

National Tsing Hua University Hsinchu, Taiwan

2System Development Department III

Alpha Networks Inc. Hsinchu Taiwan

3Medical Department Altek Corporation Hsinchu, Taiwan

ABASTRACT In recent years, web services (WS’s) have been widely used to

support interoperable machine-to-machine interaction over a

network. In order to ensure a reliable WS system, a number of

fault tolerance designs have been proposed. It is known that

network connection and hardware devices may fail. In addition,

the acceptance test (AT) as well as the decision mechanism (DM),

which are common in fault tolerance designs, could also fail

unexpectedly. Such uncertainties may affect the reliability of a

WS-based system but have not yet been carefully considered in

reliability modeling. Therefore, we propose extended NVP

(ENVP) and extended RB (ERB) for the reliability analysis.

Various operations of ENVP and ERB are discussed, and a

simulation procedure is implemented to evaluate the system

reliability and the failure probability of fault-tolerant WS-based

systems. The experimental results show a high degree of

correlation between the numbers of AT’s and the reliability

improvements. The proposed fault tolerance designs could

improve the system reliability, and the simulation procedure

could also help in exploring appropriate configurations of fault

tolerance designs for practitioners.

Categories and Subject Descriptors

C.4 [Performance of Systems]: design studies, fault tolerance,

modeling techniques, reliability, availability, and serviceability;

H.3.5 [Online Information Systems]: Web-based services.

General Terms

Management, Design, Reliability, Experimentation.

Keywords Acceptance Test (AT), Decision Mechanism (DM), Fault

Tolerance Design, Reliability Assessment, Web Service (WS).

NOTATIONS n Number of functionally identical WS’s.

m Number of AT’s.

WSi The i-th functionally identical WS. Specifically, WS1

is the primary WS (PWS) and WS2~WSn are the

alternative WS’s (AWS).

ATr The r-th AT.

Hmid Host machine of the middleware.

Hi Host machine of WSi.

Pi Provider of WSi.

Ni Network connected to WSi.

NC Network connected to a client.

rcomp Reliability of the component.

fcomp Failure probability of the component.

fcomp=1-rcomp.

1. INTRODUCTION A Web service (WS) is generally defined as a software system

designed to support interoperable machine-to-machine interaction

over a network [1]. In the past, different web-based techniques

such as XML, simple object access protocol (SOAP), universal

description discovery and integration (UDDI), and web services

description language (WSDL) have been widely used to realize

operations of the web service architecture [2]. By using WS-based

techniques, users on different platforms could access different

kinds of services easily. Software developers could also combine

various WS’s to systems in order to achieve agile development.

With more and more publicly-available WS-based systems, the

way to implement a reliable WS’s platform has become an

important issue [3-6].

On the other hand, it is hard to remove all faults during software

operational phases. The software practitioners instead consider

fault tolerance (FT) designs with redundancy or backups of core

components to build highly reliable systems. The N-version

programming (NVP) and recovery block (RB) are two popular

fault tolerance mechanisms [3, 5]. The NVP utilizes functionally

equivalent software components (versions) to enable software

fault tolerance [4], and the RB utilizes different representations of

input data to provide the tolerance of design faults [6],

respectively. Other types of fault tolerance techniques were

proposed in previous studies [7-16], such as N-self-checking

programming, and distributed recovery block.

However, the reliability of web-based systems is hard to evaluate

[8-10]. Berman et al. investigated several backup mechanisms and

cost assessments to enhance the accuracy of software reliability

analysis [6-8]. Communications through undependable networks

between Web services make reliability assurance even more

challenging. Specifically, host overload, software failures,

hardware problems, and network congestions could all make a

WS request fail. To better assess the reliability of WS-based fault

tolerance systems, the above factors need to be considered [8, 9].

In addition, the factors of network connection and hardware

devices have not been carefully considered in the reliability

analysis of fault-tolerant WS-based systems. It is also noted that

the acceptance test (AT) and the decision mechanism (DM) are

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. Copyrights

for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee. Request permissions from

[email protected].

InnoSWDev'14, November 16-22, 2014, Hong Kong, China

Copyright 2014 ACM 978-1-4503-3226-2/14/11... $15.00

http://dx.doi.org/10.1145/2666581.2666587

Page 2: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

usually assumed error free, but in fact these components may still

fail and become a reliability bottleneck of a WS system [17]. In

this paper, we propose extended NVP (ENVP) and extended RB

(ERB) for WS-based fault tolerance systems. Besides, the

corresponding reliability models are developed to analyze the

proposed systems. The experiments are implemented by a

simulation procedure. Finally, the reliability improvement

suggestions for various fault tolerance configurations are

discussed.

2. RELATED WORKS In the following, two fault tolerance techniques, NVP and RB, are

surveyed.

2.1 N-Version Programming (NVP)

NVP was firstly proposed by Elmendorf [2, 11] and further

developed by Avizienis and Chen [9, 12]. NVP utilizes design

diversity by incorporating functionally identical versions of a

program. Different versions of a program could be executed

concurrently and a decision mechanism (DM) examines a

consensus result [4]. NVP enhances the dependability of the

software system under the assumption of low probabilities for two

or more versions of a program to produce similar erroneous

results simultaneously.

In the past few decades, researchers have focused on NVP to

improve the overall quality and independence of the diverse

developments, such as N-self-checking programming and N-

version executive [9]. Lyu et al. applied an NVP design paradigm

and had a significant reliability improvement [17, 19].

Additionally, a number of reliability models were proposed for

evaluating the NVP designs. The stochastic reliability and safety

models were separately developed by Tomek, Dugan, Mainini,

and Rendell in [9]. It was noticed that the common cause failures

could occur in different versions of a program. Dai et al. took

account of correlated failures and further modified the NVP

reliability model [13]. After that, Teng and Pham developed

NVP-based software reliability growth models which considered

the error-introduction rate and the error-removal efficiency in the

testing and debugging phases [20]. These models can be used to

predict the system reliability and to assist the testing strategies.

2.2 Recovery Block (RB)

RB was firstly proposed by Horning et al. [21] and implemented

by Randell [22], commonly used for improving system reliability

[2]. The primary version of a program is executed first, followed

by an acceptance test (AT) to examine the correctness of results.

If system failure is detected, the system will restore to a previous

checkpoint state and execute the next alternative version if there

is one. In real-time applications, a distinctive AT named a

watchdog timer is then combined to the system [23].

Zhou et al. showed that the reliability of a Web-service system

alone would not reflect the actual dependability users perceived

[8], and they categorized the failures of web-service systems into

four levels: service level, host level, provider level, and network

level. A number of reliability models were consequently

developed for Web-service systems. For example, Peng and

Huang [17] proposed a reliability framework for Service-Oriented

Architecture systems considering the reliabilities of services and

network conditions. Notably, the AT is also a possible source of

errors and could even be the reliability bottleneck of the RB [17].

Therefore, Elfawal et al. [14] proposed a fault-tolerant AT to

improve the service quality.

3. FAULT-TOLERANT RELIABILITY

MODELING In this section, we analyze the reliability of WS systems by

considering a number of fault-tolerant scenarios.

3.1 Extended NVP (ENVP)

Fig. 1 illustrates the message flow of ENVP with 3 AT’s. In the

ENVP scheme, client’s requests are forwarded to n functionally

identical WS’s through the middleware. The middleware serves

as the coordinator between the service users (client) and the

service providers (services). With the introduction of AT, after

receiving the outputs of the WS’s, the middleware could first

check the correctness of the outputs before passing them to the

DM, thus reducing the negative effects of common cause failures

as mentioned in section 2. Since the AT might still not be failure-

free, m redundant AT’s are used in case the originally selected AT

fails.

The reliability of an ENVP is formulated as follows:

1,1

1

mrrrfrrCmidr NHDM

m

r ATNVPENVP, (1)

where

7,5,3,)(

2/)1(

njPrn

njNVP (2)

and P(j) is the probability that exactly j functionally identical

WS’s are executed successfully. Specifically for n=3,

321321

321321

)3()2(

subsubsubsubsubsub

subsubsubsubsubsub

NVP

rrrrrf

rfrfrr

PPr

(3)

where rsub-i is the probability that WSi produces a correct result

and the result is transmitted to the middleware successfully,

formulated by

nirrrrr

iiii NPHWSisub 1, , (4)

and

isubisub rf 1. (5)

Based upon (1), the failure-free probability during the whole

ENVP operation is

n

i

m

r ATATNPHWS

NHDMFFENVP

ri

riiii

Cmid

rprrrr

rrrP

1 1

)(

(6)

where i

rATp is the probability that the r-th AT is firstly chosen for

WSi. For the case of successful ENVP operation with some

masked failures of the components (such as the case illustrated by

Fig. 2), the probability can be obtained by

)()( FFENVPENVPSOENVP PrP . (7)

Page 3: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

ENVP will have incorrect results or throw an exception with one

of the following conditions:

(i) The majority of WS’s fail.

(ii) The DM or all AT’s fail.

(iii) The middleware or the network connections fail.

Its probability is obtained by

ENVPFOENVP rP 1)( . (8)

3.2 Extended RB (ERB)

Fig. 3 illustrates the message flow of ERB with 3 AT’s. In the

ERB scheme, the client’s requests are forwarded to the primary

WS (WS1). After receiving the outputs of the WS, the middleware

then validates the outputs by AT. The checkpoint will restore and

the next alternative WS (WSi+1) will be executed when the current

output is not accepted by AT. There are m redundant AT’s. For

each validation request, an AT is selected randomly, and the

remaining AT’s serve as backups when the original AT fails.

The reliability of an ERB is formulated as follows:

cMid NH

n

i isubERB rrrr .1)1(1

(9)

where

m

r ATNPHWSisub riiiifrrrrr

11

. (10)

Based upon (9), the failure-free probability during the whole ERB

operation is

Request

H1 P1

P2

P3

N1

N2

N3

CP

AT1

AT2

AT3

DM HMid. NC ResponseH2

H3

1

1

1 2

2

2

3

Middleware

WS1

WS2

WS3

Fig. 1. ENVP message flow.

Data passing is depicted by arrows. Circle-ended connections are failure operations.

1’s are requests from client. 2’s are outputs of WS’s. 3’s are responses to client.

Fig. 2. ENVP successful failure recovery.

Page 4: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

m

r ATATNH

NPHWSFFERB

rrCmidrprr

rrrrP

1

)(

1

1111 . (11)

The probability of successful ERB with some masked failures of

the components (such as the case illustrated by Fig. 4) is obtained

by

)()( FFERBERBSOERB PrP . (12)

Finally, the probability of incorrect or exceptional ERB

operations is

ERBFOERB rP 1)( . (13)

4. SIMULATION AND ANALYSIS In this section, a simulation procedure will be used to evaluate

various operations of ENVP and ERB. The system reliability of

ENVP and ERB will also be analyzed. We follow similar

simulation methods from [24] and assume that the failure process

can be described by non-homogeneous Poisson process (NHPP).

A well-known NHPP model named Goel-Okumoto (GO) model

is selected to obtain the failure rates of AT’s and DM’s [25].

4.1 Simulation Procedure of ENVP

Fig. 11 illustrates the procedure to simulate ENVP operations.

The complete simulation loop is highlighted with color, the logics

related to multiple AT’s are colored red, and the DM parts are

colored brown in the figure.

Simulation initially receives the reliabilities or failure rates of

each component within the system in step 1. These values are

fixed in the operational phase and could be provided by applying

the GO model from the middleware directly. During the

simulation, a failure-free operation will be recorded when the

invocations of all the replica services (WS1-WSn) successfully

pass through the components along the WS transmission path in

steps 3-6 and a correct result (R1) is returned. In addition to the

Request H1 P1

P2

P3

N1

N2

N3

AT1

AT2

AT3

HMid. NC ResponseH2

H3

1

2

2

2

3

Middleware

CP

CP WS1

WS2

WS3

Fig. 3. ERB message flow.

Fig. 4. ERB successful failure recovery.

Page 5: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

correct results, failures detected in steps 3-4 and 6-10 will also

cause WS failures (R2) or incorrect results (R3).

For each AT, a constant failure rate λ is compared with a random

variable from the interval (0, 1] in step 6. The AT fails when the

random variable is lower than λ, which forces the next AT to be

invoked if there is one in step 10. Tests for WS’s are similar

except that the comparisons are made with the failure rates of the

WS’s.

4.2 Simulation Procedure of ERB

Fig. 12 illustrates the procedure to simulate ERB operations. The

simulation loop is highlighted with color, the logics of normal RB

are colored red, and the mechanism of multiple AT’s are colored

brown in the figure.

In the beginning, simulation receives the reliabilities or failure

rates of the components within the system in step 1. During the

simulation, a failure-free operation will be recorded by

successfully passing all components along the transmission path

in step 3, the primary WS in steps 4 and 5, and the first chosen

AT in steps 6 and 7. When a failure occurs in steps 3, 5, or 7, the

ERB will successfully recover and still execute successfully as

long as alternative WS’s (step 10) or AT’s (step 12) are available

and the chosen AT does not falsely accept an incorrect result (step

9). ERB executes incorrectly (R3) when an incorrect result is not

identified by either the middleware in step 8 or the selected AT in

step 9. ERB may also fail (R2) when no correct results are

received and identified and no spare WS’s or AT’s are available.

4.3 Experimental Results

The experimental results are illustrated to analyze ENVP and

ERB operations. In order to investigate the reliability

improvement of adding AT’s, a correlation determination R2 is

calculated to analyze the degree of interaction between the system

reliability of ENVP/ERB and the number of WS’s (#VP) or AT’s

(#AT) [26]. The correlation coefficient R between two random

variables X and Y is defined as

YX

XY

SS

SR

ˆˆ

ˆ

, (14)

where

n

i iY

n

i iX

n

i iiXY

nYYS

nXXS

nYYXXS

1

2

1

2

1

1)(ˆ

1)(ˆ

1))((ˆ

(15)

and R ranges from 0 to 1. The coefficient of determination R2 [26]

is defined as the square of the correlation coefficient, taking on

values between 0 and 1, and is used to explain how much the

variation of the system reliability is affected by the numbers of

WS’s or AT’s in use.

In ENVP and ERB simulations, the test runs 100,000 times.

Besides, the number of WS’s for ENVP is denoted as “1VP,”

“3VP,” and “5VP.” The number of WS’s for ERB is denoted as

“1RB,” “2RB,” “3RB,” and “4RB,” respectively. The number of

AT’s increases from “1AT” to “4AT” in ENVP and ERB

operations.

The results of various ENVP operations are shown in Tables 1-5

and Figs. 5-7. From Table 1, the R2 values reveal a high degree of

correlation between the measured system reliabilities and the

numbers of AT’s or VP’s in ENVP operations. Moreover, from

Tables 2-3 and Fig 5, the system reliability can be gradually

improved by increasing the numbers of AT’s and VP’s.

Further, the ENVP operations can be more reliable from “1AT” to

“4AT.” Thus, the proposed fault-tolerant designs can help

improve the system reliability of ENVP. If there is only 1 AT, the

increases of VP’s do not have a great impact on the system

reliability of ENVP. In addition, we can find that the system

reliability improves steadily and slowly as the number of AT’s

increases to 2 or more. Therefore, for the sake of cost-

effectiveness, “3VP” and a small number of AT’s may be

considered.

Table 1. Reliability and Correlation of Various ENVP Configurations

#AT

#VP 1AT 2AT 3AT 4AT R

2

1VP 0.89193 0.92584 0.93588 0.93876 0.81647

3VP 0.91951 0.97461 0.98153 0.98256 0.70089

5VP 0.91803 0.97953 0.98467 0.98502 0.66546

R2 0.70760 0.81810 0.79806 0.78976

Table 2. Probability of ENVP Failure-Free Operations

#AT

#VP 1AT 2AT 3AT 4AT

1VP 0.89193 0.92584 0.93588 0.93876

3VP 0.73887 0.80125 0.82568 0.83484

5VP 0.61372 0.68756 0.71483 0.73143

Table 3. Probability of ENVP Successful Farilure Recoveries

#AT

#VP 1AT 2AT 3AT 4AT

1VP 0.00000 0.00000 0.00000 0.00000

3VP 0.18064 0.17336 0.15585 0.14772

5VP 0.30431 0.29197 0.26984 0.25359

Table 4. Probability of ENVP Incorrect Results

#AT

#VP 1AT 2AT 3AT 4AT

1VP 0.00002 0.00000 0.00001 0.00000

3VP 0.00000 0.00000 0.00000 0.00000

5VP 0.00000 0.00000 0.00000 0.00000

Table 5. Probability of ENVP Failures (Exceptional Results)

#AT

#VP 1AT 2AT 3AT 4AT

1VP 0.10805 0.07416 0.06411 0.06124

3VP 0.08049 0.02539 0.01847 0.01744

5VP 0.08197 0.02047 0.01533 0.01498

Page 6: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

Tables 4 and 5 illustrate the probability of incorrect or exceptional

ENVP operations. We can clearly see that the failure probability

can be greatly reduced as AT’s and VP’s increase. This is because

the more WS’s we use, the higher the chances are that the correct

outputs can be obtained. Similarly, when the number of AT’s

increases, more reliable ENVP operations can be expected. It is

also noted from Table 4 that the probabilities of failure operations

are basically negligible except for in the “1VP” cases, where no

DM’s are actually used.

The results of various ERB operations are shown in Tables 6-10

and Figs. 8-10. From Table 6, the R2 values reveal a high degree

of correlation between the measured system reliabilities and the

numbers of AT’s or RB’s. Furthermore, from Tables 7-8 it can be

seen that the system reliability can be gradually improved by

increasing the numbers of AT’s and RB’s. However, there are

only small reliability improvements when the number of AT’s

exceeds 2. Therefore, “2AT” may be a better strategy under the

constraints of cost-effectiveness.

Further, we can see that the proposed ERB can help improve the

system reliability as AT’s and RB’s increase. Tables 9-10 display

the probability of incorrect or exceptional ERB operations. It can

be seen that the probabilities of incorrect results are close to 0 and

ERB failures are significantly reduced as the numbers of AT’s or

RB’s increase.

Fig 5. ENVP system reliabilities.

Fig. 6. ENVP operations (2AT cases).

Fig. 7. ENVP operations (3VP cases).

Table 6. Reliability and Correlation of Various ERB Operations

#AT

#RB 1AT 2AT 3AT 4AT R

2

1RB 0.86431 0.89845 0.91078 0.91093 0.79783

2RB 0.91880 0.96846 0.97682 0.98034 0.75670

3RB 0.92254 0.97824 0.98536 0.98691 0.70830

4RB 0.92373 0.98242 0.98809 0.98811 0.67015

R2 0.66713 0.73530 0.71799 0.68094

Table 7. Probability of ERB Failure-Free Operations

#AT

#RB 1AT 2AT 3AT 4AT

1RB 0.86431 0.87390 0.87599 0.86962

2RB 0.86422 0.87570 0.87198 0.86792

3RB 0.86347 0.87295 0.87252 0.86819

4RB 0.86427 0.87354 0.87379 0.86899

Table 8. Probability of ERB Successful Failure Recoveries

#AT

#RB 1AT 2AT 3AT 4AT

1RB 0.00000 0.02455 0.03479 0.04131

2RB 0.05458 0.09276 0.10484 0.11242

3RB 0.05907 0.10529 0.11284 0.11872

4RB 0.05946 0.10888 0.11430 0.11912

Table 9. Probability of ERB Incorrect Results

#AT

#RB 1AT 2AT 3AT 4AT

1RB 0.00015 0.00026 0.00023 0.00024

2RB 0.00031 0.00019 0.00018 0.00020

3RB 0.00027 0.00023 0.00021 0.00014

4RB 0.00028 0.00017 0.00028 0.00030

TABLE 10. Probability of ERB Failures (Exception Results)

#AT

#RB 1AT 2AT 3AT 4AT

1RB 0.13554 0.10129 0.08899 0.08883

2RB 0.08089 0.03135 0.02300 0.01946

3RB 0.07719 0.02153 0.01443 0.01295

4RB 0.07599 0.01741 0.01163 0.01159

Page 7: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

5. CONCLUSIONS This paper presents the reliability analysis of ENVP and ERB

operations. ENVP and ERB well enhance the reliability of fault

tolerance designs by adding redundancy logics to protect the

vulnerable AT’s, and the proposed reliability models are

integrated into ENVP and ERB operations. Extensive

experimental results show that the combination of “3NVP” and

“2AT” may be considered in the ENVP designs to balance the

reliability enhancements and the extra costs of redundant AT’s.

Similarly, “2RB” with “2AT” may be a good choice for ERB

designs. The correlation analysis illustrates that there could be a

high degree of correlation between the numbers of AT’s and the

reliability improvements. Thus, our proposed fault-tolerant

designs can help improve the system reliability of ENVP and

ERB operations. In the cost-effective constraint, the proposed

simulation procedure may help software practitioners explore

appropriate configurations of fault tolerance designs.

6. ACKNOWLEDGMENTS The work described in this paper was supported by the Ministry

of Science and Technology, Taiwan, under Grants NSC 101-

2221-E-007-034-MY2, NSC 101-2220-E-007-005, and MOST

103-2220-E-007-022.

7. REFERENCES [1] D. Booth, H. Haas, F. McCabe, E. Newcomer, M.

Champion, C. Ferris, and D. Orchard. Web Services

Architecture. W3C Working Group Note, 2004.

[2] L. L. Pullum. Software Fault Tolerance Techniques and

Implementation, Artech House Publishers, 2001.

[3] K. Goševa-Popstojanova and A. Grnarov. Performability

and Reliability Modeling of N Version Fault Tolerant

Software in Real Time Systems. In Proceedings of the

23rd EUROMICRO Conference, pages 532-539, Budapest,

Hungary, 1997.

[4] K. Goševa-Popstojanova and A. Grnarov, N-Version

Programming with Majority Voting Decision:

Dependability Modeling and Evaluation. In Micro-

processing and Microprogramming, 38(1-5): 811-818,

1993.

[5] A. Armoush, F. Salewski, and S. Kowalewski. Recovery

Block with Backup Voting: A New Pattern with Extended

Representation for Safety Critical Embedded Systems. In

Proceedings of the 11th International Conference on

Information Technology, ICIT 2008, pages 232-237,

Bhubaneswar, India, 2008.

[6] O. Berman and U.D. Kumar. Optimization Models for

Recovery Block Schemes. European Journal of

Operational Research, 115(2):368-379, 1999.

[7] J. B. Dugan and M. R. Lyu. System Reliability Analysis of

an N-Version Programming Application. IEEE Trans.

Reliability, 43(4):513-519, 1994.

[8] B. Zhou, K. Yin, S. Zhang, H. Jiang, and A. J. Kavs. A

Tree-Based Reliability Model for Composite Web Service

with Common-Cause Failures. In Proceedings of the 5th

International Conference on Advances in Grid and

Pervasive Computing, pages 418-429, Hualien, Taiwan,

2010.

[9] M. R. Lyu. Software Fault Tolerance, John Wiley & Sons

Ltd., 1995.

[10] C. J. Hsu and C. Y. Huang. Reliability analysis using

weighted combinational models for web-based software. In

Proceedings of the 18th international conference on World

wide web, WWW 2009, pages 1131-1132, Madrid, Spain,

2009.

[11] W. R. Elmendorf. Fault-Tolerant Programming. In The 2nd

Annual International Symposium on Fault Tolerant

Computing, FTCS-2, pages 79-83, 1972.

Fig. 8. ERB System reliabilities.

Fig. 9. ERB operations (2AT cases).

Fig. 10. ERB operations (3RB cases).

Page 8: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

[12] A. Avizienis. On the Implementation of N-Version

Programming for Software Fault-Tolerance During

Execution. IEEE International Computer Software and

Applications Conference, COMPSAC 1977, pages 149-155,

1977.

[13] Y. S. Dai, M. Xie, K. L. Poh, and S. H. Ng. A Model for

Correlated Failures in N-Version Programming. IIE

Transactions, 36(12):1183-1192, 2004.

[14] H. E. Mansour and T. Dillon. Dependability and Rollback

Recovery for Composite Web Services. IEEE Trans.

Services Computing, 4(4):328-339, 2011.

[15] Z. Zheng and M. R. Lyu. A Distributed Replication

Strategy Evaluation and Selection Framework for Fault

Tolerant Web Services. In Proceedings of the 6th IEEE

International Conference on Web Services, ICWS 2008,

pages 145-152, Beijing, China, 2008.

[16] N. Milanovic. Contract-Based Web Service Composition

Framework with Correctness Guarantees. In Proceedings

of the 2nd International Symposium on Service Availability,

pages 52-67, Berlin, Germany, 2005.

[17] K. L. Peng and C. Y. Huang. Reliability Evaluation of

Service-Oriented Architecture Systems Considering Fault-

Tolerance Designs. Journal of Applied Mathematics, 2014.

DOI= http://dx.doi.org/10.1155/2014/160608.

[18] M. R. Lyu and Y. T. He. Improving the N-Version

Programming Process Through the Evolution of a Design

Paradigm. IEEE Trans. Reliability, 42(2):179-189, 1993.

[19] M. R. Lyu, J. Chen, and A. Avižienis. Experience in

Metrics and Measurements for N-Version Programming.

International Journal of Reliability, Quality and Safety

Engineering, 1(1):41-62, 1994.

[20] X. Teng and H. Pham. A Software-Reliability Growth

Model for N-Version Programming Systems. IEEE Trans.

Reliability, 51(3):311-321, 2002.

[21] J. J. Horning, H. C. Lauer, P. M. Melliar-Smith, and B.

Randell. A Program Structure for Error Detection and

Recovery. Lecture Notes in Computer Science, 61:171-187,

1974.

[22] B. Randell. System Structure for Software Fault Tolerance.

IEEE Trans. on Software Engineering, SE-1(2):220-232,

1975.

[23] H. Hecht. Fault Tolerant Software for Real-Time

Applications. ACM Computing Surveys, 8(4):391-407,

1976.

[24] S. S. Gokhale and M. R. Lyu. A Simulation Approach to

Structure-Based Software Reliability Analysis. IEEE Trans.

Software Engineering, 31(8):643- 656, 2005.

[25] A. L. Goel and K. Okumoto. Time-Dependent Error-

Detection Rate Model for Software Reliability and Other

Performance Measures. IEEE Trans. Reliability, R-

28(3):206-211, 1979.

[26] G. Keller. Statistics for Management and Economics, 8th

edition, South-Western College, 2008.

Page 9: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

Appendix

Fig. 11. Simulation procedure of ENVP.

1. Global Init.

2. Local Init.

Comp. failure

rate estimates

3. Test comps.

along trans. path

4. Test WS i7. Fault detected

by middleware?

6. Test AT

5. Randomly sel.

a spare AT8. Fault detected

by AT?

9. Falsely rej. by AT?

10. More

alt. AT's?

R2. Request

to WS i fails

R3. Incorrect

result

recorded

R1. Correct

result recorded

Some fail

All pass

Fails

Succ.

Fails

Succ.

Yes

No

Yes

No

Yes

No

NoYes

11. More WS's?

S1. Succ. exec

recorded

S2. Fail. exec

recorded

14. More runs?

Yes

No

15. Generate

sim. results

Sim. report

13. No. correct >

No. incorrect?

Yes

No

Yes

No

12. Test DMSucc. Fails

Page 10: Enhanced N-Version Programming and Recovery Block ...m98.nthu.edu.tw/~s9862818/publications/FSE... · The N-version programming (NVP) and recovery block (RB) are two popular fault

Fig. 12. Simulation procedure of ERB.

1. Global Init.

2. Local Init.

Comp. failure

rate estimates

3. Test comps.

along trans. path

5. Test WS i

8. Fault detected

by middleware?

7. Test AT

6. Randomly sel.

a spare AT

9. Fault detected

by AT?

11. Falsely rej. by AT?

12. More

alt. AT's?

R2. Sys. fails

R3. Incorrect

exec.

recorded

R1. Succ.

exec. recorded

Some fail

All pass

Fails

Succ.

Fails

Succ.

Yes

No

Yes

No

Yes

No

NoYes

10. More

alt. WS's?

13. More runs?14. Generate

sim. results

Sim. report

Yes

No

4. Sel. next alt. WS i

NoYes