Fault Injection-based Assessment of Partial Fault Tolerance in

Fault Injection-based Assessment of Partial FaultTolerance in Stream Processing Applications

Gabriela Jacques-Silva†♠, Bugra Gedik♠, Henrique Andrade♠∗,Kun-Lung Wu♠, Ravishankar K. Iyer†

†Coordinated Science LaboratoryUniversity of Illinois at Urbana-Champaign

1308 W. Main St., Urbana, [email protected]

♠Thomas J. Watson Research CenterIBM Research

19 Skyline Dr., Hawthorne, NY{g.jacques,bgedik,klwu}@us.ibm.com

ABSTRACTThis paper describes an experimental methodology used toevaluate the effectiveness of partial fault tolerance (PFT)techniques in data stream processing applications. Withouta clear understanding of the impact of faults on the qualityof the application output, applying PFT techniques in prac-tice is not viable. We assess the impact of PFT by injectingfaults into a synthetic financial engineering application run-ning on top of IBM’s stream processing middleware, SystemS. The application output quality degradation is evaluatedvia an application-specific output score function. In addi-tion, we propose four metrics that are aimed at assessingthe impact of faults in different stream operators of the ap-plication flow graph with respect to predictability and avail-ability. These metrics help the developer to decide wherein the application he should place redundant resources. Weshow that PFT is indeed viable, which opens the way forconsiderably reducing the resource consumption when com-pared to fully consistent replicas.

Categories and Subject DescriptorsD.2.5 [Software Engineering]: Testing and Debugging—Error handling and recovery ; H.2.4 [Database Manage-ment]: Systems—Distributed databases

General TermsReliability, experimentation

1. INTRODUCTIONStream processing applications continuously analyze the

heterogeneous incoming data and perform a variety of com-putations over generally incomplete information [5]. Highavailability is a key requirement of stream processing sys-tems [25], since they process continuous live data. Typically,

∗Currently employed by Goldman Sachs. .Email: [email protected]

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DEBS’11, July 11–15, 2011, New York, New York, USA.Copyright 2011 ACM 978-1-4503-0423-8/11/07 ...$10.00.

streaming applications are designed by assembling streamoperators as data flow graphs, which can be distributed overa set of nodes to achieve high performance and scalability[12]. A fault in a computing node or in a stream operatorcan result in massive data loss due to the typical high datarates of incoming streams.

Although many fault tolerance techniques for stream com-puting [13, 16, 19] provide guarantees that no data is lostor any inconsistency exists (e.g., duplicate delivery of thesame data item, which we refer to as data duplication),these methods usually cause significant degradation in per-formance. Aiming at reducing such performance overhead,partial fault tolerance (PFT) techniques [4, 17, 21, 23, 30]have been proposed. These techniques assume that data lossand data duplicates are acceptable under faulty conditions.The rationale is that many streaming applications toleratedata imprecision by design, and, as a result, can still achievecorrectness without using stringent fault tolerance methods.

While all of the above techniques require a careful evalua-tion to assess the fault tolerance achieved and the resultingperformance degradation, this is especially true for meth-ods that provide PFT. Hence, the use of PFT is not viablewithout a clear understanding of the impact of faults on theapplication output.

This paper describes a fault injection-based [14] exper-imental methodology to evaluate the effectiveness of PFTtechniques in streaming applications. Our first goal is toassess the impact of PFT on the output of an application.We use a fault injector to mimic the effect of a fault inthe application when a specific PFT mechanism is in place.To the best of our knowledge, we are the first to describea thorough fault injection experiment that evaluates PFTtechniques in a streaming application. Our second goal is tocharacterize how each stream operator used by the applica-tion behaves under failures, so the developer can decide if thePFT technique in place is adequate for the target operator.We characterize each operator by calculating four evalua-tion metrics over the application output generated duringthe injection trials. In addition, the metrics can be usedto identify which operators are most critical to the appli-cation output quality. Prioritizing most critical operatorswhen protecting an application with PFT methods leads tolower resource utilization while maintaining output quality.

Analyzing the impact of faults in streaming applicationshas many challenges that are not addressed by traditionalfault injection methodologies [8, 14, 20, 29]. Streaming

applications can tolerate approximate results and be non-deterministic. In addition, these applications produce re-sults continually, requiring a careful analysis of the outputto estimate the fault impact. We address these issues by(i) defining an output score function (OSF) to measure theapplication output quality and compare it with the outputunder faults, and (ii) using the OSF over limited sections ofthe output to compute our proposed evaluation metrics.

Our fault injection experiment target is a simplified finan-cial engineering application running on top of IBM’s streamprocessing middleware, System S [1, 12]. We considereda bursty data loss fault model, which emulates the crashof a stream operator. Our results show the following: (i)the impact on the application output quality varies widelyfor faults in different stream operators, demonstrating thatoperator sensitivity to faults is an important differentiatorin deploying PFT; (ii) the tested application provided somesurprising results; specifically, one stream operator with highselectivity turned out to be the least critical in terms of qual-ity degradation when subjected to data loss. This indicatesthat PFT can be a powerful technique to maintain the ac-curacy of the results and preserve computing resources byreplicating only parts of the application.

2. PARTIAL FAULT TOLERANCESeveral researchers [4, 15, 17, 21, 22, 23, 30] have de-

scribed PFT techniques that are applicable to stream pro-cessing applications. While many leverage the partial andoften strategic employment of fault tolerance techniques tolower the performance loss, no technique can guarantee per-fect application output under faulty conditions. DifferentPFT mechanisms have different effects on the input/outputstream of the failed operator and, as a result, on the applica-tion output. Hence there is a need to conduct experimentsto understand the behavior of streaming applications underfaults.

One particular technique [17] is based on a stream opera-tor checkpoint mechanism that leverages code generation toautomatically provide specialized state serialization meth-ods depending on the stream operator type. When an op-erator fails, its upstream operators do not buffer outgoingtuples1 unless they are required to produce a semanticallycorrect result after a fault. Checkpointing [17] results inbursty tuple loss on the operator input stream.

Another example of PFT is to employ free running repli-cas, as proposed by Murty and Welsh [21]. This techniquedoes not enforce determinism among the stream operatorreplicas, resulting in different effects on the operator outputstream, such as tuple reordering, duplication, loss, and valuedivergence.

S4 [22], the open-source distributed stream processingmiddleware from Yahoo!, provides fault tolerance by restart-ing an operator from scratch once a failure is detected. Thestate of the operator is rebuilt by processing new incomingstream data items. This approach is similar to the strat-egy defined in [15] as gap recovery. Both techniques lead tobursty tuple loss on the occurrence of failures.

To evaluate how applications behave under PFT tech-niques, it is critical to understand the effect of faults oninput/output stream of the operator. Note that we are notconcerned with the specific mechanism used by the stream

1A tuple is the basic unit of data in a stream.

processing middleware to detect a fault in a stream operator(e.g., heartbeats) and to restart the operator. Nevertheless,these detection and recovery times are important to deter-mine the duration of the fault effect.

3. EVALUATION METHODOLOGYThe experimental methodology uses fault injection to

characterize the output quality of a streaming applicationin the presence of faults while a specific PFT mechanism isin use. We assume that a fault detector and a fault tolerancemechanism are already in place and have been validated. Wealso assume that a stream operator fails by crashing in a fail-stop manner (i.e., without producing corrupted results) [6].The selected fault model is broad because operator failurescould be due to several distinct and indirect causes: a nodefailure (e.g., operating system kernel crash), a transient soft-ware fault (e.g., race condition), or a transient hardwarefault (e.g., a memory bit flip that causes a process crash).

Characterizing the error behavior of streaming applica-tions presents many challenges. Traditional fault injectionmethodologies evaluate the behavior of an application bychecking its output after the injection of a fault [20, 29]. Adeterministic application can be checked for correct behav-ior under faults by comparing the output of the faulty runto the output of the fault-free run (also called the goldenrun) [8]. In streaming applications, checking the correct be-havior cannot be done by a simple bit-by-bit comparisonof the faulty and fault-free outputs. Such applications areoften non-deterministic2, and are typically able to tolerateimprecise results.

To understand how a fault affects the application out-put, we compute an application-specific output score func-tion (OSF). The OSF is calculated over a set of tuples of theoutput of the application, and is applied for the faulty andthe golden output. Figure 1 shows two samples of streamingapplication outputs [1, 28]. Figure 1(a) shows a sample out-put of our target financial application described in Section6.1. The output contains the ticker symbol of a company inthe stock market and the projected financial gain obtainedby buying a stock at a specific time. In this case, the OSFis a summation of the financial gain (i.e., 101.10). Figure1(b) shows a sample output of a chip manufacturing appli-cation [28]. The output contains the predicted wafer yield ofa manufacturing process, and a prediction error, indicatingthe extent to which the prediction deviates from the groundtruth. In this case, the OSF is the prediction error average(i.e., 0.06). The result of the defined OSF over the applica-tion output data must have little variation (i.e., within errorthresholds) when considering different non-faulty executiontrials of a non-deterministic application.

To assess the quality of the output under faults, we de-fine a quality score (QS), which is the ratio of the averageOSF calculated over the faulty output (i.e., the applicationoutput produced in the presence of an operator fault) bythe average OSF calculated over the golden output. Thisratio estimates the fractional deviation of the faulty resultand the correct result. The OSF average is obtained byexecuting the application multiple times with the same con-figuration (e.g., same injected fault). The average accounts

2A source of non-determinism is the multiplexing of multi-ple streams, where the relative order of tuple arrival to anoperator is arbitrary.

TickerFinancial

gain

IBM 41.66BAC 30.56TWX 28.88

(a)

Predicted Predictionwafer yield error

95 0.0698 0.0292 0.10

(b)

Figure 1: Examples of output for a financial (a) anda chip manufacturing (b) application. The OSF isapplied over the application result (e.g., financialgain or prediction error values).

for the stochastic deviations on output caused by the appli-cation non-determinism. The average OSF and the QS arethe basis for the proposed evaluation metrics. In addition,our metrics consider that stream operators can fail at differ-ent execution times, and that the time to detect and repairfrom such failure can vary. Figure 2 illustrates a possiblefailure scenario. Different execution times are representedin our methodology by the injection of faults into operatorsat different stream offsets from the beginning of the inputstream. Different detection and repair times are representedby the injection of faults with different outage durations. Wedo not make assumptions about the operator failure proba-bility distribution.

failurefailure detection

stream operator lifetime(stream offset)

detection latency

repairtime

d ioutage duration

Figure 2: Timeline of a stream operator failure. Thefailure is detected after an average detection latency,and the detector triggers a repair procedure. Theoutage duration comprises the time to detect thefailure and the repair time.

Our evaluation metrics characterize each operator in thefollowing terms. (i) The outage duration impact, defined asthe correlation coefficient between the outage duration andthe QS computed over the part of the output stream affectedby the fault. If the correlation coefficient is high, there isa direct quality improvement by applying techniques withlower recovery time for the target operator. (ii) The datadependency, defined as the standard deviation and analysisof variance (ANOVA) test of the QS obtained under faultsinjected into different stream offsets. A high standard devi-ation and rejected ANOVA test indicates that the impact ofthe fault on the quality is dependent on the specific data af-fected by the fault. (iii) The recovery time, defined as the Ppercentile of the QS observations over time that fall outsidean error threshold. A high value for this metric indicatesthat the application takes a long time to stabilize and tostart producing correct results again. (iv) The quality im-pact, defined as the sum of squares of the difference betweenthe faulty QS and the golden QS evaluated over time. Ahigh sum value indicates that answers produced by the ap-plication under faults are far away from the correct result.More details on the computation of each metric can be foundin Section 5.2.

Our experimental methodology comprises the followingsteps:

1. Choose a fault model according to the in place PFTtechnique – the fault model is selected in correspondence toa recovery technique used to restore the operator upon afault. In our experiments, we consider a bursty tuple lossmodel.

2. Optimize fault injection target operators – condi-tioned on the chosen fault model, only certain operatorsmust be selected and subjected to fault injection. For ex-ample, operators that are not sensitive to the selected faultmodel do not need to be selected as an experimentation tar-get.

3. Use the actual expected data rate from stream sourcesto realistically model the effect of a fault – for example, theknowledge of the real data input rate allows quantifying howmuch data is dropped when a fault occurs.

4. Inject faults at different stream offsets and with dis-tinct outage durations – injecting faults at different streamoffsets mimics random fault arrival times during the opera-tor execution. Different outage durations mimics variationson the detection and recovery times;

5. Evaluate the experimentation results – based on theOSF function, each operator is characterized using the pro-posed metrics. These metrics can quantify the relative sen-sitivity of the operators to faults, and use it as a basis tocompare different fault tolerance techniques.

4. FAULT INJECTION FRAMEWORKTo assess the impact of PFT, we built a fault injection

framework to emulate the effect of these techniques on theinput/output streams of a target stream operator. Cur-rently, the framework supports a bursty tuple loss faultmodel, but it can be extended to include other fault mod-els (e.g., tuple duplication). Bursty tuple loss can emulatethe following situations: (i) a stateless operator crashes andrestarts, and no in-flight tuples are saved; (ii) a stateful op-erator crashes and restores its state from a checkpoint uponrestart, and no in-flight tuples are saved. The checkpointpreserves the operator state immediately before the occur-rence of the fault; (iii) a stateful/stateless operator crashesand perfoms a failover to a replica. The operator has onlyone input stream and the backup replica is operating atapproximately the same pace as the primary replica. In ad-dition, injecting bursty tuple loss in the source operator canemulate faults affecting the real stream source (e.g., sensors)and data drop due to bursty data arrival and limited inputbuffer size. Emulating operator crashes by manipulating itsinput/output streams allows us to better control the faultinjection experiment. Dropping approximately the same setof tuples in each experiment trial helps us to reduce thevariation on the OSF result of different experiment trials.

4.1 Emulating Faulty BehaviorOur fault injection framework is designed to work seam-

lessly with Spade (Stream Processing Application Declara-tive Engine) [12], the declarative stream processing languagethat is part of System S. Spade offers language extensibil-ity by allowing the implementation of User-defined Built-inOperators (UBOPs). When a developer identifies a general-purpose operation, he can describe it as a new type of streamoperator, effectively extending the Spade language.

Our framework uses UBOPs to extend the language withstream operators that mimic the faulty behavior of an oper-ator when using a specific PFT technique. Figure 3 shows

how the framework operates. First, it receives as input aSpade application, a target operator, a fault model, and itsinjection parameters. Based on the target operator and thefault model, the framework modifies the Spade program toembed a fault injection operator (FIOP) in specific positionsin the data flow graph. For example, to emulate tuple loss atthe input ports of an operator, all the operators connected tothese input ports are re-routed to send their output streamsto the FIOP. The FIOP is connected to the target opera-tor. Based on the new flow graph, the framework generatesmultiple Spade programs, each of them with a FIOP con-figured with a different fault injection parameter. After thecompilation, the application is ready for fault injection runs.

OP

OP2

OP

OP2

OP1 OP

OP1

1OP3

1OP3

OP2

3OP2

OP3

Target operatorfault model

SPADEcompiler

SPADE application

Modified SPADE application

Pre‐processor

fault model parameters

FIOPpp

Figure 3: Fault injection framework changes theoriginal application dataflow graph to include a faultinjection operator (FIOP). The modified applicationis re-compiled and then executed with the fault in-jection parameters.

Figure 4 depicts how the injection occurs at runtime. Thefigure shows the injection of the bursty tuple loss fault modelinto the input port of operator OP2. In this example, OP1sends tuples containing a stock symbol and a price to OP2.After the graph pre-processing, OP1 connects to the FIOP,which connects to OP2. The FIOP is placed right beforethe target operator and receives the following parameters:(i) the outage duration, specified in terms of the number oftuples to be dropped, and (ii) the stream offset, specified interms of the number of tuples processed by the operator upuntil the fault. In Figure 4, the FIOP triggers a fault afterit processes the stock symbol IBM at price USD 123.24. Theduration of the fault is 2 tuples, leading the FIOP to dropthe tuples with stock symbol YHOO and GOOG. After the FIOPdrops the specified number of tuples, its operation goes backto normal, i.e., forwarding tuples received by OP1 to OP2.

!"#$%&'(((("IBM" )*+,-'((((122.74

!"#$%&'(("GOOG" )*+,-'((((537.99

!"#$%&'(((("YHOO" )*+,-'((((23.18

!"#$%&'(((("IBM" )*+,-'((((123.24

!"#$%&'(((("IBM" )*+,-'((((122.74

!"#$%&'(((("IBM" )*+,-'((((123.24

!"#$%&''(")$*!+'

,#)&$-''!.,&#'

./0( 12./(

./3(

Figure 4: FIOP emulates a bursty tuple loss modelby dropping tuples received after a specified streamoffset and during a configured outage duration.

As described earlier in this Section, our framework doesnot actually crash and restart an operator during the injec-tion. Even though the operator continues to run, it does notsend or receive any new tuples for the time correspondingto the fault detection and recovery. As a result, its internalstate (if any) remains unchanged.

4.2 Placing Injection OperatorsTo understand how the application behaves under faults,

in the worst case we may need to inject faults in all opera-tors. However, streaming applications can have a substan-tially large number of operators. To reduce the number ofrequired fault injection targets when evaluating a target ap-plication, the framework pre-analyzes the data flow graph.It selects as injection targets only those operators in whichthe injected fault results in a behavior that is different thana fault injected into another operator in the graph. An oper-ator not chosen as fault injection target is assumed to havethe same behavior as its upstream operator.

For the bursty tuple loss fault model, the inspection startsby selecting all source operators as injection targets. Inject-ing faults into the source mimics a fault affecting the streamfeed originated outside the stream processing middleware(e.g., the raw sensor data feed), or the source operator it-self.

From each source operator, the analysis continues to alldownstream operators by doing a breadth-first traversal, un-til the sink is reached. A bursty tuple loss operator is placedin the data flow graph immediately before a chosen targetoperator. The framework selects an operator as a target ifits position in the data flow graph meets one of the followingproperties:

1. An upstream operator produces more than one outputstream – a common pattern in streaming applications is forone operator to have its outputs consumed by more than oneoperator downstream. As illustrated by Figure 5(a), bothOP2 and OP3 consume the stream produced by OP1. If OP1fails, part of its input stream is lost, affecting both OP2 andOP3. If OP2 fails, OP1 can continue to send data to OP3,but all data sent to OP2, while it is offline, is lost. Thesetwo different scenarios can impact the application output indifferent ways. Therefore, both scenarios must be emulatedwhen evaluating the application behavior under faults.

2. The operator consumes more than one input stream –stream operators can consume outputs produced by morethan one upstream operator. One such example is the joinoperator, which correlates events coming from two differentstreams. This is shown in Figure 5(b), where OP1 and OP2send data to OP3. If OP1 fails, OP3 stops receiving datafrom one of its input ports, but it continues to process datacoming from OP2. If OP3 fails, data sent by both OP1 andOP2 are lost. Since these two scenarios represent two differ-ent error modes, both of them should be emulated duringthe fault injection experiments.

3. The upstream operator is stateful – a stream operatorcan either be stateful or stateless. For example, an operatorthat filters a stream based on the attributes of the currenttuple does not keep any state related to previously processedtuples. Figure 5(c) shows a data flow graph where a statelessoperator OP1 sends data to a stateful operator OP2, whichsends data to OP3. If OP1 fails, it loses input data from itsupstream operator while it is offline. As a result, OP2 alsodoes not receive input data during the time OP1 is offlineand does not update its internal state. If OP2 fails, the be-havior is analogous to a fault in OP1. OP2 loses its inputdata and does not update its internal state while it is recov-ering. However, the error behavior changes when OP3 fails.OP3 loses the input data, but OP2 still updates its internalstate. Once OP3 is back up, OP2 is ready to send up-to-date information and does not spend any time rebuilding itsinternal state. These scenarios have different impact on theapplication output, and both must be evaluated.

SPADEcompiler

SPADE application


Pre-processor

Target operatorFault model parameters

OP1

OP2

OP3

OP1

OP2

OP3

OP1

OP2

OP3

OP1

OP2

OP3

(a)

SPADEcompiler

SPADE application


Pre-processor


OP1

OP2

OP3

OP1

OP2

OP3

OP1

OP2

OP3

OP1

OP2

OP3

(b)

SPADEcompiler

SPADE application

Modified SPADE applications

Pre-processor


OP1

OP2

OP3

OP1

OP2

OP3

OP1

OP2

OP3

OP1

OP2

OP3

(c)

Figure 5: Placement rules for bursty tuple lossFIOPs consider the number of inputs and outputs ofeach operator, and if the operator is either statelessor stateful.

5. EVALUATING THE FAULT INJECTIONOUTCOME

As mentioned in Section 3, we evaluate a fault injec-tion outcome based on an application-specific output scorefunction (OSF). The OSF characterizes how the applicationperforms, and it can be computed independently of whichstream operator failed. Next sections describe how we han-dle the continuous output of streaming applications and howwe use the OSF to compute each of our metrics.

5.1 Handling Continuous OutputStream processing applications typically produce output

results continuously. If the output is not carefully ana-lyzed, variations in the output due to the application non-determinism can be confused with the effects of a fault. Thiscan lead to an overestimation of the faulty effect. We mini-mize this problem by limiting which segments of the contin-uous output stream are analyzed for estimating the faultyoutcome. For example, results produced before the faultoccurrence are ignored when computing the metric. In ad-dition, results produced after the application started to pro-duce output within error thresholds when compared to thegolden run are also ignored.

Continuous output can also mask the effects of faults. Forexample, an application can manifest a fault by missing xalerts and misdetecting y alerts. When applying globally anOSF that considers the total number of detected alerts, themisdetected alerts compensate for the missed ones. This canerroneously lead the developer to think that the fault hadlow impact on the application output. We minimize thisproblem by computing the OSF over local sections of theoutput stream rather than once over the complete outputset. The last two metrics described in Section 5.2 use alocal OSF computation.

5.2 Evaluation MetricsTable 1 summarizes our evaluation metrics. The first two

metrics indicate the predictability of the stream operatorbehavior under faults. An operator that does not have pre-dictable behavior under faults is not a good target for apply-ing the PFT mechanism under test, because if such operatorfails in the field, the application outcome is unknown. Thelast two metrics are related to system availability, and allowthe assessment of which operators are more critical for theapplication to preserve output quality under faults.

Outage duration impactBy computing the correlation coefficient between outage du-ration and quality score (Coq) we can assess the impact ofthe outage duration (Figure 2). If the QS and the outage

MetricOperator

Definitioncharacteristic

Coq Outage durationimpact

Correlation coefficient betweenoutage duration and quality score

Doq Data dependencyStandard deviation and analysis ofvariance test of the quality score

for different stream offsets

Rlq Recovery timeP percentile of local quality

scores outside a threshold value

Ilq Quality impactSum of squared errors of

local quality score

Table 1: Summary of evaluation metrics

duration are highly correlated (i.e., the correlation coeffi-cient is close to 1 or -1), the developer can use off-the-shelfcurve fitting methods to find a function that describes thequality loss in relation to a certain outage. We can thenfeed this function with outage parameters extracted fromreal failures in the field and evaluate the risk (in terms ofquality degradation) of using the evaluated PFT technique.If such behavior poses unacceptable risk to the application,this operator should be protected against faults.

The Coq metric can be computed in the following way.A fault injection experiment for a single operator injectsfaults at m different stream offsets using n different outagedurations. Each stream offset is referred to as SOi, wherei ∈ [1..m], and each outage duration is referred to as ODj ,where j ∈ [1..n]. For each SOi and ODj , there are p repeti-tions, where each one generates an output stream with onlya single section affected by the injected fault. Such section isestimated based on the SOi and the maximum ODj value.The OSF for the affected section of the stream is referred toas FOi,j,k, where k ∈ [1..p].

The average output score function OSFi,j for each ODj

and a particular SOi is computed as

OSFi,j =

∑pk=1 FOi,j,k

p(1)

The OSF for the golden run is calculated over the sectionof the output stream affected by the fault with maximumODj value3. The golden run is executed q times, whereeach execution generates one GOi,l, where l ∈ [1..q]. Thequality score is referred to as QSi,j and is computed as

QSi,j =OSFi,j

(∑q

l=1GOi,l)/q(2)

After this step, a particular SOi has n ODj values asso-ciated with it and their corresponding QSi,j results. Withthese two sets of data, we compute the Spearman’s rank cor-relation coefficient, which assesses if two sets of values havea monotonic relationship [24]. This step results in associat-ing a correlation coefficient CCi with each SOi. Correlationcoefficients have bounds [-1..1].

The operator Coq is then calculated as

Coq =

∑mi=1 CCi

m. (3)

Data dependencyThe Doq metric is the standard deviation (σq) and analysisof variance test (Aq) of the quality score for different stream

3The index j is omitted in all formulas using a single fixedvalue of ODj .

offsets. This metric evaluates how the same fault (i.e., thesame fault model and outage duration) affects the outputquality when injected at different stream offsets (Figure 2).A high variability in the application output quality underthe same fault indicates data dependency, i.e., the impacton the output depends on the data being affected by thefault. An operator with a high σq and a rejected ANOVAtest [27] is not a good candidate for PFT, since the resultof a fault in the field is highly unpredictable. An operatorwith a low σq and an accepted ANOVA test indicates thatthe fault has a similar impact in output quality, independentof where the fault was injected.

To compute Doq, we first calculate σq, similarly to the Coq

metric. The difference is that we choose the same fixed ODj

value for each SOi, instead of considering all ODj values.As before, we compute the QSi for each SOi.

The σq is calculated with the standard deviation formula,using the QSi of each stream offset SOi as data samples.

The analysis of variance Aq is a one-way ANOVA hypoth-esis test. This test assesses if there is a statistically sig-nificant difference between the observed means of differentgroups, where each group is obtained under a distinct con-dition [27]. For the Doq metric, the test decides whether thechanges in the stream offset of the injected fault affect thefault’s impact on the application QS. If H0 (null hypothe-sis) is accepted, it means that the target operator is not datadependent. A rejected H0 indicates the opposite. Equation4 shows the parameters for invoking the ANOVA test thatreturns an accept or reject value.

Aq = ANOVA((QS1,1, ..., QS1,k); ...; (QSi,1, ..., QSi,k)),(4)

where QSi,k = FOi,k/((∑q

l=1GOi,l)/q). Doq is the tuple(σq, Aq).

Recovery timeThe Rlq metric is the P percentile of quality scores outsidea threshold value. This metric estimates how long it takesfor the application to recover and to start producing normaloutputs after the occurrence of a fault. The larger the valueof this metric, the larger is the impact of an operator failureon the application availability. This metric assesses the de-viation of the application output quality locally, i.e., by com-puting the OSF over different intervals of the output stream(e.g., all tuples produced during a 1 second interval). Forthis metric, an OSF data point is considered normal whenthe difference between the faulty OSF and the golden OSFis less than a certain threshold (e.g., faulty OSF is less than3% away from the golden OSF). Any difference greater thanthe threshold is considered to be an erroneous output. Ourmetric considers the coverage of P% of the erroneous outputas it can provide enough accuracy in evaluating the recoverytime of the application. In our experiments, we considereda P of 90%.

To compute Rlq, we choose the same single outage dura-tion ODj for all stream offsets SOi. Each experiment trial kgenerates one output stream, which is divided in s sections.For each section, we compute the local OSF referred to asLOi,k,t, where t ∈ [1..s]. The average of LOi,k,t over eachexperiment trial is referred to as LOi,t, and is computedsimilarly to Equation 1. A similar procedure is followed foreach of the q trials of the golden run. The OSF for each sec-tion of the golden output stream is referred to as GLOi,l,t.

GLOi,t refers to the average of GLOi,l,t over each trial, andis calculated similarly to Equation 1.

In the next step, we build an error array based on LOi,t

and GLOi,t, with t starting at Sbegin, where Sbegin is thefirst section of the output stream produced by the goldenrun after the fault injection point. Each position of thearray is referred to as EQi,u, where u ∈ [1..s− Sbegin], andis computed as

EQi,u =|LOi,t −GLOi,t|

GLOi,t

. (5)

For each position u in the error array, we compute thenumber of error values that are greater than the establishedthreshold up until and including uth error value EQi,u. Thisis denoted by NEi,u and is represented formally as

NEi,u =

u∑v=1

1[EQi,v > threshold]. (6)

Then, we compute the index Rlqi where P% of the erroneous

QS observations fall. That is:

Rlqi = min u s.t. NEi,u ≥ p ∗NEi,s−Sbegin (7)

The final step is to obtain the maximum index for all streamoffsets SOi, that is Rlq = maxi R

lqi . Picking the maximum

allows the assessment of the risk by considering the worstcase manifested during experimentation.

Figure 6 shows an example of the Rlq metric. The linewith a circle marker shows the QS values of the faulty run inrelation to the golden run (square marker) for each sectionof the output stream. The dashed line shows the allowederror threshold. The Rlq metric covers 90% (P ) of the datapoints that lie outside the threshold values (S13) after thefault is injected, showing an approximation of how long theapplication output takes to stabilize after a fault.

1.5local errors lower than threshold

golden QS

1.0

threshold

threshold

golden QS

uality score

0.5

1 6 11 16S1 S6 S11 S16

local errors greater than threshold

threshold

Rlq

S13

Qu

1 6 11 16Application output stream section

Figure 6: Rlq metric considers the number of localQS observations that fall outside a specified errorthreshold when compared to the golden run.

Quality impactThe Ilq metric is the sum of squared errors (SSE) of localquality score, which allows us to compare the fault impact ofdifferent operators on the application output quality. Simi-lar to the Rlq, we consider local OSF values that are outsideof a threshold tolerance. The magnitude of the fault im-pact is obtained by summing the squares of all local errorsthroughout the application execution after the injection upuntil the chosen P percentile of the Rlq metric.

The computation of this metric is similar to the Rlq com-putation. Instead of applying Equation 6, we calculate the

SSE of a single SOi (referred to as Ilqi ) as

Ilqi =

Rlqi∑

v=0

(EQi,v)2[EQi,v > threshold]. (8)

Ilq is then maxi Ilqi .

6. EXPERIMENTAL EVALUATIONOur experiment target is a simplified application from the

finance engineering domain called Bargain Discovery [1]. Inour experiments, we assume the following: (i) a stream op-erator crash is always detected, (ii) all stateful operators ofthe application are being checkpointed [17], and (iii) an op-erator restores its state to the state immediately before thefault occurrence.

6.1 Target ApplicationThe target application processes stock trades and quotes,

and outputs information on all stocks in which there is apotential money gain on buying the stock at a given time.Figure 7 shows the Bargain Discovery flow graph.

f(x) f(x)Source

TradeFilter VWAP

BargainIndex Sink

Aggregator

TradeQuote

f(x)

f(x)

QuoteFilter

Figure 7: Stream operator graph of the Bargain Dis-covery application. One path processes only stocktrades, while the other path processes only stockquotes. The streams are later correlated to decide ifa given stock quote has a low price compared to itsprice history.

The application has one source operator, which readstrades and quotes events from a file4. Each entry in thefile corresponds to a real event from the stock market. Eachentry (tuple) contains a ticker, which is the symbol thatidentifies a company on an exchange. The type indicatesthe action taken by an investor. A trade action means aninvestor bought a certain number (volume) of shares at aspecific price. A quote indicates an investor wants to sell anumber (ask size) of stocks at a certain price (ask price).

The processing logic starts with the TradeQuote operator,which reduces the size of the tuple. Two different operatorsconsume the output stream of TradeQuote, generating twobranches in the application flow graph.

The first branch of the flow graph starts with TradeFilter,which filters all tuples of type trade. The Aggregator con-sumes all trades and sums up the total volume and totalprice for the 5 most recent trades of a given stock sym-bol. The operator generates a new sum every time a newtrade of the corresponding symbol is processed on the in-put stream. The VWAP operator processes the Aggregatoroutput stream and generates a tuple with the moving aver-age price of a given stock symbol. The second branch of theflow graph has only the QuoteFilter operator, which outputsonly tuples with type quote.

The processing logic finishes with the BargainIndex, whichcorrelates the output streams of VWAP and QuoteFilter.

4In a real application deployment, data arrives as a contin-uous stream.

For every incoming quote tuple, it checks what is the mostrecent moving average stock price for the given ticker sym-bol. The operator estimates the potential money gain bymultiplying the ask size by the difference between the mov-ing average and the ask price. All outputs produced by theBargainIndex are stored into a file by the Sink operator.

For the purpose of fault injection, we modified the appli-cation input file by adding a primary key for each entry ofthe file. This key follows a strict ascending order. The appli-cation propagates this key until the Sink. With such key, wecan precisely identify segments of the faulty output streamand match them with the equivalent segment of the goldenrun’s output stream. This allows an accurate comparisonbetween the application OSF with and without faults.

6.2 Experiment ParametersThe input stream used in our experiments consists of real

market trades and quotes transactions from December 2005.We limited the number of processed trades and quotes to 5million events. This dataset has the following characteris-tics: (i) the average event rate is 500 tuples/second, withpeak rate of 2200 tuples/second; and (ii) quote transactionsaccount for 80% of input stream events.

For this experiment, we chose 6 different outage durations.They are specified in seconds and have the following values:0.5, 1, 2, 4, 8 and 16. The value 0.5 second was estimatedby measuring how long it takes the System S runtime todetect a crashed process and restore it to its normal oper-ation. The value 16 seconds is the time that the System Sruntime takes to detect that a node has failed and to migratea stream operator to a different machine. As described inSection 4, the FIOP that emulates the bursty tuple loss faultmodel expects as parameter the outage duration specified interms of the number of tuples. Each of the outage durationvalues was converted to the number of tuples that would belost due to the fault using both the average and peak inputrate we observed in our dataset. The average and peak in-put rates are converted according to the processing rates ofeach operator. We used System S built-in instrumentationfeatures [11] to obtain the processing rates of each operator.

The chosen stream offset trigger values are the following:0.5, 1.5, 2.5, 3.5, and 4.5 million. Similar to the outageduration, we approximate the offset trigger based on thenumber of tuples processed by each operator.

Due to the application non-determinism, the golden runwas executed 300 times. Each outage duration and streamoffset combination was executed 5 times, totaling 300 faultinjections per operator. The target operators for this ap-plication are Source, TradeFilter, QuoteFilter, VWAP, andBargainIndex. They are highlighted in Figure 7 and werechosen based on the optimization criteria described in Sec-tion 4.2. All experiments ran on a single node with Linuxoperating system, 4 Intel Xeon 3GHz processors, and 8GBof RAM. The experiments ran on a single node to reducethe effects of non-determinism on the application output.

Output Score Function. We defined the OSF of theapplication as the total sum of the financial gain. This ap-plication can misbehave in two ways: (i) underestimatingthe OSF (QS below 1), i.e., the application fails to indicateopportunities for buying profitable stocks; and (ii) overesti-mating the OSF (QS above 1), i.e., the application is esti-mating that certain stocks are more profitable than they arein reality. This can lead a trader to make wrong decisions.

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

0 5 10 15 20 25 30 35 40

Qua

lity s

core

Outage duration (in thousands of tuples)

Sourcef(x)

(a) Source operator

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

0 5 10 15 20 25 30 35 40

Qua

lity s

core


QuoteFilterf(x)

(b) QuoteFilter operator

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

0 5 10 15 20 25 30 35 40

Qua

lity s

core


TradeFilterf(x)

(c) TradeFilter

Figure 8: Quality score for fault injection trials with stream offset 1.5 million tuples and different outagedurations for operators Source, QuoteFilter, and TradeFilter. Figure 8(a) shows that the QS underestatima-tions overcomes the overestimations when the Source operator is offline for a long period of time. Figure 8(b)shows that the longer the outage duration, the more the QS decreases. Figure 8(c) indicates that the longerthe outage of the TradeFilter operator, the more the application overestimates the trading financial gain.

6.3 Results

Outage duration impact.The Coq metric assesses if the outage duration caused by afault in an operator directly impacts the application outputquality. As described in Section 5.2, this metric considersonly the section of the output stream that is directly affectedby the fault. In the Bargain Discovery, we estimate the af-fected output stream by identifying all trades that were lostdue to the injected fault and all trades that were correlatedwith a moving average that was miscalculated due to theinjected fault. For example, when the fault injection targetis the Source operator, the downstream operators lose bothtrade and quote tuples. In this case, the affected segment ofthe output stream considers both the dropped quotes andthe quotes correlated with miscalculated moving averages.However, when the fault injection target is the QuoteFilter,the affected segment of the output stream consists only ofthe dropped quotes. Note that this analysis is application-specific, and should be customized for each application, if aprecise estimate of the affected output stream is desired.

Table 2 shows the Coq result for all target operators. Notethat the QS result is highly correlated with the outage dura-tion when faults are injected into TradeFilter, QuoteFilter,and VWAP.

Operator Coq Doq Rlq Ilq

Source -0.38 (0.21,R) 340 21.23TradeFilter 0.97 (0.76,R) 340 48.70QuoteFilter -1.00 (0.00,A) 6 6.00

VWAP 0.99 (0.14,R) 73 30.87BargainIndex -0.69 (0.08,A) 43 7.89

Table 2: Operator metrics results for Bargain Dis-covery application

Figure 8 illustrates how the quality score (QS) varies un-der different outages for three operators. The x axis is theoutage duration and the y axis is the QS result. The functionf(x) is the result of least-squares fitting of a linear function.

For the Source (Figure 8(a)) there is an OSF overesti-mation when the outage duration is small (less than 5000tuples), and an OSF underestimation when the outage du-ration increases (greater than 15000 tuples). When a faultaffects the Source, both trades and quotes are lost. When

data loss is small, not as many quotes are dropped, result-ing in many correlations with a miscalculated moving av-erage. When the data loss is big, the financial loss due tonot correlating quotes for a long time is greater than theoverestimation due to miscalculated averages.

Figure 8(b) shows the high correlation between QS andoutage duration for the QuoteFilter operator. Note thatthe QS is 0.00 at the maximum injected outage duration.The QS is 0.00 because the Coq metric considers only theaffected output stream under the maximum outage durationfor its computation. When we inject a fault into QuoteFilterwith the maximum outage duration, the BargainIndex doesnot perform correlations with any quote tuple, and, as aresult, the application does not produce any output duringthe outage period.

Figure 8(c) shows the QS values for the TradeFilter oper-ator. When this operator fails, the operator that maintainsthe most recent trades stops adjusting the moving averagesbased on new trade values. This can lead to the evaluationof a non-profitable stock as profitable (case 1) and vice versa(case 2). In our dataset, the magnitude of case 1 was alwaysgreater than case 2. This indicates that the financial lossfor buying non-profitable stocks is greater than the financialloss incurred because purchase opportunities were missed.

Data dependency. Table 2 shows the Doq metric re-sults. A stands for an accepted Aq, and R stands for arejected Aq. The operator with the greatest σq is Trade-Filter, and the one with the lowest σq is QuoteFilter. Theonly operators with an accepted ANOVA test (i.e., low QSvariability under faults) with α = 0.05 are QuoteFilter andBargainIndex. Figure 9 shows the QS result (y axis) for allinjected stream offsets (x axis) under the maximum outageduration. The QS for QuoteFilter is 0.00 independently ofwhich stream offset the fault is injected into. This is be-cause BargainIndex cannot do any stream correlation whenQuoteFilter fails. For TradeFilter, the lowest QS is 1.34 andthe greatest is 3.35. This represents a considerable variation,which indicates that effect of a fault on the QS depends toa great degree on what data the outage affects.

Recovery time. The Rlq estimates how long the applica-tion takes to produce output below an error threshold oncean operator fails. For the target application, we consider athreshold of 3% away from the OSF of the golden run. Wecomputed the local OSF values considering the primary key

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

0.5 1 1.5 2 2.5 3 3.5 4 4.5

Quali

ty sc

ore

Stream offset (in millions of tuples)

SourceQuoteFilterTradeFilter

VWAPBargainIndex

Figure 9: Quality score for different stream offsetsunder maximum outage duration. The impact of afailure in the TradeFilter operator on the applicationQS is highly data dependent, while the impact ofa failure in QuoteFilter is independent of the dataaffected by the fault.

of the resulting tuple. All tuples with key values falling intoan interval of 5000 units are grouped into one stream sec-tion (e.g., tuples with keys between 5000 and 10000). Thisapproximates to one local observation at every 2.2 secondswhen the input stream is producing tuples at its peak rate.Table 2 shows the Rlq for all target operators for an outageduration of 16 seconds under peak rate. Figures 10 and 11show the Rlq value as vectors. The displayed value is the Rlq

for the injected stream offset, and not the maximum valueamong all injected stream offsets.

The operators with highest recovery time are Source andTradeFilter. These two operators have the same Rlq valuebecause during the injection trials the same exact set of tu-ples were dropped with respect to the Aggregator. Addition-ally, they have the highest Rlq because they both affect thestate of the Aggregator, which maintains the history of therecent trades. Once new tuples are processed, the Aggre-gator updates its internal state, producing moving averageestimations with fresh data. As seen in Figure 10 and 11,the QS result stabilizes as more tuples are processed.

The VWAP and BargainIndex operators (Table 2) havesmaller Rlq values. When they fail, the state they affectdownstream is quicker to rebuild in comparison to the Sourceand the TradeFilter. Once BargainIndex recovers from itscheckpoint, its internal state contains outdated moving aver-ages. However, it immediately starts receiving correct mov-ing average values, allowing correct correlations with newincoming quotes. QuoteFilter has a small recovery time be-cause it does not affect the state of operators downstream.

Quality Impact. The Ilq evaluates the magnitude of theimpact on the application output when an operator fails.The outage duration, error threshold, and interval of sec-tions of the output stream are the same ones used by theRlq metric. Table 2 shows the Ilq values for an outage dura-tion of 16 seconds under peak rate. Our results reveal that afault in TradeFilter affects the application output the most,while a fault in QuoteFilter has the lowest impact.

Figure 10 shows the QS result (y axis) for every sectionof the output stream (x axis) for 3 target operators. The xaxis starts at section 300, which corresponds to the injectionstream offset after 1.5 million tuples have been processed.Figure 10(a) shows the QS for the Source operator. Afterthe fault injection there are no tuples present in the outputstream, leading to 100% underestimation of the OSF when

compared to the golden run. Once the operator resumessending tuples, the application overestimates its results upto 59% percent. Figure 10(b) shows the QS observationsfor the QuoteFilter operator. The QuoteFilter has a lowIlq because it only affects the output during the outage pe-riod. When faults are injected into the VWAP operator(Figure 10(c)), the application produces high overestimates(up to 113% greater than the golden run OSF). This is be-cause a fault in VWAP affects the BargainIndex state. As aresult, the application continues to correlate new quotes withoutdated moving average values. Note that when VWAPfails, the history of recent trades maintained by Aggregatoris kept up-to-date. As a result, once VWAP recovers, it canimmediately send up-to-date values downstream.

Figure 11 shows the QS for the TradeFilter when sub-jected to faults at stream offset after 3.5 million tuples havebeen processed and with different outage durations. Ourexperiments show that as the outage duration increases, thepeak OSF overestimate for a certain stream offset and theIlq increase. When the outage lasts 2 seconds (Figure 11(a)),the maximum overestimate is 7% and the Ilq is 0.25. Foran outage of 8 seconds (Figure 11(b)), the maximum over-estimate is 72% and the Ilq is 12.85. The peak overestimatefor an outage duration of 16 seconds (Figure 11(c)) is 144%and the Ilq is 48.70. Our results show that even thoughthe TradeFilter and Source are losing the same set of tuplesunder a injected fault with the same outage duration, theQS deviation is higher when the fault is injected into theTradeFilter. When TradeFilter fails, the Source continuesto send new data, and the BargainIndex continues to cor-relate quotes with an obsolete moving average value. Whenthe Source fails, the application stops the correlation of anydata during the outage duration, resulting in errors withlower magnitude.

6.4 DiscussionOur results show the following regarding the Bargain Dis-

covery application:1. Influence of operator output stream on the state of the

downstream flow graph determines the criticality of the oper-ator – the total state size (in bytes) of operators downstreamof a failed operator determines how long the applicationtakes to fully rebuild its state and for how long the applica-tion produces erroneous results. As a result, operators withgreater influence on the downstream state are more criticalto maintain the application output quality. For example,the TradeFilter is the most critical operator with respect tobursty tuple loss, both in terms of the quality impact andrecovery time, making it a top priority for protection againstbursty tuple loss. Even though the TradeFilter operator is astateless filter, it directly affects the stateful Aggregator andBargainIndex downstream. Another example is QuoteFilter,which is the least critical operator with respect to bursty tu-ple loss. This operator has low impact on application outputquality, short recovery time, and very predictable behaviorunder faults (Aq = accept and σ = 0.00). Although Bar-gainIndex is stateful and consumes data from QuoteFilter,the BargainIndex does not keep internal state related toQuoteFilter’s output stream.

2. Checkpoint is not adequate to protect operators whenfaults have a long outage duration - our results show thatcheckpointing [17] provides good protection against faultswith short outage duration (e.g., TradeFilter and Aggrega-

0.00

0.50

1.00

1.50

2.00

2.50

200 300 400

140

500 600 700 800 900 1000

Qualit

y s

core

Application output stream section

Source

(a) Source operator

0.00

0.50

1.00

1.50

2.00

2.50

200 300 400 500 600 700 800 900 1000

Qualit

y s

core


QuoteFilter

6

(b) QuoteFilter operator

0.00

0.50

1.00

1.50

2.00

2.50

200 300 400 500 600 700 800 900 1000

Qualit

y s

core


VWAP

18

(c) VWAP

Figure 10: Quality score for each section of the output stream with stream offset 1.5 million tuples andoutage duration of 16 seconds under peak rate. Vector shows the value of the Rlq metric for each operator.Figure 10(a) shows that Source operator has a high Rlq value, while Figure 10(b) shows that the applicationquickly recovers from a failure in the QuoteFilter operator. Figure 10(c) shows that a VWAP failure has ashort Rlq when compared to Source, but it generates QS observations with higher values.

0.00

0.50

1.00

1.50

2.00

2.50

650 700 750 800 850 900 950 1000

Qualit

y s

core


TradeFilter-2

1

(a) Outage of 2 seconds

0.00

0.50

1.00

1.50

2.00

2.50

650 700 750 800 850 900 950 1000

Qualit

y s

core


TradeFilter-8

10

(b) Outage of 8 seconds

0.00

0.50

1.00

1.50

2.00

2.50

650 700 750 800 850 900 950 1000

Qualit

y s

core


TradeFilter-16

44

(c) Outage of 16 seconds

Figure 11: Local QS observations of the output stream with stream offset of 3.5 million tuples and differentoutage durations under peak rate for TradeFilter operator. Ilq and Rlq values increase as the outage durationincreases.

tor in Figure 11(b)), but is not enough for faults with longrecovery time.

3. Position in the flow graph is not an adequate heuristicfor deciding operator criticality - although other researches[4, 21] suggest that the position in the flow graph can beused to deploy PFT, our study indicates that the positionon the flow graph, and the type of operator alone are notadequate heuristics to characterize operator criticality. Al-though QuoteFilter and TradeFilter have both similar po-sition in the flow graph and the same operator type, theyhave very distinct behavior under faults.

4. The proposed metrics can be used to reconfigure theapplication fault tolerance and to observe a measurable im-provement in application output quality - based on our ex-perimental results, we can improve the fault tolerance ofthe application by applying, for example, a technique withlower recovery time, such as high availability groups [17].This technique maintains active replicas of operator groupsof the application flow graph. Once an operator in the activegroup fails, the backup group becomes active. The failovertime from one replica to the other is at most 2 seconds inSystem S. Our experiments show that by replicating a groupof critical operators, such as the TradeFilter, Aggregator andVWAP (shown in Figure 12), we can see an improvementon the output quality of the application under faults. Wheneither TradeFilter, Aggregator, or VWAP fails, the Rlq is 2and the Ilq is 2. This is a significant improvement whencompared to the previous values (Ilq of 48.70 and Rlq of

340). Under faults in the Source and the TradeQuote, thenew Ilq is 1.05 and the Rlq is 39, in contrast to an Ilq of21.23 and Rlq of 340 obtained in our previous tests. Forother operators, such as the QuoteFilter, the metrics showthat a simple restart is enough to maintain good applicationoutput quality. This has a big positive impact on resourceutilization for our target application, given that QuoteFilterhas an input selectivity of 80%, as described in Section 6.2.

TradeFilter VWAPAggregator

f(x) f(x) Sinkf(x)

Source TradeQuote f(x) f(x)

BargainIndex

f(x)

f(x)

f(x)

f(x)

QuoteFilter

SinkBargainIndex

f(x)

TradeFilter VWAPAggregator

f(x)

Source TradeQuote f(x) f(x)

TradeFilter VWAPAggregatorhigh availabilitygroup

f(x)

f(x) f(x)

QuoteFilter

Sinkf(x)

BargainIndex

h k i tf(x)restart

checkpointhigh availability group

Figure 12: Fault tolerance reconfiguration of Bar-gain Discovery considers that operators with highmetric values should be replicated, resulting in im-proved output quality on the occurrence of failures.

An important factor in deriving our conclusions was theOSF definition, which closely follows the semantics of ourapplication. Our experimental methodology, together witha well-defined OSF, enabled us to make informed decisionswith respect to fault tolerance, since we can evaluate the costassociated with the applied fault tolerance technique and thebenefits such protection yields. Our results with the Bargain

Discovery application have also demonstrated robustness todifferent choices of OSF. We tested both average financialgain and number of produced tuples as OSFs, and the relativecriticality of operators in terms of Ilq and Rlq were similarto the one obtained by the total sum of financial gain.

7. EXPERIMENTAL METHODOLOGYLIMITATIONS

Injecting faults into a streaming application and evaluat-ing its results has many challenges, mainly imposed by thepossible non-determinism of the application and its toler-ance to approximate results. In this paper, we described anexperimental methodology we used to evaluate the impact ofPFT techniques on the output of a financial engineering ap-plication. We envision that our experimental methodologycan be further generalized to be applicable to other stream-ing applications. In this section, we discuss some of thelimitations of our current approach which should be studiedin order to generalize the proposed methodology.

1. Application-specific quality metric - our basis to com-pare a faulty and a golden run is via an application-specificOSF. The OSF approximates the behavior of the applica-tion output and it copes with the non-determinism of theapplication output. For some applications, it may be hardto define an OSF. As a result, generic OSFs (e.g., number oftuples) can be investigated as an alternative for such cases.

2. Concurrent failures - our experiments assumed thatoperators fail independently. However, depending on thephysical deployment of the application, multiple operatorsmay fail together (e.g., if they reside in the same process [12]or run on the same node). We plan to investigate the emula-tion of such failures by considering the co-located operatorsas a single operator with a more complex internal logic.

3. Number of required injection trials for applicationswith large number of operators - our methodology injectsfaults in each operator considering different outage dura-tions and stream offsets. In addition, if the application isnon-deterministic, we execute the same injection configura-tion multiple times. This can result in a high number ofexperimental trials if the application has a large number ofoperators. In Section 4.2, we describe an optimization stepbased on the properties of the operator graph to help reduc-ing the number of target operators. Considering co-locatedoperators as a single operator can also reduce the numberof required experiments.

8. RELATED WORKMany fault tolerance techniques for streaming applica-

tions consider that no data can be dropped or duplicated [13,16, 19], which depends upon the implementation of expen-sive buffer management and consistency protocols. Thesetechniques do not evaluate the application output qualitywhen faults occur, since they assume that the applicationproduces the same output despite the occurrence of faults.Balazinska et al. [3] propose to produce tentative (lower pre-cision) results during the occurrence of faults. The evalua-tion of this technique was based on the number of tentativetuples produced during faults, but there was no evaluationwith respect to their impact on the output quality.

Previous literature on PFT techniques [4, 17, 21, 23, 30]do not describe how to systematically evaluate the impactof faults on the application output quality. Bansal et al. [4]

assume that the importance of each component in a stream-ing application can be described as a linear combination ofimportance of the inputs it is consuming. Our experimentsshow that this is not the case for our target application. Ourearlier work [17] relies on the application designer to specifywhich components need to be fault-tolerant, while in thistechnique we inject faults to understand the application be-havior under faults. Murty and Welsh [21] and Pietzuch [23]do not evaluate their proposed fault tolerance methods withrespect to effects on the quality of an application’s results.Zhu et al. [30] assess the output quality of their proposedfault tolerance method in terms of a Sum of Squared Errors.We propose three other evaluation metrics.

There is vast research on evaluating fault tolerant systemswith fault injection methods [8, 14, 29]. Streaming applica-tions have unique characteristics, such as non-determinism,and continuous data processing despite the occurrence offaults. This brings up additional challenges to the evaluationmethodology, which cannot be attacked with the techniquesdescribed in the literature.

Model-based dependability evaluation techniques [18] arecomplementary to fault injection experimentation. One useof model-based approaches is to understand the impact offaults on the application availability under a given failuredistribution. Fault injection assumes that a fault that canhappen on a real installation has occurred in the system,independent of its frequency.

Our work is related to research in load shedding of stream-ing applications, where the stream processing middlewarecan drop data once it detects that the system is operatingover its capacity. Tatbul et al. [26] study the problem of us-ing application semantics to drop tuples via a loss-tolerancegraph. This graph can only have stream operators of specifictypes. Babcock et al. [2] proposes an accuracy metric sim-ilar to our QS. We propose four different metrics which arebased on the QS. Previous work [9] considers the operatortype to establish a specific quality metric. Our methodol-ogy uses a quality metric that is application-specific, butthat is independent of the types of operators used by theapplication.

Condie et al. [7] evaluate the accuracy of applications run-ning on top of a MapReduce implementation that supportscontinuous queries. The characteristics of their target ap-plications are distinct from ours. Although their target ap-plication is implemented as a continuous query, it has a fi-nal exact and deterministic answer (e.g., number of pageviews). Our scenario is more complex, since our applicationis non-deterministic and every tuple produced is a true in-dependent outcome of the application, and not a value thatgets adjusted over time.

Fiscato et al. [10] propose a model for streaming appli-cations with quality metrics based on the importance of atuple. The importance of tuples produced by an operatordepends on the importance of its input tuples. Tuples com-ing from the same base stream are assumed to be equallyimportant. In our experiments, this does not hold true.Specifically, although QuoteFilter and TradeFilter outputstreams are derived from the same input stream, they showvery different behavior under faults.

9. CONCLUSIONPartial fault tolerance techniques aim at decreasing their

impact on performance of streaming applications by allowing

the application to lose and duplicate data when faults oc-cur. In this paper, we described a fault injection methodol-ogy used to evaluate the impact of using PFT techniques onthe output of a streaming application. Our results show thatPFT can lead to better resource consumption, and that faultinjection can be used to learn how to selectively deploy faulttolerance techniques in the application processing graph. Inaddition, our experiments indicate that some heuristics pro-posed by previous research [4, 21] on partial fault tolerancemay not be the best option to maintain application outputquality under faults.

As future work, we plan to evaluate other streaming ap-plications to further generalize our experimental methodol-ogy and evaluation metrics. We believe there is a need formore fault injection experiments on stream processing ap-plications, so that our insights and the body of knowledgewith respect to the behavior of streaming applications underfaults can be broadened.

10. REFERENCES[1] H. Andrade, B. Gedik, K.-L. Wu, and P. Yu. Scale-up

strategies for processing high-rate data streams inSystem S. In ICDE, 2009.

[2] B. Babcock, M. Datar, and R. Motwani. Loadshedding for aggregation queries over data streams. InICDE, 2004.

[3] M. Balazinska, H. Balakrishnan, S. R. Madden, andM. Stonebraker. Fault-tolerance in the Borealisdistributed stream processing system. ACMTransactions on Database Systems, 33(1):1–44, 2008.

[4] N. Bansal, R. Bhagwan, N. Jain, Y. Park, D. Turaga,and C. Venkatramani. Towards optimal resourceallocation in partial-fault tolerant applications. InINFOCOM, 2008.

[5] D. Carney, U. Cetintemel, M. Cherniack, C. Convey,S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, andS. Zdonik. Monitoring streams: a new class of datamanagement applications. In VLDB, 2002.

[6] S. Chandra and P. Chen. How fail-stop are faultyprograms? In FTCS, June 1998.

[7] T. Condie, N. Conway, P. Alvaro, J. Hellerstein,K. Elmeleegy, and R. Sears. MapReduce online. InNSDI, 2010.

[8] D. Costa, T. Rilho, and H. Madeira. Joint evaluationof performance and robustness of a COTS DBMSthrough fault-injection. In DSN, 2000.

[9] A. Das, J. Gehrke, and M. Riedewald. Approximatejoin processing over data streams. In SIGMOD, 2003.

[10] M. Fiscato, Q. H. Vu, and P. Pietzuch. Aquality-centric data model for distributed streammanagement systems. In QDB, 2009.

[11] B. Gedik, H. Andrade, and K.-L. Wu. A codegeneration approach to optimizing high-performancedistributed data stream processing. In CIKM, 2009.

[12] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, andM. Doo. SPADE: the System S declarative streamprocessing engine. In SIGMOD, 2008.

[13] Y. Gu, Z. Zhang, F. Ye, H. Yang, M. Kim, H. Lei, andZ. Liu. An empirical study of high availability instream processing systems. In Middleware’09, 2009.

[14] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault

injection techniques and tools. IEEE Computer,30(4):75–82, 1997.

[15] J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel,M. Stonebraker, and S. Zdonik. High-AvailabilityAlgorithms for Distributed Stream Processing. InICDE 2005, 2005.

[16] J.-H. Hwang, U. Cetintemel, and S. B. Zdonik. Fastand highly-available stream processing over wide areanetworks. In ICDE, 2008.

[17] G. Jacques-Silva, B. Gedik, H. Andrade, and K.-L.Wu. Language level checkpointing support for streamprocessing applications. In DSN, 2009.

[18] G. Jacques-Silva, Z. Kalbarczyk, B. Gedik,H. Andrade, K.-L. Wu, and R. K. Iyer. Modelingstream processing applications for dependabilityevaluation. In DSN, 2011.

[19] Y. Kwon, M. Balazinska, and A. Greenberg.Fault-tolerant Stream Processing Using a Distributed,Replicated File System. In VLDB, 2008.

[20] H. Madeira, D. Costa, and M. Vieira. On theemulation of software faults by software faultinjection. In DSN, 2000.

[21] R. N. Murty and M. Welsh. Towards a dependablearchitecture for Internet-scale sensing. In HOTDEP,2006.

[22] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:Distributed stream computing platform. In KDCloud,2010.

[23] P. Pietzuch. Challenges in dependable Internet-scalestream processing. In WDDDM, 2008.

[24] C. E. Spearman. The proof and measurement ofassociation between two things. The AmericanJournal of Psychology, 14:72–101, 1904.

[25] M. Stonebraker, U. Cetintemel, and S. Zdonik. The 8requirements of real-time stream processing. ACMSIGMOD Record, 34(4):42–47, 2005.

[26] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack,and M. Stonebraker. Load shedding in a data streammanager. In VLDB, 2003.

[27] K. S. Trivedi. Probability and statistics with reliability,queuing and computer science applications. JohnWiley and Sons Ltd., Chichester, UK, 2002.

[28] D. S. Turaga, O. Verscheure, J. Wong, L. Amini,G. Yocum, E. Begle, and B. Pfeifer. Online FDCcontrol limit tuning with yield prediction usingincremental decision tree learning. In SematechAEC/APC, 2007.

[29] K. Whisnant, R. K. Iyer, Z. Kalbarczyk, P. H. J. III,D. A. Rennels, and R. R. Some. The effects of anarmor-based sift environment on the performance anddependability of user applications. IEEE Transactionson Software Engineering, 30(4):257–277, 2004.

[30] Q. Zhu, L. Chen, and G. Agrawal. Supportingfault-tolerance in streaming Grid applications. InIPDPS, 2008.

Documents

Fault Injection-based Assessment of Partial Fault Tolerance in