10
Experimental Evaluation of N-tier Systems: Observation and Analysis of Multi-Bottlenecks Simon Malkowski , Markus Hedwig , and Calton Pu Center for Experimental Research in Computer Systems Georgia Institute of Technology, Atlanta, GA 30332-0765, USA {simon.malkowski, calton}@cc.gatech.edu Chair of Information Systems Research Albert-Ludwigs-University of Freiburg, 79098 Freiburg, Germany [email protected] Abstract In many areas such as e-commerce, mission-critical N-tier applications have grown increasingly complex. They are characterized by non-stationary workloads (e.g., peak load several times the sustained load) and complex dependencies among the component servers. We have studied N-tier ap- plications through a large number of experiments using the RUBiS and RUBBoS benchmarks. We apply statistical meth- ods such as kernel density estimation, adaptive filtering, and change detection through multiple-model hypothesis tests to analyze more than 200GB of recorded data. Beyond the usual single-bottlenecks, we have observed more intricate bottleneck phenomena. For instance, in several configura- tions all system components show average resource uti- lization significantly below saturation, but overall through- put is limited despite addition of more resources. More concretely, our analysis shows experimental evidence of multi-bottleneck cases with low average resource utilization where several resources saturate alternatively, indicating a clear lack of independence in their utilization. Our data corroborates the increasing awareness of the need for more sophisticated analytical performance models to describe N- tier applications that do not rely on independent resource utilization assumptions. We also present a preliminary tax- onomy of multi-bottlenecks found in our experimentally observed data. 1. Introduction In modern data centers, enterprise-class N-tier systems with web servers, application servers, and database servers are growing in economic importance, infrastructure footprint, and application complexity. Traditional performance anal- ysis methods are challenged by this growth due to bot- tleneck phenomena that so far have been considered rare and unusual. In this paper, we show experimentally such “unusual” phenomena and how to detect and analyze them. Our data from system configurations reveal bottleneck cases with different kinds of partially saturated resources, which can be grouped according to their resource usage depen- dence and saturation frequency into oscillatory bottlenecks, concurrent bottlenecks, and simultaneous bottlenecks. In order to distinguish these cases from single-bottlenecks, which have traditionally been recognized as predominant saturation phenomenon, we use the umbrella-term of multi- bottlenecks. Multi-bottlenecks have been previously studied in system theory [1]–[3] and system analysis [4], [5]. However, clas- sical computer performance analysis [6], [7], which started with stable workloads on mainframes, is traditionally re- stricted to detecting single-bottlenecks. Typical assumptions of analytical models (e.g., mean value analysis in queuing theory) are the independence of arriving tasks and service times. Partially motivated by the growth of N-tier systems, some recent performance models have moved beyond these assumptions. Consequently, the importance of bursty work- load conditions [8], [9] and non-stationary workload in general [10] have been recognized. Nevertheless, average utilization values remain the method of choice in top-down N-tier system analysis [11]–[14]. When mentioned [9], [11], phenomena such as multi-bottlenecks in N-tier applications have been described as “challenging case”. In this paper we analyze experimental results with multi- bottlenecks. We show that these bottlenecks may happen while system components show average resource utilization significantly below saturation. Figure 1 exemplifies this case by contrasting the throughput and response time of a saturated RUBBoS experiment with the corresponding candidate-bottleneck resources. While there is an obvious correlation between system throughput and database CPU utilization, the question why this correlated resource does not seem to saturate remains unanswered. Because none of the resources show full resource utilization levels, standard approaches to bottleneck detection in N-tier systems (e.g., [15]) are not able to diagnose them. However, instead of attributing the obvious performance limitation to some hidden (i.e., unmonitored) resource, we show—using our

Experimental Evaluation of N-tier Systems: Observation and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Experimental Evaluation of N-tier Systems: Observation and

Experimental Evaluation of N-tier Systems:

Observation and Analysis of Multi-Bottlenecks

Simon Malkowski∗, Markus Hedwig†, and Calton Pu∗∗Center for Experimental Research in Computer Systems

Georgia Institute of Technology, Atlanta, GA 30332-0765, USA{simon.malkowski, calton}@cc.gatech.edu

†Chair of Information Systems ResearchAlbert-Ludwigs-University of Freiburg, 79098 Freiburg, Germany

[email protected]

Abstract

In many areas such as e-commerce, mission-critical N-tierapplications have grown increasingly complex. They arecharacterized by non-stationary workloads (e.g., peak loadseveral times the sustained load) and complex dependenciesamong the component servers. We have studied N-tier ap-plications through a large number of experiments using theRUBiS and RUBBoS benchmarks. We apply statistical meth-ods such as kernel density estimation, adaptive filtering, andchange detection through multiple-model hypothesis tests toanalyze more than 200GB of recorded data. Beyond theusual single-bottlenecks, we have observed more intricatebottleneck phenomena. For instance, in several configura-tions all system components show average resource uti-lization significantly below saturation, but overall through-put is limited despite addition of more resources. Moreconcretely, our analysis shows experimental evidence ofmulti-bottleneck cases with low average resource utilizationwhere several resources saturate alternatively, indicating aclear lack of independence in their utilization. Our datacorroborates the increasing awareness of the need for moresophisticated analytical performance models to describe N-tier applications that do not rely on independent resourceutilization assumptions. We also present a preliminary tax-onomy of multi-bottlenecks found in our experimentallyobserved data.

1. Introduction

In modern data centers, enterprise-class N-tier systems withweb servers, application servers, and database servers aregrowing in economic importance, infrastructure footprint,and application complexity. Traditional performance anal-ysis methods are challenged by this growth due to bot-tleneck phenomena that so far have been considered rareand unusual. In this paper, we show experimentally such“unusual” phenomena and how to detect and analyze them.Our data from system configurations reveal bottleneck cases

with different kinds of partially saturated resources, whichcan be grouped according to their resource usage depen-dence and saturation frequency into oscillatory bottlenecks,concurrent bottlenecks, and simultaneous bottlenecks. Inorder to distinguish these cases from single-bottlenecks,which have traditionally been recognized as predominantsaturation phenomenon, we use the umbrella-term of multi-bottlenecks.

Multi-bottlenecks have been previously studied in systemtheory [1]–[3] and system analysis [4], [5]. However, clas-sical computer performance analysis [6], [7], which startedwith stable workloads on mainframes, is traditionally re-stricted to detecting single-bottlenecks. Typical assumptionsof analytical models (e.g., mean value analysis in queuingtheory) are the independence of arriving tasks and servicetimes. Partially motivated by the growth of N-tier systems,some recent performance models have moved beyond theseassumptions. Consequently, the importance of bursty work-load conditions [8], [9] and non-stationary workload ingeneral [10] have been recognized. Nevertheless, averageutilization values remain the method of choice in top-downN-tier system analysis [11]–[14]. When mentioned [9], [11],phenomena such as multi-bottlenecks in N-tier applicationshave been described as “challenging case”.

In this paper we analyze experimental results with multi-bottlenecks. We show that these bottlenecks may happenwhile system components show average resource utilizationsignificantly below saturation. Figure 1 exemplifies thiscase by contrasting the throughput and response time ofa saturated RUBBoS experiment with the correspondingcandidate-bottleneck resources. While there is an obviouscorrelation between system throughput and database CPUutilization, the question why this correlated resource doesnot seem to saturate remains unanswered. Because none ofthe resources show full resource utilization levels, standardapproaches to bottleneck detection in N-tier systems (e.g.,[15]) are not able to diagnose them. However, insteadof attributing the obvious performance limitation to somehidden (i.e., unmonitored) resource, we show—using our

Page 2: Experimental Evaluation of N-tier Systems: Observation and

methodology—that the observed throughput limitation is dueto an oscillatory bottleneck in the database where CPU anddisk saturate alternatively (see Section 6).

The main contribution of this paper is threefold. First, ourexperimental evaluation of N-tier application benchmarks(i.e., RUBiS and RUBBoS) documents the presence ofpresumed unusual multi-bottlenecks. Second, we introducea simple classification of multi-bottlenecks that enables thedetection of the observed phenomena. Third, using statisticalmethods such as kernel density estimation, adaptive filtering,and change detection through multiple-model hypothesistests, we show how to detect multi-bottlenecks with an effi-cient top-down approach, even if average resource utilizationis significantly below saturation.

The remainder of this paper is structured as follows. InSection 2 we introduce the simple classification of multi-bottlenecks that is used in our analysis. In Section 3 weoutline the necessary methods for statistical interpretation ofthe measurement data. Section 4 presents the experimentalsetup and infrastructure. Section 5 shows a scenario withseven concurrent bottlenecks. In Section 6 we present twodifferent oscillatory bottleneck cases. Related work is sum-marized in Section 7, and Section 8 concludes the paper.

2. Simple Classification of Multi-bottlenecks

The common understanding of a system bottleneck (orbottleneck for short) can intuitively be derived from itsliteral meaning as the key limiting factor for achievinghigher system throughput. Hence, an improvement to thethroughput of the bottleneck resource results in the highestpossible system throughput improvement [6], [16]. Classicalqueuing theory defines the set of bottlenecks in a queuingsystem as follows. B is the set of all resources i that reachfull utilization Ui when N , the number of jobs in the system,tends to infinity under stable class-mix conditions [1].

B =�

i��� limN→∞

Ui(N) = 1�

(1)

Due to their simplicity and intuitiveness, definitions sim-ilar to Equation (1) have usually provided the foundationfor reasoning about bottleneck behavior in computer systemperformance analysis. But despite their popularity, such for-mulations are based on assumptions that do not necessarilyhold in practice. In other words, we show experimentally thatEquation (1) does not allow correct identification of bottle-necks in the case of empirical N-tier application monitoringdata. Our data suggest that resources in N-tier applicationscannot generally be assumed to exhibit independent utiliza-tion. In fact, bottlenecks may be comprised of more thanone physical resource. Similarly, the dimension of time hasto be taken into account in N-tier systems, which may besubject to very strong workload fluctuations. Therefore, theassumption of stable class-mix conditions has to be relaxed

Res

ou

rce

usa

ge

dep

end

ence!

dep

end

ent!

oscillatory!

bottlenecks!

not!

observed!

ind

epen

den

t!

concurrent!

bottlenecks!

simultaneous !

bottlenecks!

partially! fully!

Resource!

saturation frequency!

Figure 2. Simple multi-bottleneck classification.

as well. Finally, an infinite number of users is not technicallyfeasible, hence, the assumption of full resource utilizationhas to be adapted to actual monitoring conditions.

Since our data show that the aforementioned assumptionsmay be too rigid, the need arises to classify the observedphenomena in an alternate way. But before doing so througha formal set of definitions in the later part of this section,we first establish the key differences between our taxonomyand common approaches to N-tier bottleneck detection.Because of queuing theory, the term bottleneck is often usedsynonymously for single-bottleneck. In a single-bottleneckcases, the saturated resource typically exhibits a near-linearlyload-dependent average resource utilization that saturatesat a value of one past a certain workload. Consequently,such bottlenecks are straight-forward to detect. However,the characteristics of bottlenecks may change significantlyif more than one bottleneck resource is present in thesystem, which is the case for many real N-tier applicationswith heterogeneous workloads. Therefore, we distinguish thephenomena discussed in this paper from single-bottlenecksby using the umbrella-term multi-bottlenecks.

Because system resources may be causally dependent intheir individual usage patterns, multi-bottlenecks introducethe classification dimension of resource usage dependence.Additionally, greater care has to be taken in classifyingbottleneck resources according to their resource saturationfrequency. We distinguish between resources that saturatefor the entire observation period (i.e., fully saturated) andresources that saturate for only certain parts of the period(i.e., partially saturated). Note that previous efforts in thisarea have typically omitted the notions of dependence andsaturation frequency in their analysis (e.g., [15]).

Figure 2 summarizes the classification that forms the basisof our multi-bottleneck detection. The x-axis and y-axisrepresent resource saturation frequency and resource usagedependence, respectively. During our extensive empiricalevaluation, we have only observed three of the four possible

Page 3: Experimental Evaluation of N-tier Systems: Observation and

0!

5!

10!

15!

20!

0!

100!

200!

300!

400!

1000! 3000! 5000! 7000! 9000! 11000! 13000!

Aver

age

resp

on

se t

ime

[s]!

Av

erag

e th

roug

pu

t [o

ps/

s]!

Workload [#]!

Throughput! Response time!

(a) Average throughput and response time.

0!

20!

40!

60!

80!

100!

1000! 3000! 5000! 7000! 9000! 11000! 13000!

Aver

age

uti

liza

tio

n [

%]!

Workload [#]!

DB CPU! DB disk! CM CPU! WEB CPU! APP1 CPU!

(b) Resource utilization for candidate-bottleneck resources.

Figure 1. 1/2/1/1L RUBBoS experiment with read/write workload.

combinatoric cases, which suggests that if resources exhibitfull saturation, their resource utilization is not dependenton any other saturated resource. Under this assumption,multiple fully saturated resources, imply resource usageindependence, which is classified as simultaneous bottleneckcase. A well-known example of systems with resources thatare resource usage independent are embarrassingly parallelapplications. Concurrent bottlenecks happen if multiple re-sources saturate independently of each other, but only fora fraction of the observation period. As the load increases,a concurrent saturation pattern may transform into simul-taneous saturation. The most notable characteristic of usageindependent resources is that each resource’s saturation maybe evaluated separately. On the contrary, the detection ofoscillatory bottlenecks is more challenging because multipleresources form a combined bottleneck, which can onlybe analyzed and resolved in union. Oscillatory bottlenecksconsist of partially saturated resources and are caused byresource usage dependencies (e.g., load-balancing policies)that are an inherent characteristic of complex N-tier systems.Such dependencies result in a usage alternation between thesaturated resources (i.e., there is no overlapped saturation),which is observable as interleaved saturation patterns. Thelatter may cause particularly low average saturation for everysingel resource, and observable saturation frequency can be-come difficult to interpret. Therefore, it is inevitable to takeresource usage dependence into account in multi-bottleneckdetection. We find that once resources are grouped accordingto their usage dependence, it is possible to compute theircombined frequency and derive a representative measure forthe bottleneck magnitude.

Given these findings, the following definitions aresufficient for a simple multi-bottleneck classification.The resulting generic schema may be parameterized in aconcrete setting, such as presented in this paper, and utilizedto detect bottlenecks in N-tier application monitoring data.In the following, we augment the definitions with ourconcrete parameter choices, leaving sensitivity analysis andparameterization comparisons as an interesting topic for

future research.

Critical saturation. System resource i ∈ I , where I isa set of system resources, is critically saturated for anobservation interval t of length λ > 0 iff its utilizationexceeded a saturation threshold α ∈ [0, 1] for this interval,whereby α is referred to as the critical saturation threshold.

We found that parameterizing tuple (λ, α) with 1 secondand 95%, respectively, leverages workload fluctuation,technical feasibility, and standard statistical significancewell in our data. Set I consists of all monitored resources,which are all CPUs, disks, and network links in our testbed.

Resource saturation frequency. The resource saturationfrequency f i

R (or frequency for short) of resource i ∈ I isthe number of intervals with critical saturation divided bythe total number of intervals in observation period Λ.

We adopted the system observation period Λ equal to theexperiment runtime of 8 minutes. Therefore, the resourcesaturation frequency f i

R indicates how often resourcedemand exceeded the capacity of resource i during runtime.

Fully saturated resource. A fully saturated resource i ∈ Ihas maximal saturation frequency; i.e., f i

R ≈ 1.

We found that f iR ≥ 0.95 constitutes a convenient detection

rule for full saturation of resource i.

Partially saturated resource. A partially saturatedresource i ∈ I is not fully saturated, but has a non-negligible saturation frequency; i.e., f i

R ∈ (0, 1).

We found that 0.01 < f iR < 0.95 constitutes a convenient

detection rule for partial saturation of resource i.

Bottleneck resource. A bottleneck resource b ∈ I is eithera fully saturated resource or a partially saturated resource.

Resource usage dependence. A bottleneck resource bj ∈ Iis resource usage dependent (or dependent for short), ifthere exists another bottleneck resource bk ∈ I such thattheir binary critical saturation states Bt

j ∈ {0, 1} and

Page 4: Experimental Evaluation of N-tier Systems: Observation and

Btk ∈ {0, 1} have a mutually exclusive relationship for any

given interval t; i.e., Prob(Btj = 1 ∧ Bt

k = 1) ≤ ε with0 < ε� 1 for all t in Λ.

We found that this definition implies an intuitive rule ina top-down approach to bottleneck detection. Becauseof the mutual exclusion property, dependent bottleneckresources do not exhibit overlap of critical saturationintervals among each other. Even at very high workloads,such overlap has a statistically insignificant probability;e.g., Prob(overlap) ≤ 5%.

Resource usage independence. A bottleneck resourcebj ∈ I is resource usage independent (or independent forshort), if there exists no other bottleneck resource bk ∈ Isuch that their binary critical saturation states Bt

j ∈ {0, 1}and Bt

k ∈ {0, 1} have a mutually exclusive relationship forany given interval t; i.e., Prob(Bt

j = 1 ∧Btk = 1) > ε with

0 < ε� 1 for all t in Λ.

In contrast to dependent bottleneck resources, independentbottleneck resources show a clearly increasing overlap ofcritical saturation intervals with growing workload; e.g.,Prob(overlap) > 5%.

Single-bottleneck. A single-bottleneck β ∈ I is either afully saturated bottleneck resource or partially saturatedbottleneck resource iff there exists only one bottleneckresource in the system.

Oscillatory bottleneck. An oscillatory bottleneck β ⊆ Iis a set of partially saturated, resource usage dependentbottleneck resources iff there exist more than one bottleneckresource in the system.

In this paper, we focus on the maximal set of resourcesthat form an oscillatory bottleneck. The study of subsets ofresources in such oscillations is a subject of future research.

Concurrent bottleneck. A concurrent bottleneck β ∈ I is apartially saturated, resource usage independent bottleneckresource iff there exist more than one bottleneck resourcein the system.

Simultaneous bottleneck. A simultaneous bottleneck β ∈ Iis a fully saturated, resource usage independent bottleneckresource iff there exist more than one bottleneck resourcein the system.

Multi-bottleneck. A multi-bottleneck is either an oscillatorybottleneck, a concurrent bottleneck, or a simultaneousbottleneck.

Bottleneck. A bottleneck is either a single-bottleneck or amulti-bottleneck.

Bottleneck saturation frequency. The bottleneck saturationfrequency fβ

B (or frequency for short) of bottleneck β is thenumber of intervals where at least one of the bottleneck

resources in β is critically saturated divided by the totalnumber of intervals in observation period Λ.

The bottleneck saturation frequency indicates how oftenresource demand exceeded the capacity of any of theresources that are analyzed in union due to their strongusage dependence. Hence, this measure allows theassessment of how often a system under examinationsuffered of a particular performance limiting phenomenon.

Primary bottleneck. The primary system bottleneck is thebottleneck with the highest bottleneck saturation frequency.

Note that there can be more than one primary bottleneckand that any single-bottleneck or simultaneous bottleneck isalso a primary bottleneck by defintion.

3. Statistical Data Interpretation

In the following we provide a brief overview of the mostimportant statistical concepts that formed the basis of ourdata analysis. Since real systems are typically subject tovariable request characteristics, it is necessary to analyzethe distributions of resource utilization values to infer actualsaturation characteristics. However, histograms are oftenpoor estimates of unknown density functions [17], therefore,we chose estimation through kernel densities [18], instead.Given a finite sample X1, . . . ,X2 from a univariate distri-bution, the unknown density function g can be estimatedfrom the observed data using kernel regression. Althoughliterature offers various kernel functions, the symmetricGaussian density is a popular choice in density estimation.Given a value x, the kernel function c, and a smoothnessparameter h (i.e., “bandwidth”) the density estimator gc

takes the following form.

gc =1n

n�

t=1

k

h

�x−Xt

h

�(2)

There are different approaches to calculating bandwidth h.We choose a common strategy (i.e., AMISE estimation [18]),which simplifies to estimation with the sample standarddeviation estimate σ.

h0 =σ 3√

43√

n(3)

An example of a resulting three-dimensional density graphis shown in Figure 4(a).

In the analysis of multi-bottlenecks, it is further necessaryto asses the dependence relationship of resources in excess oftheir relative probabilities. In order to reduce the complexityof this analysis in an efficient top-down approach, it ispossible to segment the monitoring data into piecewise con-stant functions. This method provides intuitive aggregationand reduces the data size significantly compared to bottom-up analysis. Such a segmenting problem is also known

Page 5: Experimental Evaluation of N-tier Systems: Observation and

(a) Software setupClassification Software

Web server Apache 2.0.54

Application server Ap. Tomcat 5.5.17JOnAS 4.6.6

Cluster middleware C-JDBC 2.0.2Database server MySQL 5.0.51a

Operating system Redhat FC4Kernel 2.6.12

System monitor Systat 7.0.2

(b) Hardware node setupType Components

Normal Processor Xeon 3GHz 64-bitMemory 2GBNetwork 6 x 1GbpsDisk 2 x 146GB 10,000rpm

Low-cost Processor PIII 600Mhz 32-bitMemory 256MBNetwork 5 x 100MbpsDisk 13GB 7,200rpm

(c) Sample topology (1/2/1/2L)

!"#$%

&"'("'%

)**$%

&"'("'&%

+,-&."'%

/011,"$

23'"%

45$%

&"'("'&%

Table 1. Details of the experimental setup on the Emulab cluster.

as adaptive filtering problem that detects abrupt changesin streams of noisy data and has been previously studiedin signal processing. Therefore, we perform an automatedmultiple-model hypothesis test on a set of signal estimationmodels, which are based on different assumptions each, todivide the measured time series into segments according topiecewise constant mean and variance models [19]. Giventhe time index t, the noise measure e, the parameter vectorθ, and the noise variance γ, each signal y can be expressedas follows.

y(t) = θ(t) + e(t) (4)

Ee(t)2 = γ(t) (5)

In our case y(t) in (4) is identical to Xt in (2). The set offinal change times is determined algorithmically [19]. Thisapproach is directly interpretable as characterization of eachstable interval of the piecewise constant resource utilizationfunctions. We characterize each utilization interval withrespect to being statistically distinguishable from criticalsaturation (see Section 2). Such a classification is performedfor each bottleneck resource metric, and the results can besummarized in a simple graph for the entire system (e.g.,Figure 4(d)).

4. Experimental Setup

Among N-tier application benchmarks, RUBBoS andRUBiS have been used in numerous research efforts due totheir real production system significance. In our experiments,the run consist of an 8-minute ramp-up, a 12-minute runperiod, and a 30-second ramp-down. Performance measure-ments (e.g., CPU or network utilization) are taken duringthe run using Linux account logging utilities (i.e., Sysstat)with one-second intervals.

RUBBoS [20] is an N-tier e-commerce system modeledon bulletin board news sites similar to Slashdot. The bench-mark can be implemented as 3-tier (web server, applicationserver, and database server) or 4-tier (with the addition ofcluster middleware such as C-JDBC) systems. The bench-mark places high load on the database tier. The workload

consists of 24 different interactions (involving all tiers)such as register user, view story, and post comments. Thebenchmark includes two kinds of workloads: browse-onlyand read/write interaction mixes.

RUBiS [21] is an N-tier system benchmark modeledon online auction sites such as eBay. The benchmark canbe implemented as a 3-tier, 4-tier, or 5-tier system. Wehave chosen a configuration consisting of web server, webcontainer, EJB container, cluster middleware, and databaseserver. Web and EJB containers are deployed together onthe same physical nodes. The benchmark usually places ahigh load on application servers. The workload consists of26 interactions such as register user, sell item, and place bid.RUBiS includes two kinds of workloads: browse-only andread/write interaction mixes.

The experiments used in this paper were run in theEmulab testbed [22] with various types of servers. Table 1(b)contains a summary of the hardware used in our exper-iments. Normal and low-cost nodes were connected over1,000 Mbps and 100 Mbps links, respectively. The exper-iments were carried out by allocating a dedicated physicalnode to each server. In the initial setting all componentswere normal nodes. As an alternative, database serverswere also hosted on low-cost machines. We use a four-digit notation #W/#A/#C/#D to denote the number of webservers, application servers, cluster middleware nodes, anddatabase servers. The server node type is either normalor low-cost (“L”). If a specification is omitted, it can beassumed that the default node type (i.e., normal) has beenused. A sample topology of an experiment with one webserver, two application servers, one cluster middleware node,and two low-cost database servers (i.e., 1/2/1/2L) is shownin Table 1(c).

The presented dataset is part of an ongoing effort, forwhich we have run a very high number of experimentsover a wide range of configurations and workloads. Atypical RUBiS or RUBBoS experimentation cycle requiresthousands of lines of code that need to be managed foreach experiment. The experimental data output are systemmetric data points (i.e., network, disk, and CPU utilization)

Page 6: Experimental Evaluation of N-tier Systems: Observation and

0!

3!

6!

9!

12!

0!

40!

80!

120!

160!

500! 700! 900! 1100! 1300!

Aver

age

resp

on

se t

ime

[s]!

Av

erag

e th

roug

pu

t [o

ps/

s]!

Workload [#]!

Throughput! Response time!

Figure 3. Average throughput and response time of a1/6/1/1L RUBiS experiment with browse-only workloadan throttled database network bandwidth (2 Mbps).

in addition to higher-level monitoring data (e.g., responsetimes and throughput). Although the scripts contain a highdegree of similarity, the differences among them are subtleand important due to the dependencies among the varyingparameters. Maintaining these scripts by hand is a notori-ously expensive and error-prone process.

To enable experimentation at this scale, we employedan experimental infrastructure created for the Elba project[23] to automate system configuration management, partic-ularly in the context of N-tier system staging. The Elbaapproach [12] divides each automated staging iteration intosteps such as converting policies into resource assign-ments [24], automated code generation [25], benchmarkexecution, and analysis of results.

5. Concurrent Bottlenecks

In this section we show a RUBiS benchmark experiment thatexhibits concurrent saturation of application server CPUsand the database network link. More concretely, these dataare obtained with a 1/6/1/1L RUBiS configuration using thebrowse-only interaction mix. Traffic shaping (implementedby the packet scheduler modules in the Linux NET3 kernel)has been used to throttle the network bandwidth betweenthe application servers and the database server. Packets areadded to a scheduler queue tree and shaped correspondingto a 2Mbps bandwidth limit using the drop-tail queuingdiscipline. The system workload ranges between 500 and1,400 users in steps of 100.

Figure 3 summarizes the overall system performance,which is observable in the clients. The throughput increasesnear-linearly up to the performance knee at around 800concurrent users where the rate saturates at around 140interactions per second. Similarly, the average response timeexceeds six seconds past a workload of 800 users. Evidently,there is at least one bottleneck in the system that needs tobe identified through a detailed analysis.

Figure 4 shows the summary of the bottleneck analysisfor this scenario. The bottlenecks in the system are best

illustrated by studying their resource utilization as a functionof workload. The three density graphs in Figures 4(a), 4(b),and 4(c) show the increasing demand peak (z-axis) on theresource utilization (x-axis) as the workload grew from 500to 1,400 users (y-axis). More specifically, the three densitygraphs contrast the changes in the resource consumptionprofiles of bottleneck resources (i.e., application serverCPUs and database network) and non-saturated resources(i.e., database CPU). Note that the application server CPUgraph is representative of all application servers. While thefirst graph (Figure 4(a)) shows the continuous saturation ofthe network bandwidth at 250 KBps (consistent with thelink shaping parameter of 2 Mbps), the application serverCPU utilization (Figure 4(b)) transitions from a normalto a multimodal saturated curve shape with a dominatingmode at the upper capacity limit. This multimodality, whichcauses average utilization values to be misleadingly low, isthe result of an unstable resource consumption profile dueto non-stationary request sequences. For higher workloadsthe application server CPUs vary between being criticallyutilized and being underutilized. Overall, the seven saturatedsystem resources limit the number of tasks going into thedatabase server. Therefore, its CPU remains unsaturated,and its utilization density retains its normal shape despitegrowing intensity throughout for the entire workload span(Figure 4(c)).

In order to precisely characterize whether the uncoveredbottleneck resources are concurrent or oscillatory bottle-necks, we estimated the saturation frequencies of eachresource and visualized them in Figure 4(d). Althoughthere are fourteen different combinations, the concurrentcharacter of the seven bottlenecks is apparent. There isa clear linear decrease in probability of non-overlappingbottleneck states, and a corresponding linear increase inoverlapping bottleneck states as the workload grew to 1,400users. For workloads higher than 1,000, the probability ofstates without critical saturation of at least one resource isstatistically insignificant. Both, the frequency of the net-work bottleneck and the CPU bottlenecks are solely load-dependent, which implies saturation-independence amongthe resources. Because non of the resources is fully sat-urated, we can conclude that the system suffers of sevenconcurrent bottlenecks in the six application sever CPUs andthe database network link.

6. Oscillatory Bottlenecks

A common assumption in computer system performanceanalysis is that low average resource utilization impliesabsence of saturation. However, in the following we showthat a standard RUBBoS read/write mix workload maycauses request sequences in the database that result inalternating partial resource saturation with particularly lowaverage resource utilization. While Section 6.1 details a

Page 7: Experimental Evaluation of N-tier Systems: Observation and

150175

200225

250

600800

10001200

1400

0.02

0.04

0.06

0.08

0.1

Network!out [KB]Workload [#]

Prob

abili

ty d

ensi

ty

(a) DB network outgoing rate density.

025

5075

100

600800

10001200

1400

0.02

0.04

0.06

0.08

0.1

CPU util [%]Workload [#]

Prob

abili

ty d

ensi

ty

(b) App1 CPU utilization density.

025

5075

100

600800

10001200

1400

0.02

0.04

0.06

0.08

0.1

CPU util [%]Workload [#]

Prob

abili

ty d

ensi

ty

(c) DB CPU utilization density.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

500

600700800

9001000

110012001300

1400

System state probability

Wor

kloa

d [#

]

NO BOTTLENECKS

DB NETWORK BOTTLENECK

0 DB NW BNs; 0 App CPU BNs0 DB NW BNs; 1 App CPU BNs0 DB NW BNs; 2 App CPU BNs0 DB NW BNs; 3 App CPU BNs0 DB NW BNs; 4 App CPU BNs0 DB NW BNs; 5 App CPU BNs0 DB NW BNs; 6 App CPU BNs1 DB NW BNs; 0 App CPU BNs1 DB NW BNs; 1 App CPU BNs1 DB NW BNs; 2 App CPU BNs1 DB NW BNs; 3 App CPU BNs1 DB NW BNs; 4 App CPU BNs1 DB NW BNs; 5 App CPU BNs1 DB NW BNs; 6 App CPU BNs

(d) Probability distribution (i.e., frequency of resource saturation states) among the fourteen different bottleneckresource saturation states.

Figure 4. Detailed bottleneck analysis of a 1/6/1/1L RUBiS experiment with browse-only workload an throttleddatabase network bandwidth (2 Mbps).

scenario with an oscillatory bottleneck within a single node,Section 6.2 shows an example of an oscillatory bottleneckwith eight resources distributed in the database tier.

6.1. Within-node Dependences

As previously shown in Figure 1(a), the 1/2/1/1L RUBBoSscenario under read/write workload exhibits a clear overloadpattern past workloads of 3,000 user. Nevertheless, theaverage utilization values for candidate bottleneck resources(Figure 1(b)) remain far from saturation. Although thecomparison of these two figures reveals a strong correlationbetween system throughput and database CPU utilization,the CPU utilization seems to saturate at an average of lessthan 80 percent.

In order to diagnose the system bottlenecks, we turn toFigure 5. The analysis of Figures 5(a) and 5(b) yields the ex-planation of the previously observed low utilization values.In fact, the two examined densities (database CPU and disk)are both bimodal with two distinct concentration sectors forthe entire workload span. Both utilization metrics seem tovary between low and high values during the experimenttime. A sample examination of the two time series fora short interval of runtime for 12,000 users (Figure 5(c))seems to support this impression. However, it is importantto note that such an examination is not scalable and requireshuman intuition to derive insights in the actual systembehavior. Therefore, we automatically summarize the metric

0!

3!

6!

9!

12!

0!

150!

300!

450!

600!

1000! 3000! 5000! 7000! 9000! 11000! 13000!

Aver

age

resp

on

se t

ime

[s]!

Aver

age

thro

ugp

ut

[ops/

s]!

Workload [#]!

Throughput! Response time!

Figure 6. Average throughput and response time of a1/1/1/8L RUBBoS experiment with read/write workload.

data in Figure 5(d), which reveals two bottleneck resourceswith four saturation states for the entire workload span.Clearly, overlapping saturation is extremely rare. The ratioof time fractions for the two non-overlapped bottleneck stateseems to be relatively stable, which further strengthens theassumption of high dependence between the two bottleneckresources. Consequently, the reason for the system saturationlays in an oscillatory bottleneck with interleaved saturationof database CPU and disk. A thorough log analysis revealsthat if many long queries are bottlenecked at the databasedisk, the CPU remains idle since it is waiting for I/O, andno new request are admitted to the node. This causes theobserved exclusive saturation pattern.

Page 8: Experimental Evaluation of N-tier Systems: Observation and

025

5075

100

40006000

800010000

12000

0.01

0.02

0.03

0.04

0.05

Resource util [%]Workload [#]

Prob

abili

ty d

ensi

ty

(a) DB CPU utilization density.

025

5075

100

40006000

800010000

12000

0.01

0.02

0.03

0.04

0.05

Resource util [%]Workload [#]

Prob

abili

ty d

ensi

ty

(b) DB disk utilization density.

50 100 1500

25

50

75

100

Res

ourc

e ut

iliza

tion

[%]

Experiment time [s]

! DB CPU ! DB disk I/O BW

(c) Sample time series of databaseresource utilization for 12,000 users.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3000400050006000700080009000

10000110001200013000

System state probability

Wor

kloa

d [#

]

No BN DB CPU BN DB Disk BN DB CPU BN; DB Disk BN

(d) Probability distribution (i.e., frequency of resource saturation states) among the fourdifferent bottleneck resource saturation states.

Figure 5. Detailed bottleneck analysis of a 1/2/1/1L RUBBoS experiment with read/write workload.

6.2. Between-Node Dependences

The performance metrics of a 1/1/1/8L RUBBoS deploymentunder standard read/write workload for 1,000 to 13,000concurrent user sessions shown in Figure 6 reveal systemsaturation past a workload of 4,000 users. The analysis,necessary to infer the underlying bottlenecks, is summarizedin Figure 7. Figure 7(a) shows that the CPU utilization in thedatabase does not reach critical levels during the experimenttime. Consequently, this resource can be disregarded asbottleneck candidate. The inspection of the density graphfor the disk utilization in the first database (representativeof all eight databases) reveals slightly elevated density valuesat the high tail of the right-skewed density (see Figure 7(b)).Unlike the previous examples, these values do not seem toexplain the overall performance deterioration at first sight.The elevation levels are constant, and their overall prob-ability remains significant but small throughout the entireexperiment. In other words, the resource shows a saturationbehavior, which is very infrequent during the runtime. Thiscould suggest a strongly oscillating bottleneck with a largenumber of bottleneck resources. Such an interpretation isfurther supported by Figure 7(c), which shows the densityfor the maximal disk utilization among all eight databases.Past a workload of 3,000 users, the maximal value is alwayshigher than 50 percent, and a dominant peak has appearedat the high percentiles. This means that the system exhibits

a high utilization in at least one of the database disks at alltimes for higher workloads.

Nonetheless, the question remains whether the bottleneckresources saturate dependently. While a sample manual eval-uation of the resource utilization time series (see Figure 7(d))does not yield clear results due to the magnitude of thevariability, the automated partitioning used to generate Fig-ure 7(e) reveals the oscillatory character of this bottleneck.The probability plots are strongly dominated by two systemstates (i.e., single database disk bottleneck or no bottleneckat all). There is virtually no overlap between the saturationof the eight resources. The rarely observed overlap can beattributed to the noise that is introduced into the data byeach resource and by the stochastic aggregation method.Therefore, we can conclude the the performance limitation inthis scenario is caused by an oscillatory bottleneck with eightsaturation-dependent, partially saturated database disks. Adetailed log analysis shows that because of the workload dis-tributing capabilities of C-JDBC and the relative infrequencyof overly long queries, the bottleneck is strongly distributedamong the eight resources. The interleaved saturation iscaused by the default C-JDBC replication policy, whichwaits for the completion of all concurrent write-requestsbefore committing. Writes are always immediately sent toall databases (i.e., multi-master replication), thus the entiresystem is bottlenecked if one database node saturates.

Page 9: Experimental Evaluation of N-tier Systems: Observation and

025

5075

100

30006000

900012000

0.02

0.04

0.06

0.08

Resource util [%]Workload [#]

Prob

abili

ty d

ensi

ty

(a) DB1 CPU utilization density.

025

5075

100

30006000

900012000

0.02

0.04

0.06

0.08

Resource util [%]Workload [#]

Prob

abili

ty d

ensi

ty

(b) DB1 disk utilization density.

025

5075

100

30006000

900012000

0.02

0.04

0.06

0.08

Resource util [%]Workload [#]

Prob

abili

ty d

ensi

ty

(c) Maximal DB utilization densityamong all eighth DBs.

50 100 1500

25

50

75

100

Res

ourc

e ut

iliza

tion

[%]

Experiment time [s]

(d) Sample time series of disk utilizationin all DBs with 12,000 users.

0 0.2 0.4 0.6 0.8 1

100020003000400050006000700080009000

10000110001200013000

System state probability

Wor

kloa

d [#

]

NO BOTTLENECKS

1 DB DISK BOTTLENECK

No BN1 DB disk BN2 DB disk BN3 DB disk BN4 DB disk BN7 DB disk BN8 DB disk BN

(e) Probability distribution (i.e., frequency of resource saturation states) amongthe seven different bottleneck resource saturation states.

Figure 7. Detailed bottleneck analysis of a 1/1/1/8L RUBBoS experiment with read/write workload.

7. Related Work

Traditional performance analysis in computer systems pre-sumes models based on expert knowledge, employs standardstatistical methods, and parameterizes them based on acertain experimentation design [6], [7]. Queuing modelshave been common practice in many research efforts dealingwith performance prediction [26], [27]. Although theseapproaches have been applied very successfully, they sufferfrom their rigid assumptions when handling all evolution oflarge applications. The availability of extensive instrumenta-tion data profiles [26] or constant mean inter-arrival times ofrequest [27] do not hold in general since actual parametersvary widely in real applications. The characteristic loadnon-stationarity in N-tier systems has been exploited forperformance prediction by Stewart et al. [10], who alsoexplain how their anomaly detection can be used to invokea bottleneck detection process such as the one presented inthis paper.

Statistically induced models have been recently used toremove human intervention from the loop [28]–[30], andextensive experiments have been conducted to comparedifferent bottleneck detection methodologies [31]. Never-theless, all these algorithmic approaches explicitly correlatehigh-level application performance with low-level systembehavior. In this paper, we solely use performance degra-dation patterns as a trigger for our analysis, similarly to

real system administrators [32]. In practice, many computermanuals possess a performance tuning section, which typi-cally relies on specialized “rules of thumb”. Additionally,commercial tools (e.g., HP Open View or IBM Tivoli)offer the possibility of inspecting an abundant variety ofmetric data graphically without clear aggregation and anal-ysis frameworks. Some discussion on a two-dimensionalbottleneck characterization in RUBiS and RUBBoS has beenpreviously provided by Amza et al. [13]. However, thisapproach solely relies on manual human diagnosis and onlytargets stable bottleneck characterization with small-scaleexperimentation.

In contrast to our work, previous bottleneck detectionresearch in computer systems has built upon an extremelydetailed understanding of the systems (e.g., invasively in-strumented central system [33]) or an analysis confinedto a small resource subset (e.g., network traffic or soft-ware configurations [5]). Methods for bottleneck detectionand analysis have also been also discussed in literatureon simulation in areas such as industrial production [4].The latter work emphasizes a technique for dealing withshifting bottleneck behavior based on resource utilization.The notion of ordering bottlenecks by severity with the helpof resource demand distributions has been introduced byLuthi [3]. He provides proof that approximating distributionswith histograms yields a higher precision than conventionalmethods. Unlike our observation-based work, both these

Page 10: Experimental Evaluation of N-tier Systems: Observation and

approaches remain very generic without inference of domainspecific phenomena.

8. Conclusion

For mission-critical applications such as e-commerce, N-tiersystems have grown in complexity with non-stationary work-loads and inter-task dependencies created by requests thatare passed between the various servers. These complicatingfactors create multi-bottlenecks such as oscillatory bottle-necks (where inter-task dependencies cause the bottleneck tomigrate among several resources) and concurrent bottlenecks(where multiple bottlenecks arise among several resources).Multi-bottlenecks are non-trivial to analyze, since they mayescape typical assumptions made in classic performanceanalysis such as stable workloads and independence amongtasks.

In this paper, we describe an experimental study ofmulti-bottlenecks using a large dataset of N-tier applicationbenchmark data. We used techniques and tools developedfor automated system management (e.g., code generationfor experimental scripts) to collect more than 200GB ofmeasurement data on the N-tier application benchmarksRUBiS and RUBBoS. Using statistical techniques such askernel density estimation, we show that multi-bottleneckphenomena arise naturally in sufficiently complex N-tiersystems. For instance, in 72.2% of our RUBBoS experimentswith low-cost database nodes (e.g., Section 6), we wereable to identify oscillatory bottlenecks, and in 83.7% ofour RUBiS experiments, we found either concurrent orsimultaneous bottlenecks (e.g., Section 5). Furthermore, itis non-trivial to reveal multi-bottlenecks since they oftenhappen when no single-bottleneck is visible (i.e., averageresource utilization well below saturation for all resources).

Acknowledgment

This research has been partially funded by National Sci-ence Foundation grants ENG/EEC-0335622, CISE/CNS-0646430, CISE/CNS-0716484, AFOSR grant FA9550-06-1-0201, NIH grant U54 RR 024380-01, IBM, Hewlett-Packard, Wipro Technologies, and Georgia Tech Founda-tion through the John P. Imlay, Jr. Chair endowment. Anyopinions, findings, and conclusions or recommendationsexpressed in this material are those of the author(s) anddo not necessarily reflect the views of the National ScienceFoundation or other funding agencies and companies men-tioned above.

References

[1] G. Balbo and G. Serazzi, “Asymptotic analysis of multiclass closedqueueing networks: multiple bottlenecks,” in Perform. Eval. ’97.

[2] G. Casale and G. Serazzi, “Bottlenecks identification in multiclassqueueing networks using convex polytopes,” in MASCOTS ’04.

[3] J. Luthi, “Interval matrices for the bottleneck analysis of queueingnetwork models with histogrambased parameters,” in IPDS ’98.

[4] C. Roser, M. Nakano, et al., “Shifting bottleneck detection,” inWSC ’02.

[5] F. Ricciato, F. Vacirca, et al., “Diagnosis of capacity bottlenecksvia passive monitoring in 3g networks: An empirical analysis,” inComput. Netw. ’07.

[6] R. Jain, The art of computer systems performance analysis:techniques for experimental design, measurement, simulation, andmodeling. New York, NY, USA: John Wiley & Sons, Inc., 1991.

[7] D. J. Lilja, Measuring Computer Performance - A Practitioner’sGuide. New York, NY, USA: Cambridge University Press, 2000.

[8] N. Mi, “Performance impacts of autocorrelated flows in multi-tieredsystems,” in Perform. Eval. Rev. ’07.

[9] G. Casale, N. Mi, et al., “How to parameterize models with burstyworkloads,” in HotMetrics ’08.

[10] C. Stewart, T. Kelly, et al., “Exploiting nonstationarity forperformance prediction,” in SIGOPS Oper. Syst. Rev. ’07.

[11] Q. Zhang, L. Cherkasova, et al., “A regression-based analytic modelfor dynamic resource provisioning of multi-tier applications,” inICAC ’07.

[12] C. Pu, A. Sahai, et al., “An observation-based approach toperformance characterization of distributed n-tier applications,” inIISWC ’07.

[13] C. Amza, E. Cecchet, et al., “Bottleneck characterization of dynamicweb site benchmarks,” in IBM CAS ’02.

[14] E. Cecchet, A. Chanda, et al., “Performance comparison ofmiddleware architectures for generating dynamic web content,” inMiddleware ’03.

[15] M. Litoiu, “A performance analysis method for autonomiccomputing systems,” in ACM Trans. Auton. Adapt. Syst. ’07.

[16] Y. Wang, Q. Zhao, et al., “Bottlenecks in production networks: Anoverview,” in JSSSE ’05.

[17] C. Alexopoulos, “Statistical analysis of simulation output: state ofthe art,” in WSC ’07.

[18] B. E. Hanson, “Bandwidth selection for nonparametric distributionestimation,” www.ssc.wisc.edu/∼bhansen/papers/wp.htm, May 2004.

[19] F. Gustafsson, Adaptive Filtering and Change Detection. JohnWiley & Sons, Inc., 2001.

[20] “RUBBoS,” jmob.objectweb.org/rubbos.html.[21] “RUBiS,” rubis.objectweb.org.[22] “Emulab - Network Emulation Testbed,” www.emulab.net.[23] “The Elba project,” www.cc.gatech.edu/systems/projects/Elba.[24] A. Sahai, S. Singhal, et al., “Automated generation of resource

configurations through policies,” in Policy ’04.[25] G. Jung, C. Pu, et al., “Mulini: an automated staging framework for

qos of distributed multi-tier applications,” in WRASQ ’07.[26] C. Stewart and K. Shen, “Performance modeling and system

management for multi-component online services,” in NSDI’05.[27] B. Urgaonkar, G. Pacifici, et al., “An analytical model for multi-tier

internet services and its applications,” Perform. Eval. Rev. ’05.[28] M. K. Aguilera, J. C. Mogul, et al., “Performance debugging for

distributed systems of black boxes,” in SOSP ’03.[29] I. Cohen, M. Goldszmidt, et al., “Correlating instrumentation data

to system states: a building block for automated diagnosis andcontrol,” in OSDI’04.

[30] C. Huang, I. Cohen, et al., “Achieving scable autoamted diagnosisof distributed systems performance problems,” HP Labs, Tech. Rep.,2007.

[31] S. Malkowski, M. Hedwig, et al., “Bottleneck detection usingstatistical intervention analysis,” in DSOM ’07.

[32] P. Bodık, O. Fox, et al., “Advanced tools for operators atamazon.com,” in In HotAC ’06.

[33] R. Blake and J. S. Breese, “Automatic bottleneck detection,”ftp://ftp.research.microsoft.com/pub/tr/tr-95-10.ps, 1995.