22
11 Software Aging in the Eucalyptus Cloud Computing Infrastructure: Characterization and Rejuvenation JEAN ARAUJO, RUBENS MATOS, VANDI ALVES, and PAULO MACIEL, Federal University of Pernambuco F. VIEIRA DE SOUZA, Federal University of Piau´ ı RIVALINO MATIAS Jr., Federal University of Uberlˆ andia KISHOR S. TRIVEDI, Duke University The need for high reliability, availability and performance has significantly increased in modern applications, that handle rapidly growing demands while providing uninterruptible services. Cloud computing systems fundamentally provide access to large pools of data and computational resources. Eucalyptus is a software framework largely used to implement private clouds and hybrid-style Infrastructure as a Service. It imple- ments the Amazon Web Service (AWS) API, allowing interoperability with other AWS-based services. This article investigates the software aging effects in the Eucalyptus framework, considering workloads composed of intensive requests for remote storage attachment and virtual machine instantiations. We found problems that may be harmful to system dependability and performance, specifically regarding to RAM memory and swap space exhaustion, besides highly excessive CPU utilization by the virtual machines. We also present an approach that applies time series analysis to schedule rejuvenation, so as to reduce the downtime by predicting the proper moment to perform the rejuvenation. We experimentally evaluate our approach using an Eucalyptus test bed. The results show that our approach achieves higher availability, when compared to a threshold-triggered rejuvenation method based on continuous monitoring of resources utilization. Categories and Subject Descriptors: C.4 [Performance of Systems]: Performance attributes, reliability, availability and serviceability; D.4.8 [Operating Systems]: Performance—Measurements General Terms: Measurement, Performance Additional Key Words and Phrases: Software aging and rejuvenation; cloud computing; dependability and performance analysis; memory leak ACM Reference Format: Araujo, J., Matos, R., Alves, V., Maciel, P., Vieira de Souza, F., Matias Jr., R., and Trivedi, K. S. 2014. Software aging in the Eucalyptus cloud computing infrastructure: Characterization and rejuvenation. ACM J. Emerg. Technol. Comput. Syst. 10, 1, Article 11 (January 2014), 22 pages. DOI: http://dx.doi.org/10.1145/2539122 1. INTRODUCTION The deployment of cloud-based architectures has grown over the recent years, mainly because they constitute a scalable, cost-effective and robust service platform [Peng et al. 2009; McKinley et al. 2006]. Such features are made possible due to the integration of This research was supported in part by the NASA Office of Safety and Mission Assurance (OSMA) Software Assurance Research Program (SARP) under a JPL subcontract #1440119. Authors’ addresses: J. Araujo, R. Matos, V. Alves, and P. Maciel, Informatics Center, Federal University of Pernambuco; F. Vieira de Souza, Statistics and Informatics Department, Federal University of Piaui; R. Matias Jr., School of Computer Science, Federal University of Uberlˆ andia; K. S. Trivedi, Department of Electrical and Computer Engineering, Duke University. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2014 ACM 1550-4832/2014/01-ART11 $15.00 DOI: http://dx.doi.org/10.1145/2539122 ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software aging in the eucalyptus cloud computing infrastructure

Embed Size (px)

Citation preview

Page 1: Software aging in the eucalyptus cloud computing infrastructure

11

Software Aging in the Eucalyptus Cloud Computing Infrastructure:Characterization and Rejuvenation

JEAN ARAUJO, RUBENS MATOS, VANDI ALVES, and PAULO MACIEL, Federal Universityof PernambucoF. VIEIRA DE SOUZA, Federal University of PiauıRIVALINO MATIAS Jr., Federal University of UberlandiaKISHOR S. TRIVEDI, Duke University

The need for high reliability, availability and performance has significantly increased in modern applications,that handle rapidly growing demands while providing uninterruptible services. Cloud computing systemsfundamentally provide access to large pools of data and computational resources. Eucalyptus is a softwareframework largely used to implement private clouds and hybrid-style Infrastructure as a Service. It imple-ments the Amazon Web Service (AWS) API, allowing interoperability with other AWS-based services. Thisarticle investigates the software aging effects in the Eucalyptus framework, considering workloads composedof intensive requests for remote storage attachment and virtual machine instantiations. We found problemsthat may be harmful to system dependability and performance, specifically regarding to RAM memory andswap space exhaustion, besides highly excessive CPU utilization by the virtual machines. We also presentan approach that applies time series analysis to schedule rejuvenation, so as to reduce the downtime bypredicting the proper moment to perform the rejuvenation. We experimentally evaluate our approach usingan Eucalyptus test bed. The results show that our approach achieves higher availability, when compared toa threshold-triggered rejuvenation method based on continuous monitoring of resources utilization.

Categories and Subject Descriptors: C.4 [Performance of Systems]: Performance attributes, reliability,availability and serviceability; D.4.8 [Operating Systems]: Performance—Measurements

General Terms: Measurement, Performance

Additional Key Words and Phrases: Software aging and rejuvenation; cloud computing; dependability andperformance analysis; memory leak

ACM Reference Format:Araujo, J., Matos, R., Alves, V., Maciel, P., Vieira de Souza, F., Matias Jr., R., and Trivedi, K. S. 2014. Softwareaging in the Eucalyptus cloud computing infrastructure: Characterization and rejuvenation. ACM J. Emerg.Technol. Comput. Syst. 10, 1, Article 11 (January 2014), 22 pages.DOI: http://dx.doi.org/10.1145/2539122

1. INTRODUCTION

The deployment of cloud-based architectures has grown over the recent years, mainlybecause they constitute a scalable, cost-effective and robust service platform [Peng et al.2009; McKinley et al. 2006]. Such features are made possible due to the integration of

This research was supported in part by the NASA Office of Safety and Mission Assurance (OSMA) SoftwareAssurance Research Program (SARP) under a JPL subcontract #1440119.Authors’ addresses: J. Araujo, R. Matos, V. Alves, and P. Maciel, Informatics Center, Federal Universityof Pernambuco; F. Vieira de Souza, Statistics and Informatics Department, Federal University of Piaui;R. Matias Jr., School of Computer Science, Federal University of Uberlandia; K. S. Trivedi, Department ofElectrical and Computer Engineering, Duke University.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2014 ACM 1550-4832/2014/01-ART11 $15.00

DOI: http://dx.doi.org/10.1145/2539122

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 2: Software aging in the eucalyptus cloud computing infrastructure

11:2 J. Araujo et al.

various software components that enable reservation and access to remote computa-tional resources, by means of standard interfaces and protocols, strongly based on webservices [Eucalyptus 2011]. Virtualization is an essential requirement to build a typicalcloud-computing infrastructure [Armbrust et al. 2009]. Cloud-oriented data centers al-low the success of massive user-centric applications, such as social networks, that haveexperienced a rapid increase in the number of concurrent accesses. These benefits arerather important, mainly to small enterprises, enabling the provisioning of differentlevels of resource allocations, in a rapid manner, guaranteeing the performance andavailability levels needed for the current day massive user-centric systems. Althoughthe performance, availability and reliability are major requirements for cloud-orientedinfrastructures, an aspect usually neglected by many service providers is the effectof software aging phenomenon [Grottke et al. 2008], which has been verified to playan important role in the reliability and performance degradation of many softwaresystems [Grottke et al. 2008; Matias and Freitas Filho 2006; Bao et al. 2005]. Whileessential to elastic computing, the usage of virtual machines and remote storage vol-umes requires memory and disk intensive operations, mainly during virtual machineallocation, reconfiguration and destruction. Such operations may exhaust hardwareand operating system resources [Araujo et al. 2011a] in the presence of software agingdue to software faults or poor system design [Grottke et al. 2008]. The software agingeffects in cloud computing environments were investigated in Araujo et al. [2011a,2011b]. These papers demonstrate the occurrence of aging effects in an Eucalyptus-based infrastructure due to the accumulation of memory leaks. In Araujo et al. [2011c]and Matos et al. [2011], rejuvenation strategies have been proposed for reducing thedowntime caused by the aging effects in the Eucalyptus framework.

In this article, we extend the study presented in Araujo et al. [2011c], including anevaluation of other aging effects that were not described there for the Eucalyptus cloudcomputing environment. We focus on memory and CPU utilization during consecutiveattachments of remote block storage volumes. Such operations are essential to provideflexible allocation of virtual machines, with minimum dependency on local storagedevices, or complex data replication mechanisms. In addition to the trend analysisof software aging-related data proposed in Araujo et al. [2011c], we analyze the agingeffects on the Eucalyptus elastic block storage (EBS). EBS is a technology that providesflexible allocation of remote storage volumes to the virtual machines running in a cloudenvironment.

The remaining parts of this article are organized as follows. In Section 2, we presentthe fundamental concepts of the main topics discussed in this article. Section 3 presentsrelated works, especially regarding cloud computing and software aging. Section 4explains the test bed environment used in our experiments, including the definitionof the adopted workloads. Section 5 describes the experimental studies, divided intothree parts: the first experiment is performed using an EBS-based workload, whereasthe second and third experiments are performed with a workload based on virtualmachine’s lifecycle. Section 6 summarizes our conclusions and discusses possible topicsfor future research.

2. BACKGROUND

The investigation of software aging in cloud computing requires a multidisciplinaryapproach, that is at the intersection of several different but related topics. This sectionhighlights the main concepts that provide the basis for this work.

2.1. Software Aging and Rejuvenation

Software aging can be defined as a growing degradation of the software’s internal state,during its operational life [Grottke et al. 2008]. Causes of software aging have been

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 3: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:3

verified as the accumulated effects of software fault activations [Avizienis et al. 2004]during the system runtime [Huang et al. 1995]. Aging in a software system, as inhuman beings, is a cumulative process. The accumulating effects of successive erroroccurrences directly influence the aging-related failure manifestation. These faultsgradually lead the system towards an erroneous state [Huang et al. 1995]. This gradualshifting is caused by aging effects accumulation, being the fundamental nature of thesoftware aging phenomenon. It is important to highlight that a system fails due tothe cumulative consequences of aging effects over time. For example, considering aspecific load demand, an application server system may fail due to unavailable physicalmemory, which may be caused by cumulative memory leaks that in turn may be due tothe lack of some memory deallocation. In this case, the aging-related fault is a defectin the code that causes memory leaks; the memory leak is the observed effect of anaging-related fault. The aging factors [Grottke et al. 2008] are the input patterns thatexercise the code region where the aging-related faults may reside. The aging-relatedeffect may be observable usually only after a long run of the system. The time toaging-related failure (TTARF) is an important metric for reliability and availabilitystudies of systems suffering from aging [Bao et al. 2005]. Previous studies (e.g., Matiasand Freitas Filho [2006], Bao et al. [2005], and Matias Jr. et al. [2010]) on the aging-related failure phenomenon show that the TTARF probability distribution is stronglyinfluenced by the intensity with which the system gets exposed to aging factors, suchas the system workload.

Due to the cumulative property of the software aging phenomenon, it occurs moreintensively in continuously running systems that are executed over a long period oftime, such as cloud-computing framework software components. In long-running exe-cutions, a system suffering from software aging exhibits an increasing failure rate dueto the aging effect accumulation caused by successive errors, which degrades the sys-tem internal state integrity. Problems such as data inconsistency, numerical errors, andexhaustion of operating system resources are examples of software aging consequences[Grottke et al. 2008]. Since the notion of software aging was introduced in Huang et al.[1995], many studies have been conducted in order to characterize and understandthis important phenomenon. Monitoring the aging effects is essential to any agingcharacterization study. Many previous studies have implemented aging monitoring atdifferent system levels, however, to the best of our knowledge, the discussion of agingeffects in a cloud computing environment is not sufficiently explored. Once the agingeffects are detected, mitigation mechanisms might be applied in order to reduce theirimpact on the applications or the operating system. The search for software aging mit-igation approaches resulted in the so-called software rejuvenation techniques [Huanget al. 1995; Matias and Freitas Filho 2006; Vaidyanathan and Trivedi 2005]. Since theaging effects are typically caused by hard-to-track (residual) software faults, rejuvena-tion techniques look for reducing the aging effects during the software runtime, untilthe aging causes (e.g., a software bug) are fixed definitively. Examples of rejuvenationapproaches are the software restart or system reboot. In the former, the aged applica-tion process is killed and then a new process is created as a substitute. Replacing anaged process by a new one removes the aging effects accumulated during the replacedprocess’s runtime. Other approaches focus on different system levels, such as Kouraiand Chiba [2007] that presents a rejuvenation technique for virtualized environments.

A common problem during software rejuvenation is the downtime overhead caused bythe restart or reboot actions, since the application or operating system are unavailableduring the execution of these rejuvenation mechanisms. Matias and Freitas Filho[2006] proposed a zero-downtime rejuvenation technique for the apache web server,which was used by Matos et al. [2011] for addressing the characteristics of some agingeffects observed in the Eucalyptus cloud computing environment.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 4: Software aging in the eucalyptus cloud computing infrastructure

11:4 J. Araujo et al.

2.2. Dependability in Cloud Computing

Cloud computing provides access to computers and their functionality via the Internetor a local area network [Eucalyptus 2011]. The US National Institute of Standards andTechnology - NIST [NIST 2011], defines cloud computing as follows: “Cloud computingis a model for enabling convenient, on-demand network access to a shared pool ofconfigurable computing resources (e.g., networks, servers, storage, applications, andservices) that can be rapidly provisioned and released with minimal management effortor service provider interaction”. Cloud types (including public, private, and hybrid)refer to the nature of access and control with respect to the use and provisioning ofvirtual and physical resources. The most common cloud service styles are referred toby the acronyms IaaS (Infrastructure as a Service), PaaS (Platform as a Service), andSaaS (Software as a Service) [Sun Microsystems 2009]. Numerous advances in softwarearchitecture have helped to promote the adoption of cloud computing. These advanceshelp to support the goal of efficient application development while helping applicationsto be elastic and scale gracefully and automatically [Sun Microsystems 2009]. Cloudcomputing is seen by some as an important forward-looking model for the distributionand access of computing resources because it offers the following potential advantages:

—Scalability. Applications designed for cloud computing need to scale dynamicallywith workload demands so that performance and compliance with service level agree-ments remain on target [Eucalyptus 2011; Sun Microsystems 2009].

—Security. Applications need to provide access only to authorized and authenticatedusers, and the users need to be able to trust that their data is secure [Sun Microsys-tems 2009].

—Availability. Regardless of the application being provided, users of cloud applicationsexpect them to be up and running every minute of every day [Sun Microsystems2009].

—Reliability and Fault-Tolerance. Reliability means that applications do not fail andmost importantly they do not lose data [Sun Microsystems 2009], that is, it is theability to perform and maintain its functions even under unexpected circumstances.

Many of the desirable features of a cloud system are related to the concept of depend-ability. There is no unique definition of dependability. By one largely adopted definition,it is the ability of a system to deliver the required services that can justifiably be trusted[Avizienis et al. 2004]. It is also defined as the property that prevents a system fromfailing in an unexpected or catastrophic way.

Indeed, dependability is also related to disciplines such as availability and reliability.Availability is the ability of a system to perform its slated function at a specific instant oftime [Trivedi et al. 2009; Xie et al. 2004; Musa 1998]. Dependability is a very importantproperty for a cloud system as it should provide services with high availability, highstability, high fault tolerance, and dynamical extensibility.

Because cloud computing is a large-scale distributed computing paradigm, and itsapplications are accessible anywhere and anytime, dependability in cloud system be-comes more important and yet more difficult to achieve [Sun et al. 2010]. The softwareaging effects in cloud systems may affect the performance of communication among thecloud components, as well as their dependability. The degradation of communicationperformance, in its turn, may have an impact on the dependability of that system.Therefore, the presence of different and complex software layers in cloud systemsraises the need of appropriate and effective monitoring of aging effects, as well as theneed of proper rejuvenation mechanisms in order to assure the dependability aspectspreviously mentioned.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 5: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:5

2.3. Time Series

A time series can be represented by a set of observations of a variable arranged se-quentially over time [Kedem and Fokianos 2002]. The series values form a stochasticprocess, that is, a collection of random variables X(t), for each t ∈ T , where T is theindex set. A probability distribution associated to the random variables is another com-ponent of a stochastic process. In most situations, the variable t represents time, butcan also represent another physical quantity, for example, space. The main applicationsof the time series are: description, explanation, process control and prediction.

Time series enables one to build models that explain the behavior of the observedvariable and the types of time series analyses may be divided into frequency-domain[Bloomfield 2000; Kedem and Fokianos 2002] and time-domain methods [Akaike 1969;Box and Jenkins 1970; Chatfield 1996].

Note that a model is the probabilistic description of a time series. The modeler mustdecide how to use the chosen model according to his goals. Many forecasting models arebased on the method of “least squares” that provides the basis for all related theoreticalstudies. Usually they are classified according to the number of parameters involved.

The regression models are among the most used for description and prediction of timeseries. This article adopts four models, namely: the linear model, the quadratic model,the exponential growth model and the model known as Pearl-Reed logistic, which arebriefly described as follows, based on the predicted value Yt of E[X(t)]:

—Linear Trend Model (LTM). This is the default model used in the analysis of trends.Its equation is given by Yt = β0 + β1 · t + et where β0 is known as the y-intercept, β1represents the average rate of growth per unit time, and et is the error of fit betweenthe model and the real series [Montgomery et al. 2008].

—Quadratic Trend Model (QTM). This model takes into account a smooth curvaturein the data. Its representation is given by Yt = β0 + β1 · t + β2 · t2 + et where thecoefficients have similar meanings as in the previous case [Montgomery et al. 2008].

—Growth Curve Model (GCM). This is the model of trend growth or decay in exponentialform. Its representation is given by Yt = β0 · βt

1 + et.—S-Curve Trend Model (SCTM). This model fits the logistics of Pearl-Reed. It is usually

used in time series that follow the shape of the S-curve. Its representation is givenby Yt = 10a/(β0 + β1β

t2) + et.

Error measures [Schwarz 1978] are adopted for choosing the model that best fits themonitored environment. MAPE, MAD, and MSD are the error measures adopted inthis article:

—MAPE (Mean Absolute Percentage Error) represents the accuracy of the values of thetime series expressed in percentage. This estimator is given by:

MAPE =∑n

t=1 |(Yt − Yt)/Yt|n

× 100,

where Yt is the actual value observed at time t (Yt �= 0), Yt is the model calculatedvalue and n is the number of observations.

—MAD (Mean Absolute Deviation) represents the accuracy of the calculated values ofthe time series. It is expressed in the same unit of data. MAD is an indicator of theerror size and is given by the statistic:

MAD =∑n

t=1 |(Yt − Yt)|n

,

where Yt, t, Yt and n have the same meanings as in MAPE.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 6: Software aging in the eucalyptus cloud computing infrastructure

11:6 J. Araujo et al.

Fig. 1. Example of Eucalyptus-based environment.

—MSD (Mean Squared Deviation) is more sensitive to larger deviations than the MADindex. Its expression is given by

MSD =∑n

t=1(Yt − Yt)2

n,

where Yt, t, Yt and n have the same meanings as in the previous indices.

2.4. Eucalyptus Framework: An Overview

Eucalyptus is a software that implements scalable IaaS-style private and hybrid clouds[Eucalyptus 2010]. It was created with the purpose of cloud computing research andit is interface-compatible with the commercial service Amazon EC2 (Elastic ComputeCloud) [Jones 2008; Eucalyptus 2011]. The API compatibility enables one to run thesame application on Amazon and Eucalyptus environments without modification.

In general, the Eucalyptus cloud computing platform uses the virtualization capa-bilities (hypervisor) of the underlying computer system to enable flexible allocationof computing resources decoupled from a specific hardware [Eucalyptus 2010]. Thereare five high-level components in the Eucalyptus architecture, each one with its ownweb service interface: Cloud Controller (CLC), Cluster Controller (CC), Node Controller(NC), Storage Controller (SC), and Walrus [Eucalyptus 2010]. Figure 1 shows an exam-ple of Eucalyptus-based cloud computing environment, considering two clusters (A andB). Each cluster has one Cluster Controller, one Storage Controller, and various NodeControllers. The components in each cluster communicate with the Cloud Controllerand Walrus in order to service the user requests. A user is able to employ EC2 tools asan interface to the Cloud Controller, or S3 (Amazons Simple Storage Service) tools toaccess Walrus. A brief description of each component follows.

The Cloud Controller (CLC) is the front-end to the entire cloud infrastructure. TheCLC is responsible for exposing and managing the underlying virtualized resources(servers, network, and storage) via Amazon EC2 API [Sun Microsystems 2009]. Thiscomponent uses web service interfaces to receive the requests of client tools on one sideand to interact with the rest of the Eucalyptus components on the other side.

The Cluster Controller (CC) usually executes on a cluster front-end machine[Eucalyptus 2010, 2009], or on any machine that has network connectivity to both the

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 7: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:7

nodes running Node Controllers (NC) and to the machine running the Cloud Controller.CCs gather information about a set of VMs and schedule VM execution on specific NCs.

The Cluster Controller has three primary functions: schedule incoming requests tocreate VM instances at specific NCs, control the instance of virtual network overlay,and gather/report information about a set of NCs [Eucalyptus 2009].

A Node Controller (NC) runs on each node and controls the life cycle of VM instancesrunning on the node. Hence, it is required that the NC interact with the operatingsystem and the hypervisor running on the target node. A Cluster Controller (CC)manages the actions of NCs. NCs control the execution, inspection, and terminationof VM instances on the node where they run, and fetch and clean up local copies ofVM images. NCs query and control the system software on their nodes in responseto queries and control requests from the Cluster Controller [Eucalyptus 2010]. An NCmakes queries to discover the nodes physical resources - number of CPU cores, amountof memory, available disk space - as well as to learn about the state of VM instanceson the nodes [Eucalyptus 2009; Johnson et al. 2010].

Storage controller (SC) provides persistent block storage for use by the VMs. Itimplements block-accessed network storage, similarly to that provided by AmazonElastic Block Storage EBS [Amazon 2011a], and it is capable of interfacing with variousstorage systems (e.g., NFS, iSCSI). An elastic block storage is a block device that can beattached to a virtual machine but sends disk traffic across the locally attached networkto a remote storage location. An EBS volume cannot be shared across VM instances[Johnson et al. 2010].

Walrus is a file-based data storage service, which is interface-compatible with Ama-zons Simple Storage Service (S3) [Eucalyptus 2009]. Walrus implements a REST in-terface (through HTTP), sometimes called the “Query” interface, as well as SOAPinterfaces that are compatible with S3 [Eucalyptus 2009; Johnson et al. 2010]. Usersthat have access to Eucalyptus can use Walrus to stream data into/out of the cloud aswell as from VM instances that they have started on the nodes. Additionally, Walrusacts as a storage service for VM images. Root filesystem as well as OS kernel andramdisk images used to instantiate VMs on the nodes can be uploaded to Walrus andaccessed from nodes.

3. RELATED PAPERS

The characteristics, architectures and applications of several popular cloud computingplatforms are analyzed and discussed in Peng et al. [2009], which aims to clarify thedifferences among the investigated platforms. The authors conclude that althougheach cloud computing platform has its own strength, there are a lot of unsolved issuesin all of them. Such issues include the continuous or high availability mechanismsof cluster failover in cloud environment, consistency guarantee, synchronization indifferent clusters, interoperation, standardization, and security.

In Cordeiro et al. [2010], a comparative analysis of the three most popular cloudcomputing solutions (Xen Cloud Platform, Eucalyptus, and OpenNebula) is presented.The paper also describes illustrative examples of use of each platform, and it proposesthat by understanding some of the main differences between them, one may decidewhere and when each solution may be more appropriate for use.

Iosup et al. [2011] investigates the performance of cloud computing services forscientific computing workloads, quantifying the presence in real scientific computingworkloads of Many-Task Computing (MTC) users, that is, users who employ looselycoupled applications comprising many tasks to achieve their scientific goals. That studywas followed by an empirical evaluation of the performance of four commercial cloudcomputing services. Last, trace-driven simulation was used to compare the performancecharacteristics and cost models of clouds and other scientific computing platforms, for

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 8: Software aging in the eucalyptus cloud computing infrastructure

11:8 J. Araujo et al.

general and MTC-based scientific computing workloads. The results indicate that thecurrent clouds need an order of magnitude in performance improvement to be usefulto the scientific community, and show which improvements should be considered firstto address this discrepancy between offer and demand.

Mihailescu et al. [2011] proposes improvements for the resilience of cloud appli-cations to infrastructure anomalies, by means of OX, a runtime system that usesapplication-level availability constraints and application topologies discovered on thefly. This system allows application owners to specify groups of highly available virtualmachines. To discover application topologies, OX monitors network traffic among vir-tual machines transparently, and based on this information dynamically implementsVM placement optimizations to enforce application availability constraints and reduceand/or alleviate application exposure to network communication anomalies, such astraffic bottlenecks.

A technique for providing high availability in virtualized environments, called Re-mus, is presented in Cully et al. [2008]. It is an extension to the Xen hypervisor, whichworks by continually live-migrating a VM from the primary host to the backup. Such anapproach prevents outages due to hardware failures and unusual software bugs, but itcannot avoid or fix problems commonly caused by software aging. In fact, a continuoussoftware replication mechanism may copy bad aspects of system state, such as memoryleaks or fragmentation, resulting from a faulty application.

By considering software aging related to application domains, other than cloud com-puting, Matias and Filho [2010] present a study where they explored OS Linux kernelusing instrumentation techniques to measure software aging effects. Carrozza et al.[2010] propose a practical approach to detect aging phenomena caused by memoryleaks in distributed objects in an Off-The-Shelf middleware, that is commonly usedto develop critical applications. The approach, which is validated on a real-world casestudy from the Air Traffic Control domain, defines algorithms and support tools to per-form data filtering and for trading off experimentation time and statistical accuracy ofaging trend estimates.

Machida et al. [2010] present an availability analysis of virtualized servers, focusingon aging and rejuvenation of virtual machines monitors (VMMs or hypervisors), whichare important components for every cloud computing infrastructure. Machida et al.[2010] used stochastic reward nets to analyze the events of failure, repair, and preven-tive maintenance. Those analytical models helped to find out an optimal combinationof time intervals to perform the VM and VMM rejuvenation, aiming to achieve high-service availability and minimal loss of transactions. Similar works on rejuvenation invirtualized systems are found in Paing and Thein [2012] and Rezaei and Sharifi [2010].

4. TESTBED ENVIRONMENT

We built a test bed composed of six machines (2.66-GHz Core 2 Quad processors,4-GB RAM, 500-GB SATA hard disk). Three experimental studies were carried outin this infrastructure. For experiment #1, five of the six physical machines ran theUbuntu Server 10.04 (Linux kernel 2.6.38-8) and the Eucalyptus System version 2.0.2.One machine, used as a client for the cloud, ran the Ubuntu Desktop 11.04 (Linuxkernel 2.6.38-8 x86-64). The operating system running in the virtual machines wasa customized version of Ubuntu Server Linux 9.04 that runs an HTTP server. Forexperiments #2 and #3, the operating system used was the Ubuntu Server 10.04 (kernel2.6.35-24) with the Eucalyptus System version 1.6.1. The cloud environment under testwas fully based on the Eucalyptus framework and the KVM [KVM 2012] hypervisor.Figure 2 shows the components of our test bed. The Cloud Controller, Cluster Controller,Storage Controller and Walrus were installed on the same machine ( the host 1 in ourenvironment), and the VMs were instantiated on four physical machines (hosts 2, 3,

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 9: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:9

Fig. 2. Components of the testbed environment.

4 and 5) so that each of them ran a Node Controller. Each host has the capacity torun at most four VMs. A single machine (client host) was used to monitor the entireenvironment and also to perform requests to the Cloud Controller, acting as a clientfor the cloud infrastructure implemented on our test bed. All nodes were connected toa private local area network, by means of a dedicated switch.

The first experimental study considered the management of elastic block storagevolumes assigned to the virtual machines. The second and third ones aimed at theeffects of the life cycle of the virtual machines, so they stressed the instantiation ofvirtual machines in the nodes of the cloud. Each study used a specific workload, thatwas designed to accelerate their corresponding aging effects.

4.1. Description of the Workload #1: Management of EBS Volumes

For the first experiment, the environment was monitored for 300 hours. The durationfor the experiment was defined based on previous works [Araujo et al. 2011b, 2011c]and empirical observation of the time elapsed until the occurrence of aging symptoms,considering the adopted workload that could accelerate possible faults and the mani-festation of related aging effects. The workload adopted considered some Eucalyptusfeatures that manage remote storage volumes assigned to the virtual machines. TheEucalyptus commands euca-attach-volume and euca-detach-volume were used for thispurpose. The workload generation was implemented by a set of scripts that started 10VMs and repeatedly attached and detached the 1-gigabyte remote volumes to the VMs.The use of elastic block storage enabled failover mechanisms such as the reboot of VMsin a different physical machine. Therefore, when a host failed, data and applicationwere kept in a consistent state.

Figure 3 represents the workload used in the first experiment. There are 50 storagevolumes (Volume1, . . . ,Volume50) available in the test environment, and 10 virtual ma-chines (VM1, . . . ,VM10). At the beginning of the experiment, each VM has one volumeassigned to it, so Volumei is attached to VMi, where i = 1 to 10. The script detachesthe current volumes from all VMs every 30 seconds, waits 10 seconds and attaches newvolumes in the next range, from Volume11 to Volume20. When the current volumes tobe detached are in the range from Volume40 to Volume50, the assignment returns to theinitial range, from Volume1 to Volume10. This workload script executes these operationsfor 300 hours, while measurement scripts collect data at 1-minute intervals in each

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 10: Software aging in the eucalyptus cloud computing infrastructure

11:10 J. Araujo et al.

Fig. 3. EBS management workload.

physical machine of the test bed. We collect memory usage for the Eucalyptus-relatedprocesses by means of the Linux/proc pseudo-filesystem [Canonical 2011], that gathersmuch information about the running processes and the overall system. The systemprograms mpstat and free [Blum 2008] have also been used to gather measures suchas CPU utilization, swap memory utilization and the number of zombie processes. Theprocesses responsible for the cloud controller, node controllers and virtual machineswere also monitored. For each of these processes, the CPU usage, resident and virtualmemory utilizations were also monitored.

4.2. Description of the Workload #2: Management of VM Lifecycle

For the study on VM lifecycle management, we changed the previously mentionedtestbed infrastructure. Now, three nodes have a 32-bit (i386) version of the UbuntuServer Linux OS, whereas one has the 64-bit (amd64) version of the same OS platform,which allowed us to capture possible different aging effects related to the systemarchitecture. The environment was monitored for 30 days for the experiment #2 and 72hours for the experiment #3. Similarly to the experiment #1, the definition of durationfor these experiments was also based on empirical observations of the time elapseduntil the manifestation of some aging symptoms, considering the workload that weadopted to accelerate the lifecycle of the virtual machines. Such a lifecycle is composedof four states: Pending, Running, Shutting down, and Terminated, as shown in Figure 4.Scripts are used in order to start, reboot and kill the VMs in a short time period.Such operations are essential to this kind of environment, because they enable quickscaling of capacity, both up and down, as the computing requirements change (theso-called elastic computing pattern) [Amazon 2011b]. Cloud-based applications adaptthemselves to increases in the demand by instantiating new virtual machines, andsave resources by terminating underused VMs when the load of requests is low. VMreboots are also essential to high-availability mechanisms that automatically restartVMs on other physical servers when a server failure is detected. Our workload wasimplemented by means of shell script functions that performed the operations we havejust mentioned, as it may be seen here:

—Instantiate Function. This function instantiates 8 VMs in a cluster. The VMs areinstances of an Ubuntu Server running an HTTP server.

—Kill Function. This function finds out which VM instances are running in the cloudand kills all of them.

—Reboot Function. Much like the previous function, it also finds all the existing VMinstances, but instead of killing them, this function executes their reboot.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 11: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:11

Fig. 4. VM lifecycle management workload.

Every two minutes, the script checked whether more than two hours have passedfrom the beginning of the last initialization. If so, all VMs were killed; otherwise,all VMs were rebooted. Empirically, we decided to use the time from two minutes toreboot eucalyptus service and two hours for carrying out the kill function because ofthe number of executions that we had in the monitoring period, which would generatea workload that could constantly stress the Eucalyptus infrastructure.

The environment was monitored for about two hours without workload. Then, thecontrol script instantiated all VMs and followed the workload cycle previously de-scribed. The monitoring period without running any workload was chosen to enablestating a relationship between the data obtained with and without workload.

5. EXPERIMENTAL RESULTS

In this section, we present the results of the three experiments performed in the Euca-lyptus environment. Experiment #1 is related to the management of EBS volumes; theexperiment #2 is based on the usage of system-wide resources during virtual machineslifecycle management; and experiment #3 shows the results related to the degradationof application-specific resources during the virtual machine’s lifecycle management.

5.1. Experiment #1: Management of EBS Volumes

The data collected in this experiment show some important aging effects on the elasticblock storage management of Eucalyptus. Specifically, the analysis of resources utiliza-tion in one of the cluster nodes (host 2) indicates aging symptoms in some componentsof this cloud infrastructure. Figure 5 shows that the virtual memory consumed bythe Eucalyptus Node Controller process (apache2/var/run/eucalyptus/httpd-nc.conf )increases linearly during the entire experiment. We modeled this increase througha linear regression. The result is the equation Yt = 356842 + 2.32 ∗ t, where Yt isthe predicted amount of memory usage at time t. The MAPE for this linear model is0.00099, which confirms the goodness of fit for the obtained model. Figure 6 showsthe plot of real and fitted values, besides the three error indices. The mean absolutedeviation (MAD) is 368 KB, a small value if we consider that only increases in the orderof megabytes would deserve attention. Table I shows the predicted values, consideringthat this Eucalyptus infrastructure would receive the workload for larger period oftime, which is represented in months. The single-node controller process is expected

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 12: Software aging in the eucalyptus cloud computing infrastructure

11:12 J. Araujo et al.

Fig. 5. Virtual memory utilization by the node controller process.

Fig. 6. Trend analysis for the virtual memory of the node controller.

Table I. Prediction for Virtual MemoryValues

Time (month) Predicted (KB)

2 557,2904 757,7386 958,1868 1,158,63410 1,359,08212 1,559,530

to reach about 1.5 GB of virtual memory in a period of one year. This behavior mayhave important consequences to the performance of virtual machines that run on thisnode. It is important to highlight that Eucalyptus is the software infrastructure for themanagement and execution of VMs and it should not consume, to a large extent, theresources supposed to be provisioned for the VMs.

The utilization of resident memory, presented in Figure 7, also has a trend of growthbut it drops at around 10,000 minutes, due to the Linux memory management thatstarts transferring part of the data from RAM to the swap area on the hard disk. Thisphenomenon is confirmed by checking the swap utilization, in Figure 8, that startsincreasing at the same time. Swap space in Linux is commonly used when the amount

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 13: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:13

Fig. 7. Resident memory utilization by the node controller process.

Fig. 8. Swap utilization in host 2.

of available physical memory (RAM) is low. If the system needs more memory resourcesand the RAM is full, inactive pages in memory are moved to the swap space. Whileswap space can help systems with a small amount of RAM, it should not be considereda replacement for physical memory. Swap space is located on hard drives, which havea slower access time than physical memory. Compared to memory, disks are veryslow. Memory latency can be measured in nanoseconds, while disks are measured inmilliseconds, so accessing the disk can be tens of thousands times slower than accessingphysical memory. In such situations the system is struggling to find free memory andkeep applications running at the same time. A system administrator may add morephysical RAM to the system. Instead, it is better to mitigate the aging issues or try tofix the leakage through software update.

In Figure 8, we can also see that the used swap space reaches about 450 MB, whereasthe resident memory of the node controller decreases by only about 100 MB. The re-maining amount of memory comes from the virtual machines allocated in that host.Figure 9 shows the resident memory utilization for both virtual machines (KVM pro-cesses) that were running in host 2. It is important to emphasize that we measure thememory used by the KVM processes responsible for each VM at the host machine, andnot inside the guest operating system runing in the VM. A growth tendency similar tothe one detected in the node controller process was observed as well as a drop at the

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 14: Software aging in the eucalyptus cloud computing infrastructure

11:14 J. Araujo et al.

Fig. 9. Resident memory utilization for VMs at host 2.

Fig. 10. CPU utilization for VMs at host 2.

mark of about 10,000 minutes. Notice that each virtual machine has a more significantmemory utilization growth than the node controller. They reach almost 256 MB, whichis the total size of RAM configured for each VM in our environment. Therefore, each VMwas near to the maximum usage of physical memory, reinforcing the harmfulness ofthis aging phenomenon due to its possible performance consequences. The aging effectsin the virtual machines also appear by means of increasing CPU utilization as timepasses by. Figure 10 shows that CPU utilization in virtual machines reaches almost90%, confirming that the system performance degrades as the workload requests areprocessed over time. This can be considered one of the most critical results observedin this experiment, because such a high CPU utilization can make the system to taketoo much time to respond and even cause failures in the execution of new requests[Witkon 2007; Sousa et al. 2009]. These results reveal that the system must be care-fully evaluated and tuned, possibly developing and applying patches to Eucalyptus asa preventive measure for taking into account further workload demands.

Figure 11 shows another very important effect of system degradation, which is thenumber of unsuccessful attachment/detachment requests during the experiment. Since10 attachments and 10 detachments are done in each workload cycle, the maximumnumber of errors reported is 20 for each graph point. The increase in the number of er-rors begins at a point close to the beginning of swap usage growth in the node, indicating

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 15: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:15

Fig. 11. Errors in workload requests for volumes attachment/detachment.

some relationship between these aging symptoms. The errors returned in the workloadrequests are due to the attempt of attaching a remote volume to a device that is alreadyin-use by the VM. This information highlights the increasing trend of the detachmenttime as the experiment goes on. This tendency becomes critical when the 10-secondinterval between detachments and new attachments is not sufficient anymore.

5.2. Experiment #2: Virtual Machines Instantiation Workload - System Wide Resources

We analyzed the utilization of hardware and software resources in a scenario where theadopted workload performed operations related to instantiation of virtual machines.The main goal of this case study was to verify the existence of software aging symptomsin the scenario of VM lifecycle management. Some indicators of software aging hadalready been observed, but there was a need for confirmation and characterization ofthis phenomenon. Therefore, the experiment was performed during a 30-day period.The extent of the experiment was based on empirical observations of the time elapseduntil the manifestation of possible aging symptoms, considering the workload that wasadopted to stress the system.

The CPU utilization and swap space usage, for the cloud controller machine (host 1)and node controller 64-bit machine (host 4), are among the most representative resultsfound in this scenario. Other results are found in Araujo et al. [2011a] and Matos et al.[2011].

Figure 12 shows the CPU utilization results of the cloud controller machine. Inalmost all experiment the CPU usage does not exceed 5%. However, some major growthspurts following a nearly linear pattern can be observed throughout the experiment.We realize that such peaks of resource usage are increasing over time, which may be asign of progressive performance degradation.

There was also a considerable growth in swap memory use in the cloud controllermachine and node controller machine that runs the 64-bit OS. In Figure 13, we cansee that this growth has come close to 14 MB and 3.5 MB, respectively. The growthis constant, without drops, since the host continued responding to the requests ofVM instantiation throughout all the experiment. However, even without stopping theservice, this behavior deserves attention because the swap space is a limited resourceand in a longer period this growth may lead to resource depletion, then to the systemcrash. It also may cause performance issues, similarly to the case observed in theprevious experimental study.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 16: Software aging in the eucalyptus cloud computing infrastructure

11:16 J. Araujo et al.

Fig. 12. CPU utilization in the cloud controller machine (Host 1).

Fig. 13. Swap memory used.

5.3. Experimental #3: Virtual Machines Instantiation Workload - ApplicationSpecific Resources

Based on the experimental study #2 and previous works [Araujo et al. 2011a], wenoticed some software aging symptoms in the Eucalyptus node controller, related tovirtual machine instantiations. The repeated operations (instantiation, reboot and ter-mination) highlighted an ever-increasing usage of virtual memory, which disruptedthe VM processes on the node controller in 32-bits operating systems, which, in turn,stopped responding to VM instantiation commands. This behavior constitutes an agingphenomenon of the Eucalyptus framework related to memory leaks. A manual restartof the Eucalyptus node controller service made the virtual memory usage fall to lessthan 110 MB. After restarting the service (by means of manual intervention), the samepattern is repeated. As can be seen in Figure 14, the process’s virtual memory hasgrown until about 3064 MB and again the node controller was not able to service therequests of virtual machines instantiation. This section further explains the rejuvena-tion method that was adopted for mitigating the harmful consequences of this agingphenomenon.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 17: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:17

Fig. 14. Virtual memory used in the NC process at Host3 [Araujo et al. 2011a].

Fig. 15. Classification of rejuvenation strategies.

Adopted Rejuvenation Method. We propose an automated method for triggering arejuvenation mechanism in private cloud environments that are backed up by theEucalyptus cloud-computing framework. Our method is based on the rejuvenationmechanism originally presented in Matias and Freitas Filho [2006], which sends asoftware signal (SIGUSR1) to the apache master process, so that all apache slaveprocesses, in idle state, are terminated and new ones are created to replace them.This action cleans up the accumulated memory leak effects inside the replaced apacheprocesses, and has a small impact on the service, since the master process is still ableto wait for new connections while the slave processes’ rejuvenation occurs. Creatinga new process to replace the old one causes a downtime of around 5 seconds, due tothe load of Eucalyptus configurations during process startup, as observed in previousexperiments. In a production environment, the interval between process restarts shouldbe as large as possible, in order to reduce the summed effects of small downtimesduring a large runtime period. One approach to determine that maximum interval isbased on a high frequency monitoring of process memory usage. At the exact momentwhen a memory limit is reached, the rejuvenation is triggered. Since a small samplinginterval may affect system performance, so we believe that a 1-minute interval is theminimum time to avoid interference with the system. Despite the ability to providegood results, we can identify a problem in such an approach. It is possible that thenode controller process reaches its memory limit between two monitoring epochs intime. An additional downtime is introduced in this way, which we will describe asmonitor-caused downtime.

Our proposed approach for triggering mechanism aims to remove this additionaldowntime, as it tries to keep the interval between process restarts as large as possible.The prediction about when the critical memory utilization (CMU) will be reached isused for this purpose. Therefore, considering the classification of rejuvenation strate-gies presented in Figure 15, our approach has characteristics of two categories, sinceit is a threshold-based rejuvenation but it is aided by predictions. Time-series fitting

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 18: Software aging in the eucalyptus cloud computing infrastructure

11:18 J. Araujo et al.

Fig. 16. Forecasting and threshold-based hybrid rejuvenation approach.

enables us to perform a trend analysis and therefore state, with an acceptable er-ror, the time remaining until the process reaches the CMU. This information makesit possible to schedule the rejuvenation to a given time (Trej), which takes into ac-count a safe limit (Tsaf e) to complete the rejuvenation process before the CMU isreached. The safe limit should encompass the time spent during the rejuvenationand the time relative to the time-series prediction error. Therefore, we can state thatTrej = TCMU − Tsaf e = TCMU − (Trestart + TPredError).

As can be seen in Figure 16, the trend analysis is started only after the monitoringscript detects that the node controller process has grown over a time series computationstarting point, TSCSP , which in our case is 80% of the critical memory utilization. Thisstarting point was adopted to avoid unnecessary interference on the system due to theperiodic computation of time series fitting. When a limit of 95% of CMU is reached, thelast prediction generated by the trend analysis is used to schedule the rejuvenationaction, that is, the last computed TCMU will be used to assess the Trej and the systemwill be prepared so that the rejuvenation occurs gracefully in time Trej . We see thatby reaching this time series computation final point, TSCFP , there is no benefit incontinuing computing new estimates, and it would be a risk to postpone the schedulingof the rejuvenation action. Note that the values adopted for TSCFP and TSCSP arespecific to our environment, and therefore may vary if this strategy is instantiated forother kinds of systems.

Result Analysis. We carried out an experimental study to verify the effectivenessof our proposed rejuvenation method on the described Eucalyptus cloud computingenvironment. We focused on the rejuvenation of the Node Controller process, since ithas shown the major aging effects among all monitored components. This experimentwas executed for 72 hours because it was already known, from experiment #2, thatsuch a time interval was enough to obtain the expected aging results.

It is noteworthy that the use of time series to reduce the system downtime duringthe execution of a rejuvenation action may also be applied to other computing envi-ronments. The time series itself is not involved directly in the system, but indicatesthe appropriate time at which an action should be taken. We used data collected inpreliminary experiments to find out which kind of time series provides a better fit forthe growth of virtual memory usage in Node Controller processes.

We used four models (LTM, QTM, GCM, and SCTM) for trend analysis. A summary ofthe results of the fitting and their errors are shown in Table II, where Yt is the predictedvalue of the memory consumption at time t. It can be seen from Table II that the valuesof the indices MAPE, MAD, and MSD are smaller for the LTM and QTM models. Sothe choice must be made selecting one of these two models. It is also observed thatdespite the fact that the MAPE values are the same for these two models, the otherindices values of the QTM model are smaller than for the LTM model. So the QTMmodel was chosen as the best fit for the trend analysis of virtual memory utilization inEucalyptus node controllers, and therefore it was used for our rejuvenation scheduling.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 19: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:19

Table II. Summary of the Accuracy Indices for Each Model (NC Virtual Memory)

Model Yt MAPE MAD MSD

LTM 44157.1 + 2.85t 1% 900 1472447

QTM 43354.8 + 2.95830t − 0.000002t2 1% 872 1343698

GCM 53013.6(1.00003t) 6% 6014 52619294

SCTM (106)/(4.70218 + 16.8036(0.999942t)) 2% 1259 3449812

Fig. 17. Quadratic trend analysis of virtual memory.

Table III. Comparison of Experiments

Threshold-based rejuvenation Proposed rejuvenation method

Availability 0.999584 0.999922Number of nines 3.38 4.11

Downtime 108 seconds 20 seconds

The actual experimental study for verifying the rejuvenation method was performedin two parts. First, the cloud environment was stressed with the workload described inSection 4.2, and the rejuvenation action was triggered only when the monitoring scriptdetected that the critical limit was reached. Next, the same workload was used, butthe rejuvenation was scheduled based on the time-series predictions.

Figure 17 shows the trend analysis for the growth of virtual memory utilization,fit by a quadratic function Yt = 94429 + 3825.3t − 0.0686t2. Such an analysis hasprovided a value of 809 minutes for the TCMU , that is, the predicted time to reachthe 3-GB limit. By knowing the beginning time of the experiment, we scheduled therejuvenation to a given Trej = 809 − (5/60 + 5), in minutes, counting up from thebeginning of the experiment. After rejuvenation, the memory usage is reduced, andother trend analysis is carried out when the 80% limit is reached again.

The results show that the proposed rejuvenation triggering method brings systemavailability to a higher level, when compared to the threshold-based rejuvenation ap-proach. The number of nines increases from 3.38 to 4.11 (see Table III). In a time-lapseof one year, such difference means a decrease from 218 minutes to 40 minutes of down-time, that is, the unavailable time was reduced by about 80%. Such an enhancement inthe system availability avoids the loss of requests for instantiation of virtual machines,or any other similar users’ requests.

Table IV presents the absolute percentage error between the predictions and theactual values for the virtual memory utilization in this experiment. The error varies

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 20: Software aging in the eucalyptus cloud computing infrastructure

11:20 J. Araujo et al.

Table IV. Comparison of Virtual Memory Predictions and Actual Values

Time (min) Predicted (KB) Actual (KB) Error (%)

120 608181 695624 12.57%240 1069185 1064728 0.42%360 1530189 1433776 6.72%480 1991193 1977272 0.70%600 2452197 2601880 5.75%

in this range from two to ten hours, but it is below 10% in most analyzed points, whichmeans that our approach achieved an acceptable accuracy in its predictions, and itmay be enhanced by using the last prediction errors to adjust the related threshold.The best approximations are obtained at four and eight hours of experiment, whichare the points were the “fit” line intercepts the “actual” line in Figure 17. The worsterrors in Table IV correspond to the regions where the lines in Figure 17 have thelargest distances, showing that the 12.57% error at 120 minutes is close to the higherbound of prediction errors in this experiment. Such a maximum error in the predictionsprovided by the time series enable the use of our approach in other environments thathave similar aging characteristics.

6. FINAL REMARKS

This article investigates software aging effects in the Eucalyptus-based cloud com-puting infrastructure. In addition to the detection of different aging effects in theEucalyptus, a rejuvenation method is also proposed for mitigating the identified agingeffects. We found indicators of memory leaking in Eucalyptus processes and relatedto the handling of elastic block storage. Such problems may be harmful to the Euca-lyptus dependability, or any other cloud applications running under its environment.Performance degradation due to RAM memory exhaustion and subsequent use of swapspace are detected and discussed. The high CPU utilization by the virtual machinesalso highlighted possible faults related to the management of EBS volumes, supportedby the guest operating system or even by the KVM hypervisor. Memory leaks in theVMs management, especially on instantiations, caused system crashes that blockedthe creation of new VMs. In terms of aging mitigation approach, the proposed reju-venation method used multiple thresholds and time-series forecasting to reduce thevirtual memory utilization before the system reaches a critical point. The experimen-tal results show that our approach offers a reduced downtime when compared to athreshold-based method. Note that the proposed method is not tied to the characteris-tics of our experimental cloud computing environment, so it can be adapted to handleaging issues in practically any other software system.

ACKNOWLEDGMENTS

We would like to thank the following Brazilian agencies for reseacrh support: CNPq, FACEPE, FAPEMIGand CAPES. We also give our thanks to the MoDCS Research Group.

REFERENCES

AKAIKE, H. 1969. Fitting autoregressive models for prediction. Ann. Institute Stat. Math. 21, 1, 243–247.AMAZON. 2011a. Amazon Elastic Block Store (EBS). Amazon.com, Inc. Available in: http://aws.amazon.

com/ebs.AMAZON. 2011b. Amazon elastic compute cloud - ec2. Amazon.com, Inc.ARAUJO, J., MATOS JUNIOR, R., MACIEL, P., AND MATIAS, R. 2011a. Software aging issues on the eucalyptus cloud

computing infrastructure. In Proceedings of the IEEE International Conference on Systems, Man, andCybernetics (SMC’11). Anchorage.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 21: Software aging in the eucalyptus cloud computing infrastructure

Software Aging in the Eucalyptus Cloud Computing Infrastructure 11:21

ARAUJO, J., MATOS JUNIOR, R., MACIEL, P., MATIAS, R., AND BEICKER, I. 2011b. Experimental evaluationof software aging effects on the eucalyptus cloud computing infrastructure. In Proceedings of theACM/IFIP/USENIX International Middleware Conference (Middleware’11). Lisbon.

ARAUJO, J., MATOS JUNIOR, R., MACIEL, P., VIEIRA, F., MATIAS, R., AND TRIVEDI, K. S. 2011c. Software rejuve-nation in eucalyptus cloud computing infrastructure: A method based on time series forecasting andmultiple thresholds. In Proceedings of the 3rd International Workshop on Software Aging and Rejuvena-tion (WoSAR’11) in conjuction with the 22nd Annual International Symposium on Software ReliabilityEngineering (ISSRE’11). Hiroshima.

ARMBRUST, M., FOX, A., GRIFFITH, R., JOSEPH, A. D., KATZ, R., KONWINSKI, A., LEE, G., PATTERSON, D., RABKIN,A., STOICA, I., AND ZAHARIA, M. 2009. Above the clouds: A Berkeley view of cloud computing. Tech. Rep.UCB/EECS-2009-28, UC Berkeley Reliable Adaptive Distributed Systems Laboratory. Feb.

AVIZIENIS, A., LAPRIE, J., RANDELL, B., AND LANDWEHR, C. 2004. Basic concepts and taxonomy of dependable andsecure computing. IEEE Trans. Depend. Secure Comput. 1, 11–33.

BAO, Y., SUN, X., AND TRIVEDI, K. S. 2005. A workload-based analysis of software aging and rejuvenation. IEEETrans. Reliab. 54, 541–548.

BLOOMFIELD, P. 2000. Fourier Analysis of Time Series: An Introduction. Wiley Series in Probability andStatistics.

BLUM, R. 2008. Linux Command Line and Shell Scripting Bible. Wiley.BOX, G. AND JENKINS, G. 1970. Time Series Analysis. Holden-Day series in time series analysis. Holden-Day,

San Francisco, CA.CANONICAL. 2011. Manual pages about using a GNU/Linux system. Canonical Ltd. Available in: http://

manpages.ubuntu.com/manpages/hardy/man5/proc.5.html.CARROZZA, G., COTRONEO, D., NATELLA, R., PECCHIA, A., AND RUSSO, S. 2010. Memory leak analysis of mission-

critical middleware. J. Syst. Softw. 83, 1556–1567.CHATFIELD, C. 1996. The Analysis of Time Series: An Introduction 5th Ed. Chapman & Hall/CRC, New York.CORDEIRO, T., DAMALIO, D., PEREIRA, N., ENDO, P., PALHARES, A., GONCALVES, G., SADOK, D., KELNER, J., MELANDER,

B., SOUZA, V., AND MANGS, J.-E. 2010. Open source cloud computing platforms. In Proceedings of the 9thInternational Conference on Grid and Cloud Computing (GCC’2010) (Jiangsu). 1–5.

CULLY, B., LEFEBVRE, G., MEYER, D., FEELEY, M., HUTCHINSON, N., AND WARFIELD, A. 2008. Remus: High avail-ability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium onNetworked Systems Design and Implementation (San Francisco). 161–174.

EUCALYPTUS. 2009. Eucalyptus Open-Source Cloud Computing Infrastructure - An Overview. EucalyptusSystems, Inc., 130 Castilian Drive, Goleta, CA 93117 USA.

EUCALYPTUS. 2010. Cloud Computing and Open Source: IT Climatology is Born. Eucalyptus Systems, Inc.,130 Castilian Drive, Goleta, CA 93117 USA.

EUCALYPTUS. 2011. Eucalyptus - the open source cloud platform. Eucalyptus Systems, Inc. Available in:http://open.eucalyptus.com/.

GROTTKE, M., MATIAS, R., AND TRIVEDI, K. 2008. The fundamentals of software aging. In Proceedings of the1st International Workshop on Software Aging and Rejuvenation (WoSAR), in conjunction with the 19thIEEE International Symposium on Software Reliability Engineering (Seattle).

HUANG, Y., KINTALA, C., KOLETTIS, N., AND FULTON, N. D. 1995. Software rejuvenation: Analysis, module andapplications. In Proceedings of the 25th Symposium on Fault Tolerant Computing (FTCS-25) (Pasadena).381–390.

IOSUP, A., OSTERMANN, S., YIGITBASI, N., PRODAN, R., FAHRINGER, T., AND EPEMA, D. 2011. Performance analysis ofcloud computing services for many-tasks scientific computing. IEEE Trans. Paral. Distrib. Syst. (TPDS),Special Issue on Many-Task Computing 22, 931–945.

JOHNSON, D., MURARI, K., RAJU, M., RB, S., AND GIRIKUMAR, Y. 2010. Eucalyptus Beginner’s Guide UEC Ed. ForUbuntu Server 10.04 - Lucid Lynx, v1.0.

JONES, M. T. 2008. Cloud computing with Linux - cloud computing platforms and applications. IBM Corpora-tion, 12.

KEDEM, B. AND FOKIANOS, K. 2002. Regression Models for Time Series Analysis. Wiley.KOURAI, K. AND CHIBA, S. 2007. A fast rejuvenation technique for server consolidation with virtual machines.

In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems andNetworks (DSN’07) (Washington). 245–255.

KVM. 2012. Kernel based virtual machine. Project Home Page. Available in: http://www.linux-kvm.org.MACHIDA, F., KIM, D. S., AND TRIVEDI, K. 2010. Modeling and analysis of software rejuvenation in a server

virtualized system. In Proceedings of the 2010 IEEE 2nd International Workshop on Software Aging andRejuvenation (WoSAR). 1 –6.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Page 22: Software aging in the eucalyptus cloud computing infrastructure

11:22 J. Araujo et al.

MATIAS, R. AND FILHO, P. J. F. 2010. Measuring software aging effects through OS kernel instrumentation.In Proceedings of the 2nd International Workshop on Software Aging and Rejuvenation (WoSAR), inconjunction with 21th IEEE International Symposium on Software Reliability Engineering (ISSRE’10)(San Jose).

MATIAS, R. AND FREITAS FILHO, P. J. 2006. An experimental study on software aging and rejuvenation in webservers. In Proceedings of the 30th Annual International Computer Software and Applications Conference(COMPSAC’06) (Chicago).

MATIAS JR., R., BARBETTA, P. A., TRIVEDI, K. S., AND FILHO, P. J. F. 2010. Accelerated degradation tests appliedto software aging experiments. IEEE Trans. Reliab. 59, 1, 102–114.

MATOS JR., R., ARAUJO, J., MACIEL, P., VIEIRA, F., MATIAS, R., AND TRIVEDI, K. S. 2011. Software rejuvenationin Eucalyptus cloud computing infrastructure: A hybrid method based on multiple thresholds and timeseries prediction. Int. Trans. Syst. Sci. Appl. 7, 295–303.

MCKINLEY, P. K., SAMIMI, F. A., SHAPIRO, J. K., AND TANG, C. 2006. Service clouds: A distributed infrastructure forcomposing autonomic communication services. In Proceedings of the 2nd IEEE International Symposiumon Dependable, Autonomic and Secure Computing (DASC’06) (Indianapolis, IN). 341–348.

MIHAILESCU, M., RODRIGUEZ, A., AND AMZA, C. 2011. Enhancing application robustness in infrastructure-as-a-service clouds. In Proceedings of the 1st International Workshop on Dependability of Clouds, Data Centersand Virtual Computing Environments (DCDV 2011) in conjunction with the 41st Annual IEEE/IFIPInternational Conference on Dependable Systems and Networks (DSN’11) (Hong Kong).

MONTGOMERY, D. C., JENNINGS, C. L., AND KULAHCI, M. 2008. Introduction to Time Series Analysis and Forecast-ing. Wiley series in probability and statistics.

MUSA, J. D. 1998. Software Reliability Engineering: More Reliable Software, Faster Development and Testing2 Ed. McGraw-Hill, New York, NY.

NIST. 2011. NIST. National Institute of Standards and Technology, Information Technology Laboratory, U.S.Department of Commerce. Available in: http://csrc.nist.gov.

PAING, A. M. M. AND THEIN, N. L. 2012. Stochastic reward nets model for time based software rejuvenation invirtualized environment. Int. J. Comput. Sci. Telecommuni. 3, 1, 1–10.

PENG, J., ZHANG, X., LEI, Z., ZHANG, B., ZHANG, W., AND LI, Q. 2009. Comparison of several cloud computingplatforms. In Proceedings of the 2nd International Symposium on Information Science and Engineering(ISISE) (Shanghai). IEEE Press, 23–27.

REZAEI, A. AND SHARIFI, M. 2010. Rejuvenating high available virtualized systems. In Proceedings of theInternational Conference on Availability, Reliability, and Security, 2010 (ARES’10). 289–294.

SCHWARZ, G. 1978. Estimating the dimension of a model. Ann. Stati.SOUSA, E., MACIEL, P. R. M., ARAJO, C., ALVES, G., AND CHICOUT, F. 2009. Performance modeling for evaluation

and planning of electronic funds transfer systems. In Proceedings of ISCC’09. 73–76.SUN, D., CHANG, G., GUO, Q., WANG, C., AND WANG, X. 2010. A dependability model to enhance security of

cloud environment using system-level virtualization techniques. In Proceedings of the 1st InternationalConference on Pervasive Computing, Signal Processing and Applications. 6.

SUN MICROSYSTEMS. 2009. Introduction to Cloud Computing Architecture 1 Ed. Sun Microsystems, Inc.TRIVEDI, K. S., KIM, D. S., ROY, A., AND MEDHI, D. 2009. Dependability and security models. In Proceedings of

the 7th International Workshop on the Design of Reliable Communication Networks (DRCN’09).VAIDYANATHAN, K. AND TRIVEDI, K. S. 2005. A comprehensive model for software rejuvenation. IEEE Trans.

Depend. Secure Comput. 2, 124–137.WITKON, E. 2007. Using Load Testing to meet Your SLA. RadView Software. RadView Executive White Paper.XIE, M., DAI, Y.-S., AND POH, K.-L. 2004. Computing System Reliability: Models and Analysis. Kluwer Academic

Publishers.

Received April 2012; revised September 2012; accepted November 2012

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.