54
Data-driven Performance Prediction and Resource Allocation for Cloud Services RERNGVIT YANGGRATOKE PhD Thesis Stockholm, Sweden, 2016

Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Data-driven Performance Prediction

and Resource Allocation for Cloud Services

RERNGVIT YANGGRATOKE

PhD ThesisStockholm, Sweden, 2016

Page 2: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

TRITA-EE 2016:020ISSN 1653-5146ISBN 978-91-7595-876-7 KTH, School of Electrical Engineering

Akademisk avhandling som med tillstand av Kungl Tekniska hogskolan framlagges tilloffentlig granskning for avlaggande av doctoralexamen den May 3, 2016 i sal F3, KTH.

c� Rerngvit Yanggratoke, February, 2016

Tryck: Universitetsservice US AB

Page 3: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Abstract

Cloud services, which provide online entertainment, enterprise resource man-agement, trip planning, tax filing, etc., are becoming essential for consumers, busi-nesses, and governments. The key functionalities of such services are provided bybackend systems in data centers. This thesis focuses on three fundamental problemsrelated to resource allocation, system dimensioning, and performance predictionsfor backend systems. We address these problems using data-driven approaches:triggering dynamic allocation by changes in the environment, obtaining configura-tion parameters from measurements, and learning from observations.

The first problem relates to resource allocation for large clouds with potentiallyhundreds of thousands of machines and services. To address this problem, we havedeveloped and evaluated a scalable and generic protocol for resource allocation.Scalability is achieved through the use of a gossip algorithm. The protocol is genericin the sense that it can be instantiated for di↵erent management objectives throughthe choice of objective functions. It jointly allocates CPU, memory, and networkresources to services hosted by the cloud. We prove convergence properties of theprotocol. Extensive simulation studies suggest that the quality of the allocationis independent of the system size, up to a hundred thousand machines, for themanagement objectives considered.

The second problem focuses on performance modeling of a distributed key-valuestore, and we study specifically the Spotify backend for streaming music. Under-standing the performance of the Spotify storage system is essential for achievinglow-latency playback. We developed analytical models for system capacity underdi↵erent data allocation policies and for response time distribution, taking intoaccount that the Spotify system is lightly loaded. We evaluated the models bycomparing model predictions with measurements from our lab testbed and fromthe Spotify operational environment. We found the prediction error to be below12% for all investigated scenarios.

The third problem relates to real-time predictions of service metrics, which weaddress through statistical learning. Service metrics are learned from observingdevice and network statistics, which makes this method service agnostic. We per-formed experiments on a server cluster running video streaming and key-value storeservices. We showed that feature set reduction significantly improves the predic-tion accuracy, while simultaneously reducing model computation time. Finally, wedesigned and implemented a real-time analytics engine, which processes streamsof device statistics and service metrics from testbed sensors and produces modelpredictions through online learning.

3

Page 4: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Sammanfattning

Molntjanster for att erbjuda underhallning, a↵arssystem, reseplanering, skat-tetjanster, etc. haller pa att bli nodvandiga for saval konsumenter som foretag ochmyndigheter. De viktigaste funktionerna for sadana tjanster tillhandahalls av sys-tem i datacenter. Denna avhandling fokuserar pa tre grundlaggande problem relat-erade till resursallokering, dimensionering av system och forutsagelse av prestandanhos bakomliggande system. Vi angriper dessa med datadrivna metoder: dynamiskallokering utlost vid forandringar i miljon, konfigurations-parametrar skattade franmatningar och inlarning fran observationer.

Det forsta problemet ar relaterat till resursallokering for stora moln med poten-tiellt flera hundratusentals maskiner och tjanster. For att losa detta problem harvi utvecklat och utvarderat ett skalbart och generiskt protokoll for resursalloker-ing. Skalbarheten uppnas genom att anvanda en “gossip”-algoritm. Protokollet argeneriskt i den meningen att malfunktionen som valjs gor att protokollet kan in-stansieras for olika andamal. Processorkraft, minne och natverksresurser allokerastillsammans till tjanster som kors i molnet. Vi bevisar protokollets konvergensegen-skaper. Studier fran omfattande simuleringar antyder att allokeringens kvalitet aroberoende av systemets storlek, upp till hundratusen maskiner, for de studeradeandamalen.

Det andra problemet fokuserar pa modellering av prestandan hos en distribueradnyckel-varde databas och vi studerar Spotifys bakomliggande system for stromningav musik. Det ar vasentligt att forsta prestandan hos Spotifys lagringssystem foratt astadkomma uppspelning med lag latens. Vi har utvecklat analytiska mod-eller for systemets kapacitet vid olika policyer for allokering av data och vid olikafordelningar av svarstid. Dessa modeller tar aven hansyn till att Spotifys system arlagt belastat. Vi utvarderar modellen genom att jamfora modellens forutsagelsermed matningar fran var testbadd och matningar fran Spotifys driftsmiljo. Forsamtliga undersokta scenarier fann vi att felet hos forutsagelserna var mindre an12%.

Det tredje problemet ar relaterat till realtidsforutsagelser for matvarden fortjanster och vi angriper det med statistisk inlarning. Genom att observera en enhetoch dess natverksstatistik lar vi oss matvarden, vilket gor att denna metod fungerarutan nagon forhandsinformation om den aktuella tjansten. Vi utfor experimentpa ett serverkluster med tjanster for videostromning och nyckel-varde databaser.Vi visar att en minskning av uppsattningen egenskaper forbattrar forutsagelsernasprecision markbart, samtidigt som berakningstiden for modellen minskar. Slutligendesignar och implementerar vi en motor for realtidsanalys som behandlar strommarav statistik fran enheter och matvarden fran sensorer i testbadden och tar framforutsagelser av modellen genom lopande inlarning.

4

Page 5: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Acknowledgements

First of all, let me express my deepest gratitude to my advisor, Prof. Rolf Stadler, forhis continuous support and valuable guidance during the long journey of this thesis. Thisjourney has been a significant challenge in my life. More often than not, I reach a deadend, run out of possible ideas, and am utterly mystified of what to do. During suchdi�cult times, he directs me to the light at the end of the tunnel. Without him, I wouldnot have been able to complete this journey. To me, he is an advisor as well as a lifelongfriend. I also would like to thank Prof. Gunnar Karlsson for giving me the opportunityto be a member of LCN.

Many individuals have contributed to the results presented in this thesis. Withoutthem, the thesis would not have materialized to this shape. In this regard, I wouldlike to thank Fetahi Wuhib, Gunnar Kreitz, Mikael Goldmann, Viktoria Fodor, JawwadAhmed, John Ardelius, Christofer Flinta, Andreas Johnsson, and Daniel Gillblad for theirvaluable inputs and contributions on all aspects of the work.

I would like to thank all colleagues here at LCN for maintaining a motivating andfriendly environment for research. Specially, I am grateful to Misbah Uddin, my o�cemate, for general helps and discussions of interesting topics inside and outside of research.I also would like to thank Anna Ohlsson, Connie Linell, and Ingela Nelson for helpingwith general issues.

I am grateful to my family in Thailand for their continuous help and support through-out this period. I also would like to thank my wife for making me feel at home awayfrom home. Last but not least, I would like to thank friends in Stockholm for activities,hangouts, and parties that kept me alive during this solitary period.

Rerngvit Yanggratoke,Stockholm, 2016.

Page 6: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders
Page 7: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Table of Contents

Table of Contents 7

I Introduction 9

1 Introduction 111.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Problem and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3 The Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 221.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Related Research 252.1 Resource Allocation for Cloud Environments . . . . . . . . . . . . . . . . 252.2 Performance Modeling of a Distributed Key-value Store . . . . . . . . . . 272.3 Analytics-based Prediction of Service Metrics . . . . . . . . . . . . . . . 29

3 Summary of Original Work 33

4 Open Problems for Future Research 37

5 List of Publications in the Context of this Thesis 41

Bibliography 43

6 Gossip-based Resource Allocation for Green Computing in Large Clouds 556.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.3 Modeling Resource Allocation and our Generic Solution . . . . . . . . . . 596.4 The Problem and our Solution . . . . . . . . . . . . . . . . . . . . . . . . 626.5 Evaluation through Simulation . . . . . . . . . . . . . . . . . . . . . . . . 646.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.7 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 68References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Allocating Compute and Network Resources under Management Ob-jectives in Large-Scale Clouds 737.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.2 Modeling Resource Allocation under Management Objectives . . . . . . . 76

7

Page 8: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

7.3 An Architecture for Resource Allocation and Management . . . . . . . . 817.4 Placement under Management Objectives . . . . . . . . . . . . . . . . . . 837.5 Evaluation through Simulation . . . . . . . . . . . . . . . . . . . . . . . . 877.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 95References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8 On the Performance of the Spotify Backend 1018.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1018.2 The Spotify backend storage architecture . . . . . . . . . . . . . . . . . . 1038.3 Predicting response time . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048.4 Estimating the capacity of a storage cluster for di↵erent object allocation

policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

9 Predicting Real-time Service-level Metrics from Device Statistics 1319.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329.2 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1339.3 Background: Statistical learning . . . . . . . . . . . . . . . . . . . . . . . 1349.4 Device statistics and service-level metrics . . . . . . . . . . . . . . . . . . 1359.5 Testbed and experimentation . . . . . . . . . . . . . . . . . . . . . . . . 1369.6 Evaluation of the prediction models . . . . . . . . . . . . . . . . . . . . . 1399.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

10 A Service-agnostic Method for Predicting Service Metrics in Real-time14910.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15010.2 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15110.3 Statistical learning methods used in this work . . . . . . . . . . . . . . . 15310.4 Device statistics and service-level metrics . . . . . . . . . . . . . . . . . . 15410.5 Testbed and prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15510.6 Model computation and evaluation . . . . . . . . . . . . . . . . . . . . . 15910.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16810.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8

Page 9: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Part I

Introduction

9

Page 10: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders
Page 11: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Chapter 1

Introduction

1.1 Background and Motivation

Cloud services, e.g., electronic-mail, social-network, and music-on-demand services arewell integrated into the fabric of our daily life. These services connect people around theworld, enable convenient accesses to information, and are vital to innovation for industryand society [1].

A cloud service typically consists of frontend systems, for instance, an electronic-mail client in a mobile phone, and backend systems, for example, a storage machine forelectronic mails. A backend system normally runs on one or several machines located ina data center, which is a facility for hosting computing systems and related equipment.

A data center normally contains a large number of machines connected via a network.The machines are mounted on racks. Racks are organized into clusters. Data centersvary in size; a data center occupies a single room, one floor of a building, or an entirebuilding. Figure 1.1 illustrates a view from inside a data center.

This thesis focuses on three fundamental problems related to the management of cloudservices and data centers. They center around (1) resource allocation in large clouds, (2)performance modeling of a distributed key-value store, and (3) real-time prediction ofservice metrics.

The first problem is a fundamental problem in cloud computing. Cloud computingcan be understood as the “use of computing resources (hardware and software) thatare delivered as a service over a network (typically the Internet)” [3]. With the termcloud environment we mean the physical and software infrastructure that supports cloudcomputing.

A cloud environment normally involves three stakeholders. The first stakeholder isthe cloud service provider, who owns and administrates a data center infrastructure. Ex-amples of cloud service providers include the business divisions that run Amazon WebServices [4], the Google Cloud Platform [5], and Windows Azure [6]. The second stake-holder is the cloud client, who owns and operates a cloud service by leasing a part of thevirtualized infrastructure from the cloud provider. Examples of such cloud clients arecompanies like Dropbox [7] and Netflix [8]. The cloud client provides a cloud service overthe Internet to the third stakeholder, which we call the end user. An end user accessesthe cloud service through frontend systems mentioned earlier.

A backend system is often composed of multiple applications. For example, a web

11

Page 12: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Figure 1.1: A view from inside a data center [2].

backend system, such as LAMP (Linux, Apache, MySQL and PHP), is composed of weband database applications. An application is executed on a single machine and requiresresources, such as CPU, memory, and storage.

A cloud service provider performs resource management in a data center according toa management objective. Examples of such objectives are the balanced load objective,which states that machines should experience the same load, and the energy e�ciencyobjective, which states that the power consumption of all machines should be minimized.

A key problem for the cloud service provider is how to select the machine for executingan application, such that the placement satisfies, at the same time, the managementobjective of the provider and the resource demands of all applications in the cloud. Thisproblem is also referred to as the application placement problem. Figure 1.2 illustrates theproblem in a data center environment, with two sample objectives for the case of the CPUresources. On the left side, applications, which are represented by boxes a

1

, a2

, ..., aA areplaced on machines, which are shown as bins S

1

, S2

, ..., SN ; the height of an applicationbox represents the CPU demand of the application, while the height of a machine binrepresents the CPU capacity of the machine. On the right side, for the balanced loadobjective, applications are placed in such a way that the bins are filled to the same height,while for the energy e�ciency objective, applications are placed in such a way that theminimum number of bins is used. Servers with no placed applications (empty bins) areshut down to achieve energy e�ciency.

In order to compute the placement decisions, states of applications and machinesare required to be monitored. Traditionally, the solution to the application placement

12

Page 13: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Figure 1.2: Application placement in a data center environment.

problem has been computed in a centralized fashion, i.e., on a single machine. This istechnically feasible for small data centers where the number of applications and physicalmachines is below ten thousand.

Some recently built data centers, however, are very large [9–11]; with over 300,000square feet in size, they often contain hundreds of thousands of machines and applications.In such a data center, the number of states to be monitored and the number of placementdecisions to be computed are very large, and the application placement problem cannot besolved in a timely fashion through a centralized computation. Therefore, a more scalablesolution is needed.

Note that we address the application placement problem in such a way that oursolution does not only apply to interactive services, but also to batch services, e.g., dataanalytics. Batch services generally have di↵erent requirements than interactive services,for example, in that the batch services prioritize high service throughput rather than lowrequest latency.

The second problem we address in this thesis is a key problem in performance man-agement of storage systems. Such systems store persistent data for cloud services. Animportant type of storage systems is the distributed key-value store. A distributed key-value store runs on a number of storage machines and has two key operations, namely getand put. The get operation reads a value from the store by specifying a key, while theput operation writes a specified value to the store for a given key. In this work, we usethe term object for a value kept in the store. Examples of distributed key-value storesthat are used in production systems today are Cassandra [12], Dynamo [13], Riak [14],and MongoDB [15].

The distributed key-value store we focus on in our work is a system built by Spotify.

13

Page 14: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Figure 1.3: The Spotify storage architecture.

Spotify o↵ers an on-demand music streaming service [16], which currently has more than75 million active users and 20 million subscribers [17]. The core functionality of thebackend system of this music service is provided by the Spotify storage system. Itsarchitecture is captured in Figure 1.3. Spotify has several backend sites, which follow thelayout in the figure.

The Spotify storage system is two-tiered. When a user invokes the service, the clientsends a request for an object—a part of a song—to an Access Point (AP), which routesthe request to a Production Storage server. (In this thesis, we use the term server andmachine interchangeably.) If the requested object is stored in the Production Storageserver, a response is returned immediately. Otherwise, the request is forwarded over theInternet to theMaster Storage system (which is based upon a third-party storage service),and the retrieved object is subsequently cached in a Production Storage server.

For most storage systems, it is impractical to replicate all objects across storagemachines. Therefore, each object is stored on, or allocated to, a subset of machines.We refer to a strategy that allocates a set of objects to a set of machines as an objectallocation policy.

A quality of service objective indicates a performance goal of a service that is linkedto the user experience. Examples of such an objective for a storage system are an upperbound on latency and a lower bound on throughput. A quality of service objectiveis important for real-time services, such as on-demand music streaming services andinteractive games. For instance, achieving low latency is key to the Spotify service; theresponse time of pressing the play button on the client interface should feel immediateto the user. (In this thesis, we use the term latency and response time interchangeably.)

Performance modeling includes the development of an analytical framework or tool

14

Page 15: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

to establish a relationship between (1) a performance metric of a system and (2) theexternal load and system parameters, such as machine characteristics and configurations.Given such a model, we can predict performance metrics for specific external load andsystem parameters.

Performance modeling of a distributed key-value store is essential for achieving thequality of service objective for cloud services. Applying a reasonably accurate model, aplanner of the Spotify storage system can properly provision resources such that servicerequests comply with the quality of service objective, e.g., the requirement that theplayback latency is su�ciently low.

The third problem we address in this thesis is real-time prediction of service metrics.Backend systems of cloud services generally run on multiple physical or virtual machinesinterconnected through a data-center network. The thesis work considers two services:video-on-demand streaming and distributed key-value store. Video-on-demand streamingis a cloud service for distributing video content from providers to end users. Video-on-demand means that each end user can select which particular video to view; thedistribution is streaming in the sense that the video content is presented as soon assu�cient data is received by the client. Well-known service providers for video-on-demandstreaming include Youtube [18] and Netflix [8]. (We have introduced the distributed key-value store earlier.)

A service metric relates to the user experience of a cloud service. For instance, forvideo streaming applications, the popular service metrics are video frame rate and audiobu↵er rate, while for a distributed key-value store, the metrics are latency for key-value-store operations.

Accurate prediction of such service metrics in real-time is important. Such a capabilityis a key building block for service assurance, which aims at providing a high-quality userexperience. In particular, a service provider, equipped with this capability, can react todeteriorating service quality by reallocating resources.

Accurate prediction of service metrics in real-time is di�cult. This is because theservices are built out of large and complex software systems that run on general-purposeplatforms and operating systems, which do not provide real-time guarantees. Traditionalworks approach the problem by modeling various layers of hardware and software in orderto come up with an accurate prediction model. Such an approach requires significante↵ort by domain and modeling experts. Furthermore, a developed model is applicableonly for the specific service under investigation, i.e., it is service specific.

An alternative approach, which we pursue in this work, is based upon statisticallearning, whereby the behaviour of the target system is learned from observations. Insuch a case, a large amount of observational data is needed, but no detailed knowledgeabout the backend systems and their interactions is required. Our method is serviceagnostic in the sense that it takes as input operating-systems and network statisticsinstead of service-specific metrics.

1.2 Problem and Approach

In the context of network management, functions are often broadly categorized into fiveareas: fault management, configuration management, accounting management, perfor-

15

Page 16: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

mance management, and security management, which are known as FCAPS [19]. Thisthesis considers three key problems in the area of performance management for cluster-based services. We present the problems and approaches in the chronological order ofthe thesis work: (1) resource allocation in large clouds, (2) performance modeling of adistributed key-value store, and (3) real-time prediction of service metrics.

All three problems are related to service quality. While the first problem focuses onallocating resources, the second and the third focus on developing models for predictingperformance metrics. The predictions are data-driven in the sense that we estimatemodel parameters from measurements. All three pieces of our work are essential toassuring and maintaining quality of service. The methods we apply include distributedand adaptive algorithms, probabilistic and stochastic modeling techniques, and statisticallearning methods.

1. Resource allocation in large clouds

Problem

We restrict our discussion to a single data center that contains the cloud infrastructure.We model the data center as a system with a set of applications A and a set of machinesN . As stated in Section 1.1, an application is part of a backend system and runs on asingle machine.

Figure 1.2 illustrates the problem for the case of considering a single resource, CPU, forthe applications. We subsequently formulate the problem in general for di↵erent resourcetypes, e.g., CPU, memory, and network-access bandwidth. We use ✓a to represent thegeneric resource required by application a 2 A for a specific resource type ✓. We representthe generic capacity of machine n by ⇥n. We use Pn to represent the set of applicationsplaced on machine n. An application a 2 A on machine n is allocated the genericresource ✓a,n. The allocation follows a local resource allocation policy. An example ofsuch an allocation policy is one that allocates resources according to application demands,i.e., ✓a,n = ✓a.

We consider a dynamic setting, whereby (1) the demand of an application a changesover time, i.e., ✓a = ✓a(t), (2) applications are dynamically added to and removed fromthe cloud environment, and (3) machines are dynamically added to and removed fromthe cloud environment.

We consider two constraints for application placement and the constraints apply forall resource types. First, the capacity constraint states that the aggregate of the allocatedresources on machine n cannot exceed the capacity of machine n, i.e.,

Pa2Pn

✓a,n ⇥n.Second, the resource demand constraint states that the resources allocated to applicationa must be equal or larger than the application demand, i.e., ✓a,n � ✓a. Additionalconstraints, such as colocation or anti-colocation constraints, can be considered using ourapproach, although we have not done so in this work.

In this thesis, we study the application placement problem, i.e., the problem of placinga set of applications A onto a set of machines N that execute those applications. Afeasible solution is an allocation that satisfies (1) the capacity constraints for the resourceallocation on all machines and (2) the resource demand constraints of all applications.

We formulate the application placement problem as an optimization problem. Amongthe feasible solutions, an optimal solution maximizes or minimizes an objective function,

16

Page 17: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

which measures the quality of the solution according to the management objective. An ob-jective function is computed from local state variables of all machines in the cloud. A sam-ple objective function for a balanced load objective is the variance of the CPU utilizationacross all machines in the cloud. (The CPU utilization of machine n is

Pa2Pn

✓a,n/⇥n.)Minimizing this objective function is akin to balancing the load across machines. Findingan optimal solution for the placement problem is equivalent to solving a variant of themultiple knapsack problem and is NP-hard [20].

In a data center, a resource allocation system (1) monitors the states of applica-tions and machines, (2) continuously computes the optimal solution to the applicationplacement problem, and (3) executes the placement decisions on the cloud infrastructure.We identify design goals of a system that e↵ectively and e�ciently performs resourceallocation in a large-scale cloud environment through a set of properties as follows:

E�cient: The resource allocation system consumes a small amount of cloud resources,e.g., at most a single CPU core from each machine.

Scalable: The system scales both in the number of machines and the number of appli-cations.

Adaptive: The system dynamically computes a new solution to the placement prob-lem in response to a change in demand, application churn, or a change in systemconfiguration.

Generic: The system can compute the solution to the application placement problemfor a class of management objectives, not just for a single objective.

Feasible: It is possible to build a system prototype with today’s technology.

This thesis addresses the problem of engineering a resource allocation system that satisfiesthe design goals above. The problem is hard, since the above stated goals must besimultaneously met.

Approach

The core of our solution to achieve the above goals is a distributed and adaptive algorithmin form of a gossip protocol. (In this thesis, we use the term algorithm and protocolinterchangeably.) A gossip protocol is a round-based protocol that relies on pairwiseinteractions between nodes to accomplish a global objective, whereby each node executesthe same code, and the size of the exchanged message is limited [21].

The choice of using a gossip protocol is motivated by the simple interaction patternof a gossip protocol and its inherent scalability, which stems from the fact that each nodehas a limited view of the system. This property allows the protocol to scale, in our case,to at least 100,000 machines and applications.

The gossip protocol forms a key part of our distributed middleware that runs on allmachines of the cloud. The protocol uses a peer sampling service, which provides for eachnode a set of peers for gossip interactions. Figure 1.4 illustrates a sample execution of around of the gossip protocol for the balanced load objective. The protocol is executed as

17

Page 18: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

(a) An arrow represents a communication initi-ated from a source to a destination node.

(b) Gossip interaction between two nodes for the balancedload objective.

Figure 1.4: A sample execution of a round of a gossip protocol.

a sequence of rounds and terminates when the balanced load objective is achieved to asu�cient degree.

Our evaluation of the protocol includes two complementary parts. The first part is atheoretical analysis of key properties of the protocol, including convergence, in order toshow that the key metrics of the protocol converge, and correctness, in order to demon-strate that the converged metrics solve the application placement problem. Second, weperform a simulation study of the protocol for realistic scenarios. This work includes im-plementing and customizing simulators that reproduce key functionalities of the system,for example, resource consumption and gossip interactions. Such a study is important,since it allows us to evaluate the behavior and performance of the protocol for a large-scalesystem.

2. Performance Modeling of a Distributed Key-value Store

Problem

A distributed key-value store is based on a cluster of interconnected storage machines,which we also refer to as a storage cluster. The cluster stores a set of objects, each ofwhich has a key and a value. An object is accessed through a get request with the object’skey, and the cluster retrieves the value for the key.

We focus on the distributed key-value store, which is part of the backend of theSpotify music service (see Figure 1.3). A key metric of a distributed key-value store isthe response time of a get request, which is the duration measured from the time whena request arrives at the storage cluster to the time when the request leaves the cluster.

We consider two object allocation policies in this thesis. The first policy, which wecall the random policy, allocates objects uniformly at random to machines in the storagecluster. The second policy, which we call the popularity-aware policy, allocates an objectto a machine depending on how frequently the object is requested, i.e., its request rate.

18

Page 19: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

The allocation is performed with the objective that each machine experiences the sameaggregate request rate.

The capacity of a machine is the maximum request rate that it can process withoutviolating a given quality of service objective; we refer to the capacity of a storage clusteras the maximum request rate to the cluster, such that each machine serves a rate belowits capacity.

A performance model of a system can be developed for di↵erent system levels, for ex-ample, for the device level, the operating system level, or the application level. Generally,an accurate model for a higher level of the system is more complex than a model for alower level, because it must account for the interactions among components in the lowerlevels. For example, an application-level model must take into account the performancecharacteristics, functionalities, and configurations of the devices in the system and theiroperating systems.

This thesis addresses the problem of developing and evaluating application-level per-formance models for a distributed key-value store. We work with application-level models,since they express the overall performance characteristics of a system as seen by the usersof the system. We first develop a model for predicting the response time distribution ofa distributed key-value store for a lightly loaded system. Second, we model the capacityof a storage cluster for di↵erent object allocation policies.

We identify desirable properties of a performance model for a distributed key-valuestore as follows:

Accurate: The model should be su�ciently accurate to predict a target metric withan acceptable error for operational loads.

Obtainable: It should be possible with reasonable e↵ort to estimate the model param-eters from a real system.

Simple: The model should be of low complexity and analyzable with reasonable e↵ort.

E�cient: The model equations should be solvable in real-time.

The main di�culty of developing a performance model for a distributed key-value storeis the complexity of the system, which makes it hard to identify a simple model that iscomputable with realistic overhead and is su�ciently accurate.

Approach

In order to obtain a tractable performance model, we simplify the Spotify storage archi-tecture shown in Figure 1.3 and arrive at the model shown in Figure 1.5. First, we omitMaster Storage in the simplified model and thus assume that all objects are stored inProduction Storage, since more than 99% of the requests to the Spotify storage systemare served from Production Storage. Second, we model the functionality of all APs of asite as a single component. We assume that the AP selects a storage server uniformly atrandom to forward an incoming request, which approximates the statistical behavior ofthe system under the Spotify object allocation and routing policies. Further, we neglectthe queuing delays at the AP and the network delays between the AP and the storageservers, because they are small compared to the response times at the storage servers.

19

Page 20: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Figure 1.5: Simplified architecture as a basis for the performance model

For modeling the performance of the Spotify storage system (which is a distributedkey-value store), we apply basic probabilistic and stochastic modeling techniques; we usePoisson processes, queuing models, and balls-and-bins models.

We use Poisson processes to model the arrival times of get requests. A queuing modelis employed to model the response time of a get request for the storage system, and aballs-and-bins model is applied to model the capacity of the storage system.

We evaluate the models by comparing model predictions with measurements from twodi↵erent environments. The first environment is our lab testbed, over which we have fullcontrol. This allows us to experiment with a full range of model parameters and refinethe model if needed. For example, we can perform tests to find the range of possible loadsfor which the model is su�ciently accurate. After we have confidence in the developedmodel, we perform the evaluation in the second environment, the Spotify operationalenvironment, which gives us very limited control.

3. Real-time Prediction of Service Metrics

Problem

Figure 1.6 outlines our setup for real-time prediction of service metrics. We consider asystem containing a cluster of server machines that is connected to a client machine overa network. The cluster operates services for the client, e.g., a video-on-demand (VOD)service S

1

or a key-value store (KV) service S2

. We consider the cluster statistics XD,the network statistics XN between the server cluster and the client machine, and howthese statistics relate to service metrics Y (from either of the services) on the client side.

Device statistics XDi of server i = 1..s refer to metrics on the operating-system level,e.g., CPU and memory utilization of a server machine. The device statistics XD ofthe cluster is the concatenation of the server statistics, i.e., XD = [XD1 , XD2 , ..., XDs ].The network statistics XN refers to end-to-end metrics between the server cluster and

20

Page 21: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Figure 1.6: Setup for real-time prediction of service metrics.

the client machine, e.g., network round-trip-time and packet loss rate. We write X =[XD, XN ] to represent the device and network statistics. The statistics Y on the clientside refer to service-level metrics, for example, video frame rate, audio bu↵er rate, andaverage read latency. The metrics X and Y evolve over time, influenced by the externalload, operating system dynamics, etc. Assuming a global clock that can be read on boththe client and the servers, we model the evolution of the metrics X and Y as time series{Xt}t, {Yt}t, and {(Xt, Yt)}t.

Our objective is to predict the service-level metric Yt at time t on the client, basedon knowing the cluster and network metrics Xt. Using statistical learning, we formulatethe problem as a regression problem. In particular, we want to find a learning modelM : Xt ! Yt, such that Yt closely approximates Yt for a given Xt. In addition, wewant to design a subsystem that runs on the same platform as the system in Figure 1.6,continuously collects the feature vectorX, and produces model predictions Y in real-time.

Our solution should have the following properties. It should be:

Accurate: The model M should be su�ciently accurate to predict a target servicemetric Y .

Service-agnostic: The approach should be generic and applicable for a range of cluster-based services.

E�cient: The system should be able to continuously collect the feature vector inreal time. The model computation should have low complexity, so that Y can becomputed in real-time.

The main di�culty of developing the learning model comes from the potentially verylarge feature space, which can include statistics from a large number of devices. For thesystem work, the main challenge is to achieve operation in real-time and at low overhead.

Approach

Our approach to prediction is based on statistical learning. We use low-level device andnetwork statistics as features in order to develop a prediction method that is service

21

Page 22: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

agnostic. We set up an experimental testbed for creating statistics, for instance, in formsof traces, in order to develop and validate learning models. We configure the testbedto run two di↵erent services: video-on-demand (VOD) or key-value-store (KV) services.Further, we develop load generators for these services, sensors to capture device statistics,and a subsystem to collect measurements.

In order to achieve real-time prediction of service metrics, we progress in three steps.Each step serves as a milestone for the following step. The first step involves batchlearning using traces from the testbed, which allows us to obtain a baseline for theaccuracies of di↵erent learning models. For batch learning, we consider a set of sam-ples {(X

1

, Y1

), ..., (Xm, Ym)} taken from testbed measurements. Assuming each sample(Xt, Yt) is drawn uniformly at random from a joint distribution (X, Y ), we use conceptsand methods from statistical learning to identify a learning model M . The model isevaluated using the validation-set approach. We experiment with a variety of regres-sion methods including least-square linear regression, ridge regression, lasso regression,regression tree, and random forest [22].

The second step involves online learning on the same traces, since this method issuitable for real-time adaptation of learning models. We consider a time series of samples{(X

1

, Y1

), (X2

, Y2

), ...} from testbed measurements. We apply online methods that pro-cess this series sequentially and produce a sequence of models M

1

,M2

, .... The sequenceis evaluated using the interleaved test-then-train approach [23].

Finally, the third step involves model prediction and evaluation in real-time on thetestbed. In this case, samples are collected from live statistics instead of traces, andonline learning models are computed by the real-time analytics engine.

To achieve high prediction accuracy and real-time operation, we identify a methodto automatically reduce the dimensionality of the feature space. Specifically, we useforward-stepwise selection, which is a heuristic method to search a feature space.

1.3 The Contribution of this Thesis

We believe our thesis makes significant contributions towards engineering solutions forresource allocation in a large-scale cloud environment, performance modeling for a dis-tributed key-value store, and a service-agnostic method for predicting service metricsin real-time. This thesis work includes formal modeling and analysis, simulation study,prototype development, and testbed experiments.

1. A generic protocol for resource allocation in large clouds

We developed a scalable and generic protocol for resource allocation in large clouds,which we call the Generic Resource Management Protocol (GRMP). This protocol isgeneric in the sense that it can be instantiated for di↵erent management objectives.The instantiation is achieved by specifying an objective function, which is specific to amanagement objective and is computed from local state variables. We provided protocolinstantiations for four management objectives, namely, balanced load, energy e�ciency,fairness, and service di↵erentiation (see Chapters 6 and 7).

The resource allocation system, the key component of which is GRMP, achieves thedesign goals stated above. First, the system is e�cient, because the protocol executes

22

Page 23: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

with small overhead, i.e., using at most a single CPU core from each machine. Also,the protocol limits the cost of reconfiguration, for example, by attempting to minimizethe number of applications that need to be reconfigured for a new placement. Second,the protocol is scalable, since the resources consumed by each machine are independentof the system size. Third, the protocol is adaptive, because it is executed periodicallyor in response to an event, such as the arrival of an application, the termination of anapplication, or the change of application demand. The protocol takes as input the cur-rent placement and produces a new placement according to the management objective.Fourth, the protocol is generic, since it can be instantiated for di↵erent management ob-jectives. Finally, the protocol has been implemented on an Openstack cloud managementtestbed, and the performance measurements on this testbed demonstrate the e↵ectivenessof GRMP [24].

2. Performance models of a distributed key-value store

We have developed and evaluated models for predicting the performance of a distributedkey-value store. First, we developed a model for predicting the response time distributionof get requests. The model takes as input the external load, the system size, and modelparameters obtained from the physical servers through measurements. Second, we de-veloped a model for predicting the capacity of a storage cluster for two di↵erent objectallocation policies. The model takes as input the object allocation policy (a random or apopularity-aware policy), the number of machines in the cluster, the number of objects,and the parameters of the object popularity distribution (see Chapter 8).

Both models satisfy the design goals stated above. First, they are accurate in thesense that the prediction errors, i.e., the di↵erences between the model predictions andthe measurements from the real systems, are at most 11%. Second, they are obtainable,because the model parameters can be estimated quickly by running a monitoring tool onthe storage machines. Third, they are simple in the sense that the model equations arein closed form and can be interpreted in a straightforward way. Finally, they are e�cientin the sense that solutions to the model equations can be computed quickly and with lowoverhead (within a second on a laptop computer).

Surprisingly, we find that such simple models can capture performance characteristicsof a complex system. We explain this result with the fact that the storage systems wemodel are dimensioned in such a way that access to memory/storage is the only potentialbottleneck, while other resources, such as CPUs and the network, are lightly loaded.In other words, our model is accurate and thus applicable for a lightly loaded system;however, the model captures well the normal operating range of the Spotify storagesystem.

3. A solution for real-time prediction of service metrics

We have designed, developed, and evaluated a solution for predicting real-time servicemetrics for cluster-based services. The two key components of the solution are a learningmethod and a real-time analytics engine (see Chapters 9 and 10).

The learning method is service-agnostic in the sense that it takes as input operating-system-level and network-level statistics and, therefore, does not require instrumenting

23

Page 24: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

the service on the cluster. We evaluated the models with our method for a range ofscenarios. The first set of scenarios considers two di↵erent services and di↵erent loadpatterns on the testbed. The second set of scenarios includes running a single service orboth services simultaneously on the testbed. The last set of scenarios includes predictingservice metrics end-to-end, taking network statistics into account.

The real-time analytics engine processes streams of low-level statistics and client-sidemetrics and produces model predictions through online learning. This engine is a keybuilding block for an automated service-assurance system, and it has proved to be apowerful tool for model evaluation and demonstration purposes.

The solution satisfies the design goals stated above. First, the models produced by ourmethod are accurate in the sense that the prediction error, i.e., the di↵erence between themodel predictions and the measurements from the real systems, is below 15% for mostof the experiments. Second, the solution is service agnostic, since it does not requireinstrumenting the service on the cluster. Third, our solution predicts service metrics ona testbed in real-time, since collecting the feature vector and producing model predictionsoccur within a subsecond.

1.4 Publications

The results of this research have been documented in nine papers. Six of them have beenpublished in peer-reviewed conferences of the network management research commu-nity, namely, IEEE/IFIP International Conference on Network and Service Management(CNSM), IFIP/IEEE Symposium on Integrated Network and Service Management (IM),and IEEE/IFIP Network Operations and Management Symposium (NOMS). In partic-ular, the conference publications are in CNSM 2011, CNSM 2012, IM 2015, and CNSM2015. Two of them have been published in the Journal of Network and Systems Man-agement (JNSM). One of them has been submitted to the journal for publication and iscurrently under review. All papers are listed in Chapter 5.

One of the publication titled “Predicting response times for the Spotify backend” wonthe Best Paper Award at CNSM 2012.

The prototype implementation of the real-time analytics engine has been demon-strated in a network management conference (IFIP/IEEE IM) in Ottawa, Canada, May,2015. We have made the traces from this research publicly available [25, 26].

24

Page 25: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Chapter 2

Related Research

The work presented in this thesis focuses on three research areas: resource allocationfor cloud environments, performance modeling for a distributed key-value store, andanalytics-based prediction of service metrics.

2.1 Resource Allocation for Cloud Environments

In this section, we relate our work to recent and current research in the field of resourceallocation for cloud environments. We specifically focus on the application placementproblem. We refer to a scheduler as a set of machines that compute the solution tothe application placement problem. The scheduler is a key part of a resource allocationsystem. We limit our discussion to key aspects of the scheduler, and a more completesurvey is available in [27].

Centralized vs. distributed scheduler

Existing works propose either a centralized or a distributed scheduler. A centralizedscheduler uses a single machine for computing the solution to the application placementproblem; most existing works employ this scheduler, for example, [28–48]. The mainadvantage of a centralized scheduler is simplicity, because the scheduler is aware of allresources and applications in the system. The disadvantage for this scheduler is its limitedscalability, since the scheduler has to monitor states of all applications and machines tocompute the placement solutions. As a result, experiences in data centers show that thisscheduler is technically feasible for clusters up to some ten thousand machines.

A distributed scheduler uses a set of machines to jointly compute the solution tothe application placement problem. Works that propose a distributed scheduler include[24, 49–54]. Chapters 6 and 7 present our study for such a distributed scheduler. Themain advantage of a distributed scheduler is its scalability, because several machines worktogether to compute the solution to the application placement problem, and the resourcesrequired for computation on each machine are limited. However, the disadvantage of thescheduler is that states of the computation are distributed across machines, which makesthis scheduler di�cult to design, analyze, implement, and debug.

25

Page 26: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Initial vs. dynamic placement

Related works can be categorized based on whether they solve an initial or dynamic place-ment problem. The initial placement problem is a version of the application placementproblem, which assumes that no applications have been placed yet. Existing works thatsolve the initial placement problem include [29, 38, 40, 44, 46, 48, 55].

The dynamic placement problem is a version of the application placement problem,which assumes that some applications have already been placed. Thus, a solution to thisproblem generally also minimizes the cost of reconfiguration from the existing placement,e.g., the number of applications required to be reconfigured. Works that solve the dynamicplacement problem include [24, 28, 45, 47, 51, 53, 54, 56–60]. Our investigation inChapters 6 and 7 falls into the category.

Solutions to the initial and dynamic placement problem complement each other. Thesolution to the initial placement problem is important for initializing a resource allocationsystem, i.e., when the system is started and no applications are placed. The solution tothe dynamic placement problem is crucial to handling application churn, a change inapplication demand, and a change in system resources. In practice, both solutions arerequired to implement a functional resource allocation system.

Network-aware resource allocation

Related research can be classified based on whether the work considers network resourcesin addressing the application placement problem; we refer to such a work as a network-aware resource allocation. The allocation is crucial for some applications that have inter-connected and communicating modules, i.e., components of the application. Examples ofsuch applications include MapReduce [61], Dryad [62], or LAMP (Linux, Apache, MySQLand PHP).

A general approach to network-aware resource allocation is, first, to represent anapplication as a connected graph, whereby a vertex represents a module, and an edgeconnecting modules represents the networking demand between the modules. Second,the infrastructure is also represented by a connected graph, whereby a vertex representsa machine, and an edge represents a networking link connecting the machines. Theapplication placement problem is then stated as a graph embedding problem [63], whereeach application graph is to be embedded onto the infrastructure graph. Works that arenetwork-aware resource allocations include [36, 37, 40, 48, 59, 64–68].

Although we do not represent a data center by a connected graph, our work in Chapter7 describes a solution for network-aware resource allocation. Assuming a full bisectionbandwidth network—an emerging networking technology for a single data center—, wesolve the problem of placing applications that have interconnected and communicatingmodules.

Energy-aware resource allocation

Related research can be classified based on whether the work adopts a technique toconsider the power consumption of machines within a data center. We refer to such awork as energy-aware resource allocation. Examples of techniques are server consolidationand adjusting the power consumption of CPUs on a machine.

26

Page 27: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Server consolidation means packing all applications onto a small number of machines,such that the aggregate power consumption is minimized. Works that employ serverconsolidation include [28–34, 45, 47, 49, 54, 60]. Our investigation in Chapters 6 and 7includes a protocol that uses this technique.

When a machine is lightly loaded, techniques for adjusting the power consumption ofCPUs on a machine, such as a Dynamic voltage scaling (DVS) or a Dynamic frequencyscaling (DFS), can reduce energy consumption on the machine. Hence, a resource allo-cation system can estimate the aggregate demand of applications placed on a machineand tune the power consumption of CPUs according to the demand. Works that employsuch a technique include [69–72].

2.2 Performance Modeling of a DistributedKey-value Store

In this section, we relate our work to the recent and current research with similar objec-tives in the fields of performance modeling of a distributed key-value store.

A distributed key-value store

An important type of storage systems is the distributed key-value store. A distributedkey-value store runs on a number of storage machines and has two key operations, getand put operations. The get operation reads a value from the store by specifying a key,while the put operation writes a specified value to the store for a given key. In this work,we use the term object for a value kept in the store, and the distributed key-value storewe focus on is the Spotify storage system.

The development and evaluation of distributed key-value stores is an active researcharea. In contrast to Spotify’s storage system design, which is hierarchical, many advancedkey-value storage systems in operation today are based on a peer-to-peer architecture.Among them are Voldemort [73], Amazon Dynamo [74], Cassandra [12], Riak [75], Hy-perDex [76], and Scalaris [77]. Most of these stores use some form of consistent hashing toallocate objects to machines. The di↵erences in the designs of the systems are motivatedby their respective operational requirements, and they relate to the number of objectsto be hosted, the size of the objects, the rate of update, the number of clients, and theexpected scaling of the load.

Performance modeling of a storage device

A distributed key-value store includes several storage machines, and a storage machineconsists of one or more storage devices that store persistent data. Hence, the performanceof the store depends on the performance of a storage machine, which subsequently dependson the performance of storage devices on the machine.

Performance modeling of a storage device is a technique that expresses a relationshipbetween a performance metric of interest, an external load, and system parameters ofa storage device. The related research explores performance models for major storagedevices including magnetic and solid state drives.

27

Page 28: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

A magnetic drive or a “spinning disk” is a traditional form of a storage device that hasrotating discs with magnetic heads. The advantage of this device is its storage capacityand endurance. Examples of works that develop performance models for such a drive are[78–81].

A solid state drive (SSD) or a “flash memory” is an important type of a storage devicethat has no moving component. The advantage of this device is its low access latency;however, this device has a disadvantage in that it has low endurance, i.e., it has a limitednumber of writes before the device becomes unusable. Examples of works that developperformance models for such a drive are [82–84].

The model explained in Chapter 8 is not for an individual device, but rather for thedistributed key-value store. The model captures the performance aspects of an entiresystem, which consists of multiple storage machines, each of which includes multiplestorage devices.

A white-box and black-box performance model for a storagesystem

The model is classified as either white-box or black-box. A white-box model assumesinternal knowledge of the storage system. For example, a storage system can be repre-sented as a queuing system. A white-box model has the advantage in that it containsfunctional description of the system. The model has the disadvantage in that its applica-bility is limited to systems for which the model assumptions hold. The related researchthat applies a white-box model to study the performance of a storage system includes[79, 80, 82, 83, 85–87]. Chapter 8 presents performance models that are white-box.

On the other hand, a black-box model, which often is based on a machine learningtechnique, does not assume knowledge about the internal workings of the storage system.Hence, such a model is applicable to a wider range of systems than a white-box model;nonetheless, the model has the disadvantage in that it requires training data from the realsystem, which might be costly to obtain. The related research that applies a black-boxmodel includes [78, 84, 88, 89]. Our research in Chapters 9 and 10 is in this area.

Performance control of a distributed key-value store

An operator of a distributed key-value store typically has a performance goal, which statesthat a performance metric of interest must be in a desired range. For example, the averageresponse time of a request must be below a certain threshold. Performance control of astorage system employs a control system to ensure that such a goal is achieved.

Such a control system is often modeled as a feedback controller, which adjusts acontrol signal based on how the system reacts to an external load. Examples of suchcontrol signals include a signal to increase the number of storage machines when theaverage response time is higher than a threshold and a signal to decrease the number ofstorage machines when the average response time is lower than another threshold.

Works that propose a performance control system for a distributed key-value storeinclude [90–94]. Our study in Chapter 8 is di↵erent in the sense that it focuses ondimensioning and metrics prediction, and not on control.

28

Page 29: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Object placement policy for a distributed key-value store

We refer to a storage cluster as a cluster of storage machines in the distributed key-valuestore and an object allocation policy as the allocation of a set of objects to a storagecluster. Chapter 8 considers two types of object allocation policies: a random policy anda popularity-aware policy.

A random policy allocates an object uniformly at random to a storage machine inthe cluster. Such a policy has the advantage that it is easy to implement, because it canbe realized with functions that approximate random assignment, e.g., hash functions,and the allocation does not need to be adapted over time. Nonetheless, the policy has adisadvantage in that storage machines experience di↵erent request rates. Related researchthat applies a random policy includes [74, 85].

A popularity-aware policy allocates an object to a storage machine depending on howfrequently the object is requested, i.e., depending on the request rate for this object. Theallocation is performed in such a way that each server experiences the same (or almost thesame) aggregate request rate, which is the advantage of this policy. However, this policyhas a disadvantage in that it is complex to implement, since it requires a (large) routingtable, which must be maintained, and the object allocation needs to be adapted over time,as the popularity of objects changes. Related research that applies a popularity-awarepolicy includes [95–98].

Chapter 8 describes the system dimensioning and configuration parameters, for whicheither a popularity-aware or a random policy provides better performance. We show thatthe key parameters are the number of objects and the number of servers in the system.

2.3 Analytics-based Prediction of Service Metrics

In this section, we relate our work to recent and current research in the field of analytics-based prediction of service metrics.

Service-specific vs. service-agnostic prediction

Existing works can be categorized based on whether their prediction methods are service-specific or service-agnostic.

Many papers propose a method that is targeted towards a specific service and service-specific metrics. Most of these works consider a small feature set, often less than 15features, which is manually constructed by domain experts. Studies that adopt thisapproach include [99–113]. For instance, the authors in [105] described a method forpredicting quality-of-service metrics for an IPTV services, while the authors in [102]proposed a method dynamically allocates run-time resources to MapReduce tasks.

In contrast, service-agnostic prediction methods are targeted towards several servicesand service metrics simultaneously. Works in this category generally start from a largefeature space and apply feature reduction techniques to reduce the dimensionality of thefeature space. A low-dimensional feature space makes a prediction model less prone tooverfitting and is easier to interpret than a high-dimensional feature space. In addition,a low-dimensional feature space requires less samples and therefore less computationalresources for training the model. For service-agnostic prediction, feature selection is part

29

Page 30: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

of the learning method. Example works in this category include [112, 114, 115]. Since ourresearch focuses on generic analytics support for cloud infrastructure, our work describedin Chapters 9 and 10 is belonged to this category.

There is a tradeo↵ between service-specific and service-agnostic prediction methods.Service-specific methods are simpler but they do not generalize. In contrast, service-agnostic methods can be used for several services, at the cost of higher complexity andcomputational resources.

Regression vs. classification problem

Using the framework of statistical learning, existing works formulate the problem ofpredicting service metrics as either a regression problem or a classification problem.

The solution to a regression problem is a continuous function that produces the ex-pected value of a service metric from a given feature vector. Most related works in thiscategory model metric prediction as a regression problem, for example, [99–110, 112, 113,116]. Chapters 9 and 10 formulate the prediction problem as a regression problem.

In the case where a service metric has only a few possible states, i.e., its domain isdiscrete, service-metric prediction is modelled as a classification problem. An exampleof such a metric is a value being below (or above) a predefined threshold (which canindicate a violation or compliance with a service-level objective). Sample works thatapply classification include [112, 114, 115, 117–119]. For instance, [114] presents a methodfor classifying a compliance state of a service-level objective based on response timesof a three-tier web application. Another example is [118] that describes a method forclassifying a time range of a query execution time for a data warehouse system.

In the context of predicting service metrics, a regression problem is more flexiblethan a classification problem, since it enables service-metric prediction for a range ofservice-level objectives.

Batch vs. online learning

Research related to analytics-based prediction of service metrics is generally based onbatch learning or on online learning.

Batch learning is the standard approach in statistical learning. It is typically appliedto find an accurate model that fits a given dataset. The approach considers a set oftraining samples drawn from a joint distribution of features and response variables. Inthe case of batch learning, learning models do not change over time. Recent researchthat follows this approach includes [104, 106–108, 110, 112–114, 119]. For instance, theauthors of [107, 108, 113] present methods for predicting service metrics of multi-tierweb applications by training models using measurement data. We adopt this approachin Chapters 9 and 10.

Online learning is a statistical learning approach commonly followed when the systemevolves over time. In this case, training samples arrive sequentially as a time series. Thisseries is processed sequentially and results in a sequence of models. Works that are basedon online learning include [111, 112, 120, 121]. For example, the authors of [111, 121]propose online learning methods to forecast service metrics for web service selection. Weapply online learning in Chapter 10.

30

Page 31: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

In this thesis, we use both batch learning and online learning methods. We applybatch learning on traces from our testbed to obtain a baseline for model accuracy. Weapply online learning as a basis for implementing real-time prediction, since samplesbecome available in a sequential fashion as time evolves.

Device, network, and application statistics

Existing works can be grouped based on the type of features (or predictor variables)that are used as input for training models: device statistics, network statistics, andapplication (or service) statistics. Our research in Chapters 9 and 10 uses device andnetwork statistics.

Device statistics are measurements from devices or operating systems of server ma-chines, e.g., CPU utilization, memory utilization, or the number of active TCP sockets.Using these statistics is important to build prediction models that capture the usage ofdevice resources, system events, etc. Related research that considers device statisticsincludes [114, 115, 119]. For instance, the authors of [114] propose models that predictviolations of service-level objectives for an enterprise web application.

Network statistics are computed from local metrics of network devices or measure-ments from network paths. Examples of such features are available bandwidth and net-work delay. Such statistics are essential to create prediction models that capture networkproperties and behavior. Sample works that utilize network statistics are [104–106]. Theauthors of [104] present an approach to predict service metrics for IPTV streaming froma set of network statistics that include network delay, packet loss, and jitter.

Application statistics are service-specific measurements. Examples include responsetime, queue length, bu↵er utilization, and waiting time of service requests. Publicationsthat make use of application statistics include [107–112]. For example, the authors of[109] propose a method for predicting service metrics of web applications for a new user,based on experiences from other users.

31

Page 32: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders
Page 33: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Chapter 3

Summary of Original Work

The results of this research are documented in nine papers, out of which six have beenpublished in conferences, two have been published in journals, and one has been submittedfor journal publication. The complete list is presented in Chapter 5. Five papers areincluded in the text of this thesis.

Paper A: Gossip-based Resource Allocation for GreenComputing in Large Clouds

We address the problem of resource allocation in a large-scale cloud environment, whichwe formalize as that of dynamically optimizing a cloud configuration for green computingobjectives under CPU and memory constraints. We propose a generic gossip protocolfor resource allocation, which can be instantiated for specific objectives. We developan instantiation of this generic protocol which aims at minimizing power consumptionthrough server consolidation, while satisfying a changing load pattern. This protocol,called GRMP-Q, provides an e�cient heuristic solution that performs well in most cases—in special cases it is optimal. Under overload, the protocol gives a fair allocation of CPUresources to clients. Simulation results suggest that key performance metrics do notchange with increasing system size, making the resource allocation process scalable towell above 100,000 servers. Generally, the e↵ectiveness of the protocol in achieving itsobjective increases with increasing memory capacity in the servers.

This paper appears in this thesis as Chapter 7. It is also published as:

R. Yanggratoke, F. Wuhib and R. Stadler, “Gossip-based resource allocation for greencomputing in large clouds,” In Proc. 7th International Conference on Network andService Management (CNSM), Paris, France, October 24-28, 2011.

My contribution to the work has been protocol design, implementation, and evaluation. Inaddition to fruitful discussions on all aspects of this work, Fetahi Wuhib from EricssonResearch helped with formalizing and improving the protocol.

33

Page 34: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Paper B: Allocating Compute and Network Resources underManagement Objectives in Large-Scale Clouds

We consider the problem of jointly allocating compute and network resources in a largeInfrastructure-as-a-Service (IaaS) cloud. We formulate the problem of optimally allocat-ing resources to virtual data centers (VDCs) for four well-known management objectives:balanced load, energy e�ciency, fair allocation, and service di↵erentiation. Then, weoutline an architecture for resource allocation, which centers around a set of cooperatingcontrollers, each solving a problem related to the chosen management objective. We il-lustrate how a global management objective is mapped onto objectives that govern theexecution of these controllers. For a key controller, the Dynamic Placement Controller,we give a detailed distributed design, which is based on a gossip protocol that can switchbetween management objectives. The design is applicable to a broad class of manage-ment objectives, which we characterize through a property of the objective function. Weevaluate, through simulation, the dynamic placement of VDCs for a large cloud underchanging load and VDC churn. Simulation results show that this controller is e↵ectiveand highly scalable, up to 100,000 nodes, for the management objectives considered.

This paper appears in this thesis as Chapter 7. It is also published as:

F. Wuhib, R. Yanggratoke, and R. Stadler, “Allocating Compute and NetworkResources under Management Objectives in Large-Scale Clouds,” Journal of Networkand Systems Management (JNSM), Vol. 23, No. 1, pp.111-136, January 2015.

Fetahi Wuhib from Ericsson Research formalized and designed the protocol. My contri-bution to the work has been the implementation and evaluation of the protocol.

Paper C: On the Performance of the Spotify Backend

We model and evaluate the performance of a distributed key-value storage system thatis part of the Spotify backend. Spotify is an on-demand music streaming service, o↵eringlow-latency access to a library of over 20 million tracks and serving over 20 million userscurrently. We first present a simplified model of the Spotify storage architecture, in orderto make its analysis feasible. We then introduce an analytical model for the distributionof the response time, a key metric in the Spotify service. We parameterize and validatethe model using measurements from two di↵erent testbed configurations and from theoperational Spotify infrastructure. We find that the model is accurate—measurementsare within 11% of predictions—within the range of normal load patterns. In addition, wemodel the capacity of the Spotify storage system under di↵erent object allocation policiesand find that measurements on our testbed are within 9% of the model predictions. Themodel helps us justify the object allocation policy adopted for Spotify storage system.

This paper appears in this thesis as Chapter 8. It is also published as:

R. Yanggratoke, G. Kreitz, M. Goldmann, R. Stadler and V. Fodor, “On theperformance of the Spotify backend,” Journal of Network and Systems Management(JNSM), Vol. 23, No. 1, pp.210-237, January 2015.

My contribution to the work has been problem formulation, design and implementation ofanalytical models, testbed setup and configuration, experimental design, data collection,

34

Page 35: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

and model evaluation. In addition to fruitful discussions on all aspects of this work,Gunnar Kreitz and Mikael Goldmann from Spotify helped with describing Spotify backendarchitecture. Viktoria Fodor from KTH provided guidance in developing and verifying theanalytical models.

Paper D: Predicting Real-time Service-level Metrics fromDevice Statistics

While real-time service assurance is critical for emerging telecom cloud services, under-standing and predicting performance metrics for such services is hard. In this paper, wepursue an approach based upon statistical learning whereby the behavior of the targetsystem is learned from observations. We use methods that learn from device statisticsand predict metrics for services running on these devices. Specifically, we collect statis-tics from a Linux kernel of a server machine and predict client-side metrics for a video-streaming service (VLC). The fact that we collect thousands of kernel variables, whileomitting service instrumentation, makes our approach service-independent and unique.While our current lab configuration is simple, our results, gained through extensive ex-perimentation, prove the feasibility of accurately predicting client-side metrics, such asvideo frame rates and RTP packet rates, often within 10-15% error (NMAE), also underhigh computational load and across traces from di↵erent scenarios.

This paper appears in this thesis as Chapter 9. It is also published as:

R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R. Stadler,“Predicting real-time service-level metrics from device statistics,” In Proc. the 14th

IFIP/IEEE International Symposium on Integrated Network Management (IM),Ottawa, Canada, 11-15 May, 2015.

My contribution to the work has been problem formulation, formalizing the learning prob-lem, testbed setup and implementation, experimental design, data collection, and evalua-tion of the learning models. In addition to fruitful discussions on all aspects of this workand providing some text for the paper, Andreas Johnsson developed sensor code, JohnArdelius performed literature review, while Jawwad Ahmed and Christofer Flinta helpedwith defining experiments and interpreting results.

Paper E: A Service-agnostic Method for Predicting ServiceMetrics in Real-time

We predict performance metrics of cloud services using statistical learning, whereby thebehavior of a system is learned from observations. Specifically, we collect device andnetwork statistics from a cloud testbed and apply regression methods to predict, in real-time, client-side service metrics for video streaming and key-value store services. Ourmethod is service agnostic in the sense that it takes as input operating-systems andnetwork statistics instead of service-specific metrics. We show that feature set reductionsignificantly improves the prediction accuracy in our case, while simultaneously reducingmodel computation time. We find that the prediction accuracy decreases when, instead ofa single service, both services run on the same testbed simultaneously or when the network

35

Page 36: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

quality on the path between the server cluster and the client deteriorates. Finally, wediscuss the design and implementation of a real-time analytics engine, which processesstreams of device statistics and service metrics from testbed sensors and produces modelpredictions through online learning.

This paper appears in this thesis as Chapter 10. The paper has been submitted for pub-lication in Journal of Network and Systems Management (JNSM). My contribution tothe work has been problem formulation, formalizing the learning problem, testbed setupand implementation, experimental design, data collection, and evaluation of the learningmodels. In addition to fruitful discussions on all aspects of this work, the other authorshelped with defining experiments and interpreting results.

36

Page 37: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Chapter 4

Open Problems for Future Research

Based on the work presented in this thesis, we have identified open research questionsin resource allocation for large clouds, performance modeling of a distributed key-valuestore, and real-time prediction of service metrics that require further investigation.

Resource allocation in large clouds

• Centralized vs. decentralized resource allocation: A fundamental investigation isneeded about centralized vs. distributed resource allocation schemes, with respectto system size, management objectives, service-level objectives, resource e�ciency,management overhead, and robustness. This work includes formal modeling, aswell as measurements from real systems, which are used to populate the modelparameters.

• Solving the resource allocation problem in a distributed fashion using a gossip proto-col: A thorough investigation is needed regarding how the choice of the managementobjective (i.e., the objective function) and how specific load patterns a↵ect the ef-fectiveness and the convergence speed of the gossip protocol presented in this thesis.An aspect of this work will be investigating to which extent the e↵ectiveness of theprotocol can be improved in the case of high load and overload. Second, the workin this thesis relies on a specific network model whose application is feasible onlyfor a single data center. An important question is how to extend our work to coveran alternative network model that is suitable for resource management across datacenters and for telecom clouds.

Performance modeling of a distributed key-value store

• Black-box models for performance predictions: In this work, we adopt probabilisticand stochastic models for performance modeling of the Spotify backend. An alter-native approach is to apply machine learning techniques. Such techniques enableperformance predictions for a wider range of storage systems, because they do notassume knowledge about their internal workings. Part of this investigation shouldbe a comparative study of the two di↵erent approaches.

37

Page 38: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

• Online performance management using analytical modelling: In this thesis, we applyour models to o✏ine system dimensioning. An interesting future direction is toapply our models in the context of performance management during the operationof the system. This includes developing a subsystem that continuously estimatesmodel parameters at runtime, taking into account that the resources of the storageservers may be shared with other tasks. The results from this thesis can be a basisfor engineering a feedback controller that aims at achieving a performance goalunder changing load.

Analytics-based prediction of service metrics

• Prediction in large systems: In Chapters 9 and 10, we focus on training learningmodels in a centralized fashion for a small system. Device or network statistics arecollected at a single location before starting the training process. A fundamentalproblem is how our basic approach can scale to a large system that includes 100,000or more devices, for instance. There are at least two ways to address this prob-lem. First, the computation of learning models can be performed on a dedicated(distributed) infrastructure, and many investigations are pursuing this direction,e.g., [122, 123]. Second, the computation can be performed on a service infrastruc-ture, close to the sources of the device statistics. In this case, we need to addressthe problem of distributed learning with feature sharding [124, 125], whereby thefeatures of a single training sample are distributed across devices. This directionhas so far received less attention, but it fits our approach to metrics predictions,because the features are inherently distributed. A thorough investigation is neededto study and compare both directions, in order to achieve accurate prediction inlarge systems.

• Analytics-based performance management: In Chapters 9 and 10, we limit ourselvesto the prediction problem, i.e., identifying prediction models for real-time servicemetrics, but we do not consider actions to achieve performance management goals.We propose research into how to perform service assurance based on predictions ofservice metrics. One of the di�culties is that the space of possible actions is verylarge. Such actions can include relocating applications, changing scheduling policies,adding physical resources, and upgrading or reinitializing software components.Methods have to be developed to identify corrective actions that minimize the costassociated with customer satisfaction and operational expenses.

• Prediction of end-to-end service metrics: Chapter 10 presents an early result thattakes into account network-related features for end-to-end prediction. The methodin Chapter 10 requires end-to-end measurements and it does not scale well to alarge number of clients at di↵erent locations. An alternative approach, which doesnot require end-to-end measurements, is to extend the feature space to include localstatistics from network devices, such as routers and switches. An issue with thisapproach is how to deal with large networks. For a specific client, statistics from alimited number of network devices will be relevant for end-to-end predictions, andthese features are location-specific. A careful study investigating, comparing, andpossibly extending these two approaches is warranted.

38

Page 39: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

• Forecasting of service metrics: Chapters 9 and 10 focus on estimating servicemetrics for the current time, based on current and past statistics (either device ornetwork statistics). The presented method is suitable for reactive control actions,but not for proactive control. Proactive control is essential to guarantee servicequality, but it requires forecasting of service metrics, i.e., estimating service metricsfor future times. The traditional approach of metrics forecasting is based on thewell-understood field of time series analysis. In the domain of cloud services, thereare important scenarios where considering only the history of service metrics is notsu�cient to predict future metrics. Consider the case where the cloud and networkinfrastructure is simultaneously shared by multiple independent services. Then, aparticular service metric depends on the resource consumption of all services. Insuch a case, it is very di�cult to capture the evolution of service metrics by usinga time series model. Therefore, we propose work investigating a new forecastingmethod that is based on the results of this thesis. For instance, one can expand thefeature space with historical device and network statistics and estimate the targetmetric for a future point in time.

39

Page 40: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders
Page 41: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Chapter 5

List of Publications in the Contextof this Thesis

1. R. Yanggratoke, F. Wuhib and R. Stadler, “Gossip-based resource allocation forgreen computing in large clouds,” In Proc. 7th International Conference on Networkand Service Management (CNSM), Paris, France, October 24-28, 2011.

2. R. Yanggratoke, G. Kreitz, M. Goldmann and R. Stadler, “Predicting responsetimes for the Spotify backend,” In Proc. 8th International Conference on Networkand Service Management (CNSM), Las Vegas, NV, USA, October 22-26, 2012. Thispublication receives the Best Paper Award from the conference.

3. F. Wuhib, R. Yanggratoke, and R. Stadler, “Allocating Compute and Network Re-sources under Management Objectives in Large-Scale Clouds,” Journal of Networkand Systems Management (JNSM), Vol. 23, No. 1, pp.111-136, January 2015.

4. R. Yanggratoke, G. Kreitz, M. Goldmann, R. Stadler and V. Fodor, “On the per-formance of the Spotify backend,” Journal of Network and Systems Management(JNSM), Vol. 23, No. 1, pp.210-237, January 2015.

5. R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R.Stadler, “Predicting real-time service-level metrics from device statistics,” In Proc.14th IFIP/IEEE International Symposium on Integrated Network Management (IM),Ottawa, Canada, 11-15 May, 2015.

6. R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R.Stadler, “A platform for predicting real-time service-level metrics from device statis-tics,” In Proc. 14th IFIP/IEEE International Symposium on Integrated NetworkManagement (IM), Ottawa, Canada, 11-15 May, 2015.

7. R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R.Stadler, “Predicting service metrics for cluster-based services using real-time analyt-ics,” In Proc. 11th International Conference on Network and Service Management(CNSM), Barcelona, Spain, 9-13 November, 2015.

41

Page 42: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

8. R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R.Stadler, “A service-agnostic method for predicting service metrics in real-time,”submitted to Journal of Network and Systems Management (JNSM).

9. J. Ahmed, A. Johnsson, R. Yanggratoke, J. Ardelius, C. Flinta, R. Stadler, “Pre-dicting SLA Conformance for Cluster-Based Services Using Distributed Analyt-ics,” Accepted to IFIP/IEEE Network Operations and Management Symposium(NOMS), 2016.

Technical reports

1. R. Yanggratoke, F. Wuhib, and R. Stadler (2011). Gossip-based resource allocationfor green computing in large clouds (long version).

2. R. Yanggratoke, G. Kreitz, M. Goldmann, and R. Stadler, “A Performance Modelof the Spotify backend,” In Proc. of Swedish National Computer Networking Work-shop (SNCNW), 2012.

3. R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R.Stadler (2015). Predicting Real-time Service-level Metrics from Device Statistics.KTH Royal Institute of Technology.

4. J. Ahmed, A. Johnsson, R. Yanggratoke, J. Ardelius, C. Flinta, and R. Stadler(2015). Predicting SLA Violations in Real Time using Online Machine Learning.arXiv preprint arXiv:1509.01386.

Public datasets

1. R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, and RStadler. (2014). Linux kernel statistics from a video server and service metrics froma video client. Distributed by Machine learning data set repository [MLData.org].http://mldata.org/repository/data/viewslug/realm-im2015-vod-traces

2. R. Yanggratoke and R. Stadler (2015). Linux kernel statistics from a video-streamingcluster and service metrics from a video client. Distributed by Machine learningdata set repository [MLData.org]. http://mldata.org/repository/data/viewslug/realm-cnsm2015-vod-traces

42

Page 43: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

Bibliography

[1] EU. 10 Key Recommendations - Vision and Needs, Impacts and Instruments. EU,2011. url: http://cordis.europa.eu/fp7/ict/istag/documents/istag_key_recommendations_beyond_2013_full.pdf.

[2] 123net. Wikimedia. https://commons.wikimedia.org/wiki/File:123Net_Data_Center_(DC2).jpg License: Creative Common Attribution-ShareAlike 3.0Unported. 2011.

[3] Peter Mell and Tim Grance. The NIST Definition of Cloud Computing. Tech. rep.NIST, 2009. url: http://www.csrc.nist.gov/groups/SNS/cloud-computing/.

[4] Amazon. Amazon Web Services. http://aws.amazon.com/.

[5] Google. Google Cloud Platform. https://cloud.google.com.

[6] Microsoft. Windows Azure. http://www.windowsazure.com/.

[7] Dropbox. Dropbox. http://www.dropbox.com/.

[8] Netflix. Netflix. http://netflix.com.

[9] Katie Fehrenbacher. The story behind how Apples iCloud data center got built.http://gigaom.com/2012/07/12/the-story-behind-how-apples-icloud-data-center-got-built/. 2012.

[10] Sarah Silbert. Facebook building $1.5 billion data center in Altoona, Iowa. http://www.engadget.com/2013/04/22/facebook-building-1-5-billion-data-center-in-altoona-iowa/. 2013.

[11] Eric Lai. Microsoft’s $500M Iowa data center to use shipping containers. http://www.computerworld.com/s/article/9113202/Microsoft_s_500M_Iowa_data_center_to_use_shipping_containers.

[12] Avinash Lakshman and Prashant Malik. “Cassandra: a decentralized structuredstorage system”. In: SIGOPS Oper. Syst. Rev. 44.2 (Apr. 2010), pp. 35–40. issn:0163-5980. doi: 10.1145/1773912.1773922. url: http://doi.acm.org/10.1145/1773912.1773922.

[13] Giuseppe DeCandia et al. “Dynamo: amazon’s highly available key-value store”.In: SIGOPS Oper. Syst. Rev. 41.6 (Oct. 2007), pp. 205–220. issn: 0163-5980. doi:10.1145/1323293.1294281. url: http://doi.acm.org/10.1145/1323293.1294281.

43

Page 44: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[14] Rusty Klophaus. “Riak Core: building distributed applications without sharedstate”. In: ACM SIGPLAN Commercial Users of Functional Programming. CUFP’10. Baltimore, Maryland: ACM, 2010, 14:1–14:1. isbn: 978-1-4503-0516-7. doi:10.1145/1900160.1900176. url: http://doi.acm.org/10.1145/1900160.1900176.

[15] MongoDB. https://www.mongodb.com/.

[16] G. Kreitz and F. Niemela. “Spotify – Large Scale, Low Latency, P2P Music-on-Demand Streaming”. In: Peer-to-Peer Computing (P2P), 2010 IEEE TenthInternational Conference on. 2010, pp. 1–10. doi: 10.1109/P2P.2010.5569963.

[17] The Spotify Team. 20 Million Reasons to Say Thanks. https://news.spotify.com/us/2015/06/10/20-million-reasons-to-say-thanks/. 2015.

[18] Youtube. Youtube. http://youtube.com.

[19] ITU. ITU-T M.3400. ITU, 2000.

[20] H. Shachnai and T. Tamir. “On Two Class-Constrained Versions of the MultipleKnapsack Problem”. English. In: Algorithmica 29.3 (2001), pp. 442–467. issn:0178-4617. doi: 10.1007/s004530010057. url: http://dx.doi.org/10.1007/s004530010057.

[21] Ken Birman. “The promise, and limitations, of gossip protocols”. In: SIGOPSOper. Syst. Rev. 41.5 (Oct. 2007), pp. 8–13. issn: 0163-5980. doi: 10 . 1145 /1317379.1317382. url: http://doi.acm.org/10.1145/1317379.1317382.

[22] Trevor Hastie Gareth James Daniela Witten and Robert Tibshirani. An Introduc-tion to Statistical Learning with Applications in R. Springer, 2014.

[23] Joao Gama, Raquel Sebastiao, and Pedro Pereira Rodrigues. “Issues in evalua-tion of stream learning algorithms”. In: Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining. ACM. 2009,pp. 329–338.

[24] F. Wuhib, R. Stadler, and H. Lindgren. “Dynamic resource allocation with man-agement objectives;Implementation for an OpenStack cloud”. In: Network andservice management (cnsm), 2012 8th international conference and 2012 work-shop on systems virtualiztion management (svm). 2012, pp. 309–315.

[25] Rerngvit Yanggratoke et al. Linux kernel statistics from a video server and ser-vice metrics from a video client. Distributed by Machine learning data set repos-itory [MLData.org]. http://mldata.org/repository/data/viewslug/realm-im2015-vod-traces. 2014.

[26] Rerngvit Yanggratoke and Rolf Stadler. Linux kernel statistics from a video-streaming cluster and service metrics from a video client. Distributed by Machinelearning data set repository [MLData.org]. http://mldata.org/repository/data/viewslug/realm-cnsm2015-vod-traces/. 2015.

[27] Brendan Jennings and Rolf Stadler. “Resource Management in Clouds: Survey andResearch Challenges”. English. In: Journal of Network and Systems Management23.3 (2015), pp. 567–619. issn: 1064-7570. doi: 10.1007/s10922-014-9307-7.url: http://dx.doi.org/10.1007/s10922-014-9307-7.

44

Page 45: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[28] D. Gmach et al. “An integrated approach to resource pool management: Policies,e�ciency and quality metrics”. In: Dependable Systems and Networks With FTCSand DCC, 2008. DSN 2008. IEEE International Conference on. 2008, pp. 326 –335. doi: 10.1109/DSN.2008.4630101.

[29] Akshat Verma et al. “Server workload analysis for power minimization using con-solidation”. In: USENIX’09. San Diego, California: USENIX Association, 2009,pp. 28–28.

[30] M. Cardosa, M.R. Korupolu, and A. Singh. “Shares and utilities based powerconsolidation in virtualized server environments”. In: IM 2009. 2009, pp. 327 –334. doi: 10.1109/INM.2009.5188832.

[31] B. Speitkamp and M. Bichler. “A Mathematical Programming Approach for ServerConsolidation Problems in Virtualized Data Centers”. In: IEEE TSC 3.4 (2010),pp. 266 –278. issn: 1939-1374. doi: 10.1109/TSC.2010.25.

[32] C. Subramanian, A. Vasan, and A. Sivasubramaniam. “Reducing data centerpower with server consolidation: Approximation and evaluation”. In: HiPC 2010.2010, pp. 1 –10. doi: 10.1109/HIPC.2010.5713161.

[33] Vinicius Petrucci, Orlando Loques, and Daniel Mosse. “Dynamic optimization ofpower and performance for virtualized server clusters”. In: ACM SAC 2010. Sierre,Switzerland, 2010, pp. 263–264. isbn: 978-1-60558-639-7.

[34] Niraj Tolia et al. “Unified Thermal and Power Management in Server Enclosures”.In: ASME Conference Proceedings 2009.43604 (2009), pp. 721–730. doi: 10.1115/InterPACK2009-89075.

[35] V. Shrivastava et al. “Application-aware virtual machine migration in data cen-ters”. In: INFOCOM, 2011 Proceedings IEEE. 2011, pp. 66 –70. doi: 10.1109/INFCOM.2011.5935247.

[36] Chuanxiong Guo et al. “SecondNet: a data center network virtualization architec-ture with bandwidth guarantees”. In: Proceedings of the 6th International COn-ference. Co-NEXT ’10. Philadelphia, Pennsylvania: ACM, 2010, 15:1–15:12. isbn:978-1-4503-0448-1. doi: 10.1145/1921168.1921188.

[37] Hitesh Ballani et al. “Towards predictable datacenter networks”. In: Proceedings ofthe ACM SIGCOMM 2011 conference. SIGCOMM ’11. Toronto, Ontario, Canada:ACM, 2011, pp. 242–253. isbn: 978-1-4503-0797-0. doi: 10 . 1145 / 2018436 .2018465. url: http://doi.acm.org/10.1145/2018436.2018465.

[38] Meng Wang, Xiaoqiao Meng, and Li Zhang. “Consolidating virtual machines withdynamic bandwidth demand in data centers”. In: INFOCOM, 2011 ProceedingsIEEE. 2011, pp. 71 –75. doi: 10.1109/INFCOM.2011.5935254.

[39] D. Breitgand and A. Epstein. “Improving consolidation of virtual machines withrisk-aware bandwidth oversubscription in compute clouds”. In: INFOCOM, 2012Proceedings IEEE. 2012, pp. 2861 –2865. doi: 10.1109/INFCOM.2012.6195716.

[40] Xiaoqiao Meng, V. Pappas, and Li Zhang. “Improving the Scalability of DataCenter Networks with Tra�c-aware Virtual Machine Placement”. In: INFOCOM,2010 Proceedings IEEE. 2010, pp. 1 –9. doi: 10.1109/INFCOM.2010.5461930.

45

Page 46: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[41] Gunho Lee et al. “Topology-aware resource allocation for data-intensive work-loads”. In: SIGCOMM Comput. Commun. Rev. 41.1 (2011), pp. 120–124. issn:0146-4833. doi: 10.1145/1925861.1925881.

[42] J.W. Jiang et al. “Joint VM placement and routing for data center tra�c en-gineering”. In: INFOCOM, 2012 Proceedings IEEE. 2012, pp. 2876 –2880. doi:10.1109/INFCOM.2012.6195719.

[43] O. Biran et al. “A Stable Network-Aware VM Placement for Cloud Systems”. In:Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM Interna-tional Symposium on. 2012, pp. 498 –506. doi: 10.1109/CCGrid.2012.119.

[44] D. Jayasinghe et al. “Improving Performance and Availability of Services Hostedon IaaS Clouds with Structural Constraint-Aware Virtual Machine Placement”.In: Services Computing (SCC), 2011 IEEE International Conference on. 2011,pp. 72 –79. doi: 10.1109/SCC.2011.28.

[45] M. Tighe and M. Bauer. “Integrating cloud application autoscaling with dynamicVM allocation”. In: Network Operations and Management Symposium (NOMS),2014 IEEE. 2014, pp. 1–9. doi: 10.1109/NOMS.2014.6838239.

[46] HwaMin Lee, Young-Sik Jeong, and HaengJin Jang. “Performance analysis basedresource allocation for green cloud computing”. English. In: The Journal of Super-computing 69.3 (2014), pp. 1013–1026. issn: 0920-8542. doi: 10.1007/s11227-013-1020-x. url: http://dx.doi.org/10.1007/s11227-013-1020-x.

[47] Guruh Fajar Shidik, Khabib Mustofa, et al. “Evaluation of Selection Policy withVarious Virtual Machine Instances in Dynamic VM Consolidation for Energy Ef-ficient at Cloud Data Centers”. In: Journal of Networks 10.7 (2015), pp. 397–406.

[48] J. Soares et al. “Resource allocation in the network operator’s cloud: A virtualiza-tion approach”. In: Computers and Communications (ISCC), 2012 IEEE Sympo-sium on. 2012, pp. 000800–000805. doi: 10.1109/ISCC.2012.6249399.

[49] Gueyoung Jung et al. “Mistral: Dynamically Managing Power, Performance, andAdaptation Cost in Cloud Infrastructures”. In: ICDCS2010. 2010, pp. 62 –73. doi:10.1109/ICDCS.2010.88.

[50] Deepal Jayasinghe et al. “Improving Performance and Availability of ServicesHosted on IaaS Clouds with Structural Constraint-Aware Virtual Machine Place-ment”. In: Proceedings of the 2011 IEEE International Conference on ServicesComputing. SCC ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 72–79. isbn: 978-0-7695-4462-5. doi: 10.1109/SCC.2011.28. url: http://dx.doi.org/10.1109/SCC.2011.28.

[51] Y.O. Yazir et al. “Dynamic Resource Allocation in Computing Clouds Using Dis-tributed Multiple Criteria Decision Analysis”. In: IEEE International Conferenceon Cloud Computing. 2010, pp. 91 –98. doi: 10.1109/CLOUD.2010.66.

46

Page 47: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[52] Eugen Feller, Louis Rilling, and Christine Morin. “Snooze: A Scalable and Au-tonomic Virtual Machine Management Framework for Private Clouds”. In: Pro-ceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloudand Grid Computing (ccgrid 2012). CCGRID ’12. Washington, DC, USA: IEEEComputer Society, 2012, pp. 482–489. isbn: 978-0-7695-4691-9. doi: 10.1109/CCGrid.2012.71. url: http://dx.doi.org/10.1109/CCGrid.2012.71.

[53] F. Wuhib, R. Stadler, and M. Spreitzer. “A Gossip Protocol for Dynamic Re-source Management in Large Cloud Environments”. In: Network and Service Man-agement, IEEE Transactions on 9.2 (2012), pp. 213–225. issn: 1932-4537. doi:10.1109/TNSM.2012.031512.110176.

[54] M. Tighe et al. “A distributed approach to dynamic VM management”. In: Net-work and Service Management (CNSM), 2013 9th International Conference on.2013, pp. 166–170. doi: 10.1109/CNSM.2013.6727830.

[55] G. Koslovski et al. “Locating Virtual Infrastructures: Users and InP perspectives”.In: Integrated Network Management (IM), 2011 IFIP/IEEE International Sympo-sium on. 2011, pp. 153 –160. doi: 10.1109/INM.2011.5990686.

[56] Xiangliang Zhang et al. “Virtual machine migration in an over-committed cloud”.In: Network Operations and Management Symposium (NOMS), 2012 IEEE. 2012,pp. 196–203. doi: 10.1109/NOMS.2012.6211899.

[57] Hong Xu and Baochun Li. “Egalitarian stable matching for VM migration in cloudcomputing”. In: Computer Communications Workshops (INFOCOM WKSHPS),2011 IEEE Conference on. 2011, pp. 631–636. doi: 10.1109/INFCOMW.2011.5928889.

[58] Jing Xu and Jose Fortes. “A multi-objective approach to virtual machine manage-ment in datacenters”. In: Proceedings of the 8th ACM international conference onAutonomic computing. ICAC ’11. Karlsruhe, Germany: ACM, 2011, pp. 225–234.isbn: 978-1-4503-0607-2. doi: 10.1145/1998582.1998636. url: http://doi.acm.org/10.1145/1998582.1998636.

[59] M.F. Zhani et al. “VDC Planner: Dynamic migration-aware Virtual Data Cen-ter embedding for clouds”. In: Integrated Network Management (IM 2013), 2013IFIP/IEEE International Symposium on. 2013, pp. 18–25.

[60] Qi Zhang et al. “Harmony: Dynamic Heterogeneity-Aware Resource Provision-ing in the Cloud”. In: Distributed Computing Systems (ICDCS), 2013 IEEE 33rdInternational Conference on. 2013, pp. 510–519. doi: 10.1109/ICDCS.2013.28.

[61] MapReduce. http: // www. mapreduce. org . 2012.

[62] Dryad. http: // research. microsoft. com/ en-us/ projects/ dryad/ . 2012.

[63] D. Archdeacon. “The Complexity of the Graph Embedding Problem”. English.In: Topics in Combinatorics and Graph Theory. Ed. by Rainer Bodendiek andRudolf Henn. Physica-Verlag HD, 1990, pp. 59–64. isbn: 978-3-642-46910-7. doi:10.1007/978-3-642-46908-4_6. url: http://dx.doi.org/10.1007/978-3-642-46908-4_6.

47

Page 48: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[64] A. Amokrane et al. “Greenslater: On Satisfying Green SLAs in Distributed Clouds”.In: Network and Service Management, IEEE Transactions on 12.3 (2015), pp. 363–376. issn: 1932-4537. doi: 10.1109/TNSM.2015.2440423.

[65] Qi Zhang et al. “Venice: Reliable virtual data center embedding in clouds”. In:INFOCOM, 2014 Proceedings IEEE. 2014, pp. 289–297. doi: 10.1109/INFOCOM.2014.6847950.

[66] Wai-Leong Yeow, Cedric Westphal, and Ulas Kozat. “Designing and EmbeddingReliable Virtual Infrastructures”. In: Proceedings of the Second ACM SIGCOMMWorkshop on Virtualized Infrastructure Systems and Architectures. VISA ’10. NewDelhi, India: ACM, 2010, pp. 33–40. isbn: 978-1-4503-0199-2. doi: 10.1145/1851399.1851406. url: http://doi.acm.org/10.1145/1851399.1851406.

[67] M.G. Rabbani et al. “On tackling virtual data center embedding problem”. In:Integrated Network Management (IM 2013), 2013 IFIP/IEEE International Sym-posium on. 2013, pp. 177–184.

[68] A. Amokrane et al. “Greenhead: Virtual Data Center Embedding across Dis-tributed Infrastructures”. In: Cloud Computing, IEEE Transactions on 1.1 (2013),pp. 36–49. issn: 2168-7161. doi: 10.1109/TCC.2013.5.

[69] Yiyu Chen et al. “Managing server energy and operational costs in hosting cen-ters”. In: SIGMETRICS Perform. Eval. Rev. 33.1 (June 2005), pp. 303–314. issn:0163-5999. doi: 10.1145/1071690.1064253. url: http://doi.acm.org/10.1145/1071690.1064253.

[70] Anshul Gandhi et al. “Optimal power allocation in server farms”. In: Proceedingsof the eleventh international joint conference on Measurement and modeling ofcomputer systems. SIGMETRICS ’09. Seattle, WA, USA: ACM, 2009, pp. 157–168. isbn: 978-1-60558-511-6. doi: 10.1145/1555349.1555368. url: http://doi.acm.org/10.1145/1555349.1555368.

[71] Tibor Horvath and Kevin Skadron. “Multi-mode energy management for multi-tierserver clusters”. In: Proceedings of the 17th international conference on Parallelarchitectures and compilation techniques. PACT ’08. Toronto, Ontario, Canada:ACM, 2008, pp. 270–279. isbn: 978-1-60558-282-5. doi: 10 . 1145 / 1454115 .1454153. url: http://doi.acm.org/10.1145/1454115.1454153.

[72] Minghong Lin et al. “Dynamic right-sizing for power-proportional data centers”.In: INFOCOM, 2011 Proceedings IEEE. 2011, pp. 1098–1106. doi: 10.1109/INFCOM.2011.5934885.

[73] Voldemort. Voldemort. http://www.project-voldemort.com.

[74] Giuseppe DeCandia et al. “Dynamo: amazon’s highly available key-value store”.In: SIGOPS Oper. Syst. Rev. 41.6 (Oct. 2007), pp. 205–220. issn: 0163-5980.

[75] Riak. Riak. http://basho.com/riak/.

48

Page 49: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[76] Robert Escriva, Bernard Wong, and Emin Gun Sirer. “HyperDex: a distributed,searchable key-value store”. In: Proceedings of the ACM SIGCOMM 2012 con-ference on Applications, technologies, architectures, and protocols for computercommunication. SIGCOMM ’12. Helsinki, Finland: ACM, 2012, pp. 25–36. isbn:978-1-4503-1419-0. doi: 10.1145/2342356.2342360. url: http://doi.acm.org/10.1145/2342356.2342360.

[77] Thorsten Schutt, Florian Schintke, and Alexander Reinefeld. “Scalaris: reliabletransactional p2p key/value store”. In: Proceedings of the 7th ACM SIGPLANworkshop on ERLANG. ERLANG ’08. Victoria, BC, Canada: ACM, 2008, pp. 41–48. isbn: 978-1-60558-065-4.

[78] J.D. Garcia et al. “Using Black-Box Modeling Techniques for Modern Disk DrivesService Time Simulation”. In: Simulation Symposium, 2008. ANSS 2008. 41stAnnual. 2008, pp. 139 –145.

[79] A.S. Lebrecht, N.J. Dingle, andW.J. Knottenbelt. “A Performance Model of ZonedDisk Drives with I/O Request Reordering”. In: Quantitative Evaluation of Sys-tems, 2009. QEST ’09. Sixth International Conference on the. 2009, pp. 97 –106.doi: 10.1109/QEST.2009.31.

[80] Field Cady, Yi Zhuang, and Mor Harchol-Balter. “A Stochastic Analysis of HardDisk Drives”. In: International Journal of Stochastic Analysis 2011 (2011), pp. 1–21. issn: 2090-3332. doi: 10.1155/2011/390548.

[81] J.W. Branch, Yixin Diao, and L. Shwartz. “A framework for predicting servicedelivery e↵orts using IT infrastructure-to-incident correlation”. In: Network Op-erations and Management Symposium (NOMS), 2014 IEEE. 2014, pp. 1–8. doi:10.1109/NOMS.2014.6838266.

[82] Simona Boboila and Peter Desnoyers. “Performance models of flash-based solid-state drives for real workloads”. In: Proceedings of the 2011 IEEE 27th Symposiumon Mass Storage Systems and Technologies. MSST ’11. Washington, DC, USA:IEEE Computer Society, 2011, pp. 1–6. isbn: 978-1-4577-0427-7. doi: 10.1109/MSST.2011.5937227. url: http://dx.doi.org/10.1109/MSST.2011.5937227.

[83] Peter Desnoyers. “Analytic modeling of SSD write performance”. In: Proceedingsof the 5th Annual International Systems and Storage Conference. SYSTOR ’12.Haifa, Israel: ACM, 2012, 12:1–12:10. isbn: 978-1-4503-1448-0. doi: 10.1145/2367589.2367603. url: http://doi.acm.org/10.1145/2367589.2367603.

[84] Shan Li and H.H. Huang. “Black-Box Performance Modeling for Solid-State Drives”.In: Modeling, Analysis Simulation of Computer and Telecommunication Systems(MASCOTS), 2010 IEEE International Symposium on. 2010, pp. 391–393. doi:10.1109/MASCOTS.2010.48.

[85] E. Thereska et al. “Informed data distribution selection in a self-predicting storagesystem”. In: Autonomic Computing, 2006. ICAC ’06. IEEE International Confer-ence on. 2006, pp. 187 –198. doi: 10.1109/ICAC.2006.1662398.

[86] Kaiqi Xiong and H. Perros. “Service Performance and Analysis in Cloud Comput-ing”. In: IEEE Congress on Services. 2009, pp. 693 –700.

49

Page 50: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[87] H. Khazaei, J. Misic, and V.B. Misic. “Performance Analysis of Cloud Comput-ing Centers Using M/G/m/m+r Queuing Systems”. In: IEEE Transactions onParallel and Distributed Systems 23.5 (2012), pp. 936 –943.

[88] Ajay Gulati et al. “Pesto: online storage performance management in virtualizeddatacenters”. In: Proceedings of the 2nd ACM Symposium on Cloud Computing.SOCC ’11. Cascais, Portugal: ACM, 2011, 19:1–19:14. isbn: 978-1-4503-0976-9.

[89] Stephan Kraft et al. “IO performance prediction in consolidated virtualized envi-ronments”. In: SIGSOFT Softw. Eng. Notes 36.5 (Sept. 2011), pp. 295–306. issn:0163-5948.

[90] Siba Mohammad, Eike Schallehn, and Gunter Saake. “A Self-tuning Frameworkfor Cloud Storage Clusters”. English. In: Advances in Databases and Informa-tion Systems. Ed. by Morzy Tadeusz, Patrick Valduriez, and Ladjel Bellatreche.Vol. 9282. Lecture Notes in Computer Science. Springer International Publishing,2015, pp. 351–364. isbn: 978-3-319-23134-1. doi: 10.1007/978-3-319-23135-8_24. url: http://dx.doi.org/10.1007/978-3-319-23135-8_24.

[91] Beth Trushkowsky et al. “The SCADS director: scaling a distributed storagesystem under stringent performance requirements”. In: Proceedings of the 9thUSENIX conference on File and stroage technologies. FAST’11. San Jose, Califor-nia: USENIX Association, 2011, pp. 12–12. isbn: 978-1-931971-82-9. url: http://dl.acm.org/citation.cfm?id=1960475.1960487.

[92] Ahmad Al-Shishtawy and Vladimir Vlassov. ElastMan : Autonomic ElasticityManager for Cloud-Based Key-Value Stores. Tech. rep. 12:01. QC 20120831. KTH,Software and Computer Systems, SCS, 2012, p. 14.

[93] Markus Klems et al. “The Yahoo!: cloud datastore load balancer”. In: Proceed-ings of the fourth international workshop on Cloud data management. CloudDB’12. Maui, Hawaii, USA: ACM, 2012, pp. 33–40. isbn: 978-1-4503-1708-5. doi:10.1145/2390021.2390028. url: http://doi.acm.org/10.1145/2390021.2390028.

[94] Federico Piccinini. “Dynamic load balancing based on latency prediction”. MAthesis. KTH, School of Information and Communication Technology (ICT), 2013,p. 121.

[95] Vieira Dario, Melo Cesar, and Ghamri-Doudane Yacine. “Performance Evaluationof an Object Management Policy Approach for P2P Networks”. In: InternationalJournal of Digital Multimedia Broadcasting 2012 (2012). doi: 10.1155/2012/189325. url: http://www.hindawi.com/journals/ijdmb/2012/189325/cta/.

[96] T. Fujimoto et al. “Video-Popularity-Based Caching Scheme for P2P Video-on-Demand Streaming”. In: Advanced Information Networking and Applications (AINA),2011 IEEE International Conference on. 2011, pp. 748 –755. doi: 10.1109/AINA.2011.103.

[97] S.A. Chellouche et al. “Home-Box-assisted content delivery network for InternetVideo-on-Demand services”. In: Computers and Communications (ISCC), 2012IEEE Symposium on. 2012, pp. 000544 –000550. doi: 10 . 1109 / ISCC . 2012 .6249353.

50

Page 51: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[98] Yipeng Zhou, T.Z.J. Fu, and Dah Ming Chiu. “Division-of-labor between serverand P2P for streaming VoD”. In: Quality of Service (IWQoS), 2012 IEEE 20thInternational Workshop on. 2012, pp. 1 –9. doi: 10.1109/IWQoS.2012.6245979.

[99] Andrea Matsunaga and Jose AB Fortes. “On the use of machine learning to predictthe time and resources consumed by applications”. In: Proceedings of the 2010 10thIEEE/ACM International Conference on Cluster, Cloud and Grid Computing.IEEE Computer Society. 2010, pp. 495–504.

[100] Sajib Kundu et al. “Modeling virtualized applications using machine learning tech-niques”. In: ACM SIGPLAN Notices. Vol. 47. ACM. 2012, pp. 3–14.

[101] Helmut Hlavacs and Thomas Treutner. “Predicting web service levels during VMlive migrations”. In: Systems and Virtualization Management (SVM), 2011 5thInternational DMTF Academic Alliance Workshop on. IEEE. 2011, pp. 1–10.

[102] Zhihong Liu et al. “DREAMS: Dynamic Resource Allocation for MapReduce withData Skew”. In: Integrated Network Management (IM 2015), 2015 IFIP/IEEEInternational Symposium on. 2015.

[103] Peter Bodık et al. “Statistical machine learning makes automatic control practicalfor internet datacenters”. In: Proceedings of the 2009 conference on Hot topics incloud computing. 2009, pp. 12–12.

[104] Sidath Handurukande et al. “Magneto approach to QoS monitoring”. In: IntegratedNetwork Management (IM), 2011 IFIP/IEEE International Symposium on. IEEE.2011, pp. 209–216.

[105] Han Hee Song et al. “Q-score: Proactive service quality assessment in a large IPTVsystem”. In: Proceedings of the 2011 ACM SIGCOMM conference on Internetmeasurement conference. ACM. 2011, pp. 195–208.

[106] M. Mirza et al. “A Machine Learning Approach to TCP Throughput Prediction”.In: Networking, IEEE/ACM Transactions on 18.4 (2010), pp. 1026–1039. issn:1063-6692. doi: 10.1109/TNET.2009.2037812.

[107] Yilei Zhang, Zibin Zheng, and M.R. Lyu. “Exploring Latent Features for Memory-Based QoS Prediction in Cloud Computing”. In: Reliable Distributed Systems(SRDS), 2011 30th IEEE Symposium on. 2011, pp. 1–10. doi: 10.1109/SRDS.2011.10.

[108] Yilei Zhang, Zibin Zheng, and M.R. Lyu. “WSPred: A Time-Aware PersonalizedQoS Prediction Framework for Web Services”. In: Software Reliability Engineering(ISSRE), 2011 IEEE 22nd International Symposium on. 2011, pp. 210–219. doi:10.1109/ISSRE.2011.17.

[109] Zibin Zheng et al. “Collaborative Web Service QoS Prediction via NeighborhoodIntegrated Matrix Factorization”. In: Services Computing, IEEE Transactions on6.3 (2013), pp. 289–299. issn: 1939-1374. doi: 10.1109/TSC.2011.59.

[110] Zibin Zheng et al. “QoS Ranking Prediction for Cloud Services”. In: Paralleland Distributed Systems, IEEE Transactions on 24.6 (2013), pp. 1213–1222. issn:1045-9219. doi: 10.1109/TPDS.2012.285.

51

Page 52: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[111] Ayman Amin, A. Colman, and Lars Grunske. “An Approach to Forecasting QoSAttributes of Web Services Based on ARIMA and GARCH Models”. In: WebServices (ICWS), 2012 IEEE 19th International Conference on. 2012, pp. 74–81.doi: 10.1109/ICWS.2012.37.

[112] Rob Powers, Moises Goldszmidt, and Ira Cohen. “Short Term Performance Fore-casting in Enterprise Systems”. In: Proceedings of the Eleventh ACM SIGKDDInternational Conference on Knowledge Discovery in Data Mining. KDD ’05.Chicago, Illinois, USA: ACM, 2005, pp. 801–807. isbn: 1-59593-135-X. doi: 10.1145 / 1081870 . 1081976. url: http : / / doi . acm . org / 10 . 1145 / 1081870 .1081976.

[113] Chengyuan Yu and Linpeng Huang. “A Web service QoS prediction approachbased on time- and location-aware collaborative filtering”. In: Service OrientedComputing and Applications (2014), pp. 1–15. issn: 1863-2394. doi: 10.1007/s11761-014-0168-4. url: http://dx.doi.org/10.1007/s11761-014-0168-4.

[114] Ira Cohen et al. “Correlating Instrumentation Data to System States: A BuildingBlock for Automated Diagnosis and Control”. In: Proceedings of the 6th Confer-ence on Symposium on Opearting Systems Design & Implementation - Volume 6.OSDI’04. San Francisco, CA: USENIX Association, 2004, pp. 16–16. url: http://dl.acm.org/citation.cfm?id=1251254.1251270.

[115] S. Zhang et al. “Ensembles of models for automated diagnosis of system perfor-mance problems”. In: Dependable Systems and Networks, 2005. DSN 2005. Pro-ceedings. International Conference on. 2005, pp. 644–653. doi: 10.1109/DSN.2005.44.

[116] Philipp Leitner et al. “Runtime Prediction of Service Level Agreement Viola-tions for Composite Services”. In: Proceedings of the 2009 International Confer-ence on Service-oriented Computing. ICSOC/ServiceWave’09. Stockholm, Sweden:Springer-Verlag, 2009, pp. 176–186. isbn: 3-642-16131-6, 978-3-642-16131-5. url:http://dl.acm.org/citation.cfm?id=1926618.1926639.

[117] Bing Tang and Mingdong Tang. “Bayesian Model-Based Prediction of ServiceLevel Agreement Violations for Cloud Services”. In: Theoretical Aspects of Soft-ware Engineering Conference (TASE), 2014. 2014, pp. 170–176. doi: 10.1109/TASE.2014.34.

[118] C. Gupta, A. Mehta, and U. Dayal. “PQR: Predicting Query Execution Times forAutonomous Workload Management”. In: Autonomic Computing, 2008. ICAC’08. International Conference on. 2008, pp. 13–22. doi: 10.1109/ICAC.2008.12.

[119] Jia Rao and Cheng-Zhong Xu. “CoSL: A coordinated statistical learning approachto measuring the capacity of multi-tier websites”. In: Parallel and DistributedProcessing, 2008. IPDPS 2008. IEEE International Symposium on. 2008, pp. 1–12. doi: 10.1109/IPDPS.2008.4536232.

52

Page 53: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders

[120] Liangzhao Zeng et al. “Service-Oriented Computing – ICSOC 2008: 6th Interna-tional Conference, Sydney, Australia, December 1-5, 2008. Proceedings”. In: ed. byAthman Bouguettaya, Ingolf Krueger, and Tiziana Margaria. Berlin, Heidelberg:Springer Berlin Heidelberg, 2008. Chap. Event-Driven Quality of Service Predic-tion, pp. 147–161. isbn: 978-3-540-89652-4. doi: 10.1007/978-3-540-89652-4_14. url: http://dx.doi.org/10.1007/978-3-540-89652-4_14.

[121] M. Godse, U. Bellur, and R. Sonar. “Automating QoS Based Service Selection”.In: Web Services (ICWS), 2010 IEEE International Conference on. 2010, pp. 534–541. doi: 10.1109/ICWS.2010.58.

[122] Matei Zaharia et al. “Resilient Distributed Datasets: A Fault-tolerant Abstractionfor In-memory Cluster Computing”. In: Proceedings of the 9th USENIX Confer-ence on Networked Systems Design and Implementation. NSDI’12. San Jose, CA:USENIX Association, 2012, pp. 2–2. url: http://dl.acm.org/citation.cfm?id=2228298.2228301.

[123] Ron Bekkerman, Mikhail Bilenko, and John Langford. “Scaling Up Machine Learn-ing: Parallel and Distributed Approaches”. In: Proceedings of the 17th ACM SIGKDDInternational Conference Tutorials. KDD ’11 Tutorials. San Diego, California:ACM, 2011, 4:1–4:1. isbn: 978-1-4503-1201-1. doi: 10.1145/2107736.2107740.url: http://doi.acm.org/10.1145/2107736.2107740.

[124] Martin Zinkevich, John Langford, and Alex J Smola. “Slow learners are fast”. In:Advances in Neural Information Processing Systems. 2009, pp. 2331–2339.

[125] S.E. Yuksel, J.N. Wilson, and P.D. Gader. “Twenty Years of Mixture of Experts”.In: Neural Networks and Learning Systems, IEEE Transactions on 23.8 (2012),pp. 1177–1193. issn: 2162-237X. doi: 10.1109/TNNLS.2012.2200299.

53

Page 54: Data-driven Performance Prediction and Resource Allocation ...kth.diva-portal.org/smash/get/diva2:916753/FULLTEXT02.pdf · computing. A cloud environment normally involves three stakeholders