24
cloud computing IEEE The Essentials Issue May/June 2012 eScience 2 Identifying Risk 14 Google App Engine 8 PROTOTYPE A digital magazine in support of the IEEE Cloud Computing Initiative

cloud - IEEE Computer Society€¦ · cloud computing IEEE The Essentials Issue May/June 2012 ... 14 Understanding Cloud Computing Vulnerabilities Bernd Grobauer, …

Embed Size (px)

Citation preview

cloudcomputing

IEEE

The Essentials IssueMay/June 2012

eScience2

Identifying Risk

14

Google App Engine8

PROTOTYPEA digital magazine in support of the IEEE Cloud Computing Initiative

c1-evan.indd 1 5/29/12 3:20 PM

cloudcomputing

IEEE

Cloud Computing Initiative Steering CommitteeSteve Diamond, chairNim CheungKathy GriseMichael LightnerMary Lynne NeilsenSorel ReismanJon Rokne

IEEE Computer Society Sta� Angela BurgessExecutive Director

Evan Butterfi eldDirector, Products & Services

Lars JentschManager, Editorial Services

Steve WoodsManager, New Media & Production

Kathy Clark-FisherManager, Products & Acquisitions

Monette Velasco and Jennie ZhuDesign & Production

May/June 2012

� e Essentials IssueInnovations in virtualization and distributed computing, as well as a poor economy and increased access to high-speed Internet, have accelerated business interest in cloud computing. We explore the risks, security and privacy concerns, and the latest in working with the Google App Engine.

1 Guest Editor’s IntroductionJon Rokne

2 Science in the Cloud:Accelerating Discovery in the 21st CenturyJoseph L. Hellerstein, Kai J. Kohlho� , and David E. Konerding

6 Cloud Computing: A Records and Information Management PerspectiveKirsten Ferguson-Boucher

8 Evaluating High-Performance Computing on Google App EngineRadu Prodan, Michael Sperk, and Simon Ostermann

14 Understanding Cloud Computing VulnerabilitiesBernd Grobauer, Tobias Walloschek, and Elmar Stöcker

TOC.indd 2 5/23/12 1:08 PM

© 2012 IEEE Published by the IEEE Computer Society IEEE Cloud Computing 1

C loud computing is transform-ing information technology. As information and processes

migrate to the cloud, it is transforming not only where computing is done, but, fundamentally, how it is done. As increas-ingly more corporate and academic worlds invest in this technology, it will also drastically change IT professionals’ working environment .

Cloud computing solves many prob-lems of conventional computing, includ-ing handling peak loads, installing so� ware

updates, and using excess computing cycles. However, the new technology has also created new challenges such as data security, data ownership, and transborder data storage.

� e IEEE has realized that cloud com-puting is poised to be the dominant form for computing in the future and that it will be necessary to develop standards, con-ferences, publications, educational mate-rial, and general awareness information for cloud computing. Because of this, the New Initiative Commi� ee of the IEEE has

funded an IEEE Cloud Computing Initia-tive (CCI). CCI coordinates cloud-related activities for IEEE and has tracks for all of the identi� ed aspects of cloud computing.

� e CCI Publications Track is tasked with developing a slate of cloud comput-ing-related publications. � e CCI provides seed funding for the publications devel-oped by the CCI Publications Track and it has already developed a mature proposal for an IEEE Transactions on Cloud Comput-ing, sponsored by � ve IEEE societies. � is transactions is slated to commence pub-lishing in-depth research papers in cloud computing in 2013 following approval by the Board of IEEE at the June 2012 meeting.

� e second publishing initiative is to develop a cloud computing magazine. In preparation, the IEEE Computer Society publications team has created this supple-ment on behalf of the Cloud Computing Publishing Track. � is supplement con-tains previously published articles that have recently appeared in several maga-zines. � e aim of IEEE Cloud Computingmagazine is to provide a focused home for cloud-related articles � e magazine will be technically cosponsored by several IEEE societies.

� e CCI Publications Track would like to have a broad representation from IEEE societies with interests in cloud comput-ing. If anyone wishes to participate in the ongoing discussion of the publications initiatives, please contact Jon Rokne at [email protected].

Jon Rokne is the CCI Publications Track chair. He is a professor and former head of the Computer Science department at the University of Calgary and the past vice president of publications for IEEE.

An Essential InitiativeJon Rokne, University of Calgary

Guest Editor’s Introduction

Intro.indd 1 5/16/12 5:37 PM

2 IEEE Cloud Computing Published by the IEEE Computer Society © 2012 IEEE

R ecent trends in science have made computational capabilities an essential part of scientific dis-

covery. This combination of science and computing is often referred to as enhanced scientific discovery, or eScience. The col-lection of essays in The Fourth Paradigm describes how science has evolved from being experimentally driven to being collab-orative and analysis-focused.1

eScience has been integral to high-energy physics for several decades due to the volume and complexity of data such experiments produce. In the 1990s, the computational

Science in the Cloud Accelerating Discovery in the 21st Century

Joseph L. Hellerstein, Kai J. Kohlhoff, and David E. Konerding, Google

Scientific discovery is transitioning from a focus on data collection to an emphasis on analysis and prediction using large-scale computation. With appropriate software support, scientists can do these computations with unused cycles in commercial clouds. Moving science into the cloud will promote data sharing and collaborations that will accelerate scientific discovery.

demands of sequencing the human genome made eScience central to biology. More recently, eScience has become essential for neuroscientists in modeling brain circuits and for astronomers in simulating cosmologi-cal phenomena.

Biology provides an excellent example of how eScience contributes to scien-tific discovery. Much modern biological research is about relating DNA sequences (genotypes) to observable characteristics (phenotypes) — as when researchers look for variations in DNA that promote can-cer. The human genome has approximately

3 billion pairs of nucleotides, the elements that encode information in DNA. These base pairs encode common human characteristics, benign individual variations, and potential dis-ease-causing variants. It turns out that individ-ual variation is usually much more common than are disease-causing variants. So, under-standing how the genome contributes to dis-ease is much more complicated than looking at the difference between genomes. Instead, this analysis often requires detailed models of DNA-mediated chemical pathways to identify disease processes. The human genome’s size and the complexity of modeling disease pro-cesses typically require large-scale computa-tions and massive storage capacity.2

A common pattern in eScience is to explore many possibilities in parallel. Computational biologists can align mil-lions of DNA “reads” (produced by a DNA sequencer) to a reference genome by aligning each one in parallel. Neuroscien-tists can evaluate a large number of param-eters in parallel to find good models of brain activity. And astronomers can analyze dif-ferent regions of the sky in parallel to search for supernovae.

That a high degree of parallelism can advance science has been a starting point for many efforts. For example, Fold-ing@Home3 is a distributed computing project that enables scientists to under-stand the biochemical basis of several diseases. At Google, the Exacycle project provides massive parallelism for doing sci-ence in the cloud.

Harvesting Cycles for ScienceOften, scientific discovery is enhanced by employing large-scale computation to assess if a theory is consistent with experimental results. Frequently, these computations (or jobs) are structured as a large number of inde-pendently executing tasks. This job structure is called embarrassingly parallel.

The Exacycle project aims to find unused resources in the Google cloud to run embar-rassingly parallel jobs at a very large scale. We do this by creating a system that’s both a simplification and a generalization of MapReduce. Exacycle simplifies MapRe-duce in that all Exacycle tasks are essentially mappers. This simplification enables more efficient resource management. Exacycle

The Essentials Issue

IC-article.indd 2 5/23/12 1:36 PM

www.computer.org/cloud 3

Exacycle implements the same com-munication interfaces between adja-cent layers. Communication from an upper to a lower layer requires the upper layer to cut data into pieces that it then passes on to the lower layer. Typically, this communication provides data to tasks within the same job. Communication from a lower layer to an upper layer involves bun-dling data to produce aggregations. These interlayer interfaces are scalable and robust with minimal requirements for managing distributed state.

The primary mechanism Exacycle uses to scale is eliminating nearly all interclu-ster networking and machine-level disk I/O. An Exacycle task typically can’t move more than 5 Gbytes of data into or out of the machine on which the tasks executes. Exacycle reduces network usage by manag-ing data movement on tasks’ behalf. Typi-cally, the thousands to millions of tasks in an Exacycle job share some of their input files. Exacycle uses this knowledge of shared input files to coschedule tasks in the same cluster. This strategy improves throughput by exploiting the high network bandwidths between machines within the same cluster. Furthermore, Exacycle uses caching so that remote data are copied into a cluster only once.

When Exacycle assigns a task to a machine, a timeout and retry hierar-chy handles failures. This combination of timeouts and retries addresses most sys-temic errors. Because tasks have unique identifiers, the Exacycle retry logic assumes that two tasks with the same identifier com-pute the same results.

For the most part, Exacycle doesn’t employ durable cluster- or machine-level storage owing to its engineering costs and performance penalties. Instead, Exa-cycle optimistically keeps nearly all state in RAM. Robustness comes from having a single authoritative store and spreading state across many machines. If a machine fails, Exacycle moves tasks from the failed machine to another machine. If there is a fail-ure of a machine running a Honcho cluster-level scheduler, the Honcho is restarted on another machine, and uses discovery ser-vices to recover cached state.

The Exacycle project began two years ago. The system has been running eScience applications in production for roughly a year, and has had continuous, intensive use over the past six months. Recently, Google donated 1 billion core hours to scientific discovery through the Exacycle Visiting Faculty Grant Program (http://research.google.com/university/exacycle

Figure 1. Exacycle system architecture. Daimyo assigns tasks to clusters, Honcho assigns tasks to machines, Peasant encapsulates tasks, and the bottom layer caches task results.

Honchowatches

Peasantdiscoveryservice

Peasant joins

Honchojoins

Assigntask

Compute cluster

Honcho(per cluster)

Assigntask

Peasant(per core)

Cachedresults

Resultstorage

Othercomputeclusters

Work unitstorage

Daimyo(global manager)

SubmitMonitor

User

Daimyowatches

Honchodiscoveryservices

generalizes MapReduce by providing auto-mation that monitors resources across the Google cloud and assigning tasks to compute clusters based on resource availability and job requirements. This provides massive scaling for embarrassingly parallel jobs.

Google is very efficient at using com-puting resources, but resource utiliza-tions still vary depending on time of day, day of the week, and season. For example, Web users most frequently use search engines during the day, and search providers typically direct traffic to data centers close to users to reduce latency. This leads to low data center utilizations during the data center’s local night time.

Still, low resource utilization doesn’t necessarily enable more tasks to run in the cloud. Many tasks require considerable memory, or moderate amounts of memory and CPU in combination. Such tasks can run in the cloud only if at least one machine satisfies all the task’s resource requirements. One way to quantify whether tasks can run is to determine if suitably sized “slots” are available. For example, recent measurements of the Google cloud indicate that it has 13 times more slots for tasks requiring only one core and 4 Gbytes of RAM than there are slots for tasks requiring four cores and 32 Gbytes of RAM. In general, finding slots for tasks that require fewer resources is much easier. For this reason, an Exacycle task typi-cally consumes about one core and 1 Gbyte of memory for no more than an hour.

For cloud computing to be efficient, it must adapt quickly to changes in resource demands. In particular, higher-priority work can preempt Exacycle tasks, which must then be re-run. However, Exacycle throughput is excellent because in practice preemption is rare. This is due in part to the manner in which Excycle locates resources that run tasks.

As Figure 1 shows, Exacycle is structured into multiple layers. At the top is the Daimyo global scheduler, which assigns tasks to clus-ters. The second layer is the Honcho cluster scheduler, which assigns tasks to machines. The Peasant machine manager in the third layer encapsulates tasks, and the bottom layer caches task results. The Honchos and Peas-ants cache information but are otherwise stateless. This simplifies failure handling.

IC-article.indd 3 5/23/12 1:35 PM

4 IEEE Cloud Computing May/June 2012

The Essentials Issue

_program.html). To support this, Exacycle consumes approximately 2.7 million CPU hours per day, and often much more. As of early February, visiting faculty had com-pleted 58 million tasks.

Visiting faculty are addressing various scientific problems that can benefit from large-scale computation:

■ The enzyme science project seeks to discover how bacteria develop resistance to antibi-otics, a growing problem for public health.

■ The molecular docking project seeks to advance drug discovery by using mas-sive computation to identify small mol-ecules that bind to one or more of the

huge set of proteins that catalyze reactions in cells. The potential here is to greatly accelerate the design of drugs that inter-fere with disease pathways.

■ The computational astronomy proj-ect plays an integral role in the design of the 3200 Megapixel Large Synoptic Sur-vey Telescope. As one example, the proj-ect is doing large scale simulations to determine how to correct for atmospheric distortions of light.

■ The molecular modeling project is expand-ing the understanding of computational methods for simulating macromolecular processes. The first application is to deter-mine how molecules enter and leave the

Figure 2. Trajectory durations for G protein-coupled receptors (GPCRs). These receptors are critical to many drugs’ effectiveness because of their role in communicating signals across cell membranes. The upper x-axis shows the trajectory duration, whereas the lower x-axis shows the core hours required to compute trajectory durations. Computing one millisecond of trajectory data requires millions of core days on a modern desktop computer. Exacycle can do these computations in a few days.

Salt bridge formation Ligand binding Major conformational change

millisecondsmicrosecondsnanoseconds

101 102 103 104 105 106 107 108 109 Total corehours

Time

LigandGPCR

Extracellular side

Intracellularside

Ligand binding site

G protein binding site

Cell membrane

cell nucleus through a channel known as the nuclear pore complex.

Google has selected these projects based on their potential to produce scientific results of major importance. One measure of impact will be publishing in top journals such as Science and Nature.

To better illustrate the potential for sci-ence in the cloud, we next look at one prob-lem in detail.

Simulating Molecular DynamicsExacycle has undertaken a project that relates to a class of molecules called G pro-tein-coupled receptors. GPCRs are criti-cal to many drug therapies. Indeed, about a third of pharmaceuticals target GCPRs. Despite this, scientists still don’t fully under-stand the molecular basis of GPCR action.

A bit of science is needed to appreciate the computational problem that Exacycle is addressing. GPCRs are critical to trans-membrane signaling, an important part of many disease pathways. Scientists know that GPCRs embed in cell membranes to provide communication between extracellular sig-nals and intracellular processes. This com-munication occurs when certain molecules bind to sites on GPCRs that are accessible from outside the membrane. However, sci-entists don’t fully understand the sequence of changes that then lead to communication across the cell membrane.

To gain a better understanding of GPCR activity, Exacycle is doing large-scale simu-lations of GPCR molecular dynamics. This is a challenging undertaking because of the detail required to obtain scientific insight. In particular, biomolecules at body tempera-ture undergo continuous fluctuations with regard to atoms’ location and the 3D shape of molecules (referred to as their conforma-tion). Many changes occur at a time scale of femtoseconds to nanoseconds (10–15 to 10–9 seconds). However, most chemical processes of interest occur at a time scale of microseconds to milliseconds. The term tra-jectory refers to a sequence of motions of a set of atoms under study over time. Figure 2 depicts the insights possible with trajectories of different durations. Understanding GPCR actions requires simulations that generate data over milliseconds.

IC-article.indd 4 5/23/12 1:35 PM

www.computer.org/cloud 5

planetary-scale collaborations that power scientific discovery in the 21st century.

References1. The Fourth Paradigm: Data-Intensive Scien-

tific Discovery, T. Hey, S. Tansley, and K. Tolle, eds., Microsoft Research, 2009.

2. M. Schatz, B. Langmead, and S. Salzberg, “Cloud Computing and the DNA Data Race,” Nature Biotechnology, vol. 28, no. 7, 2010, pp. 691–693.

3. M. Shirts and V. Pande, “Screen Savers of the World Unite!” Science, vol. 290, no. 5498, 2000, p. 1903.

4 . S. Melni k et a l . , “Drem el : Inter-active Analysis of Web-Scale Datasets,” Proc. Conf. Very Large Databases, VLDB Endowment, vol. 3, 2010, pp. 330–339.

Joseph L. Hellerstein is at Google, where he manages the Big Science Project, which addresses cloud computing for scientific dis-covery. He has a PhD in computer science from the University of California, Los Angeles. Hellerstein is a fellow of IEEE. Contact him at [email protected].

Kai J. Kohlhoff is a research scientist at Google, where he works on cloud computing and eScience. He has a PhD in structural bioin-formatics from the University of Cambridge, UK. Contact him at [email protected].

David E. Konerding is a software engineer at Google, where he works on cloud infra-structure and scientific computing. He has a PhD in biophysics from the University of California, Berkeley. Contact him at [email protected].

This article will appear in IEEE Internet Computing, July/August 2012.

Exacycle simulates the trajectories of approximately 58,000 atoms—the number of atoms in a typical GPCR system, includ-ing the cell membrane and water molecules. It does so at femtosecond precision over tril-lions of time steps by computing trajectories using embarrassingly parallel jobs.

The GPCR data analysis pipeline uses trajectories in two ways. The first is to con-struct models of GPCR behavior. For exam-ple, researchers can use trajectories to create a Markov model with states in which protein structures are described according to their 3D structure and kinetic energy. Second, researchers analyze trajectories for changes that are important for activating signaling across the cell membrane.

It takes approximately one core day to simulate half a nanosecond of a single trajec-tory on a modern desktop. So, obtaining sci-entific insight requires millions of core days to generate a millisecond of trajectory data. Clearly, massive computational resources are required.

Exacycle provides these resources to compute trajectories in parallel. However, some thought is required to use Exacycle effectively. For GPCR trajectories, the chal-lenge is that it takes millions of core hours to compute an interesting trajectory, but an Exacycle task typically executes for no more than one core hour. So, Exacycle constructs trajectories by executing a series of tasks. This requires passing partially computed tra-jectories from one task to the next in a way that maintains high throughputs.

The approach for computing trajecto-ries has several parts. A driver script gener-ates tens of thousands of tasks and submits them to Exacycle. The script also monitors task states and registers events such as task completions or failures. To maintain high throughput, this script then propagates par-tial trajectories that tasks compute to other tasks. Exacycle provides mechanisms for monitoring task executions and supporting the investigation and resolution of task and system failures.

Thus far, Exacycle has computed more than 150,000 trajectories with durations totaling more than 4 milliseconds. At peak, Exacycle simulates approximately 80 micro-seconds of trajectory data per day. This cor-responds to roughly 600 teraflops.

Exacycle has produced hundreds of terabytes of trajectory data. Analyzing these data presents a huge challenge. One approach is to use MapReduce to calculate summary statistics of trajectories and then place the results into a relational database of trajectory tables. Scientists have obtained considerable insight by doing SQL queries against the trajectory tables. However, this requires the database to have the scalability of technologies such as Dremel,4 which pro-vides interactive response times for ad hoc queries on tables with billions of rows.

A mazon, Microsoft, Google, and oth-ers offer capabilities for running sci-

ence applications in the cloud. The appeal of these services is that scientists don’t need to buy expensive, dedicated clusters. Instead, they pay a modest rent for on-demand access to large quantities of cloud computing resources. Although doing sci-ence in the cloud has appeal, it could have hidden costs. For example, scientists might have to recode applications to exploit cloud functionality or add new code if some fea-tures aren’t present in the cloud.

Science in the cloud offers much more than a compute infrastructure. A recent trend is that scientific contributions require that researchers make large data-sets publicly available. Some examples are the Allen Institute’s Brain Atlas and the US National Center for Biotechnol-ogy Information (NCBI) genome data-base. Both are repositories that researchers widely use to do computation-intensive analysis of data that others have collected. Hosting these datasets in public clouds is much easier than requiring individual sci-entists (or even universities) to build their own data-hosting systems.

Much more is on the way in this arena. Using the cloud for computation and data storage will facilitate scientists sharing both data and computational tools. Indeed, sub-stantial efforts are already under way, such as Sage Bionetworks’ idea of a “data com-mons” (http://sagebase.org/research/Synapse1.php). Sharing data and code will let scientists more rapidly build on their peers’ results. Longer term, the big appeal of science in the cloud is promoting

IC-article.indd 5 5/23/12 1:33 PM

6 IEEE Cloud Computing Published by the IEEE Computer Society © 2012 IEEE

F or many records and information management (RIM) professionals, cloud computing resembles a tradi-

tional hosting service: information storage or applications are outsourced to a third-party provider and accessed by the orga-nization through a network connection. However, the information, applications, and processing power in a cloud infrastruc-ture are distributed across many servers and stored along with other customers’ information, separated only by logical iso-lation mechanisms. This presents both new RIM challenges and benefits.

RIM professionals are specifically con-cerned with information as a core business asset. Records are a subset of organizational information that is often required to pro-vide evidence of organizational activities and transactions. They require protection in

the same way as every other asset. Decision-making processes take into consideration the wider context of organizational strat-egy and form part of a complex structure of assessments regarding information value, alignment, and assurance. All of these oper-ate within an overarching performance and risk framework.

Cloud Computing: A Brief Introduction Cloud computing is the ability to access a pool of computing resources owned and maintained by a third party via the Inter-net. It isn’t a new technology but a new way of delivering computing resources based on long existing technologies such as server virtualization. The “cloud” is composed of hardware, storage, networks, interfaces, and services that provide the means through

which users access the infrastructure, com-puting power, applications, and services on demand and independent of location. Cloud computing usually involves the transfer, stor-age, and processing of information on the provider’s infrastructure, which is outside the customer’s control.

The National Institute of Standards and Security (NIST) defines it as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or ser-vice provider interaction” (http://csrc.nist.gov/publications/PubsSPs.html#800-145/SP800-145.pdf ). As Figure 1 shows, the NIST-defined model highlights five essen-tial characteristics that reflect a service’s flexibility and the control that users have over it. NIST also distinguishes among three delivery models (software as a ser-vice [SaaS], platform as a service [PaaS], and infrastructure as a service [IaaS]) and four deployment models (public, private, hybrid, and community clouds).

Delivery ModelsAs a general rule, the customer doesn’t con-trol the underlying cloud infrastructure in any delivery model. SaaS is software offered by a third-party provider, usually on demand via the Internet and configurable remotely. PaaS also allows customers to develop new applications using APIs deployed and con-figurable remotely. In this case, the customer does have control over the deployed applica-tions and operating systems. In the IaaS pro-vision, virtual machines and other abstracted hardware and operating systems are made available. The customer, therefore, has con-trol over operating systems, storage, and deployed applications.

Deployment ModelsThere are, essentially, three deployment models: private, community, and public, with a fourth “combined” option. Private clouds are operated solely for an organiza-tion; community clouds are shared by several organizations and are designed to support a specific community. In public clouds, the infrastructure is made publicly available but is owned by an organization selling cloud

Cloud Computing A Records and Information Management Perspective

Kirsten Ferguson-Boucher, Aberystwyth University, Wales

Ultimately, how to make decisions about which cloud service/deployment model to select and what sort of things to take into consideration when making that initial decision, require the organization to consider whether loss of control will significantly affect the security of mission-critical information.

The Essentials Issue

SP-Basic.indd 6 5/16/12 5:46 PM

www.computer.org/cloud 7

services. Resources are offsite and shared among all customers in a multi tenancy model. Hybrid clouds are a composition of two or more clouds that remain unique enti-ties but are bound together by standardized or proprietary technology to enable data and application portability.

Is the Cloud Right for You? Making the decision to move to the cloud is a complex one and depends very much on your organizational context. Let’s examine some generic benefits and challenges. An obvi-ous benefit is a reduction in capital expen-diture—heavy investment in new hardware and software is no longer required for often underutilized functions, such as storage and processing. Organizations can tap into readily available computing resources on demand, with large datacenters often using virtualization technologies (the abstraction of computing resources from the underlying hardware) to enable scalable and flexible ser-vice provision. Applications, storage, servers, and networks are allocated flexibly in a mul-titenancy environment to achieve maximum computing and storage capacities.

From a provider perspective, utilization of shared resources results in higher efficiency and the ability to offer cloud computing ser-vices at low costs; customers likewise benefit from cheaper access. But make no mistake—the cloud still involves costs to an organiza-tion as it tries to integrate new services with existing legacy processes. Table 1 summarizes these and some of the other general pros and cons of cloud provision.

There are, however, very specific consider-ations that relate to the ability of the organiza-tion to manage its information and ensure that records are available for current and future organizational use. In particular, the cloud offers specific benefits for RIM: improved business processes, facilitation of location-independent collaboration, and access to resources and information at any time. How-ever, some aspects of cloud computing can have a negative impact on RIM as well:

■ compliance and e-discovery; ■ integrity and confidentiality; ■ service availability and reliability; ■ service portability and interoperability; ■ information retrieval and destruction; and

■ loss of governance, integration, and management.

The sidebar “Ten Questions to Ask When Outsourcing to the Cloud” offers some guid-ance about what service might be best for a par-ticular organization’s context.

Managing Information Assets in the CloudOrganizations are still responsible for their information even if it’s stored elsewhere (in this case, in the cloud). ISO 15489 (the international standard for records manage-ment) defines records as being authentic, reliable, and usable and possessing integrity. How does the move to the cloud affect these characteristics? Information governance and assurance require policies and proce-dures for maintaining the above and will need amending to incorporate the changing environment. There must be a clear under-standing of who’s responsible for what and how policies and procedures will be imple-mented. Issues such as metadata application, encryption strategies, and shorter-term pres-ervation requirements as well as permanent retention or destruction strategies must also be considered. Particular reference to data protection, privacy legislation, freedom of information, and environmental regulations requires organizations to know where their information is stored (in what jurisdictions) and how it can be accessed within given time frames. Will the move to the cloud restrict the organization’s ability to comply?

Litigation also requires consideration: being able to identify relevant informa-tion, retrieve it, and supply it to courts in a timely manner can be difficult if the organi-zation hasn’t thought about how this would be achieved prior to an incident. Contracts need to be negotiated with these consider-ations in mind, with clauses built in about data destruction or how information can be returned to the organization, as well as how the provider manages it.

Operating in the CloudUse of information in the cloud typically pre-cludes the use of encryption because it would adversely affect data processing, indexing, and searching. If the service uses encryp-tion, the customer would need to know if this

happens automatically and how the encryp-tion keys are created, held, and used across single and multiple sites to be able to con-firm that information is authentic. Business continuity can be affected by system failure, so information about continuity, monitoring, priority, and recovery procedures would give organizations a better picture of the risk of system failure to their activities.

U ltimately, making decisions about which cloud service/deployment

model to select, and what sort of things to take into consideration when making that initial decision, requires the organization to consider whether loss of control will signifi-cantly affect the security of mission-critical information. In particular, identifying risk and assessing the organization’s risk appetite is a critical factor in making decisions about moving to the cloud. The business must be clear about the type of information it’s will-ing to store, how sensitive that information is, and whether its loss or compromise would affect the compliance environment in which the organization operates.

AcknowledgmentsMore information about the research under-taken by Aberystwyth University in con-junction with the Archives and Records Association of UK and Ireland, which underpins this article, can be found at www.archives.org.uk/ara-in-action/best-practice -guidelines.html.

Kirsten Ferguson-Boucher lectures in records management; information governance; law, compliance, and ethics; and information assurance at Aberystwyth University, Wales. Her research interests include the convergence between related disciplines and how organiza-tions in all sectors can reach acceptable lev-els of information governance and assurance across the spectrum of technologies. Contact her at [email protected].

This article originally appeared in IEEE Security & Privacy, November/ December 2011; http://doi. ieeecomputersociety.org/10.1109/MSP.2011.159.

SP-Basic.indd 7 5/16/12 5:46 PM

8 IEEE Cloud Computing Published by the IEEE Computer Society © 2012 IEEE

F unding agencies and institutions must purchase and provision expen-sive parallel computing hardware

to support high-performance computing (HPC) simulations. In many cases, the phys-ical hosting costs, as well as the operation, maintenance, and depreciation costs, exceed the acquisition price, making the overall investment nontransparent and unprofitable.

Through a new business model of renting resources only in exact amounts for precise durations, cloud computing promises to be a cheaper alternative to parallel computers and more reliable than grids. Nevertheless, it remains dominated by commercial and industrial applications; its suitability for par-allel computing remains largely unexplored.

Until now, research on scientific cloud computing concentrated almost exclusively

on infrastructure as a service (IaaS)—infra-structures on which you can easily deploy legacy applications and benchmarks encap-sulated in virtual machines. We present an approach to evaluate a cloud platform for HPC that’s based on platform as a service (PaaS): Google App Engine (GAE).1 GAE is a simple parallel computing framework that supports development of computation-ally intensive HPC algorithms and applica-tions. The underlying Google infrastructure transparently schedules and executes the applications and produces detailed profiling information for performance and cost analy-sis. GAE supports development of scalable Web applications for smaller companies—those that can’t afford to overprovision a large infrastructure that can handle large traf-fic peaks at all times.

Google App EngineGAE hosts Web applications on Google’s large-scale server infrastructure. It has three main components: scalable services, a run-time environment, and a data store.

GAE’s front-end service handles HTTP requests and maps them to the appropriate application servers. Application servers start, initialize, and reuse application instances for incoming requests. During traffic peaks, GAE automatically allocates additional resources to start new instances. The number of new instances for an application and the distribu-tion of requests depend on traffic and resource use patterns. So, GAE performs load balanc-ing and cache management automatically.

Each application instance executes in a sandbox (a runtime environment abstracted from the underlying operating system). This prevents applications from performing malicious operations and enables GAE to optimize CPU and memory utilization for multiple applications on the same physical machine. Sandboxing also imposes various programmer restrictions:

■ Applications have no access to the under-lying hardware and only limited access to network facilities.

■ Java applications can use only a subset of the standard library functionality.

■ Applications can’t use threads. ■ A request has a maximum of 30 seconds to

respond to the client.

GAE applications use resources such as CPU time, I/O bandwidth, and the number of requests within certain quotas associated with each resource type. The CPU time is, in fuzzy terms, equivalent to the number of CPU cycles that a 1.2-GHz Intel x86 processor can perform in the same amount of time. Infor-mation on the resource usage can be obtained through the GAE application administration Web interface.

Finally, the data store lets developers enable data to persist beyond requests. The data store can’t be shared across different slave applications.

A Parallel Computing FrameworkTo support the development of paral-lel applications with GAE, we designed a

Evaluating High-Performance Computing on Google App EngineRadu Prodan, Michael Sperk, and Simon Ostermann, University of Innsbruck

Google App Engine offers relatively low resource-provisioning overhead and an inexpensive pricing model for jobs shorter than one hour.

The Essentials Issue

SW-Prodan.indd 8 5/16/12 5:50 PM

www.computer.org/cloud 9

Java-based generic framework (see Figure 1). Implementing a new application in our framework requires specialization for three abstract interfaces (classes): JobFactory, WorkJob, and Result (see Figure 2).

The master application is a Java program that implements JobFactory on the user’s local machine. JobFactory manages the algorithm’s logic and parallelization in sev-eral WorkJobs. WorkJob is an abstract class implemented as part of each slave appli-cation—in particular, the run() method, which executes the actual computational job. Each slave application deploys as a separate GAE application and, therefore, has a dis-tinct URI. The slave applications provide a simple HTTP interface and accept either data requests or computational job requests.

RequestsThe HTTP message header stores the type of request.

A job request contains one WorkJob that’s submitted to a slave application and extended. If multiple requests are submitted to the same slave application, GAE automati-cally starts and manages multiple instances to handle the current load; the programmer doesn’t have control over the instances. (One slave application is, in theory, sufficient; however, our framework can distribute jobs among multiple slave applications to solve larger problems.)

A data request transfers data shared by all jobs to the persistent data store (indicated by the useSharedData method). It uses mul-tiple parallel HTTP requests to fulfill the GAE’s maximum HTTP payload size of 1 Mbyte and improve bandwidth utilization. The fetchSharedData method retrieves shared data from the data store as needed.

In a clear request, the slave application deletes the entire data store contents. Clear requests typically occur after a run, whether it’s successful or failed.

A ping request returns instantly and deter-mines whether a slave application is still online. If the slave is offline, the master reclaims the job and submits it to another instance.

WorkJob ExecutionMapping WorkJobs to resources follows a dynamic work pool approach that’s suit-able for slaves running as black boxes on

sandboxed resources with unpredictable execution times. Each slave applica-tion has an associated job manager in the context of the master application. It requests WorkJobs from the global pool, submits them to its slave instances for computation (GAE automatically decides which instance is used), and sends back partial results.

We associate a queue size with every slave to indi-cate the number of parallel jobs it can simultaneously handle. The size should correspond to the number of processing cores avail-able underneath. Finding the optimal size at a cer-tain time is difficult for two reasons. First, GAE doesn’t publish its hard-ware information; second, an application might share the hardware with other competing applications. So, we approximate the queue size at a certain time by

Figure 1. Our parallel computing framework architecture. The boxes labeled I denote multiple slave instances. The master application is responsible for generating and distributing the work among parallel slaves implemented as GAE Web applications and responsible for the actual computation.

I

I

I

I

I

I

Request

Master application

Result

Slaveapplication

Slaveapplication

DS

DSS

Data store

JM

Request

Result

ResultJobFactory

WorkJob

JM

Workpool

Figure 2. The Java code for our parallel computing framework interface. JobFactory instantiates the master application, WorkJob instantiates the slave, and Result represents the final outcome of a slave computation.

public interface JobFactory { public WorkJob getWorkJob(); public int remainingJobs(); public void submitResult(Result); public Result getEndResult(); public boolean useSharedData(); public Serializable getSharedData();}

public abstract class WorkJob extends Serializable { private int id; public int getId(); public void setId(int); public Result run(); public void fetchSharedData();}

public abstract class Result implements Serializable { private int id; private long cpuTime; public long getCPUTime(); public void setCPUTime(long); public int getId(); public void setId(int);}

SW-Prodan.indd 9 5/16/12 5:50 PM

10 IEEE Cloud Computing May/June 2012

The Essentials Issue

conducting a warm-up training phase before each experiment.

The slave application serializes the WorkJobs’ results and wraps them in an HTTP response, which the master collects and assembles. A Result has the sam-sideae unique identifier as the WorkJob. The calculationTime field stores the effective computation time spent in run() for performance evaluation.

FailuresA GAE environment can have three types of failure: an exceeded quota, offl ine slave appli-cations, or loss of connectivity. To cope with such failures, the master implements a simple fault-tolerance mechanism to resubmit the failed WorkJobs to the corresponding slaves using a corresponding exponential back-off time-out, depending on the failure type.

BenchmarksWe began our evaluation of GAE with a set of benchmarks to provide important informa-tion for scheduling parallel applications onto its resources. To help users understand the price of moving from a local parallel computer to a remote cloud with sandboxed resources, we deployed a GAE development server on Karwendel, a local machine with 16 Gbytes of memory and four 2.2-GHz dual-core Opteron processorsta Instead of spawning additional sandboxed instances, the development server managed parallel requests in separate threads.

Resource ProvisioningResource-provisioning overhead is the time between issuing an HTTP request and receiving the HTTP response. Various fac-tors beyond the underlying TCP network influence the overhead (for example, load

balancing to assign a request to an applica-tion server, which includes the initialization of an instance if none exists).

To measure the overhead, we sent HTTP ping requests with payloads between 0 and 2.7 Mbytes in 300-Kbyte steps, repeated 50 times for each size, and took the average. The overhead didn’t increase linearly with the payload (see Figure 3) because TCP achieved higher bandwidth for larger pay-loads. We measured overhead in seconds; IaaS-based infrastructures, such as Ama-zon Elastic Compute Cloud (EC2), exhibit latencies measured in minutes.2

Just-in-Time CompilationA Java virtual machine’s just-in-time (JIT) compilation converts frequently used parts of byte code to native machine code, nota-bly improving performance. To observe JIT compilation effects, we implemented a simple Fibonacci number generator. We submitted it to GAE 50 times in sequence with a delay of one second, always using the same problem size. We set up the slave application with no instances running and measured the effec-tive computation time in the run() of each WorkJob. As we described earlier, GAE spawns instances of an application depend-ing on its recent load (the more requests, the more instances). To mark and track instances, we used a Singleton class that contained a randomly initialized static identifier field.

Figure 4 shows that seven instances han-dled the 50 requests. Moreover, the first two requests in each instance took considerably longer than the rest. After JIT compilation, the code executed over three times faster.

Monte Carlo SimulationsOne way to approximate π is through a sim-ple Monte Carlo simulation that inscribes a circle into a square, generates p uniformly distributed random points in the square, and counts m points that lie in the circle. So, we can approximate π = 4 · m/p. We ran this algo-rithm on GAE.

Obtaining consistent measurements from GAE is difficult for two reasons. First, the programmer has no control over the slave instances. Second, two identical consecutive requests to the same Web application could execute on completely different hardware in different locations. To minimize the

Figure 3. Resource-provisioning overhead didn’t increase linearly with the payload because TCP achieved higher bandwidth for larger payloads.

5,000

4,000

3,000

2,000

1,000

0 0 030 600 900 1,200Payload size (Kbytes)

1,500 1,800 2,100 2,400 2,700

Late

ncy

(ms)

Figure 4. Computation time and the mapping of requests to instances. The first two requests in each instance took considerably longer than the rest. After just-in-time compilation, the code executed almost four times faster.

0 5 10 15 20Request number

3025 35 40 45 50

5,000

4,000

3,000

2,000

1,000

0

Com

puta

tion

time

(ms)

SW-Prodan.indd 10 5/16/12 5:50 PM

www.computer.org/cloud 11

bias, we repeated all experiments 10 times, eliminated outliers, and averaged all runs.

Running the simulations. We conducted a warm-up phase for each application to determine the queue size and eliminate JIT compilation’s effects. We executed the π cal-culation algorithm first sequentially and then with an increasing number of parallel jobs by generating a corresponding number of WorkJobs in the JobFactory work pool. We chose a problem of 220 million random points, which produced a sequential execu-tion time slightly below the 30-second limit.

For each experiment, we measured and analyzed two metrics. The first was compu-tation time, which represented the average execution time of run(). The second was the average overhead, which represented the difference between the total execution time and the computation time (especially due to request latencies).

Results. Figure 5 shows that serial execu-tion on GAE was about two times slower than on Karwendel, owing to a slower ran-dom-number-generation routine in GAE’s standard math library.3 On Karwendel, transferring jobs and results incurred almost no overhead, owing to the fast local network between the master and the slaves. So, the average computation time and total execu-tion time were almost identical until eight parallel jobs (Karwendel has eight cores). Until that point, almost linear speedup occurred. Using more than eight parallel jobs generated a load imbalance that deteriorated speedup because two jobs had to share one physical core.

GAE exhibited a constant data transfer and total overhead of approximately 700 milliseconds in both cases, which explains its lower speedup. The random background load on GAE servers or on the Internet net-work caused the slight irregularities in execu-tion time for different machine sizes.

This classic scalability analysis method didn’t favor GAE because the 30-second limit let us execute only relatively small problems (in which Amdahl’s law limits scalability). To eliminate this barrier and evaluate GAE’s potential for computing larger problems, we used Gustafson’s law4 to increase the prob-lem size proportionally to the machine size.

We observed the impact on the execution time (which should stay constant for an ideal speedup). We distributed the jobs to 10 GAE slave applications instead of one to gain suf-ficient quotas (in minutes).

In this case, we started with an ini-tial problem of 180 million random points to avoid exceeding the 30- second limit. (For a larger number of jobs, GAE can’t provide more resources and starts denying connections.) Again, Karwendel had a constant execution time until eight parallel jobs (see Figure 6), demonstrating our framework’s good scalability.

Starting with nine parallel jobs, the execu-tion time steadily increased proportionally to the problem size. GAE showed similarly good scalability until 10 parallel jobs. Starting addi-tional parallel jobs slightly increased the exe-cution time. The overhead of aborted requests

(owing to quotas being reached) caused most irregularities.

For more than 17 parallel jobs, GAE had a lower execution time than Karwendel owing to Google’s larger hardware infrastructure.

Cost AnalysisAlthough we conducted all our experi-ments within the free daily quotas that Google offered, it was still important to estimate cost to understand the price of executing our applications in real life. So, alongside the π approximation, we imple-mented three algorithms with different computation and communication com-plexity (see Table 1):

■ matrix multiplication, based on row-wise dis-tribution of the first matrix and full broad-cast of the second;

Figure 5. Results for calculating p on (a) Google App Engine (GAE) and (b) Karwendel, the local machine. Serial execution on GAE was about two times slower than on Karwendel, owing to a slower random-number-generation routine in GAE’s standard math library.3

5 10 15Number of parallel jobs

Execution time

Computation time

Overhead

30

25

20

15

10

0

Tim

e (s

econ

ds)

5 10 15Number of parallel jobs

30

25

20

15

10

0

Tim

e (s

econ

ds)

Execution time

Computation time

Overhead

(a) (b)

Figure 6. Scalability results for GAE and Karwendel for proportionally increasing machine and problem sizes. Karwendel had a constant execution time until eight parallel jobs, demonstrating our framework’s good scalability.

0 5 10 15 20Number of parallel jobs

25

40

30

20

10

0

Tim

e (s

econ

ds)

8

6

4

2

0

Num

ber o

f abo

rted

requ

ests

App engine

Karwendel

Aborted requests

SW-Prodan.indd 11 5/16/12 5:50 PM

12 IEEE Cloud Computing May/June 2012

The Essentials Issue

As we expected, π approximation was the most computationally intensive and had almost no data-to-transfer cost. Surprisingly, rank sort consumed little bandwidth com-pared to CPU time, even though the full unsorted array had to transfer to the slaves and the rank of each element had to transfer back to the master. The Mandelbrot set gen-erator was clearly dominated by the amount of image data that must transfer to the master. For π approximation, we generally could sam-ple approximately 129 ∙ 109 random points for US$1 because the algorithm has linear com-putational effort. For the other algorithms, a precise estimation is more difficult because resource consumption doesn’t increase lin-early with the problem size. Nevertheless, we can use the resource complexity listed in Table

■ Mandelbrot set generation, based on the escape time algorithm; and

■ rank sort, based on each array element’s separate rank computation. This could potentially outperform other faster sequential algorithms.

We ran the experiments 100 times in sequence for each problem size and analyzed the cost of the three most limiting resources: CPU time, incoming data, and outgoing data, which we obtained through the Google appli-cation administration interface. We used the Google prices as of 10 January 2011: US$0.12 per outgoing Gbyte, $0.10 per incoming Gbyte, and $0.10 per CPU hour. We didn’t analyze the data store quota because the over-all CPU hours includes its usage.

1 to roughly approximate the cost to execute new problem sizes.

Finally, we estimated the cost to run the same experiments on the Amazon EC2 infra-structure using EC2’s m1.small instances, which have a computational performance of one EC2 compute unit. This is equivalent to a 1.2-GHz Xeon or Opteron processor, which is similar to GAE and enables a direct comparison. We packaged the implemented algorithms into Xen-based virtual machines deployed and booted on m1.small instances. Table 1 shows that the computa-tion costs were lower for GAE, owing mostly to the cycle-based payments as opposed to EC2’s hourly billing intervals.

Google recently announced a change in its pricing model that will replace CPU

Related Work in Cloud Performance

A nalysis of four commercial infrastructure-as-a-service-based clouds for scientific computing showed that

cloud performance is lower than that of traditional scientific computing.1 However, the analysis indicated that cloud com-puting might be a viable alternative for scientists who need resources instantly and temporarily.

Alexandru Iosup and his colleagues examined the long-term performance variability of Google App Engine (GAE) and Amazon Elastic Compute Cloud (EC2).2 The results showed yearly and daily patterns, as well as periods of stable performance. The researchers concluded that GAE’s and EC2’s performance varied among different large-scale applications.

Christian Vecchiola and his colleagues analyzed different cloud providers from the perspective of high-performance computing applications, emphasizing the Aneka platform-as-a-service (PaaS) framework.3 Aneka requires a third-party deployment cloud platform and doesn’t support GAE.

Windows Azure is a PaaS provider comparable to GAE but better suited for scientific problems. Jie Li and colleagues compared its performance to that of a desktop computer but performed no cost analysis.4

MapReduce frameworks offer a different approach to cloud computation.5,6 MapReduce is an orthogonal applica-tion class5 that targets large-data processing.7 It’s less suited for computationally intensive parallel algorithms8—for example, those operating on small datasets. Furthermore, it doesn’t support the implementation of more complex appli-cations, such as recursive and nonlinear problems or scientific workflows.

References1. A. Iosup et al., “Performance Analysis of Cloud Computing Services for

Many-Tasks Scientific Computing,” IEEE Trans. Parallel and Distributed

Systems, vol. 22, no. 6, 2011, pp. 931–945.

2. A. Iosup, N. Yigitbasi, and D. Epema, “On the Performance Variability

of Production Cloud Services,” Proc. 11th IEEE/ACM Int’l Symp. Cluster,

Cloud, and Grid Computing (CCGrid 11), IEEE CS, pp. 104–113.

3. C. Vecchiola, S. Pandey, and R. Buyya, “High-Performance Cloud

Computing: A View of Scientific Applications,” Proc. 10th Int’l Symp.

Pervasive Systems, Algorithms, and Networks (ISPAN 09), IEEE CS,

2009, pp. 4–16.

4. J. Li et al., “eScience in the Cloud: A Modis Satellite Data Reprojection

and Reduction Pipeline in the Windows Azure Platform,” Proc. 2010

Int’l Symp. Parallel & Distributed Processing (IPDPS 10), IEEE CS, 2010,

pp. 1–10.

5. C. Bunch, B. Drawert, and M. Norman, “MapScale: A Cloud

Environment for Scientific Computing,” tech. report, Computer

Science Dept., Univ. of California, Santa Barbara, 2009; www.cs.ucsb.

edu/~cgb/papers/mapscale.pdf.

6. J. Qiu et al., “Hybrid Cloud and Cluster Computing Paradigms for

Life Science Applications,” BMC Bioinformatics, vol. 11, supplement

12, 2010, S3; www.biomedcentral.com/content/pdf/1471-2105-11

-s12-s3.pdf.

7. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on

Large Clusters,” Comm. ACM, vol. 51, no. 1, 2008, pp. 107–113.

8. J. Ekanayake and G. Fox, “High Performance Parallel Computing with

Clouds and Cloud Technologies,” Cloud Computing and Software

Services: Theory and Techniques, S.A. Ahson and M. Ilyas, eds., CRC

Press, 2010.

SW-Prodan.indd 12 5/16/12 5:50 PM

www.computer.org/cloud 13

cycles with a new instance-hours unit. The unit is equivalent to one application instance running for one hour and will cost $0.08. In addition, Google will charge $9 a month for every application. The new model will primarily hurt Web applications that trig-ger additional instances upon sparse request peaks and afterward remain idle.

Table 1 gives a rough cost estimation assuming 15 parallel tasks and an instance uti-lization of 80 percent for useful computation. The results demonstrate that the new pric-ing model favors CPU-intensive applications that try to fully utilize all available instances. In addition, we can expect free resources to last longer with the new pricing model.

W e plan to investigate the suitability of new application classes such as

scientific workflow applications to be imple-mented on top of our generic framework and run on GAE with improved performance. For a look at other research on cloud computing performance, see the “Related Work in Cloud Performance” sidebar.

AcknowledgmentsAustrian Science Fund project TRP 72-N23 and the Standortagentur Tirol project Rain-Cloud funded this research.

References1. D. Sanderson, Programming Google App

Engine, O’Reilly Media, 2009.2. A. Iosup et al., “Performance Analysis of

Cloud Computing Services for Many-Tasks

Scientific Computing,” IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 6, 2011, pp. 931–945.

3. M. Sperk, “Scientific Computing in the Cloud with Google App Engine,” master’s thesis, Faculty of Mathematics, Computer Sci-ence, and Physics, Univ. of Innsbruck, 2011; http://dps.uibk.ac.at/~radu/sperk.pdf.

4. J.L. Gustafson, “Reevaluating Amdahl’s Law,” Comm. ACM, vol. 31, no. 5, 1988, pp. 532–533.

Radu Prodan is an associate professor at the University of Innsbruck’s Institute of Com-puter Science. His research interests include programming methods, compiler technol-ogy, performance analysis, and scheduling for parallel and distributed systems. Prodan has a PhD in technical services from the Vienna University of Technology. Contact him at [email protected].

Michael Sperk is a PhD student at the Uni-versity of Innsbruck. His research interests include distributed and parallel computing. Sperk has an MSc in computer science from the University of Innsbruck. Contact him at [email protected].

Simon Ostermann is a PhD student at the University of Innsbruck’s Institute of Com-puter Science. His research interests include resource management and scheduling for grid and cloud computing. Ostermann has an MSc in computer science from the University of Innsbruck. Contact him at [email protected].

This article originally appeared in IEEE Software, March/April 2012; http://doi.ieeecomputersociety.org/10.1109/MS.2011.131.

Table 1. Resource consumption and the estimated cost for four algorithms.

AlgorithmProblem

size (points)

Outgoing data Incoming data CPU time Cost (US$)

Gbytes Complexity Gbytes Complexity Hrs. Complexity

Google App

Engine (GAE)

New GAE

Amazon Elastic

Compute Cloud

π approximation

220,000,000 0 O(1) 0 O(1) 1.7 O(n) 0.170 0.078 0.190

Matrix multiplication

1,500 × 1,500

0.85 O(n2) 0.75 O(n2) 1.15 O(n2) 0.292 0.203 0.440

Mandelbrot set 3,200 × 3,200

0.95 O(n2) 0 O(1) 0.15 O(n2) 0.129 0.066 0.440

Rank sort 70,000 0.02 O(n2) 0.01 O(n) 1.16 O(n2) 0.119 0.120 0.245

LISTEN TO GRADY BOOCH“On Architecture” Podcast

www.computer.org/onarchitecture

LISTEN TO GRADY BOOCH“On Architecture” Podcast

www.computer.org/onarchitecture

LISTEN TO GRADY BOOCH“On Architecture” Podcast

www.computer.org/onarchitectureListen to Grady Booch“on architecture” Podcast

www.computer.org/onarchitecture

SW-Prodan.indd 13 5/16/12 5:50 PM

14 IEEE Cloud Computing Published by the IEEE Computer Society © 2012 IEEE

E ach day, a fresh news item, blog entry, or other publication warns us about cloud computing’s secu-

rity risks and threats; in most cases, security is cited as the most substantial roadblock for cloud computing uptake. But this dis-course about cloud computing security issues makes it difficult to formulate a well-founded assessment of the actual security impact for two key reasons. First, in many of these discussions about risk, basic vocabu-lary terms—including risk, threat, and vul-nerability—are often used interchangeably, without regard to their respective defini-tions. Second, not every issue raised is spe-cific to cloud computing.

To achieve a well-founded understanding of the “delta” that cloud computing adds with

Understanding Cloud Computing VulnerabilitiesBernd Grobauer, Tobias Walloschek, and Elmar Stöcker, Siemens

Discussions about cloud computing security often fail to distinguish general issues from cloud-specific issues. To clarify the discussions regarding vulnerabilities, the authors define indicators based on sound definitions of risk factors and cloud computing.

respect to security issues, we must analyze how cloud computing influences established security issues. A key factor here is security vulnerabilities: cloud computing makes cer-tain well-understood vulnerabilities more significant as well as adds new ones to the mix. Before we take a closer look at cloud-specific vulnerabilities, however, we must first establish what a “vulnerability” really is.

Vulnerability: An OverviewVulnerability is a prominent factor of risk. ISO 27005 defines risk as “the potential that a given threat will exploit vulnerabilities of an asset or group of assets and thereby cause harm to the organization,” measuring it in terms of both the likelihood of an event and its consequence.1 The Open Group’s risk taxonomy (www.

opengroup.org/onlinepubs/9699919899/toc.pdf) offers a useful overview of risk factors (see Figure 1).

The Open Group’s taxonomy uses the same two top-level risk factors as ISO 27005: the likelihood of a harmful event (here, loss event frequency) and its consequence (here, probable loss magnitude).1 The probable loss magnitude’s subfactors (on the right in Fig-ure 1) influence a harmful event’s ultimate cost. The loss event frequency subfactors (on the left) are a bit more complicated. A loss event occurs when a threat agent (such as a hacker) successfully exploits a vulnerabil-ity. The frequency with which this happens depends on two factors:

■ The frequency with which threat agents try to exploit a vulnerability. This frequency is determined by both the agents’ motivation (What can they gain with an attack? How much effort does it take? What is the risk for the attackers?) and how much access (“contact”) the agents have to the attack targets.

■ The difference between the threat agents’ attack capabilities and the system’s strength to resist the attack.

This second factor brings us toward a use-ful definition of vulnerability.

Defining VulnerabilityAccording to the Open Group’s risk taxonomy,

Vulnerability is the probability that an asset will be unable to resist the actions of a threat agent. Vulnerability exists when there is a difference between the force being applied by the threat agent, and an object’s ability to resist that force.

So, vulnerability must always be described in terms of resistance to a cer-tain type of attack. To provide a real-world example, a car’s inability to protect its driver against injury when hit frontally by a truck driving 60 mph is a vulnerability; the resis-tance of the car’s crumple zone is simply too weak compared to the truck’s force. Against the “attack” of a biker, or even a small car driving at a more moderate speed, the car’s resistance strength is perfectly adequate.

The Essentials Issue

SP-Grobauer.indd 14 5/23/12 12:59 PM

www.computer.org/cloud 15

We can also describe computer vulnera-bility—that is, security-related bugs that you close with vendor-provided patches—as a weakening or removal of a certain resistance strength. A buffer-overflow vulnerability, for example, weakens the system’s resistance to arbitrary code execution. Whether attack-ers can exploit this vulnerability depends on their capabilities.

Vulnerabilities and Cloud RiskWe’ll now examine how cloud computing influences the risk factors in Figure 1, starting with the right-hand side of the risk factor tree.

From a cloud customer perspective, the right-hand side dealing with probable magni-tude of future loss isn’t changed at all by cloud computing: the consequences and ultimate cost of, say, a confidentiality breach, is exactly the same regardless of whether the data breach occurred within a cloud or a conven-tional IT infrastructure. For a cloud service provider, things look somewhat different: because cloud computing systems were pre-viously separated on the same infrastructure, a loss event could entail a considerably larger impact. But this fact is easily grasped and incorporated into a risk assessment: no con-ceptual work for adapting impact analysis to cloud computing seems necessary.

So, we must search for changes on Figure

1’s left-hand side—the loss event frequency. Cloud computing could change the prob-ability of a harmful event’s occurrence. As we show later, cloud computing causes sig-nificant changes in the vulnerability factor. Of course, moving to a cloud infrastructure might change the attackers’ access level and motivation, as well as the effort and risk—a fact that must be considered as future work. But, for supporting a cloud-specific risk assessment, it seems most profitable to start by examining the exact nature of cloud-spe-cific vulnerabilities.

Cloud ComputingIs there such a thing as a “cloud-specific” vulnerability? If so, certain factors in cloud computing’s nature must make a vulnerabil-ity cloud-specific.

Essentially, cloud computing combines known technologies (such as virtualization) in ingenious ways to provide IT services “from the conveyor belt” using economies of scale. We’ll now look closer at what the core technologies are and which characteristics of their use in cloud computing are essential.

Core Cloud Computing TechnologiesCloud computing builds heavily on capabili-ties available through several core technologies:

■ Web applications and services. Software as a service (SaaS) and platform as a ser-vice (PaaS) are unthinkable without Web application and Web services technolo-gies: SaaS offerings are typically imple-mented as Web applications, while PaaS offerings provide development and run-time environments for Web applications and services. For infrastructure as a ser-vice (IaaS) offerings, administrators typi-cally implement associated services and APIs, such as the management access for customers, using Web application/service technologies.

■ Virtualization IaaS offerings. These tech-nologies have virtualization techniques at their very heart; because PaaS and SaaS services are usually built on top of a sup-porting IaaS infrastructure, the impor-tance of virtualization also extends to these service models. In the future, we expect virtualization to develop from virtualized servers toward computational resources that can be used more readily for executing SaaS services.

■ Cryptography. Many cloud computing security requirements are solvable only by using cryptographic techniques.

As cloud computing develops, the list of core technologies is likely to expand.

Figure 1. Factors contributing to risk according to the Open Group’s risk taxonomy. Risk corresponds to the product of loss event frequency (left) and probable loss magnitude (right). Vulnerabilities influence the loss event frequency.

Random

Contact

Threat eventfrequency

International

Loss eventfrequencyRegular

Control strength

Vulnerability

Asset loss

Threat lossProbablelossmagnitude

Risk

Secondaryloss factors

Organizational

External

Threat capacity

Asset value

Risk

Level of effort Action

Primaryloss factors

Value

Volume

Competence

Action

Internal vs.external

Embarrassment

Competitiveadvantage

Legal/regulatory

General

Detection

Legal & regulatory

Competitors

Media

Stakeholders

Containment

Remediation

Recovery

Timing

Due diligence

Response

Detection

Access

Misuse

Disclose

Modify

Deny access

Productivity

Sensitivity

Cost

SP-Grobauer.indd 15 5/23/12 12:59 PM

16 IEEE Cloud Computing May/June 2012

The Essentials Issue

Essential CharacteristicsIn its description of essential cloud charac-teristics,2 the US National Institute of Stan-dards and Technology (NIST) captures well what it means to provide IT services from the conveyor belt using economies of scale:

■ On-demand self-service. Users can order and manage services without human inter-action with the service provider, using, for example, a Web portal and management interface. Provisioning and de-provision-ing of services and associated resources occur automatically at the provider.

■ Ubiquitous network access. Cloud services are accessed via the network (usually the Internet), using standard mechanisms and protocols.

■ Resource pooling. Computing resources used to provide the cloud service are real-ized using a homogeneous infrastructure that’s shared between all service users.

■ Rapid elasticity. Resources can be scaled up and down rapidly and elastically.

■ Measured service. Resource/service usage is constantly metered, supporting optimi-zation of resource usage, usage reporting to the customer, and pay-as-you-go busi-ness models.

NIST’s definition framework for cloud comput-ing with its list of essential characteristics has by now evolved into the de facto standard for defining cloud computing.

Cloud-Specific VulnerabilitiesBased on the abstract view of cloud comput-ing we presented earlier, we can now move toward a definition of what constitutes a cloud-specific vulnerability. A vulnerability is cloud specific if it

■ is intrinsic to or prevalent in a core cloud computing technology,

■ has its root cause in one of NIST’s essential cloud characteristics,

■ is caused when cloud innovations make tried-and-tested security controls difficult or impossible to implement, or

■ is prevalent in established state-of-the-art cloud offerings.

We now examine each of these four indicators.

Core-Technology VulnerabilitiesCloud computing’s core technologies—Web applications and services, virtualization, and cryptography—have vulnerabilities that are either intrinsic to the technology or preva-lent in the technology’s state-of-the-art implementations. Three examples of such vulnerabilities are virtual machine escape, session riding and hijacking, and insecure or obsolete cryptography.

First, the possibility that an attacker might successfully escape from a virtual-ized environment lies in virtualization’s very nature. Hence, we must consider this vul-nerability as intrinsic to virtualization and highly relevant to cloud computing.

Second, Web application technologies must overcome the problem that, by design, the HTTP protocol is a stateless protocol, whereas Web applications require some notion of session state. Many techniques implement session handling and—as any security professional knowledgeable in Web application security will testify—many ses-sion handling implementations are vulner-able to session riding and session hijacking. Whether session riding/hijacking vulner-abilities are intrinsic to Web application technologies or are “only” prevalent in many current implementations is arguable; in any case, such vulnerabilities are certainly rel-evant for cloud computing.

Finally, cryptoanalysis advances can render any cryptographic mechanism or algorithm insecure as novel methods of breaking them are discovered. It’s even more common to find crucial flaws in cryptographic algorithm imple-mentations, which can turn strong encryp-tion into weak encryption (or sometimes no encryption at all). Because broad uptake of cloud computing is unthinkable without the use of cryptography to protect data confi-dentiality and integrity in the cloud, insecure or obsolete cryptography vulnerabilities are highly relevant for cloud computing.

Essential Cloud Characteristic Vulnerabilities As we noted earlier, NIST describes five essential cloud characteristics: on-demand self-service, ubiquitous network access, resource pooling, rapid elasticity, and mea-sured service.

Following are examples of

vulnerabilities with root causes in one or more of these characteristics:

■ Unauthorized access to management inter-face. The cloud characteristic on-demand self-service requires a management inter-face that’s accessible to cloud service users. Unauthorized access to the management interface is therefore an especially relevant vulnerability for cloud systems: the proba-bility that unauthorized access could occur is much higher than for traditional systems where the management functionality is accessible only to a few administrators.

■ Internet protocol vulnerabilities. The cloud characteristic ubiquitous network access means that cloud services are accessed via network using standard protocols. In most cases, this network is the Internet, which must be considered untrusted. Internet pro-tocol vulnerabilities—such as vulnerabilities that allow man-in-the-middle attacks—are therefore relevant for cloud computing.

■ Data recovery vulnerability. The cloud char-acteristics of pooling and elasticity entail that resources allocated to one user will be reallocated to a different user at a later time. For memory or storage resources, it might therefore be possible to recover data written by a previous user.

■ Metering and billing evasion. The cloud characteristic of measured service means that any cloud service has a metering capa-bility at an abstraction level appropriate to the service type (such as storage, process-ing, and active user accounts). Metering data is used to optimize service delivery as well as billing. Relevant vulnerabilities include metering and billing data manipu-lation and billing evasion.

Thus, we can leverage NIST’s well-founded definition of cloud computing in reasoning about cloud computing issues.

Defects in Known Security ControlsVulnerabilities in standard security controls must be considered cloud specific if cloud innovations directly cause the difficulties in implementing the controls. Such vulnerabili-ties are also known as control challenges.

Here, we treat three examples of such con-trol challenges. First, virtualized networks offer

SP-Grobauer.indd 16 5/23/12 12:59 PM

www.computer.org/cloud 17

insufficient network-based controls. Given the nature of cloud services, the administrative access to IaaS network infrastructure and the ability to tailor network infrastructure are typi-cally limited; hence, standard controls such as IP-based network zoning can’t be applied. Also, standard techniques such as network-based vulnerability scanning are usually forbidden by IaaS providers because, for example, friendly scans can’t be distinguished from attacker activ-ity. Finally, technologies such as virtualization mean that network traffic occurs on both real and virtual networks, such as when two virtual machine environments (VMEs) hosted on the same server communicate. Such issues con-stitute a control challenge because tried and tested network-level security controls might not work in a given cloud environment.

The second challenge is in poor key man-agement procedures. As noted in a recent European Network and Information Secu-rity Agency study,3 cloud computing infra-structures require management and storage of many different kinds of keys. Because vir-tual machines don’t have a fixed hardware infrastructure and cloud-based content is often geographically distributed, it’s more difficult to apply standard controls—such as hardware security module (HSM) storage—to keys on cloud infrastructures.

Finally, security metrics aren’t adapted to cloud infrastructures. Currently, there are no standardized cloud-specific security metrics that cloud customers can use to monitor the security status of their cloud resources. Until such standard security metrics are devel-oped and implemented, controls for secu-rity assessment, audit, and accountability are more difficult and costly, and might even be impossible to employ.

Prevalent Vulnerabilities in State-of-the-Art Cloud OfferingsAlthough cloud computing is relatively young, there are already myriad cloud offerings on the market. Hence, we can complement the three cloud-specific vulnerability indica-tors presented earlier with a forth, empiri-cal indicator: if a vulnerability is prevalent in state-of-the-art cloud offerings, it must be regarded as cloud-specific. Examples of such vulnerabilities include injection vulnerabili-ties and weak authentication schemes.

Injection vulnerabilities are exploited by

manipulating service or application inputs to interpret and execute parts of them against the programmer’s intentions. Examples of injection vulnerabilities include

■ SQL injection, in which the input contains SQL code that’s erroneously executed in the database back end;

■ command injection, in which the input contains commands that are erroneously executed via the OS; and

■ cross-site scripting, in which the input contains JavaScript code that’s erroneously executed by a victim’s browser.

In addition, many widely used authenti-cation mechanisms are weak. For example, usernames and passwords for authentication are weak due to

■ insecure user behavior (choosing weak passwords, reusing passwords, and so on), and

■ inherent limitations of one-factor authen-tication mechanisms.

Also, the authentication mechanisms’ implementation might have weaknesses and allow, for example, credential interception and replay. The majority of Web applications in current state-of-the-art cloud services employ usernames and passwords as authen-tication mechanism.

Architectural Components and VulnerabilitiesCloud service models are commonly divided into SaaS, PaaS, and IaaS, and each model influences the vulnerabilities exhibited by a given cloud infrastructure. It’s helpful to add more structure to the service model stacks: Figure 2 shows a cloud reference architecture that makes the most important security-rele-vant cloud components explicit and provides an abstract overview of cloud computing for security issue analysis.

The reference architecture is based on work carried out at the University of Cali-fornia, Los Angeles, and IBM.4 It inherits the layered approach in that layers can encom-pass one or more service components. Here, we use “service” in the broad sense of provid-ing something that might be both material (such as shelter, power, and hardware) and

immaterial (such as a runtime environment). For two layers, the cloud software environ-ment and the cloud software infrastructure, the model makes the layers’ three main ser-vice components—computation, storage, and communication—explicit. Top layer services also can be implemented on layers further down the stack, in effect skipping intermediate layers. For example, a cloud Web application can be implemented and operated in the traditional way—that is, run-ning on top of a standard OS without using dedicated cloud software infrastructure and environment components. Layering and compositionality imply that the transition from providing some service or function in-house to sourcing the service or function can take place between any of the model’s layers.

In addition to the original model, we’ve identified supporting functions relevant to services in several layers and added them to the model as vertical spans over several hori-zontal layers.

Our cloud reference architecture has three main parts:

■ Supporting (IT) infrastructure. These are facilities and services common to any IT service, cloud or otherwise. We include them in the architecture because we want to provide the complete picture; a full treatment of IT security must account for a cloud service’s non-cloud-specific components.

■ Cloud-specific infrastructure. These com-ponents constitute the heart of a cloud service; cloud-specific vulnerabilities and corresponding controls are typically mapped to these components.

■ Cloud service consumer. Again, we include the cloud service customer in the reference archi-tecture because it’s relevant to an all-encom-passing security treatment.

Also, we make explicit the network that sep-arates the cloud service consumer from the cloud infrastructure; the fact that access to cloud resources is carried out via a (usually untrusted) network is one of cloud comput-ing’s main characteristics.

Using the cloud reference architecture’s structure, we can now run through the architec-ture’s components and give examples of each component’s cloud-specific vulnerabilities.

SP-Grobauer.indd 17 5/23/12 12:59 PM

18 IEEE Cloud Computing May/June 2012

The Essentials Issue

Cloud Software Infrastructure and EnvironmentThe cloud software infrastructure layer provides an abstraction level for basic IT resources that are offered as services to higher layers: com-putational resources (usually VMEs), stor-age, and (network) communication. These services can be used individually, as is typi-cally the case with storage services, but they’re often bundled such that servers are delivered with certain network connectivity and (often) access to storage. This bundle, with or without storage, is usually referred to as IaaS.

The cloud software environment layer pro-vides services at the application platform level:

■ a development and runtime environment for services and applications written in one or more supported languages;

■ storage services (a database interface rather than file share); and

■ communication infrastructure, such as Microsoft’s Azure service bus.

Vulnerabilities in both the infrastructure and environment layers are usually specific to one of the three resource types provided

by these two layers. However, cross-tenant access vulnerabilities are relevant for all three resource types. The virtual machine escape vulnerability we described earlier is a prime example. We used it to demonstrate a vulner-ability that’s intrinsic to the core virtualization technology, but it can also be seen as having its root cause in the essential characteristic of resource pooling: whenever resources are pooled, unauthorized access across resources becomes an issue. Hence, for PaaS, where the technology to separate different tenants (and tenant services) isn’t necessarily based on vir-tualization (although that will be increasingly true), cross-tenant access vulnerabilities play an important role as well. Similarly, cloud storage is prone to cross-tenant storage access, and cloud communication—in the form of virtual networking—is prone to cross-tenant network access.

Computational ResourcesA highly relevant set of computational resource vulnerabilities concerns how vir-tual machine images are handled: the only feasible way of providing nearly identical server images—thus providing on-demand

service for virtual servers—is by cloning template images.

Vulnerable virtual machine template images cause OS or application vulnerabili-ties to spread over many systems. An attacker might be able to analyze configuration, patch level, and code in detail using administrative rights by renting a virtual server as a service customer and thereby gaining knowledge helpful in attacking other customers’ images. A related problem is that an image can be taken from an untrustworthy source, a new phenomenon brought on especially by the emerging marketplace of virtual images for IaaS services. In this case, an image might, for example, have been manipulated so as to provide back-door access for an attacker.

Data leakage by virtual machine replica-tion is a vulnerability that’s also rooted in the use of cloning for providing on-demand ser-vice. Cloning leads to data leakage problems regarding machine secrets: certain elements of an OS—such as host keys and cryptographic salt values—are meant to be private to a single host. Cloning can violate this privacy assump-tion. Again, the emerging marketplace for virtual machine images, as in Amazon EC2, leads to a related problem: users can provide template images for other users by turning a running image into a template. Depending on how the image was used before creating a template from it, it could contain data that the user doesn’t wish to make public.

There are also control challenges here, including those related to cryptography use. Cryptographic vulnerabilities due to weak random number generation might exist if the abstraction layer between the hardware and OS kernel introduced by virtualization is problematic for generating random numbers within a VME. Such generation requires an entropy source on the hardware level. Virtu-alization might have flawed mechanisms for tapping that entropy source, or having sev-eral VMEs on the same host might exhaust the available entropy, leading to weak ran-dom number generation. As we noted ear-lier, this abstraction layer also complicates the use of advanced security controls, such as hardware security modules, possibly leading to poor key management procedures.

StorageIn addition to data recovery vulnerability

User

Front end

Network

Kernel (OS/apps)

Hardware

Facilities

Cloud (Web) applicationsSaaS

Service customer

PaaS

IaaS

Cloud software environment

Cloud software infrastructure

Storage CommunicationComputational

resources

Serv

ices

& A

PIs

Man

agem

ent

acce

ss

IAA

A m

echa

nism

s

Prov

ider

Cloud-speci�c infrastructure

Supporting (IT) infrastructure

Figure 2. The cloud reference architecture. We map cloud-specific vulnerabilities to components of this reference architecture, which gives us an overview of which vulnerabilities might be relevant for a given cloud service.

SP-Grobauer.indd 18 5/23/12 12:59 PM

www.computer.org/cloud 19

due to resource pooling and elasticity, there’s a related control challenge in media saniti-zation, which is often hard or impossible to implement in a cloud context. For example, data destruction policies applicable at the end of a life cycle that require physical disk destruction can’t be carried out if a disk is still being used by another tenant.

Because cryptography is frequently used to overcome storage-related vulnerabilities, this core technology’s vulnerabilities—insecure or obsolete cryptography and poor key management—play a special role for cloud storage.

CommunicationThe most prominent example of a cloud communications service is the networking provided for VMEs in an IaaS environment. Because of resource pooling, several cus-tomers are wikely to share certain network infrastructure components: vulnerabili-ties of shared network infrastructure com-ponents, such as vulnerabilities in a DNS server, Dynamic Host Configuration Pro-tocol, and IP protocol vulnerabilities, might enable network-based cross-tenant attacks in an IaaS infrastructure.

Virtualized networking also presents a control challenge: again, in cloud services, the administrative access to IaaS network infrastructure and the possibility for tai-loring network infrastructure are usually limited. Also, using technologies such as virtualization leads to a situation where net-work traffic occurs not only on “real” net-works but also within virtualized networks (such as for communication between two VMEs hosted on the same server); most implementations of virtual networking offer limited possibilities for integrating network-based security. All in all, this constitutes a control challenge of insufficient network-based controls because tried-and-tested net-work-level security controls might not work in a given cloud environment.

Cloud Web ApplicationsA Web application uses browser technology as the front end for user interaction. With the increased uptake of browser-based com-puting technologies such as JavaScript, Java, Flash, and Silverlight, a Web cloud applica-tion falls into two parts:

■ an application component operated some-where in the cloud, and

■ a browser component running within the user’s browser.

In the future, developers will increasingly use technologies such as Google Gears to permit offline usage of a Web application’s browser component for use cases that don’t require constant access to remote data. We’ve already described two typical vulner-abilities for Web application technologies: session riding and hijacking vulnerabilities and injection vulnerabilities.

Other Web-application-specific vulnera-bilities concern the browser’s front-end com-ponent. Among them are client-side data manipulation vulnerabilities, in which users attack Web applications by manipulating data sent from their application component to the server’s application component. In other words, the input received by the server component isn’t the “expected” input sent by the client-side component, but altered or completely user-generated input. Further-more, Web applications also rely on browser mechanisms for isolating third-party content embedded in the application (such as adver-tisements, mashup components, and so on). Browser isolation vulnerabilities might thus allow third-party content to manipulate the Web application.

Services and APIsIt might seem obvious that all layers of the cloud infrastructure offer services, but for examining cloud infrastructure security, it’s worthwhile to explicitly think about all of the infrastructure’s service and application pro-gramming interfaces. Most services are likely Web services, which share many vulnerabili-ties with Web applications. Indeed, the Web application layer might be realized com-pletely by one or more Web services such that the application URL would only give the user a browser component. Thus the supporting services and API functions share many vul-nerabilities with the Web applications layer.

Management AccessNIST’s definition of cloud computing states that one of cloud services’ central character-istics is that they can be rapidly provisioned and released with minimal management

effort or service provider interaction. Con-sequently, a common element of each cloud service is a management interface—which leads directly to the vulnerability concern-ing unauthorized access to the management interface. Furthermore, because manage-ment access is often realized using a Web application or service, it often shares the vul-nerabilities of the Web application layer and services/API component.

Identity, Authentication, Authorization, and Auditing MechanismsAll cloud services (and each cloud service’s management interface) require mechanisms for identity management, authentication, authorization, and auditing (IAAA). To a cer-tain extent, parts of these mechanisms might be factored out as a stand-alone IAAA ser-vice to be used by other services. Two IAAA elements that must be part of each service implementation are execution of adequate authorization checks (which, of course, use authentication and/or authorization informa-tion received from an IAA service) and cloud infrastructure auditing.

Most vulnerabilities associated with the IAAA component must be regarded as cloud-specific because they’re prevalent in state-of-the-art cloud offerings. Earlier, we gave the example of weak user authentica-tion mechanisms; other examples include

■ Denial of service by account lockout. One often-used security control—especially for authentication with username and password—is to lock out accounts that have received several unsuccessful authen-tication attempts in quick succession. Attackers can use such attempts to launch DoS attacks against a user.

■ Weak credential-reset mechanisms. When cloud computing providers manage user credentials themselves rather than using federated authentication, they must pro-vide a mechanism for resetting credentials in the case of forgotten or lost credentials. In the past, password-recovery mecha-nisms have proven particularly weak.

■ Insufficient or faulty authorization checks. State-of-the-art Web application and service cloud offerings are often vulner-able to insufficient or faulty authorization

SP-Grobauer.indd 19 5/23/12 12:59 PM

20 IEEE Cloud Computing May/June 2012

Th e Essentials Issue

checks that can make unauthorized infor-mation or actions available to users. Miss-ing authorization checks, for example, are the root cause of URL-guessing att acks. In such att acks, users modify URLs to dis-play information of other user accounts.

■ Coarse authorization control. Cloud ser-vices’ management interfaces are par-ticularly prone to off ering authorization control models that are too coarse. Th us, standard security measures, such as duty separation, can’t be implemented because it’s impossible to provide users with only those privileges they strictly require to carry out their work.

■ Insuffi cient logging and monitoring possi-bilities. Currently, no standards or mecha-nisms exist to give cloud customers logging and monitoring facilities within cloud resources. Th is gives rise to an acute prob-lem: log fi les record all tenant events and can’t easily be pruned for a single tenant. Also, the provider’s security monitoring is oft en hampered by insuffi cient moni-toring capabilities. Until we develop and implement usable logging and monitoring standards and facilities, it’s diffi cult—if not impossible—to implement security con-trols that require logging and monitoring.

Of all these IAAA vulnerabilities, in the experience of cloud service providers, cur-rently, authentication issues are the primary vulnerability that puts user data in cloud ser-vices at risk. 5

ProviderVulnerabilities that are relevant for all cloud computing components typically concern the provider—or rather users’ inability to control cloud infrastructure as they do their own infrastructure. Among the con-trol challenges are insuffi cient security audit possibilities, and the fact that certifi cation schemes and security metrics aren’t adopted to cloud computing. Further, standard secu-rity controls regarding audit, certifi cation, and continuous security monitoring can’t be implemented eff ectively.

C loud computing is in constant devel-opment; as the fi eld matures, addi-

tional cloud-specifi c vulnerabilities certainly

will emerge, while others will become less of an issue. Using a precise defi nition of what constitutes a vulnerability from the Open Group’s risk taxonomy and the four indi-cators of cloud-specifi c vulnerabilities we identify here off ers a precision and clarity level oft en lacking in current discourse about cloud computing security.

Control challenges typically highlight situations in which otherwise successful security controls are ineff ective in a cloud sett ing. Th us, these challenges are of special interest for further cloud computing secu-rity research. Indeed, many current eff orts—such as the development of security metrics and certifi cation schemes, and the move toward full-featured virtualized network components—directly address control chal-lenges by enabling the use of such tried-and- tested controls for cloud computing.

References1. ISO/IEC 27005:2007 Information Tech-

nology—Security Techniques—Information Security Risk Management, Int’l Org. Stan-dardization, 2007.

2. P. Mell and T. Grance, “Eff ectively and Securely Using the Cloud Comput-ing Paradigm (v0.25),” presentation, US Nat’l Inst. Standards and Technology, 2009; htt p://csrc.nist.gov/groups/SNS/cloud-computing.

3. European Network and Information Secu-rity Agency (ENISA), Cloud Computing: Benefi ts, Risks and Recommendations for Information Security, Nov. 2009; www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessment/at_download/fullReport.

4. L. Youseff , M. Butrico, and D. Da Silva, “Towards a Unifi ed Ontology of Cloud Computing,” Proc. Grid Computing Envi-ronments Workshop (GCE), IEEE Press, 2008; doi: 10.1109/GCE.2008.4738443.

5. E. Grosse, “Security at Scale,” invited talk, ACM Cloud Security Work-shop (CCSW), 2010; htt p://w n . c o m / 2 0 1 0 _ G o o g l e _ F a c u l t y_Summit_Security_at_Scale.

Bernd Grobauer is a senior consultant in information security and leads the Siemens Computer Emergency Response Team’s

(CERT’s) research activities in incident detection and handling, malware defense, and cloud computing security. Grobauer has a PhD in computer science from Aarhus University, Denmark. He’s on the member-ship advisory committ ee of the International Information Integrity Institute. Contact him at [email protected].

Tobias Walloschek is a senior management consultant at Siemens IT Solutions and Ser-vices GmbH. His research interests are cloud computing security and business adoption strategies. Walloschek has a bachelor’s degree in business administration from the Univer-sity of Applied Sciences in Ingolstadt, Ger-many. He is a Certifi ed Information Systems Security Professional. Contact him at [email protected].

Elmar Stöcker is a manager at Siemens IT Solutions and Services GmbH, where he’s responsible for the portfolio strategy and governance of the professional services portfolio; he also leads the cloud comput-ing security and PaaS activities. Stöcker has a master’s degree in computer science from RWTH Aachen, Germany. Contact him at [email protected].

Th is article originally appeared in IEEE Security & Privacy, March/April 2011; http://doi.ieeecomputersociety.org/10.1109/MSP.2010.115.

SANDBOXING & VIRTUALIZATION • DETECTING CHEATERS

MARCH/APRIL 2011

INSIDER ATTACKS • MOBILE TWO-FACTOR AUTHENTICATION • TRUTH IN CROWDSOURCING

SEPTEMBER/OCTOBER 2011

January/February 2012Vol. 10, No. 1

January/February 2012Vol. 10, No. 1

Building Dependability, Reliability, and Trust

Successful Security Decisions ■ Computer Security since 9/11

Y E A R S

www.qmags.com/SNP

SUBSCRIBE FOR $1995

DIGITAL EDITIONJanuary/February 2012Vol. 10, No. 1

January/February 2012

January/February 2012

January/February 2012

January/February 2012Vol. 10, No. 1Vol. 10, No. 1Vol. 10, No. 1Vol. 10, No. 1

SP-Grobauer.indd 20 5/23/12 12:59 PM

Register today! htt p://2012.cloudcom.org/

3-6 December 2012

Taipei, Taiwan

The “Cloud” is a natural evolution of distributed computing and the widespread adaption of virtualization and Service Oriented Architecture. In Cloud Computing, IT-related capabilities and resources are provided as services, via the Internet and on-demand, accessible without requiring detailed knowledge of the underlying technology. IEEE CloudCom 2012, steered by the Cloud Computing Association, brings together cloud computing researchers and related technologies.

IEEE CloudCom 2012 4th IEEE International Conference on Cloud Computing Technology and Science

c3.indd 1 5/15/12 6:14 PM

Switch from print at computer.org/digitalcomputer

CYBERBULLYING, P. 93

SOFT BIOMETRICS, P. 106

HAS EVERTHING BEEN INVENTED?, P. 112http

://w

ww

.com

pute

r.or

gS

EP

TE

MB

ER

20

1 1

HUMAN EAR RECOGNITION, P. 79

CROWDSOURCING MAPS, P. 90

AUTOMATED PERSONAL ASSISTANTS, P. 112

http

://w

ww

.com

pute

r.org

NO

VE

MB

ER

20

11

CS PRESIDENT’S MESSAGE, P. 8 MUSEUMS AT YOUR FINGERTIPS, P. 87

THE PROFESSION AND DIGITAL TECHNOLOGY, P. 116

DE

CE

MB

ER

20

1 1

http

://w

ww

.com

pute

r.or

g

INTERACTIVE TABLETOPS, PP. 78

SMARTPHONE SECURITY, PP. 82

GAME MAKING, PP. 85

FE

BR

UA

RY

20

12ht

tp:/

/ww

w.c

ompu

ter.

org

More value, more content, more resourcesThe new multi-faceted Computer offers exclusive video and web extras that you can access only through this advanced digital version. Dive deeper into the latest technical developments with a magazine that is:

Searchable—Quickly find the latest information in your fields of interest. Access the digital archives, and save what’s most relevant to you.

Linked—Click on table of contents links and instantly go to the articles

you want to read first. Article links go to additional references to deepen your new discoveries.

Engaging—Absorb concepts as they come to life through related audio, video, animation, and more. Email authors directly. Even apply for jobs through convenient ad links.

Mobile—Read the issue anytime, anywhere, at your convenience—on your laptop, iPad, or other mobile device.

.com

pht

tp:

INTERACTIVE TABLETOPS, PP. PP. PP 78

SMARTPHONE SECURITY, TY, TY PP. 82PP. 82PP

GAME MAKING, PP. 85PP. 85PPhttp

://w

ww

.com

pht

tp:

GAME MAKING,

HONE SE

GAME MAKING, PP. 85PP. 85PP

,

. 85Now Available in Advanced Digital Format

c4.indd 25 5/15/12 6:14 PM