Data Intensive Distributed Computing - Challenges and Solutions for Large-Scale Information Management 622912992

Data Intensive Distributed

Computing:

Challenges and Solutions

for Large-Scale Information

Management

Tevfik Kosar

State University of New York at Buffalo (SUNY), USA

Information Science

REFERENCE

Detailed Table of Contents

Preface xiii

Section 1

New Paradigms in Data Intensive Computing

Chapter 1

Data-Aware Distributed Computing 1

Esma Yildirim, State University ofNew York at Buffalo (SUNY), USA

Mehmet Balman, Lawrence Berkeley National Laboratory, USA

TevfikKosar, State University ofNew York at Buffalo (SUNY), USA

With the continuous increase in the data requirements ofscientific and commercial applications, access

to remote and distributed data has become a major bottleneck for end-to-end application performance.Traditional distributed computing systems closely couple data access and computation, and generally,data access is considered a side effect of computation. The limitations of traditional distributed com¬

puting systems and CPU-oriented scheduling and workflow management tools in managing complexdata handling have motivated a newly emerging era: data-aware distributed computing. In this chapter,the authors elaborate on how the most crucial distributed computing components, such as scheduling,workflow management, and end-to-end throughput optimization, can become "data-aware." In this new

computing paradigm, called data-aware distributed computing, data placement activities are representedas full-featured jobs in the end-to-end workflow, and they are queued, managed, scheduled, and opti¬mized via a specialized data-aware scheduler. As part of this new paradigm, the authors present a set oftools for mitigating the data bottleneck in distributed computing systems, which consists ofthree main

components: a data-aware scheduler, which provides capabilities such as planning, scheduling, resource

reservation, job execution, and error recovery for data movement tasks; integration of these capabilitiesto the other layers in distributed computing, such as workflow planning; and further optimization of data

movement tasks via dynamic tuning of underlying protocol transfer parameters.

Chapter 2

Towards Data Intensive Many-Task Computing 28

loan Raicu, Illinois Institute ofTechnology, USA & Argonne National Laboratory, USA

Ian Foster, University ofChicago, USA & Argonne National Laboratory, USA

Yong Zhao, University ofElectronic Science and Technology ofChina, China

Alex Szalay, Johns Hopkins University, USA

Philip Little, University ofNotre Dame, USA

Christopher M. Moretti, University ofNotre Dame, USA

Amitabh Chaudhary, University ofNotre Dame, USA

Douglas Thain, University ofNotre Dame, USA

Many-task computing aims to bridge the gap between two computing paradigms, high throughput

computing and high performance computing. Traditional techniques to support many-task computingcommonly found in scientific computing (i.e. the reliance on parallel file systems with static configura¬

tions) do not scale to today's largest systems for data intensive application, as the rate of increase in the

number of processors per system is outgrowing the rate of performance increase ofparallel file systems.In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and

efficient use of large distributed systems for data-intensive applications. They propose a "data diffusion"

approach to enable data-intensive many-task computing. They define an abstract model for data diffu¬

sion, define and implement scheduling policies with heuristics that optimize real world performance,and develop a competitive online caching eviction policy. They also offer many empirical experimentsto explore the benefits ofdata diffusion, both under static and dynamic resource provisioning, demon¬

strating approaches that improve both performance and scalability.

Chapter 3

Micro-Services: A Service-Oriented Paradigm for Scalable, Distributed Data Management 74

Arcot Rajasekar, University ofNorth Carolina at Chapel Hill, USA

Mike Wan, University ofCalifornia at San Diego, USA

Reagan Moore, University ofNorth Carolina at Chapel Hill, USA

Wayne Schroeder, University ofCalifornia at San Diego, USA

Service-oriented architectures (SOA) enable orchestration of loosely-coupled and interoperable func¬

tional software units to develop and execute complex but agile applications. Data management on a

distributed data grid can be viewed as a set of operations that are performed across all stages in the

life-cycle of a data object. The set of such operations depends on the type of objects, based on their

physical and discipline-centric characteristics. In this chapter, the authors define server-side functions,called micro-services, which are orchestrated into conditional workflows for achieving large-scale data

management specific to collections of data. Micro-services communicate with each other using param¬eter exchange, in memory data structures, a database-based persistent information store, and a network

messaging system that uses a serialization protocol for communicating with remote micro-services. The

orchestration ofthe workflow is done by a distributed rule engine that chains and executes the workflows

and maintains transactional properties through recovery micro-services. They discuss the micro-service

oriented architecture, compare the micro-service approach with traditional SOA, and describe the use

ofmicro-services for implementing policy-based data management systems.

Section 2

Distributed Storage

Chapter 4

Distributed Storage Systems for Data Intensive Computing 95

Sudharshan S. Vazhkadai, Oak Ridge National Laboratory, USA

AH R. Butt, Virginia Polytechnic Institute and State University, USA

Xiaosong Ma, North Carolina State University, USA

In this chapter, the authors present an overview ofthe utility ofdistributed storage systems in supportingmodern applications that are increasingly becoming data intensive. Their coverage of distributed storage

systems is based on the requirements imposed by data intensive computing and not a mere summary of

storage systems. To this end, they delve into several aspects of supporting data-intensive analysis, such

as data staging, offloading, checkpointing, and end-user access to terabytes of data, and illustrate the use

ofnovel techniques and methodologies for realizing distributed storage systems therein. The data delugefrom scientific experiments, observations, and simulations is affecting all of the aforementioned day-

to-day operations in data-intensive computing. Modern distributed storage systems employ techniquesthat can help improve application performance, alleviate I/O bandwidth bottleneck, mask failures, and

improve data availability. They present key guiding principles involved in the construction of such storage

systems, associated tradeoffs, design, and architecture, all with an eye toward addressing challenges of

data-intensive scientific applications. They highlight the concepts involved using several case studies of

state-of-the-art storage systems that are currently available in the data-intensive computing landscape.

Chapter 5

Metadata Management in PetaShare Distributed Storage Network 118

Ismail Akturk, Bilkent University, Turkey

Xinqi Wang, Louisiana State University, USA

Tevfik Kosar, State University ofNew York at Buffalo (SUNY), USA

The unbounded increase in the size ofdata generated by scientific applications necessitates collaboration

and sharing among the nation's education and research institutions. Simply purchasing high-capacity,high-performance storage systems and adding them to the existing infrastructure of the collaboratinginstitutions does not solve the underlying and highly challenging data handling problem. Scientists

are compelled to spend a great deal of time and energy on solving basic data-handling issues, such as

the physical location of data, how to access it, and/or how to move it to visualization and/or compute

resources for further analysis. This chapter presents the design and implementation of a reliable and

efficient distributed data storage system, PetaShare, which spans multiple institutions across the state

of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement

across geographically distributed storage sites. At the front-end, it provides light-weight clients the en¬

able easy, transparent, and scalable access. In PetaShare, the authors have designed and implementedan asynchronously replicated multi-master metadata system for enhanced reliability and availability.The authors also present a high level cross-domain metadata schema to provide a structured systematicview of multiple science domains supported by PetaShare.

Chapter 6

Data Intensive Computing with Clustered Chirp Servers 140

Douglas Thain, University ofNotre Dame, USA

Michael Albrecht, University ofNotre Dame, USA

Hoang Bui, University ofNotre Dame, USA

Peter Bui, University ofNotre Dame, USA

Rory Carmichael, University ofNotre Dame, USA

Scott Emrich, University ofNotre Dame, USA

Patrick Flynn, University ofNotre Dame, USA

Over the last few decades, computing performance, memory capacity, and disk storage have all increased

by many orders of magnitude. However, I/O performance has not increased at nearly the same pace:

a disk arm movement is still measured in milliseconds, and disk I/O throughput is still measured in

megabytes per second. If one wishes to build computer systems that can store and process petabytes of

data, they must have large numbers of disks and the corresponding I/O paths and memory capacity to

support the desired data rate. A cost efficient way to accomplish this is by clustering large numbers of

commodity machines together. This chapterpresents Chirp asa building block for clustered data intensive

scientific computing. Chirp was originally designed as a lightweight file server for grid computing and

was used as a "personal" file server. The authors explore building systems with very high I/O capacity

using commodity storage devices by tying together multiple Chirp servers. Several real-life applicationssuch as the GRAND Data Analysis Grid, the Biometrics Research Grid, and the Biocompute Facility

use Chirp as their fundamental building block, but provide different services and interfaces appropriate

to their target communities.

Section 3

Data & Workflow Management

Chapter 7

A Survey of Scheduling and Management Techniques for Data-Intensive

Application Workflows 156

Suraj Pandey The Commonwealth Scientific and Industrial Research Organisation (CSIRO),Australia

Rajkitmar Buyya, The University ofMelbourne, Australia

This chapter presents a comprehensive survey ofalgorithms, techniques, and frameworks used for sched¬

uling and management of data-intensive application workflows. Many complex scientific experiments

are expressed in the form of workflows for structured, repeatable, controlled, scalable, and automated

executions. This chapter focuses on the type of workflows that have tasks processing huge amount of

data, usually in the range from hundreds of mega-bytes to petabytes. Scientists are already using Grid

systems that schedule these workflows onto globally distributed resources for optimizing various objec¬tives: minimizetotal makespan ofthe workflow, minimize cost and usage ofnetwork bandwidth, minimize

cost of computation and storage, meet the deadline of the application, and so forth. This chapter lists

and describes techniques used in each ofthese systems for processing huge amount of data. A survey of

workflow management techniques is useful for understanding the working ofthe Grid systems providing

insights on performance optimization of scientific applications dealing with data-intensive workloads.

Chapter 8

Data Management in Scientific Workflows 177

Ewa Deeltnan, University ofSouthern California, USA

Ann Chervenak, University ofSouthern California, USA

Scientific applications such as those in astronomy, earthquake science, gravitational-wave physics, and

others have embraced workflow technologies to do large-scale science. Workflows enable researchers

to collaboratively design, manage, and obtain results that involve hundreds ofthousands ofsteps, access

terabytes of data, and generate similar amounts of intermediate and final data products. Although work¬

flow systems are able to facilitate the automated generation of data products, many issues still remain

to be addressed. These issues exist in different forms in the workflow lifecycle. This chapter describes

a workflow lifecycle as consisting of a workflow generation phase where the analysis is defined, the

workflow planning phase where resources needed for execution are selected, the workflow execution

part, where the actual computations take place, and the result, metadata, and provenance storing phase.The authors discuss the issues related to data management at each step of the workflow cycle. Theydescribe challenge problems and illustrate them in the context ofreal-life applications. They discuss the

challenges, possible solutions, and open issues faced when mapping and executing large-scale workflowson current cyberinfrastructure. They particularly emphasize the issues related to the management ofdata

throughout the workflow lifecycle.

Chapter 9

Replica Management in Data Intensive Distributed Science Applications 188

Ann L. Chervenak, University ofSouthern California, USA

Robert Schuler, University ofSouthern California, USA

Management of the large data sets produced by data-intensive scientific applications is complicated

by the fact that participating institutions are often geographically distributed and separated by distinct

administrative domains. A key data management problem in these distributed collaborations has been

the creation and maintenance of replicated data sets. This chapter provides an overview of replicamanagement schemes used in large, data-intensive, distributed scientific collaborations. Early replicamanagement strategies focused on the development of robust, highly scalable catalogs for maintainingreplica locations. In recent years, more sophisticated, application-specific replica management systemshave been developed to support the requirements ofscientific Virtual Organizations. These systems havemotivated interest in application-independent, policy-driven schemes for replica management that can

be tailored to meet the performance and reliability requirements of a range of scientific collaborations.

The authors discuss the data replication solutions to meet the challenges associated with increasinglylarge data sets and the requirement to run data analysis at geographically distributed sites.

Section 4

Data Discovery & Visualization

Chapter 10

Data Intensive Computing for Bioinformatics 207

Judy Qiu, Indiana University - Bloomington, USA

Jaliya Ekanayake, Indiana University - Bloomington, USA

Thilina Gunarathne, Indiana University - Bloomington, USA

Jong Youl Choi, Indiana University - Bloomington, USA

Seung-Hee Bae, Indiana University - Bloomington, USA

Yang Ruan, Indiana University - Bloomington, USA

Saliya Ekanayake, Indiana University - Bloomington, USA

Stephen Wu, Indiana University - Bloomington, USA

Scott Beason, Computer Sciences Corporation, USA

Geoffrey Fox, Indiana University - Bloomington, USA

Mina Rho, Indiana University - Bloomington, USA

Haixu Tang, Indiana University - Bloomington, USA

Data intensive computing, cloud computing, and multicore computing are converging as frontiers to ad¬

dress massive data problems with hybrid programming models and/or runtimes including MapReduce,

MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies

and large-scale computing resources effectively to advance fundamental science discoveries such as

those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale

genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies

ofcollections of genes. Metagenomic research is just one ofthe areas that present a significant compu¬tational challenge because of the amount and complexity of data to be processed. This chapter discusses

the use of innovative data-mining algorithms and new programming models for several Life Sciences

applications. The authors particularly focus on methods that are applicable to large data sets comingfrom high throughput devices of steadily increasing power. They show results for both clustering and

dimension reduction algorithms, and the use of MapReduce on modest size problems. They identifytwo key areas where further research is essential, and propose to develop new 0(NlogN) complexity

algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a

promising programming model combining the best features of MapReduce with those of high perfor¬

mance environments such as MPI.

Chapter 11

Visualization of Large-Scale Distributed Data 242

Jason Leigh, University ofIllinois at Chicago, USA

Andrew Johnson, University ofIllinois at Chicago, USA

Luc Renambot, University ofIllinois at Chicago, USA

Venkatram Vishwanath, University ofIllinois at Chicago, USA &Argonne National Laboratory, USA

Tom Peterka, Argonne National Laboratory, USA

Nicholas Schwarz, Northwestern University, USA

An effective visualization is best achieved through the creation ofa proper representation ofdata and the

interactive manipulation and querying ofthe visualization. Large-scale data visualization is particularly

challenging because the size of the data is several orders ofmagnitude larger than what can be managed

on an average desktop computer. Large-scale data visualization therefore requires the use of distributed

computing. By leveraging the widespread expansion ofthe Internet and other national and international

high-speed network infrastructure such as the National LambdaRail, Internet-2, and the Global Lambda

Integrated Facility, data and service providers began to migrate toward a model ofwidespread distribu¬

tion of resources. This chapter introduces different instantiations of the visualization pipeline and the

historic motivation for their creation. The authors examine individual components of the pipeline in

detail to understand the technical challenges that must be solved in order to ensure continued scalability.They discuss distributed data management issues that are specifically relevant to large-scale visualiza¬

tion. They also introduce key data rendering techniques and explain through case studies approaches for

scaling them by leveraging distributed computing. Lastly they describe advanced display technologiesthat are now considered the "lenses" for examining large-scale data.

Chapter 12

On-Demand Visualization on Scalable Shared Infrastructure 275

Huadong Liu, University of Tennessee, USA

Jinzhu Gao, University of The Pacific, USA

Man Huang, University of Tennessee, USA

Micah Beck, University ofTennessee, USA

Terry Moore, University of Tennessee, USA

The emergence of high-resolution simulation, where simulation outputs have grown to terascale levels

and beyond, raises major new challenges for the visualization community, which is serving computationalscientists who want adequate visualization services provided to them on-demand. Many existing algo¬rithms for parallel visualization were not designed to operate optimally on time-shared parallel systemsor on heterogeneous systems. They are usually optimized for systems that are homogeneous and have

been reserved for exclusive use. This chapter explores the possibility of developing parallel visualiza¬tion algorithms that can use distributed, heterogeneous processors to visualize cutting edge simulation

datasets. The authors study how to effectively support multiple concurrent users operating on the same

large dataset, with each focusing on a dynamically varying subset ofthe data. From a system design pointofview, they observe that a distributed cache offers various advantages, including improved scalability.They develop basic scheduling mechanisms that were able to achieve fault-tolerance and load-balancing,optimal use of resources, and flow-control using system-level back-off, while still enforcing deadlinedriven (i.e. time-critical) visualization.

Compilation of References 291

About the Contributors 319

Index 331

Documents

Data Intensive Distributed Computing - Challenges and Solutions for Large-Scale Information Management 622912992