Upload
hugues-de-payens
View
15
Download
1
Embed Size (px)
DESCRIPTION
Data Intensive Distributed Computing - Challenges And Solutions For Large-scale Information Management 622912992
Citation preview
Data Intensive Distributed
Computing:
Challenges and Solutions
for Large-Scale Information
Management
Tevfik Kosar
State University of New York at Buffalo (SUNY), USA
Information Science
REFERENCE
Detailed Table of Contents
Preface xiii
Section 1
New Paradigms in Data Intensive Computing
Chapter 1
Data-Aware Distributed Computing 1
Esma Yildirim, State University ofNew York at Buffalo (SUNY), USA
Mehmet Balman, Lawrence Berkeley National Laboratory, USA
TevfikKosar, State University ofNew York at Buffalo (SUNY), USA
With the continuous increase in the data requirements ofscientific and commercial applications, access
to remote and distributed data has become a major bottleneck for end-to-end application performance.Traditional distributed computing systems closely couple data access and computation, and generally,data access is considered a side effect of computation. The limitations of traditional distributed com¬
puting systems and CPU-oriented scheduling and workflow management tools in managing complexdata handling have motivated a newly emerging era: data-aware distributed computing. In this chapter,the authors elaborate on how the most crucial distributed computing components, such as scheduling,workflow management, and end-to-end throughput optimization, can become "data-aware." In this new
computing paradigm, called data-aware distributed computing, data placement activities are representedas full-featured jobs in the end-to-end workflow, and they are queued, managed, scheduled, and opti¬mized via a specialized data-aware scheduler. As part of this new paradigm, the authors present a set oftools for mitigating the data bottleneck in distributed computing systems, which consists ofthree main
components: a data-aware scheduler, which provides capabilities such as planning, scheduling, resource
reservation, job execution, and error recovery for data movement tasks; integration of these capabilitiesto the other layers in distributed computing, such as workflow planning; and further optimization of data
movement tasks via dynamic tuning of underlying protocol transfer parameters.
Chapter 2
Towards Data Intensive Many-Task Computing 28
loan Raicu, Illinois Institute ofTechnology, USA & Argonne National Laboratory, USA
Ian Foster, University ofChicago, USA & Argonne National Laboratory, USA
Yong Zhao, University ofElectronic Science and Technology ofChina, China
Alex Szalay, Johns Hopkins University, USA
Philip Little, University ofNotre Dame, USA
Christopher M. Moretti, University ofNotre Dame, USA
Amitabh Chaudhary, University ofNotre Dame, USA
Douglas Thain, University ofNotre Dame, USA
Many-task computing aims to bridge the gap between two computing paradigms, high throughput
computing and high performance computing. Traditional techniques to support many-task computingcommonly found in scientific computing (i.e. the reliance on parallel file systems with static configura¬
tions) do not scale to today's largest systems for data intensive application, as the rate of increase in the
number of processors per system is outgrowing the rate of performance increase ofparallel file systems.In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and
efficient use of large distributed systems for data-intensive applications. They propose a "data diffusion"
approach to enable data-intensive many-task computing. They define an abstract model for data diffu¬
sion, define and implement scheduling policies with heuristics that optimize real world performance,and develop a competitive online caching eviction policy. They also offer many empirical experimentsto explore the benefits ofdata diffusion, both under static and dynamic resource provisioning, demon¬
strating approaches that improve both performance and scalability.
Chapter 3
Micro-Services: A Service-Oriented Paradigm for Scalable, Distributed Data Management 74
Arcot Rajasekar, University ofNorth Carolina at Chapel Hill, USA
Mike Wan, University ofCalifornia at San Diego, USA
Reagan Moore, University ofNorth Carolina at Chapel Hill, USA
Wayne Schroeder, University ofCalifornia at San Diego, USA
Service-oriented architectures (SOA) enable orchestration of loosely-coupled and interoperable func¬
tional software units to develop and execute complex but agile applications. Data management on a
distributed data grid can be viewed as a set of operations that are performed across all stages in the
life-cycle of a data object. The set of such operations depends on the type of objects, based on their
physical and discipline-centric characteristics. In this chapter, the authors define server-side functions,called micro-services, which are orchestrated into conditional workflows for achieving large-scale data
management specific to collections of data. Micro-services communicate with each other using param¬eter exchange, in memory data structures, a database-based persistent information store, and a network
messaging system that uses a serialization protocol for communicating with remote micro-services. The
orchestration ofthe workflow is done by a distributed rule engine that chains and executes the workflows
and maintains transactional properties through recovery micro-services. They discuss the micro-service
oriented architecture, compare the micro-service approach with traditional SOA, and describe the use
ofmicro-services for implementing policy-based data management systems.
Section 2
Distributed Storage
Chapter 4
Distributed Storage Systems for Data Intensive Computing 95
Sudharshan S. Vazhkadai, Oak Ridge National Laboratory, USA
AH R. Butt, Virginia Polytechnic Institute and State University, USA
Xiaosong Ma, North Carolina State University, USA
In this chapter, the authors present an overview ofthe utility ofdistributed storage systems in supportingmodern applications that are increasingly becoming data intensive. Their coverage of distributed storage
systems is based on the requirements imposed by data intensive computing and not a mere summary of
storage systems. To this end, they delve into several aspects of supporting data-intensive analysis, such
as data staging, offloading, checkpointing, and end-user access to terabytes of data, and illustrate the use
ofnovel techniques and methodologies for realizing distributed storage systems therein. The data delugefrom scientific experiments, observations, and simulations is affecting all of the aforementioned day-
to-day operations in data-intensive computing. Modern distributed storage systems employ techniquesthat can help improve application performance, alleviate I/O bandwidth bottleneck, mask failures, and
improve data availability. They present key guiding principles involved in the construction of such storage
systems, associated tradeoffs, design, and architecture, all with an eye toward addressing challenges of
data-intensive scientific applications. They highlight the concepts involved using several case studies of
state-of-the-art storage systems that are currently available in the data-intensive computing landscape.
Chapter 5
Metadata Management in PetaShare Distributed Storage Network 118
Ismail Akturk, Bilkent University, Turkey
Xinqi Wang, Louisiana State University, USA
Tevfik Kosar, State University ofNew York at Buffalo (SUNY), USA
The unbounded increase in the size ofdata generated by scientific applications necessitates collaboration
and sharing among the nation's education and research institutions. Simply purchasing high-capacity,high-performance storage systems and adding them to the existing infrastructure of the collaboratinginstitutions does not solve the underlying and highly challenging data handling problem. Scientists
are compelled to spend a great deal of time and energy on solving basic data-handling issues, such as
the physical location of data, how to access it, and/or how to move it to visualization and/or compute
resources for further analysis. This chapter presents the design and implementation of a reliable and
efficient distributed data storage system, PetaShare, which spans multiple institutions across the state
of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement
across geographically distributed storage sites. At the front-end, it provides light-weight clients the en¬
able easy, transparent, and scalable access. In PetaShare, the authors have designed and implementedan asynchronously replicated multi-master metadata system for enhanced reliability and availability.The authors also present a high level cross-domain metadata schema to provide a structured systematicview of multiple science domains supported by PetaShare.
Chapter 6
Data Intensive Computing with Clustered Chirp Servers 140
Douglas Thain, University ofNotre Dame, USA
Michael Albrecht, University ofNotre Dame, USA
Hoang Bui, University ofNotre Dame, USA
Peter Bui, University ofNotre Dame, USA
Rory Carmichael, University ofNotre Dame, USA
Scott Emrich, University ofNotre Dame, USA
Patrick Flynn, University ofNotre Dame, USA
Over the last few decades, computing performance, memory capacity, and disk storage have all increased
by many orders of magnitude. However, I/O performance has not increased at nearly the same pace:
a disk arm movement is still measured in milliseconds, and disk I/O throughput is still measured in
megabytes per second. If one wishes to build computer systems that can store and process petabytes of
data, they must have large numbers of disks and the corresponding I/O paths and memory capacity to
support the desired data rate. A cost efficient way to accomplish this is by clustering large numbers of
commodity machines together. This chapterpresents Chirp asa building block for clustered data intensive
scientific computing. Chirp was originally designed as a lightweight file server for grid computing and
was used as a "personal" file server. The authors explore building systems with very high I/O capacity
using commodity storage devices by tying together multiple Chirp servers. Several real-life applicationssuch as the GRAND Data Analysis Grid, the Biometrics Research Grid, and the Biocompute Facility
use Chirp as their fundamental building block, but provide different services and interfaces appropriate
to their target communities.
Section 3
Data & Workflow Management
Chapter 7
A Survey of Scheduling and Management Techniques for Data-Intensive
Application Workflows 156
Suraj Pandey The Commonwealth Scientific and Industrial Research Organisation (CSIRO),Australia
Rajkitmar Buyya, The University ofMelbourne, Australia
This chapter presents a comprehensive survey ofalgorithms, techniques, and frameworks used for sched¬
uling and management of data-intensive application workflows. Many complex scientific experiments
are expressed in the form of workflows for structured, repeatable, controlled, scalable, and automated
executions. This chapter focuses on the type of workflows that have tasks processing huge amount of
data, usually in the range from hundreds of mega-bytes to petabytes. Scientists are already using Grid
systems that schedule these workflows onto globally distributed resources for optimizing various objec¬tives: minimizetotal makespan ofthe workflow, minimize cost and usage ofnetwork bandwidth, minimize
cost of computation and storage, meet the deadline of the application, and so forth. This chapter lists
and describes techniques used in each ofthese systems for processing huge amount of data. A survey of
workflow management techniques is useful for understanding the working ofthe Grid systems providing
insights on performance optimization of scientific applications dealing with data-intensive workloads.
Chapter 8
Data Management in Scientific Workflows 177
Ewa Deeltnan, University ofSouthern California, USA
Ann Chervenak, University ofSouthern California, USA
Scientific applications such as those in astronomy, earthquake science, gravitational-wave physics, and
others have embraced workflow technologies to do large-scale science. Workflows enable researchers
to collaboratively design, manage, and obtain results that involve hundreds ofthousands ofsteps, access
terabytes of data, and generate similar amounts of intermediate and final data products. Although work¬
flow systems are able to facilitate the automated generation of data products, many issues still remain
to be addressed. These issues exist in different forms in the workflow lifecycle. This chapter describes
a workflow lifecycle as consisting of a workflow generation phase where the analysis is defined, the
workflow planning phase where resources needed for execution are selected, the workflow execution
part, where the actual computations take place, and the result, metadata, and provenance storing phase.The authors discuss the issues related to data management at each step of the workflow cycle. Theydescribe challenge problems and illustrate them in the context ofreal-life applications. They discuss the
challenges, possible solutions, and open issues faced when mapping and executing large-scale workflowson current cyberinfrastructure. They particularly emphasize the issues related to the management ofdata
throughout the workflow lifecycle.
Chapter 9
Replica Management in Data Intensive Distributed Science Applications 188
Ann L. Chervenak, University ofSouthern California, USA
Robert Schuler, University ofSouthern California, USA
Management of the large data sets produced by data-intensive scientific applications is complicated
by the fact that participating institutions are often geographically distributed and separated by distinct
administrative domains. A key data management problem in these distributed collaborations has been
the creation and maintenance of replicated data sets. This chapter provides an overview of replicamanagement schemes used in large, data-intensive, distributed scientific collaborations. Early replicamanagement strategies focused on the development of robust, highly scalable catalogs for maintainingreplica locations. In recent years, more sophisticated, application-specific replica management systemshave been developed to support the requirements ofscientific Virtual Organizations. These systems havemotivated interest in application-independent, policy-driven schemes for replica management that can
be tailored to meet the performance and reliability requirements of a range of scientific collaborations.
The authors discuss the data replication solutions to meet the challenges associated with increasinglylarge data sets and the requirement to run data analysis at geographically distributed sites.
Section 4
Data Discovery & Visualization
Chapter 10
Data Intensive Computing for Bioinformatics 207
Judy Qiu, Indiana University - Bloomington, USA
Jaliya Ekanayake, Indiana University - Bloomington, USA
Thilina Gunarathne, Indiana University - Bloomington, USA
Jong Youl Choi, Indiana University - Bloomington, USA
Seung-Hee Bae, Indiana University - Bloomington, USA
Yang Ruan, Indiana University - Bloomington, USA
Saliya Ekanayake, Indiana University - Bloomington, USA
Stephen Wu, Indiana University - Bloomington, USA
Scott Beason, Computer Sciences Corporation, USA
Geoffrey Fox, Indiana University - Bloomington, USA
Mina Rho, Indiana University - Bloomington, USA
Haixu Tang, Indiana University - Bloomington, USA
Data intensive computing, cloud computing, and multicore computing are converging as frontiers to ad¬
dress massive data problems with hybrid programming models and/or runtimes including MapReduce,
MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies
and large-scale computing resources effectively to advance fundamental science discoveries such as
those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale
genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies
ofcollections of genes. Metagenomic research is just one ofthe areas that present a significant compu¬tational challenge because of the amount and complexity of data to be processed. This chapter discusses
the use of innovative data-mining algorithms and new programming models for several Life Sciences
applications. The authors particularly focus on methods that are applicable to large data sets comingfrom high throughput devices of steadily increasing power. They show results for both clustering and
dimension reduction algorithms, and the use of MapReduce on modest size problems. They identifytwo key areas where further research is essential, and propose to develop new 0(NlogN) complexity
algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a
promising programming model combining the best features of MapReduce with those of high perfor¬
mance environments such as MPI.
Chapter 11
Visualization of Large-Scale Distributed Data 242
Jason Leigh, University ofIllinois at Chicago, USA
Andrew Johnson, University ofIllinois at Chicago, USA
Luc Renambot, University ofIllinois at Chicago, USA
Venkatram Vishwanath, University ofIllinois at Chicago, USA &Argonne National Laboratory, USA
Tom Peterka, Argonne National Laboratory, USA
Nicholas Schwarz, Northwestern University, USA
An effective visualization is best achieved through the creation ofa proper representation ofdata and the
interactive manipulation and querying ofthe visualization. Large-scale data visualization is particularly
challenging because the size of the data is several orders ofmagnitude larger than what can be managed
on an average desktop computer. Large-scale data visualization therefore requires the use of distributed
computing. By leveraging the widespread expansion ofthe Internet and other national and international
high-speed network infrastructure such as the National LambdaRail, Internet-2, and the Global Lambda
Integrated Facility, data and service providers began to migrate toward a model ofwidespread distribu¬
tion of resources. This chapter introduces different instantiations of the visualization pipeline and the
historic motivation for their creation. The authors examine individual components of the pipeline in
detail to understand the technical challenges that must be solved in order to ensure continued scalability.They discuss distributed data management issues that are specifically relevant to large-scale visualiza¬
tion. They also introduce key data rendering techniques and explain through case studies approaches for
scaling them by leveraging distributed computing. Lastly they describe advanced display technologiesthat are now considered the "lenses" for examining large-scale data.
Chapter 12
On-Demand Visualization on Scalable Shared Infrastructure 275
Huadong Liu, University of Tennessee, USA
Jinzhu Gao, University of The Pacific, USA
Man Huang, University of Tennessee, USA
Micah Beck, University ofTennessee, USA
Terry Moore, University of Tennessee, USA
The emergence of high-resolution simulation, where simulation outputs have grown to terascale levels
and beyond, raises major new challenges for the visualization community, which is serving computationalscientists who want adequate visualization services provided to them on-demand. Many existing algo¬rithms for parallel visualization were not designed to operate optimally on time-shared parallel systemsor on heterogeneous systems. They are usually optimized for systems that are homogeneous and have
been reserved for exclusive use. This chapter explores the possibility of developing parallel visualiza¬tion algorithms that can use distributed, heterogeneous processors to visualize cutting edge simulation
datasets. The authors study how to effectively support multiple concurrent users operating on the same
large dataset, with each focusing on a dynamically varying subset ofthe data. From a system design pointofview, they observe that a distributed cache offers various advantages, including improved scalability.They develop basic scheduling mechanisms that were able to achieve fault-tolerance and load-balancing,optimal use of resources, and flow-control using system-level back-off, while still enforcing deadlinedriven (i.e. time-critical) visualization.
Compilation of References 291
About the Contributors 319
Index 331