Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Scalable Collaborative Caching and Storage Platform for
Data Analytics
by
Timur Malgazhdarov
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Edward S. Rogers Sr. Department of Electrical and Computer Engineering
University of Toronto
c© Copyright 2018 by Timur Malgazhdarov
Abstract
Scalable Collaborative Caching and Storage Platform for Data Analytics
Timur Malgazhdarov
Master of Applied Science
Edward S. Rogers Sr. Department of Electrical and Computer Engineering
University of Toronto
2018
The emerging Big Data ecosystem has brought about dramatic proliferation of paradigms
for analytics. In the race for the best performance, each new engine enforces tight cou-
pling of analytics execution with caching and storage functionalities. This one-for-all ap-
proach has led to either oversimplifications where traditional functionality was dropped
or more configuration options that created more confusion about optimal settings. We
avoid user confusion by following an integrated multi-service approach where we assign
responsibilities to decoupled services. In our solution, called Gluon, we build a collab-
orative cache tier that connects state-of-art analytics engines with a variety of storage
systems. We use both open-source and proprietary technologies to implement our archi-
tecture. We show that Gluon caching can achieve 2.5x-3x speedup when compared to
uncustomized Spark caching while displaying higher resource utilization efficiency. Fi-
nally, we show how Gluon can integrate traditional storage back-ends without significant
performance loss when compared to vanilla analytics setups.
ii
Acknowledgements
I would like to thank my supervisor, Professor Cristiana Amza, for her knowledge,
guidance and support. It was my privilege and honor to work under Professor Amza’s
supervision.
I would also like to thank my examination committee members: Professor Eyal de
Lara, Professor Ashvin Goel, and Professor Ashish Khisti for their valuable comments
and feedback. I am truly grateful to my colleagues and lab mates: Dr. Stelios Sotiriadis,
Seyed Ali Jokar, and Arnamoy Bhattacharyya for their knowledge, help, and support.
Last but not least, I would like to thank my family, especially my mother Nurgul
Yessetova for her understanding, love, and support.
iii
Contents
Acknowledgements iii
Contents iv
1 Introduction 1
2 Background 7
2.1 Analytics Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Hadoop MapReduce (HMR) . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Specialized Graph Processing . . . . . . . . . . . . . . . . . . . . 11
2.2 Resource Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Storage platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Network-attached storage . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Storage Area Networks . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Distributed systems with direct-attached storage . . . . . . . . . . 14
2.4 Distributed Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Alluxio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Common Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Vanilla Hadoop Solution . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Vanilla Spark Solution . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iv
3 Thesis Idea and Design 19
3.1 Thesis Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Usability Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Case Study: Spark . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Proposed Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Collaborative caching layer . . . . . . . . . . . . . . . . . . . . . . 23
3.3.2 Service Decoupling and Modularity . . . . . . . . . . . . . . . . . 27
3.3.3 Consolidated Storage Layer . . . . . . . . . . . . . . . . . . . . . 28
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Implementation 32
4.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Alluxio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Server SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.3 GFS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.4 YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Control and Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Connecting storage component . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Server SAN to filesystem connection . . . . . . . . . . . . . . . . 41
4.4 Connecting GFS2 with Analytics Engines . . . . . . . . . . . . . . . . . . 43
4.5 Cache integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 GFS2 to Alluxio connection . . . . . . . . . . . . . . . . . . . . . 46
4.6 Spark integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7 Additional optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7.1 Asynchronous Delete . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7.2 File consistency checker . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Evaluation 53
5.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
v
5.1.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Comparative evaluation using Spark . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Spark count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.4 Gluon job statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Comparative evaluation using Hadoop MapReduce . . . . . . . . . . . . 62
5.3.1 DFSIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Terasort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.3 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Graph Processing Framework - Hama . . . . . . . . . . . . . . . . . . . . 65
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Related Work 68
6.1 Caching in Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 HPC and shared storage integrations . . . . . . . . . . . . . . . . . . . . 70
6.3 Full-stack integrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7 Future Work and Final Remarks 75
Bibliography 77
vi
Chapter 1
Introduction
Several data analytics paradigms have been recently proposed in order to accommodate
the growing needs of Big Data. Each new paradigm brought with it specialization for
a particular need of data analytics workloads. At the same time, each such specializa-
tion had as side effect a significant departure from existing data processing paradigms.
From a usability perspective, this trend makes it increasingly difficult to analyse the
trade-offs of existing offerings and determine the appropriate platform support, including
interfaces, environments, settings and configurations for both functionality and optimal
performance. In other words, as many different paradigms have proliferated to facilitate
various data management needs, they have made usability and platform management
and integration itself a growing concern.
For example, the initial MapReduce offerings, such as Apache Hadoop[8], came with
a departure from traditional approaches to data processing. Relational data access typ-
ically used SQL-based interfaces to data maintained by consolidated storage back-ends.
Newer data analytics systems, such as Hadoop, not only introduced a new Java-based
data processing language; they also required that data reside in a distributed fashion,
on compute nodes, which formed a separate data silo for data analytics. Spark[61] came
later with yet another data processing language, Scala, and also with an even more pro-
nounced decoupling from persistent data storage concerns. Both paradigms imply that
input data and intermediate data is stored in a distributed fashion on new commodity
distributed file systems, such as, HDFS[52]. Moreover, both Apache Hadoop and Spark
1
Chapter 1. Introduction 2
had their own data caching techniques with the only commonality the data locality and
distributed file system principles.
On the other hand, Apache Hama[50] and Giraph[22] have been recently introduced
for better support of graph-based data analytics as compared with Apache Hadoop and
Spark. The BSP[56] data processing paradigm, which they proposed, strays from the
data locality principle used in all former data analytics paradigms. This makes typi-
cal performance enhancements for distributed data analytics, such as, network traffic
avoidance and effective caching difficult or impossible.
In this work, we propose a scalable, unified, caching and storage platform for data
analytics, called Gluon. Our unified platform provides performance, robustness and ease
of use for any data analytics paradigm currently in use with little or no modifications.
Gluon comes with two essential services for integration of platform support for all types
of data analytics.
First, our Gluon caching layer supports global collaborative caching across the
memories of all participating compute (and storage) nodes. Second, Gluon supports
full integration of the collaborative caching service with traditional consolidated storage
back-end services.
With Gluon we emphasize the principle of data locality for in-memory data on any
compute node. At the same time, we take full advantage of fast remote memory access
when opportunities for memory availability in collaborating nodes exist. Such opportu-
nities may be present due to a variety of reasons. For example, compute nodes may be
temporarily idle due to imperfect load balancing, such as created by fault-induced strag-
glers, or skewed workloads. Furthermore, unused memory may be available on back-end
storage nodes, which can be leveraged by compute nodes.
Whenever data would be normally evicted from the local in-memory cache on any
compute node, Gluon has the capability to push the data to be evicted to a remote node.
Conversely, Gluon fetches remote in-memory data on-demand from collaborative nodes
upon subsequent local access. We currently opt for disjoint caching of data items in the
collaborative in-memory cache; therefore, upon a remote fetch, the data item is discarded
locally after use.
Chapter 1. Introduction 3
Next, Gluon brings together the benefits of large scale, on-demand in-memory caching
on one hand, and traditional, highly robust, on-disk data redundancy and archival
schemes on the other hand. Specifically, the global in-memory collaborative cache space
could be on the order of terrabytes in total size for a cluster of compute and storage
nodes. However, in the case that the total available cache space is close to exhausted, we
have the option to proactively start writing out dirty blocks of cache to persistent stor-
age. If the need of swapping out to disk arises, such blocks can be subsequently simply
discarded from the cache instead of synchronously written out to disk. Asynchronous
disk writes to back-end storage can also effectively support a periodic, transparent check-
pointing service for data analytics objects. Any data item can be checkpointed to stable
back-end storage with RAID-level redundancy by asynchronously writing the data items
to back-end storage e.g., periodically, and transparently, with no impact on the on-going
computation.
Finally, as mentioned, the seamless integration between caching and consolidated
storage in Gluon means that any updates for any files stored on back-end storage can
be integrated in a new data analytics pass transparently, automatically, on-demand.
This avoids cumbersome data manipulations which separate on-disk data silos normally
bring about e.g., for data analytics systems based on HDFS. For example, incremental
additions to log files that were previously processed by the data analytics framework
would normally need to be copied into the separate analytics data silos, possibly by
hand. In contrast, with Gluon, any data block from back-end storage can be brought
into any compute node’s cache, on-demand, at any time.
Figure 1.1 shows the proposed architecture of Gluon. The disk and storage man-
agement is fully outsourced to the consolidated storage layer. Replication, journaling,
data compression and other techniques are performed by the specialized storage software
installed on servers A, B and C. Data is asynchronously checkpointed from the cache
layer to the storage layer. The cache layer nodes share the same view of the storage files
or objects. Data is fetched from the storage layer on-demand. The cache layer manages
a memory pool for all applications running on top of the cache. Cache nodes collaborate
with each other and move data from busy nodes to more idle ones thus fully utilizing
Chapter 1. Introduction 4
Figure 1.1: Proposed architecture.
memory in the pool. Overall, Gluon dis-aggregates analytics engine components into
specialized optimally-managed components.
We implement our Gluon unified caching layer from a variety of (mostly) open-source
software components. These components are: YARN[57], Spark[61], Alluxio[35] (previ-
ously Tachyon), GFS2[34] and proprietary Huawei FusionStorage. However, our archi-
tecture is modular, and many of the existing components can be replaced, and similar
components could be interconnected for easy plug and play.
Gluon is based on RAM Disk hence currently offers a file API - it can be placed
under any application that can support Hadoop FileSystem API. The storage tier of
the platform is Server SAN software called FusionStorage. It consolidates all available
disks on the cluster into a storage pool. Server SAN nodes that consolidate the disks we
designate as SAN worker nodes. From the storage pool we create a volume of a large
size and attach it to the cluster nodes such that all nodes share this volume as one block
device. The nodes that have the volume attached we designate as storage client nodes.
Chapter 1. Introduction 5
Note that SAN workers and clients reside on different nodes. On top of the Server SAN
we install Global File System 2. GFS2 is installed on the storage client nodes. GFS2 is
a clustered file system that allows for synchronized access to a shared block device. In
our case the shared block device is the SAN volume from the SAN storage pool. The
cache tier is based on open-source Alluxio[35]. Alluxio is an in-memory cache that can
interact with YARN applications (MapReduce, Spark etc.). We modify Alluxio such that
it can have shared view of GFS2 and cache files from there, then it places them inside
the memory cache. We extended Alluxio to support asynchronous writes to GFS2. We
have also extended Spark to allow for seamless connection to Alluxio.
We show that our Gluon caching layer can be readily used by a variety of data
analytics packages with little or no modification. In our evaluation, we have used and
empirically tested Gluon in conjunction with Spark and Hadoop MapReduce (HMR). We
use real-world data and applications to test Spark and Gluon default configurations. We
also look at PageRank algorithm and utilize Spark’s GraphX library which is hardcoded
to cache graph data into Spark memory. We show that for cache-intensive workloads
Gluon outperforms Spark by 2.5x - 3x. We also show that Gluon has the same perfor-
mance as Spark with optimal configuration or over-provisioned RAM sizes. Moreover, we
show Gluon vs. HDFS comparisons in HMR workloads: Terasort and PageRank. Gluon
achieves up to 1.85x speedup in reads of re-used data. In addition, we demonstrate how
manual ingestion of data affects overall performance. Finally, Gluon expedites iterative
HMR jobs by more than 30%.
For Spark, HMR and many other analytics engines, Gluon provides the ease of use,
functionalities and opportunities for transparent performance boosting what each of the
current schemes is missing.
For Spark, we found that memory management is actually very brittle. The user
needs to explicitly specify the appropriate memory allocation to Spark, manually, oth-
erwise there is the risk of crashes for the Spark jobs. Moreover, out of the available
memory allocation as specified by the user, Spark always has a boundary for the memory
to be used for Spark computation versus the memory to be used for storage caching.
Newly proposed dynamic partitioner cannot solve usability problems too, because users
Chapter 1. Introduction 6
still need to choose memory fraction that can be reclaimed by storage space. With the
global collaborative cache management left to Gluon, both the user’s memory concern
and the potential memory waste due data skews is readily alleviated; moreover, we show
performance boosts for the Spark jobs whenever remote memory availability can be lever-
aged. Finally, in Spark, if a node crashes, then the data objects on that node need to be
recomputed either from scratch or from a user-inserted checkpoint. Gluon adds flexibility
by transparently performing asynchronous checkpointing of objects to stable back-end
storage with no observable overheads for the application.
For Hadoop MapReduce, we found that inter-job data exchange tightly coupled with
HDFS. Reducers always write to HDFS and the next set of mappers have to read from
disk. Gluon expedites this exchange and asynchronously checkpoints inter-job data in
case next job fails. Hence, it is best suited for iterative jobs, chains of jobs and high data
re-use jobs.
The next Chapter provides a brief review of popular analytics engines, storage and
cache solutions, and current vanilla deployments. Then, Chapter 3 reviews case studies
that affect analytics in production systems and proposes the new design. In Chapter 4,
we reveal implementation details and go deep into the technicalities of the platform we
developed. We also discuss benchmarks for our evaluation and the stress tests we per-
formed to understand implementation bottlenecks. Chapter 5 introduces the deployment
specifics, configurations and evaluation methodology followed by result analysis and dis-
cussion. This is tailed by related work and a final chapter that concludes this thesis and
introduces possible directions for the future work.
Chapter 2
Background
In this Chapter, we will discuss various analytics engines and storage platforms. In the
first section we will review basic mechanisms behind most popular analytics engines like
Apache MapReduce [19] and Apache Spark[61].
Over the last decade the popularity of Hadoop-related engines has been increasing.
Today analytics engines are fragmented throughout different areas of data processing.
For instance, Apache Hama[50] is targeted at large-scale graph processing algorithms.
Another example is Hive[55] that converts SQL queries into a chain of MapReduce jobs.
On top of these engines, Big Data world has introduced resource managers that are
overtaking job scheduling responsibilities from engine-specific schedulers. YARN[57] and
Mesos[23] are the most popular resource managers that can support a variety of applica-
tions including Hadoop MapReduce, Spark and Tez[48].
In the second section, we will cover popular storage platforms that are used in cloud
and enterprise. Network-attached storage and storage area networks are still being heav-
ily utilized by enterprises and cloud providers. Object stores such as S3[11] became
extremely popular due to rise of the cloud computing.
We will then recap how existing analytics engines are being deployed. Reviewing
advantages and disadvantages of different solutions helps us discover various aspects
that play crucial role when running analytics workloads in the production environment.
There have been a number of attempts to consolidate analytics with large-scale storage
services. Majority of proposed architectures either introduced new usability issues or
7
Chapter 2. Background 8
lacked satisfactory performance. Moreover, the emergence of YARN and Mesos also set
new rules of job scheduling that affected previous integration techniques. Finally, a novel
in-memory cache Alluxio opened new frontiers for consolidation mechanisms.
2.1 Analytics Engines
2.1.1 Hadoop MapReduce (HMR)
The core of computing in existing data analytics systems is the algorithm called MapRe-
duce [17]. It is an execution strategy used for processing large data sets. MapReduce
spawns multiple workers in parallel on commodity machines that usually host data being
processed.
The algorithm has two phases: Map phase and Reduce phase (Figure 2.1). During
Figure 2.1: A mapreduce example.
the map phase task executors or mappers work independent of each other on local input
data. Mappers extract input values from files and typically generate key-value pairs.
After the map phase intermediary data gets sorted by keys and split into partitions.
Chapter 2. Background 9
These partitions are then shuffled across computing machines, and then the reduce phase
starts execution. Intermediary data will be always stored on local disk if it cannot fit
into the memory. This process is called ”spilling”. It is important to understand where
intermediate values are being stored. For example, some big data architectures may store
key-value partitions on remote disk storage, and then access them again to shuffle across
the network. This can impact performance by disk and network bottlenecks.
Reducers start to execute once partitions become available to them; their job is to
reduce the number of keys by performing operations such as aggregate, filter, search etc.
The final results are then stored to the underlying file system, e.g. HDFS.
Hadoop MapReduce (HMR)[8] algorithm is one of the implementations of MapReduce
paradigm. HMR is designed to work in the bundle with Hadoop Distributed File System
(HDFS). Typically, HDFS nodes are co-allocated with nodes that run HMR tasks. Two
systems has to work together to achieve data locality such that an HMR task does not
fetch data from a remote node.
HMR tasks interact with HDFS typically on initial data load that occurs during map
phase and final data write in reduce phase (data write can happen in a map phase if there
is no reduce phase). After map phase an output record would be assigned a partition
id, i.e. a reducer that will process the record. After partition assignment, intermediate
records are collected in a circular memory buffer of each map task. If they occupy more
than 80% of the buffer then they are ”spilled” to a local disk. Before the spill, records of
each map task are sorted by partitions and later by keys. All the spills from map tasks of
one particular node get merged into one large file where records are sorted by partitions.
Then records get transferred to their related reducers.
HMR tasks access HDFS through FileSystem API which is an abstract Java class that
defines set of functions that need to be implemented. Functions include open(), create(),
mkdirs(), getFileStatus() etc. Implementing FileSystem API in order to access another
file system allows for organized development of plug-ins.
Chapter 2. Background 10
2.1.2 Spark
Like HMR, Spark also has MapReduce algorithm at its core. However, Spark does not
rely on a rigid map-then-reduce format but rather on a more general directed acyclic
graph (DAG) of operators. Figure 2.2 shows a DAG that describes an application.
This approach allows to avoid writing to disk after each reduce phase and to pass the
computation result down the execution pipeline. In a way, Spark[61] targets iterative
Figure 2.2: Spark application DAG example.
jobs or chains of HMR jobs that are typically bound by I/O bottlenecks. A typical spark
application has a driver program and many task programs that execute the same code.
Task programs run inside Spark Executors that are just JVMs with pre-defined heap size
and number of cores. Spark processes data in terms of Resilient Distributed Datasets
or RDDs, these datasets represent data at a particular stage of the application. RDD is
divided into partitions and distributed across Spark task programs.
Spark defines two types of operations: transformations and actions. Transforma-
tions are operations that do not require data shuffle. Transformations are lazy oper-
ations and thus are computed when triggered by a following action. Actions, on the
other hand, require shuffling and synchronization of data. Transformations include com-
mands like map(), filter(), flatMap() etc. Actions typically involve reduction of data:
reduceByKey(), groupByKey(), join etc.
Spark allows for in-memory caching. Spark’s Resilient Distributed Dataset (RDD)[60]
can be cached into the memory of an Executor at any point during computation. This
means that future steps of the computation that require the same dataset do not need
to recompute it.
Despite many advantages of Spark, users need to understand execution mechanisms
of the framework in detail. Moreover, tight coupling of caching may result in interference
Chapter 2. Background 11
with computation memory thus making cache behavior a user problem.
In Spark, every application is equipped with a local memory cache that an application
can use throughout the program execution. The advantage of Spark cache is that data
can be saved in the heap of the executor JVM, thus accessing data from within the
same JVM is very fast. However, the same heap space is used for computation therefore
careless use of heap’s memory can lead to major performance degradation. Therefore,
Spark users need to thoroughly understand how their application data can impact JVM
heap size. This includes size of the data partition to be re-used, size of each executor
heap space, java object serialization and it’s implication on data size and more. Hence,
tight coupling of cache and compute layers in Spark provides usability issues that can
easily lead to misuse of the cache that in turn leads to performance drop.
2.1.3 Specialized Graph Processing
Another batch processing engine that has started to compete with MapReduce recently
is Bulk Synchronous Parallel (BSP)[56]. BSP’s key idea lies in message passing. Through
communication between workers BSP can achieve a high level of synchronization. Never-
theless, BSP has its challenges. For example, how do we identify message passing routes,
i.e. which worker should be a sender or a receiver? The good news is that in graph
algorithms we do not consider this issue. Since any graph structure can tell us how
communication routes are defined. In principle, all the communication is done through
passing messages to node’s neighbours and vice versa. Therefore, BSP is a perfect match
for dependency-rich data structures like trees and graphs.
Apache Hama is a BSP framework that is a part of Apache Hadoop ecosystem[50].
Hama was inspired by Google BSP-based Pregel[38]. Hama spawns parallel workers
(typically a worker per CPU core) each worker processes messages and prepares a set of
outgoing messages. All messages are routed in a synchronization step after all workers
finished processing preparing their outgoing messages. After synchronization step mes-
sages are sent out to workers. The time period from processing messages to the end of
synchronization is called a superstep. The job is considered to be done when there are
no workers that need to send messages, i.e. all outgoing message queues are empty.
Chapter 2. Background 12
Unlike MapReduce, Hama does not store intermediate results. Workers keep mes-
sages in their respective queues and queues are stored in Java heaps. Therefore the only
two interactions Hama has with cold storage is during initial data load to workers and
final data save from workers after job is complete.
2.2 Resource Managers
Currently there two major players in the resource management of the data analytics
engines: YARN[57] and Mesos[23]. YARN is the most popular and an older framework.
It allows for fair resource negatiating across a variety of analytics applications. The
center of YARN[57] is the ResourceManager which is the main authority responsible
for distributing cluster resources among all applications in the system. Each node in
the cluster has a NodeManager that monitors node resources and application activity.
NodeManagers also launch containers for applications. A container is just a definition
of memory and CPU limits per application. The latest Hadoop versions heavily rely
on a capacity scheduler within YARN. This scheduler launches applications based on
their resource requirements (CPU and memory requirements). Each application has an
ApplicationMaster that negotiates resources from the ResourceManager and works with
NodeManagers to execute tasks.
When we talk about YARN, it is paramount to understand how YARN default sched-
uler works. By default, YARN relies on its capacity scheduler that assigns jobs based on
the available resources in the cluster. For example, if the tasks of a certain job are in
the queue YARN Capacity Scheduler will try to match task’s resource requirements with
resources available in the cluster. However, this approach can have a conflict with an-
other type of scheduling: data-location-based scheduling. Mesos was introduced later
than YARN. However, it’s primary goal is also to allow a large variety of frameworks
to execute seamlessly on the same set of machines. The argument that Mesos creates
make is that data analytics ecosystem is fragmented and users need different engines for
different types of problems. Hence, multi-framework clusters will be a commonality in
the future.
Chapter 2. Background 13
2.3 Storage platforms
Data stored on hardware disks can have different representations. At the bare metal
level data is stored in disk blocks (e.g. 4KB), hence the name block device or block
storage. A file system can introduce another level of abstraction to a block device and
represent data as a file or a directory to end users. An object storage can represent
block storage data in terms of unique objects. Analytics engines as most of other client
applications commonly operate on top of files or objects. Distributed storage platforms
may incorporate a file system representation and enforce POSIX-compliance. On the
other hand, some platforms expose virtual block devices and rely on client file systems.
Another set of platforms focus on co-locating clients with storage medium on the same
server (directly-attached storage) to provide faster performance.
2.3.1 Network-attached storage
Network attached storage (NAS) is a platform that separates client programs from storage
medium and allows for file-based or object-based access of data. NFS is one example of
such systems[49]. NFS has a file server decoupled from client servers. All data is stored
on the file server disks and client servers use network protocol to access remote files.
There are many other systems with similar architectures. These systems have problems
with scalability and high availability since all of data is stored on one node.
Another example can be Lustre[27]. Unlike NFS, it is a highly scalable distributed
filesystem that decouples client nodes from storage nodes. Lustre has many storage
nodes that manage their own data without knowledge of other cluster nodes. There
is a separate metadata managers that contain a table of all files and their respective
locations. This architecture allows for high scalability of requests unlike NFS. Ceph
is very similar to Lustre, but it also provides block storage interface as well as object
interface[58]. Other examples are cloud-based object stores like S3, OpenStack Swift and
Azure Blob[11][13][14].
Chapter 2. Background 14
2.3.2 Storage Area Networks
Storage area network (SAN) is consolidation of commodity disks that provide block level
access[15]. SAN is made of available block devices that are integrated into a single pool.
Then virtual block devices can be accessed over the network by clients. These virtual
devices appear as locally attached devices to the OS file system. SANs can support
protocols like iSCSI, FibreChannel and AoE. Unlike NAS, SANs expose block device
interface and delegate file system concerns to the client side. OS file systems are mounted
on top of virtual block volumes.
Server SAN is a SAN management software that helps to consolidate all disks on
commodity servers into a single pool of disks[53]. Users can create virtual volumes
from the pool. The volumes can then be attached as new disks to virtual or physical
machines. As in other large management systems, Server SANs typically have multiple
master nodes that control metadata about all disks in the pool(s) and about virtual
block devices. Slave/agent servers are responsible for managing disks on their servers and
reporting their state to the master. SAN clients expose virtual volumes to their respective
operating systems. Typically Server SAN replicates disk blocks across multiple disks and
servers in two- or three-way fashion. They provide data balancing, data compaction and
a variety of recovery mechanisms. The famous example of Server SANs is Amazon EBS
volumes. In this project we utilize a similar Server SAN architecture provided by Huawei
Technologies Inc. - FusionStorage solution[3].
2.3.3 Distributed systems with direct-attached storage
Direct-attached storage (DAS) is a digital storage that is directly attached to a server, i.e.
local disk. In this architecture, data is not sent over the network for storage but remains
on the server. Common single-node file systems such as ext4 and ext3 are mounted on
top of DAS. HDFS, for instance, is a DAS storage layer in the Hadoop framework[52].
It has a master/slave architecture. The NameNode is the master program that stores
and manages file namespace, file block locations, permissions, access times etc. It also
regulates access to files with client programs like HMR or Spark. HDFS is designed to
Chapter 2. Background 15
store files in terms of sequence of blocks on the DataNodes. It is usually configured with
3-way replication where each file block has 3 replicas scattered across the cluster. The
file block size is generally 64MB. By scattering blocks in the cluster HDFS can scale out
to a great extent.
Whenever the HMR program (that runs in ApplicationMaster) requires certain input
files it contacts the NameNode to get the file information, including locations of file
blocks. Then it requests containers from the ResourceManager to execute tasks. The
ApplicationMaster passes the ”preference nodes” information with the container request.
The preference nodes are those that contain input file blocks. The ResourceManager
may ignore the preference request because of resource unavailability and allocate the
containers on the nodes without required data. In this scenario, data is transferred to
the node with the container allocated. However, since there are 3 replicas of the same
file block the ResourceManager rarely ignores the ApplicationMaster preferences.
2.4 Distributed Cache
2.4.1 Alluxio
Alluxio is an in-memory cache - not just memory only - and its tiered storage feature
means it can theoretically access any storage. Because Alluxio exposes a storage inte-
gration layer through an API, applications can access any underlying persistent storage
and file systems. Alluxio can be deployed with any big data framework (Apache Spark,
Apache MapReduce, Apache Flink, Impala, etc.) on many storage systems or file systems
(Alibaba OSS, Amazon S3, EMC, NetApp, OpenStack Swift, Red Hat GlusterFS, and
more).
Alluxio is designed in the context of Hadoop[35]. This means that existing Spark and
MapReduce programs can run on top of Alluxio without any code modifications.
Alluxio’s design uses a single master called AlluxioMaster and multiple workers called
AlluxioWorkers. At a high level, Alluxio can be divided into three components, the mas-
ter, workers, and clients. The master and workers together form the Alluxio servers,
Chapter 2. Background 16
which are the main components of a typical Alluxio cluster. The clients are generally the
applications, such as Spark or MapReduce jobs.
The master is responsible for managing the global metadata of the system, e.g. the file
system tree. Clients may communicate with the master to read or write to this metadata.
Alluxio workers are responsible for managing local resources allocated to Alluxio. These
resources include local memory, SSD, or hard disk and are user configurable. Alluxio
workers store data as file blocks and serve requests from clients to read or write data by
reading or creating new file blocks; workers are very similar to HDFS DataNodes. The
worker is only responsible for the data in these file blocks; the actual mapping from file
to file blocks is only stored in the master. The Alluxio client provides users a gateway to
interact with the Alluxio workers. It exposes a cache system API. It initiates communi-
cation with master to carry out metadata operations and with workers to read and write
data that exist in Alluxio. Data that exists in the under storage(e.g. HDFS) but is not
available in Alluxio is accessed directly through an under storage client.
AlluxioWorkers store file blocks inside directories just like HDFS DataNodes. The
difference from HDFS is that AlluxioWorker’s directory is mounted as RamFS, i.e. OS
page cache.
2.5 Common Solutions
2.5.1 Vanilla Hadoop Solution
Companies that perform regular large data analytics typically deploy Hadoop in a sep-
arate cluster environment from their main data generation and curation engines. For
instance, Taobao, Chinese 3rd largest e-commerce site, accumulates logs in data ware-
house periodically transferring log data to analytics silo, i.e. HDFS.
Having another storage for analytics may incur additional costs. For instance, in
usual Hadoop deployments, data are stored on local node disks and 3-way replication is
employed to ensure reliability. This Hadoop-specific setup leads to increased storage ca-
pacity requirements overall. As a result, companies end up purchasing new hardware for
Chapter 2. Background 17
the sole purpose of running data analytics, resulting in substantial upfront infrastructure
investment, and increased management costs. Additionally, data ingestion can take some
time given the size of the data transferred, thus postponing a MapReduce or a Spark
job. Finally, periodic transfers have to be set up, configured and automated which incurs
additional engineering effort.
On the other hand, once required data is loaded to HDFS then performance of HMR
is at its optimal in terms of reads. The reason is that each data piece has 3(default)
replicas thus the probability that locality will be ignored by YARN is decreased by 3.
Also HDFS relies on the Linux-based file systems like ext3 and ext4 that manage OS
buffer cache. Given large RAM size on the analytics nodes, HDFS DataNode can store
most of its file blocks in the local memory. In addition, default HDFS settings allow it
to write first replica and asynchronously propagate 2 other replicas. With ext3 caching
onto OS buffer during writes, HDFS write performance can reach memory speed. Default
HDFS is fault-tolerant but not quite highly-available due to asynchronous distribution
of copies. To enforce synchronous copying dfs.min.replication parameter needs to be set
to a value of dfs.replication parameter.
2.5.2 Vanilla Spark Solution
Unlike HMR, Spark does not include its native file system. Spark can work with many
storage options like S3, NFS, HDFS etc. Typical Spark deployments can be of 3 types:
standalone, YARN or Mesos. Standalone Spark clusters are deployed for analytics work-
loads running only Spark programs whereas YARN or Mesos deployments allow other
engine jobs to execute in parallel with Spark jobs.
One of the key differences of Spark is that it can cache MapReduce inter-job data
thus it can decrease local disk or remote storage access frequency. For most of the jobs
that are iterative in nature or consist of a chain of smaller jobs Spark is most suitable.
Nevertheless, Spark does not cache data by default, it is up to a user to decide at
which point in the program data partitions need to be cached. Spark community provides
guidelines for coding techniques that can help achieve optimal performance. However, it
takes experience and knowledge of Spark internals in order to utilize Spark caching most
Chapter 2. Background 18
efficiently. In addition, failed tasks that do not finish a certain computation will have to
be re-tried and re-compute lost partitions. Improper caching and lost partitions will lead
to increased job execution time. Moreover, Spark JVMs cache data partition per user,
therefore if another user will need to access the same data partition it will be transferred
from the disk or remote store and cached to another Spark JVM.
2.6 Conclusion
In this Chapter, we outlined concepts and platforms that are essential building blocks of
our consolidated platform. We discussed processing engines such as HMR and Spark. All
of these are used in our final architecture. We also reviewed resource managers focusing
on YARN which is paramount in our platform. We described storage concepts in large
scale systems to show readers that storage tiers can be very different in design. We also
discussed Alluxio, the recently introduced caching tier for Hadoop ecosystem. Alluxio
helps our platform to improve read performance in high data re-use scenarios. Finally
we showed vanilla (common) Big Data stacks and pointed out possible flaws.
Chapter 3
Thesis Idea and Design
3.1 Thesis Idea
Our goal is to design a consolidated caching and storage architecture that meets the
requirements of data analytics workloads in terms of usability, cost, performance and
fault-tolerance. We propose decoupling caching and storage responsibilities from the
analytics layer and outsourcing them to external independent layers. Towards this we
design and implement a scalable collaborative caching tier that connects existing analytics
engines with robustness-oriented storage solutions.
In this chapter, first, we present several case studies that show usability issues in state-
of-the-art analytics engines. Second, we discuss the proposed design of the consolidated
architecture. We cover the collaborative cache, explain service interactions and describe
the consolidated storage layer. We focus on optimizing collaborative cache such that
our platform achieves good performance and avoids common usability issues. Hence in
this dissertation we make two main contributions: (1) building an integrated caching
and storage platform for data analytics and (2) optimizing data and control flow of
collaborative caching to improve usability, performance and robustness.
19
Chapter 3. Thesis Idea and Design 20
3.2 Usability Issues
3.2.1 Case Study: Spark
Apache Spark[61] offers caching mechanisms for intermediate data to avoid re-computation
of RDDs when they are re-used. Spark Executors keep computation objects inside the
Spark JVM heap. The same heap is utilized for cached data. Tight coupling of execution
and cache spaces leads a variety of interface options. However, instead of flexibility this
diversity comes with rigid constraints and possible confusion for users during configura-
tion of the Spark application. There is a variety of options available for users in order to
improve job performance. Spark’s .cache method uses Executor heap only as a default
option. Other options include MEMORY AND DISK and DISK ONLY. Users can also
choose if they want to store raw data or serialized data. Spark users need to understand
how much data will be stored in the cache in order to provide enough memory to Execu-
tors. When running Spark in a YARN[57] cluster, the configuration settings become even
trickier. YARN forces applications to run inside Containers. If an application exceeds
the Container limits, YARN will kill the application.
Executor memory falls under two categories in Spark: execution and storage. Exe-
cution memory is used for storing computation related objects. Storage memory, on the
other hand, is used for caching data. Both execution and storage share a unified region
called M. By default, M is set to be 0.75% of Executor heap and storage fraction can
occupy 50% of M. The fraction is configurable and up to the user to choose.
Figure 3.1 shows the DAG of a simple Spark program that a user wants to submit
to YARN cluster. The program reads 10 GB of the graph data from HDFS, extracts all
adjacency lists in the line to array and caches the lists. The listRDD is cached using
default .cache command. After caching, listRDD is used in two different map functions.
The first computation is to extract vertices and the second computation extracts edges.
Outputs of both maps are saved back to HDFS.
Let us assume that the user does not know about 0.75 fraction of M and compute-
storage split of memory, and submits Spark program to YARN cluster with 10 containers
of 1 GB size. The job fails with multiple tasks reporting GC : time limit exceeded
Chapter 3. Thesis Idea and Design 21
Figure 3.1: Spark application to extract graph data.
exception. After thorough investigation the user realizes there were only 3GB of space
available for caching and the garbage collector spent too much time evicting blocks.
Let us now assume that the user knows about M region and 50% compute-storage
split. She submits a Spark program to YARN cluster with a request of 27 containers
of 1 GB size for each. This results in a total of 27 GB of RAM allocated for 27 Spark
Executors. Each Executor reads an RDD partition of 10GB file from HDFS. After the
first map phase the actual data size results in 15 GB due to object de-serialization and
initial map overheads. Only 10.1 GB of data fit all Executors’ memory. The rest of 4.9
GB needs to be re-computed from the beginning in the second map function after cache.
This is an obvious performance loss due to misconfiguration.
Let us now assume that the user knows everything about previous runs. She decides to
submit Spark program with 27 containers of 1 GB size for each. However, she configures
caching to be MEMORY AND DISK. The job finishes smoothly and faster than previous
runs. However, after investigating the Spark UI, the user realizes that 2 Executors didn’t
use full storage memory fraction, while 3 other Executors spilled almost 3 GB to disk.
Hence the user realizes this is an optimal performance that program can achieve, however
it can be further improved.
All cases above demonstrate how the tight coupling of execution and memory in
Spark can easily result in job failure, performance loss and/or under-utilization of memory
resources. We, therefore, conclude that the Spark application will benefit from an external
collaborative cache that can grow as needed and utilize all assigned resources fully by
evicting data to remote node or disk on demand.
Chapter 3. Thesis Idea and Design 22
3.2.2 HDFS
Analytics engines generally have their own storage component (e.g. Hadoop’s HDFS)
that represents a standalone storage system[52]. Having another storage for analytics
may incur additional costs. For instance, in usual Hadoop deployments, data are stored
on local node disks and 3-way replication is employed to ensure reliability. This Hadoop-
specific setup leads to increased disk capacity requirements overall. As a result, customers
end up purchasing new hardware for the sole purpose of running data analytics, resulting
in substantial upfront infrastructure investment, and increased management costs. In
general, HDFS is not used as an enterprise storage, but is widely adopted as data ana-
lytics storage. This leads us to conclude that customers use multiple storage silos: (1)
one silo containing data for transaction processing such as enterprise and web application
processing, with (2) a second silo for analytics. This approach requires users to look for
and deploy mechanisms to periodically transfer data between silos. The emergence of
Apache Flume[24] explains the need for fast data transfer across silos.
HDFS is also considered to be a highly fault-tolerant system. However, it only has
one metadata server and it is up to the system administrator to make it more available.
On the other hand, existing storage-oriented systems like Lustre, Ceph, Huawei’s Fusion-
Storage and others strive to excel at fault-tolerance and high availability. For instance,
Huawei FusionStorage, in the default configuration, has 3 metadata servers (MDC) that
are coordinated by a Zookeeper cluster. Furthermore, since storage silos process other
workloads, e.g. webserver, placing analytics stack on the same set of storage servers is not
a good idea. Therefore, our collaborative cache is decoupled from the storage servers, i.e.
placed on a different set of servers or VMs in the data center. Since propagation to remote
storage disks from the caching layer can be a bottleneck we propose to asynchronously
propagate data to storage silos. We describe our design in more detail in the next section.
Chapter 3. Thesis Idea and Design 23
3.3 Proposed Design
Our conceptual design addresses the usability issues discussed previously. We want to
encourage flexibility in our platform. We design a platform called Gluon that provides
flexible support for the majority of workloads with existing storage systems using col-
laborative caching. To achieve that, we leverage open-source commodity compute, cache
and storage solutions. This further contributes to our usability claim. Our design con-
sists of two tiers: (1) data analytics and in-memory collaborative caching tier and (2)
consolodated storage tier.
As an analytics tier we propose to integrate any engine that is compatible with Hadoop
FileSystem API. First, our caching layer supports global collaboration across the mem-
ories of all participating compute (and storage) nodes. The cache is designed to be scal-
able, independent from the analytics engines and to utilize the given resources efficiently.
This cache should propagate analytics data to decoupled storage services and fetch data
from them on-demand.
Second, Gluon supports full integration of the collaborative caching service with tra-
ditional consolidated storage back-end services. As a storage tier we propose to integrate
any storage solution that can ensure fault-tolerance, scalability and high availability.
The consolidated storage will provide persistent storage service for all the data analytics
needs.
3.3.1 Collaborative caching layer
We propose an in-memory collaborative caching layer interposed between the storage
service and analytics engines. A cache service provides data locality in our platform.
This helps improve read performance and reduce communication overhead with remote
storage service. Collaboration between cache nodes can increase cache utilization to the
maximum. Analytics workloads can work on skewed data where some nodes have to
cache more than others. Collaboration should allow us to push extra data from a local
node to remote nodes that have spare idle CPU cycles and available memory. The data
can also be brought back to the local node from remote node on-demand.
Chapter 3. Thesis Idea and Design 24
Figure 3.2 demonstrates the proposed architecture of our platform. Co-locating cache
Figure 3.2: Proposed architecture.
nodes with computation nodes is preferred because it will allow for best locality. In our
design, we have one cache manager and multiple cache workers. The cache manager holds
metadata about each worker information. Analytics programs connect to cache layer
using cache clients. Cache client have the interface to communicate with the manager
and workers. Each cache worker controls its local resources such as RAM and disk. It also
maintains data and reports to the cache manager upon change. Cache workers are also
responsible for propagating data to storage service using corresponding storage clients.
Since we are co-locating cache with execution, data caching policies are paramount. We
describe our data caching policies next.
Data movement
Figure 3.3 demonstrates how data is propagated in our collaborative cache. Tasks can
interact with any of the cache workers and are able to write data to any of them. The
policy, however, should always favour local memory first and only when this is depleted
Chapter 3. Thesis Idea and Design 25
Figure 3.3: Data movement of task writes in collaborative cache
a task writes to remote memory. If all remote nodes’ memories are depleted, then a task
needs to wait until any cache worker has successfully evicted blocks to their respective
local disk and has free memory. In the background, data blocks are asynchronously
propagated by cache workers to the remote storage silo.
During reads, data is brought from the remote storage silo on-demand and cached in
the local memory of the workers. We only cache a block on read when the block is not
currently present on the cache layer, i.e. we avoid block replicas in the cache. Caching
data during reads optimizes the performance of subsequent re-use of the same data set.
This is very helpful in machine learning algorithms such as Logistic Regression which
requires multiple passes over the same set of data.
Consistency between workers
In general, in distributed systems, data inconsistencies may arise when replicas of the
same data block are being modified from different locations. However, in analytics work-
loads tasks perform writes to separate disjoint files. HDFS API, for instance, does not
allow for file modifications but only creation or appending. This results in each reducer
writing to disjoint files. Therefore, in our cache, although we allow replicas to exist in
Chapter 3. Thesis Idea and Design 26
rare scenarios, we will not allow joint writes to the cached blocks.
Resource sharing
Co-locating the cache layer with execution will make them compete for the same physical
resources, such as, RAM and disk. However, we already saw from the case study that
Spark already splits memory into dedicated areas for execution and storage. We saw that
the typically fraction given of 0.375 to the storage memory. We assign this amount to
the cache and set all Spark memory to be compute memory only. Unlike Spark, Hadoop
MapReduce does not use native caching. However, HMR severely suffers from inter-job
data exchange in chained or iterative workloads. Hence it will only benefit from caching
layer that can store inter-job data in memory. Local disks on analytics layer that are
used for spilling shuffle data can also be shared with the collaborative cache for storing
evicted memory blocks. We also assign top layer disks buffer and give that memory to the
corresponding cache worker because we offload disk-related operations to storage service.
However, we may still run out of cache space quickly. Therefore, cache layer will grow
independently from the compute layer where some nodes are co-located with compute
layer while others can be co-located with storage or other more idle services.
As we have previously mentioned, Spark users often need to know how much memory
their partitions occupy. With our independent caching layer the users worry less about
memory management during computation. By offloading data to external cache service
users don’t have to worry that Spark JVMs will slow down due memory thrashing and
long GC times.
Connectors
Our architecture requires us to introduce two clients one for cache and another for storage
client. We integrate storage client into cache layer instead of compute layer. In our final
architecture Hadoop API connects to cache layer.
We are required to make changes either in configuration or source code of analytics
engines in order to seamlessly connect to caching layer. For instance, Spark caching
mechanism is fine-tuned to store data in JVM heap or node disk. Storing data in external
Chapter 3. Thesis Idea and Design 27
service is not implemented in Spark. We implement new external service manager in
Spark that integrates seamlessly such that a user just needs to change one configuration
setting.
3.3.2 Service Decoupling and Modularity
Our architecture proposes to decouple computation and caching from storage respon-
sibilities. We use connectors and client programs to help decoupled services interact.
In the Hadoop ecosystem, applications interact with HDFS through a FileSystem API.
Application workers connect to HDFS DataNodes through an HDFS Client. In our plat-
form we rely on Hadoop FileSystem API, because the majority of analytics engines have
already implemented the API. We can think of HDFS and analytics engines as decoupled
services. However, in vanilla HMR or Spark setup HDFS is placed on the same nodes as
analytics engine. We discussed that this placement incurs usability issues.
In our platform, we propose to remove local storage solution such as HDFS from an-
alytics nodes. We place storage solutions onto different set of nodes that can be located
on different racks. Gluon needs to be modular and able to integrate existing analytics
and storage platforms. Figure 3.4 shows analytics and storage services that can be inte-
grated in Gluon. Analytics applications that run inside containers assigned by Resource
Manager connect to storage service through a client program. For instance, if storage
layer is HDFS then storage clients can be HDFS clients. In this scenario, Spark, HMR or
any other analytics engine would connect to storage service through Hadoop FileSystem
API. Storage service is responsible for disk and data management as well as replication.
Decoupling storage service helps customers to deploy new analytics engines in their
system. For instance, if an enterprise stores data from transaction processing in NFS[51],
then deploying an analytics engine on top of NFS just requires installing a storage client.
There will be no need for ad-hoc data ingesting to analytics silo which provides a signif-
icant improvement in terms of usability and cost.
A storage client is responsible for translating storage service calls from applications.
A majority of analytics engines (e.g. Apache) are run inside of JVMs and implemented
using Java or Scala languages. Therefore, a storage client should be compiled .jar exe-
Chapter 3. Thesis Idea and Design 28
Figure 3.4: Decoupling Storage and Analytics.
cutable that is run inside of application JVM. The storage client is a set of functions that
translates Java calls (Hadoop FileSystem API) into respective calls of storage service.
The storage service can represent data of different types: files, file blocks, objects or disk
blocks. Depending on the data representation, the storage client can be more than just a
Java Connector. For instance, for HMR or Spark or any other analytics engine accessing
block device using SCSI or iSCSI interface is not possible because they all need file or
object mapping to read/write data. Our platform can support file-based storage services
as well as block-based storage.
We propose to install storage clients on compute nodes. We design a storage client
based on HDFS client and storage service specifications. We change HDFS calls according
to the requirements of the target storage service.
3.3.3 Consolidated Storage Layer
Shared view
Large-scale storage solutions provide a shared view of data to client nodes. This is true
for a majority of such systems, e.g., Lustre, Ceph, NFS, S3, HDFS. Each system solves
contention issues in different ways. Some use locks while others rely on distributed object
stores. Gluon strives to provide integration with any large-scale storage solution therefore
needs to account for contention issues as well. Gluon avoids contention issues the same
Chapter 3. Thesis Idea and Design 29
way it avoids inconsistency between workers. Since writes to files are disjoint there is no
need to worry about locking an inode to flush data. The worst case contention scenario
is when tasks try to create new paths under the same directory. The directory inode is
locked by each task. Nevertheless, path creation times are typically insignificant when
compared to actual data writes in Big Data analytics workloads.
Data consistency
We already mentioned that data is asynchronously propagated to remote storage. Data
can also be brought to cache on-demand. However, analytics engines are not the only
ones using the remote storage service. Other engines, such as, webservers or databases
can aggregate their data inside the same storage silo. In this case, the Gluon cache
layer needs to be aware of updates from other services to make sure that its view is
consistent with the last storage update. The data analytics and caching layer needs to
perform consistency checks regularly and without extra overheads to the cache workers
or analytics tasks.
Inconsistencies may also happen between the storage and cache layer. For instance,
storage solution can accept data from transaction workloads and can update existing
data by adding, extending, modifying or removing files. This results in two types of
inconsistencies: (1) storage has more up-to-date data that the cache is unaware of and
(2) cache has more up-to-date data that the storage is unaware of.
Our design detects both types of inconsistencies and notifies the cache manager about
changes in the storage tier. In the second case, we should ignore the inconsistency because
the cache may have extra data due to temporary files created during analytics job runs
or intermediate data that is not pushed down to storage tier due to delete-on-finish
behaviour. On the other hand, in the first case, it is quite tricky to know what action
cache manager should take because file could have been either created or modified. When
file is created it is just another table entry for cache manager. It is quite straightforward
to implement. However, in modify cases cache manager needs to understand which part
of the file was changed and which chunk of the file to invalidate, and how to do that
without interfering with analytics workload. Due to cache invalidation complexities we
Chapter 3. Thesis Idea and Design 30
leave this feature for future work.
Asynchronous propagation to storage
In the background, data blocks are asynchronously propagated by cache workers to the
remote storage silo. However, cached data is typically not propagated in Spark work-
loads. There are two types of data writes in analytics workloads: intermediate and final.
Intermediate data is typically written locally (not in HDFS) and it is required to perform
shuffle/synchronization/re-use steps. Sometimes intermediate data can be cached data
(e.g. Spark). In general loosing intermediate data will result in task re-computation
which sometimes can be costly. However it is not as costly as loosing final output data
which will require whole job to be re-computed. Intermediate data is typically destroyed
by applications upon completion.
In Gluon, we propagate both intermediate and final output data. This provides lower
probability of data re-computation. For instance, if Spark executor crashes or gets killed
by YARN then it’s RDD partition gets destroyed and needs to be re-computed by other
executors. However, if Spark executors stores RDD partition in Gluon, then it can re-
cover partition from external memory layer. If the whole node crashes then there will
be a chance of RDD partitions persisted in remote storage. This way all RDD partitions
from lost nodes can be fetched from cold storage. This way we not only perform caching
but also asynchronous checkpointing of RDD partitions.
In our platform we define reasonable trade-off between fault-tolerance and perfor-
mance. In reality, our final output propagation is the same as that of HDFS replica
propagation in the default mode. Default HDFS configuration does not enforce syn-
chronous data replication, i.e. replicas are propagated during task run and/or after task
is finished. HDFS administrators have to explicitly set synchronous data propagation
option. Gluon also provides this option.
3.4 Summary
In this Chapter, we discussed a case study in simple analytics application. We showed
Chapter 3. Thesis Idea and Design 31
how usability issue can lead to failed job, poor performance and under-utilization of
resources. We also proposed Gluon - our consolidated flexible platform that can incor-
porate majority of state-of-art frameworks. Our new architecture is based on usability
studies in current analytics engines and their storage solutions. Our Gluon caching layer
supports global collaboration accross the memories of all participating compute (and
storage) nodes. Second, Gluon supports full integration of the collaborative caching ser-
vice with traditional consolidated storage back-end services.
With Gluon we emphasize the principle of data locality for in-memory data on any
compute node. At the same time, we take full advantage of fast remote memory access
when opportunities for memory availability in collaborating nodes exist. We describe
data propagation from execution layer to storage layer.
Finally, as mentioned, the seamless integration between caching and consolidated
storage in Gluon means that any updates for any files stored on back-end storage can be
integrated in a new data analytics pass transparently, automatically, on-demand. This
avoids cumbersome data manipulations which separate on-disk data silos normally bring
about e.g., for data analytics systems based on HDFS.
Chapter 4
Implementation
This chapter presents details of the implementation of our consolidated platform as well
as platform optimization and improvements. We start with a description of the system
components. We discuss each component in detail. Then we talk about how we glue all
components together. Finally we present implemented optimizations.
Our component for the collaborative caching layer is based on open-source Alluxio[35].
Alluxio is an in-memory cache that can interact with YARN applications (MapReduce,
Spark etc.). Alluxio caches files from a storage service and places them inside the memory
cache.
Components for the consolidate storage layer include open-source Global File System
2[34] and proprietary Server SAN - Huawei FusionStorage[3]. On top of the Server SAN
we install GFS2.
We create our own set of connectors to integrate caching and storage layers. We
essentially connect storage layer with analytics engines first. Then we insert the cache
tier in between. We show how each component is integrated into our platform, challenges
of integration and final design optimizations.
32
Chapter 4. Implementation 33
4.1 Components
4.1.1 Alluxio
Alluxio is an in-memory cache - not just memory only - and its tiered storage feature
means it can theoretically be extended to access any storage. Because Alluxio exposes a
storage integration layer through an API, applications can access any integrated under-
lying persistent storage and file systems. We chose Alluxio because it has a flexible code
base and has a focus on data analytics caching in contrast to Ignite[1] that also tries to
accommodate transaction-based workloads.
Alluxio’s design uses a single master called AlluxioMaster and multiple workers called
AlluxioWorkers. At a high level, Alluxio can be divided into three components, the mas-
ter, workers, and clients. The master and workers together form the Alluxio servers,
which are the main components of a typical Alluxio cluster. The clients are generally the
applications, such as Spark or MapReduce jobs.
The master is responsible for managing the global metadata of the system, e.g. the
inode tree. AlluxioClients may communicate with the master to read/write from/to the
global metadata table. Alluxio workers are responsible for managing local resources al-
located, such as, RAM, SDD and HDD. Alluxio workers manage all data as file blocks
and are very similar to HDFS DataNodes. The worker is only responsible for data in
its node; the actual mapping from file to file blocks is only stored in the master. The
AlluxioClient provides users a gateway to interact with the Alluxio workers. It exposes a
cache system API. Data that exists in the under storage(e.g. HDFS) but is not available
in the Alluxio cache is accessed directly through an under storage client. AlluxioWorkers
store file blocks inside directories just like HDFS DataNodes. The difference from HDFS
is that the AlluxioWorker mounts directory as RamFS, i.e. all data is stored in the OS
page cache.
AlluxioClient runs inside a task executor (e.g. Spark Executor). It initiates com-
munication with the master to carry out metadata operations and with workers to read
and write data that exist in the Alluxio cache. It can access RamFS and create random
access files. It can also connect to remote nodes and pass data through TCP/IP network.
Chapter 4. Implementation 34
Depending on configuration, AlluxioClients can create two output streams during writes:
(1) RamFS output stream and (2) understorage output stream (e.g. HDFS stream).
Alluxio stores file blocks to RamFS and files to the underlying storage. It is paramount
to note that a file block is typically smaller than a file itself, i.e. a file constitutes
more than one block. Upon write, the AlluxioClient creates a single file stream. While
writing to the file it creates multiple block streams. This approach is performed when
the CACHE THROUGH policy has been set. There are other write policies, such as,
MUST CACHE, THROUGH and experimental ASYNC THROUGH. By default Alluxio
has MUST CACHE which means that writes are never propagated to the underly-
ing storage. In the Gluon cache we ignore all policies except one. We focus on the
ASYNC THROUGH policy. This policy assigns a set of background threads to copy
RamFS blocks to the corresponding file in the underlying storage.
If a block is not present in RamFS, the AlluxioClient reads it from the underlying stor-
age. There are 3 read policies in Alluxio: CACHE PROMOTE, CACHE, NO CACHE.
The first policy always places a block into the highest tier. The highest tier is considered
to be a RamFS directory of a node that is reading the block. Even when a block is read
from remote RamFS, its copy is created in a local RamFS directory. This policy results
in multiple replicas of blocks in the memory tier. In our prototype we only want to cache
into the memory layer once, thus we want to avoid replicas on different RamFS nodes.
Therefore we focus on a CACHE read policy.
4.1.2 Server SAN
Huawei FusionStorage[3] is a main component in Huawei Server SAN solution. It can
be deployed on multiple general-purpose x86 servers to consolidate the local SSDs or
HDDs on all the servers into virtual storage resource pools to provide the block storage
capabilities.
FusionStorage consolidates local hard disks on all servers into multiple storage re-
source pools. Based on the storage resource pools, the FusionStorage software provides
block device interfaces for upper-layer software, for example, creating and deleting vol-
umes and snapshots. Volumes are accessed through SCSI or iSCSI protocols.
Chapter 4. Implementation 35
FusionStorage automatically stores a piece of data into several identical data copies
on different servers. The data is represented as a disk block (e.g. 4KB). The storage au-
tomatically ensures strong data consistency between the data copies and that even data
distribution, thereby preventing data hotspots. All the hard disks in storage resource
pools can function as the hot spare disks for storage resource pools. FusionStorage
Figure 4.1: High level SAN architecture of the storage tier.
helps consolidate all disks on commodity servers into a single pool of disks[53]. Fusion-
Storage is similar in architecture to Ceph block storage solution[58] which is open-source.
In this dissertation, we only disclose FusionStorage implementation details that are cov-
ered by publicly available white paper[3]. Readers can find more details about SAN
implementation from Ceph source code. Figure 4.1 demonstrates SAN system architec-
ture. Users can create virtual volumes (vol1, vol2, vol3) from the SAN pool. The volumes
can then be attached as new disks to virtual or physical machines, labelled as 1, 2 and 3.
Client nodes can access these volumes as block devices where data is stored in the form
of disk blocks, denoted as green and red circles. In our platform SAN clients reside on
different nodes to enforce decoupled architecture. SAN servers that are denoted as A, B
and C include metadata management, disk management and caching mechanisms. Typ-
Chapter 4. Implementation 36
ically SAN servers replicate disk blocks across multiple disks and servers: 3 red replicas
and 3 green replicas. They provide data balancing, thin provisioning and a variety of
recovery mechanisms.
4.1.3 GFS2
GFS2 is a shared-disk file system for a Linux commodity cluster. GFS2 is very different
from distributed file systems (such as HDFS, Lustre or GlusterFS) since it does not have
a metadata master and allows all nodes to have concurrent access to the same shared
block storage. Moreover, GFS2 can be used as a local filesystem, just like ext3. It is a
POSIX-compliant filesystem
It is primarily designed for Storage Area Network (SAN) applications in which each
node in a GFS2 cluster has equal access to the storage. In order to limit access to areas
of the storage to maintain filesystem integrity, a lock manager is used. In GFS2 this is
a distributed lock manager (DLM). DLM works on an inode basis, i.e. each writer will
lock an inode while writing to it. It is also possible to use GFS2 as a local filesystem with
the lock nolock lock manager instead of the DLM. The locking mechanism is replaceable
and can be easily integrated in case of a future need of a more specialized lock manager.
The design of GFS2 is a perfect match for SAN-like environments such as Fusion-
Storage. It is compatible with a variety of block device protocols, e.g., SCSI, iSCSI,
FibreChannel, AoE, or any other device which can be presented under Linux as a block
device shared by a number of nodes, for example a DRBD device.
4.1.4 YARN
YARN is essentially a system for managing distributed applications. It consists of a
central ResourceManager, which arbitrates all available cluster resources, and a per-node
NodeManager, which takes coordination from the ResourceManager and is responsible
for managing resources available on a single node.
In YARN, the ResourceManager is, primarily, a capacity scheduler. Essentially, it’s
Chapter 4. Implementation 37
strictly limited to arbitrating available resources in the system among the competing ap-
plications. It optimizes for maximum cluster utilization against various constraints such
as capacity guarantees, fairness, and SLAs.
YARN has a special program called the ApplicationMaster. The ApplicationMaster
is, in effect, an instance of a library that can be used by different analytics engines to
negotiate resources from the ResourceManager and work with the NodeManager(s) to
execute and monitor the containers and their resource consumption. For instance, the
Spark driver program can run inside the ApplicationMaster. It is responsible for nego-
tiating appropriate resource containers from the ResourceManager, tracking their status
and monitoring progress.
YARN is designed to allow individual applications (via the ApplicationMaster) to
utilize cluster resources in a shared, secure and multi-tenant manner. Also, it remains
aware of cluster topology in order to efficiently schedule and optimize data access i.e.
reduce data motion for applications to the extent possible. In order to meet those goals,
the ResourceManager has extensive information about an application’s resource needs,
which allows it to make better scheduling decisions across all applications in the clus-
ter. This leads us to the ResourceRequest and the resulting Container. Essentially an
application can ask for specific resource requests via the ApplicationMaster to satisfy
its resource needs. The Scheduler responds to a resource request by granting a con-
tainer, which satisfies the requirements laid out by the ApplicationMaster in the initial
ResourceRequest. The ResourceRequest object contains hostnames and corresponding
container sizes (CPU and RAM). YARN has enabled relaxed locality in default mode.
This means that data locality can be ignored if requested host does not have required
CPU and RAM. If a ResourceRequest has a hostname and a container size fulfilled then
an allocation is designated as NODE LOCAL because consequent task execution can
access data locally.
Chapter 4. Implementation 38
4.2 Control and Data Flow
Figures 4.2, 4.3, 4.5 and 4.4 show how components interact with each other. Since our
architecture is complex and involves many software technologies and a multitude of third-
party libraries it is best to demonstrate control and data flow in multiple figures.
Figure 4.2 outlines the high level view of our implementation. It shows a snapshot
Figure 4.2: Components in the consolidated platform.
of all components during a single application run. In the figure we use a generic analyt-
ics engine. In our prototype we have tested 2 different frameworks: Spark and Hadoop
MapReduce. We will walk readers through each component interaction by zooming in
each region of Figure 4.2.
Figure 4.3 shows how a job is initially submitted to the cluster.
In the first operation client (1) submits a application jar file to the Resource Man-
ager. Then the ResourceManager (2) decides where to allocate a container for the main
program of the application and requests the Node Manager to (3) allocate a container.
Once the container is allocated main program is started within the ApplicationMaster.
The main program then (4) communicates with the AlluxioMaster to retrieve any rel-
Chapter 4. Implementation 39
Figure 4.3: Top tier job submission process in Gluon.
evant data information, such as, file block locations, permissions, file sizes etc. The
AlluxioMaster is always aware of the current state of each worker and underlying file
system. Updates to the AlluxioMaster happen through AlluxioWorkers. For example,
when the AlluxioWorker creates new file block, it will notify the AlluxioMaster. After
the Application Master retrieves file block information (5) it creates a ResourceRequest
object and sends it to the ResourceManager. The ResourceRequest contains a list of
hostnames per task execution. The ResourceManager attempts to allocate a container
based on the user preference (e.g. NODE LOCAL). If the preference cannot be deliv-
ered it will grant RACK LOCAL container. Again the Resource Manager (6) requests
NodeManagers to provide containers. NodeManagers (7) allocate new containers for task
execution.
Figure 4.4: Cache and bottom tier interaction with top tier in Gluon. The READoperation.
Once containers are allocated, the ApplicationMaster’s main program starts the tasks
Chapter 4. Implementation 40
in their assigned containers (Figures 4.5, 4.4). To do READ operation, Task Executors
use the AlluxioClient to open input stream from GFS2 and start (8) reading a file block
directly. Reading from a GFS2 file triggers the ServerSAN client to fetch necessary disk
blocks (e.g. 4KB blocks) that correspond to a requested inode from the remote Server-
SAN. While reading, a file block will also be (9) stored in RamFS - caching on read. If the
RamFS directory is running out of space the Task Executor requests the AlluxioWorker
to evict some blocks. The AlluxioWorker uses an LRU evictor to move blocks to the
local-disk file system (e.g. ext3).
In case a file block exists in RamFS of any of the AlluxioWorker nodes then
Figure 4.5: Cache and bottom tier interaction with top tier in Gluon. The WRITEoperation.
BlockInStream object is created and the block is read directly from memory of that
node. In this case, a blockId is calculated using position in the file. Typically, Applica-
tion Master assigns partition of a file to a task. Partition information contains offset and
partition size.
Another procedure takes place during the WRITE operation. Unlike the READ op-
eration, before performing the write operation the Task Executor adds a journal entry
into the AlluxioMaster using AlluxioClient’s createFile API. Then the Task Executor (8)
checks whether there is available memory in RamFS and creates RamFS output stream
directly. If there is no enough space in the RamFS directory, the Executor requests the
AlluxioWorker to (9) evict some blocks and at the same time tries to write to remote
nodes with some available space in RamFS. After the Task Executor writes all file blocks
to RamFS it then notifies the AlluxioWorker to persist written file blocks to GFS2. The
Chapter 4. Implementation 41
AlluxioWorker locks these blocks (a lock prevents eviction) and, immediately after blocks
are written to RamFS, it (9) starts to move blocks to GFS2. Consequently, blocks are
propagated to the remote SAN. The Task Executor does not wait for the propagation to
finish.
4.3 Connecting storage component
4.3.1 Server SAN to filesystem connection
This choice presents an interesting challenge because Server SAN is not a file system but
a block storage. Also most of the popular NAS filesystems can already be integrated into
Hadoop ecosystem. SANs on the other hand are rarely covered. SANs communicate over
SCSI protocol. They present data the same way as physical block device does. Server
SAN can be replaced with Ceph, Lustre, NFS and many other systems in our platform.
NFS and Lustre are file systems and thus require less effort in terms of integration. By
connecting SAN to Hadoop ecosystem we can cover majority of storage platforms.
If we are to connect SAN clients to Hadoop, the obvious solution is to (1) create
Figure 4.6: Over-replication problem.
Chapter 4. Implementation 42
an individual volume from SAN pool, (2) mount ext3/ext4 on each volume and (3) co-
locate HDFS DataNode with each filesystem. However, as we discussed we face 3-way
replication in Hadoop and in Server SAN, this leads to data redundancy and unneces-
sary overhead. Figure 4.6 shows how for each HDFS file block (red and green squares)
there will be 3 disk blocks (red and green circles) stored in the SAN. Also if we con-
figure Hadoop to have no replication we risk having an unreliable system regardless of
replication on Server SAN, because we need file level reliability. Nevertheless, there is
another option - to configure Server SAN such that one large volume is shared among
Hadoop nodes. This way, even if one of the Hadoop nodes fail, we still have the access
to the same shared volume from other Hadoop nodes. Now, the challenge arises on how
to transform shared disk block level to the file level access since Hadoop only works with
files.
We employ shared-disk GFS2 to access data on Server SAN. GFS2 is installed on the
Server SAN client nodes. GFS2 is a clustered file system that allows for synchronized
access to a shared block device. Figure 4.7 shows our storage tier architecture. In our
case the shared block device is the Server SAN volume (vol1) from the SAN storage pool.
GFS2 is mounted after installation onto the nodes where SAN is installed, i.e. Server
SAN client nodes 1, 2 and 3. The mount directory is the same on all Server SAN client
nodes, i.e. when a user creates a file a.txt under mount directory /gfs2 on Server SAN
client 1, the file a.txt appears under mount directory /gfs2 on Server SAN client 2.
The disadvantage of clustered file system like GFS2 is contention during parallel
writes to the same file. GFS2 relies on Distributed Lock Manager to control parallel
writes. However, in analytics workloads it is rare that tasks write to the same output
file. For instance, HMR tasks write files in the reduce stage, and reducers each have
their own output partitions. Therefore, file write contention does not happen in typical
MapReduce scenario. Further investigating HDFS reveals that it does not allow multiple
writers to the same file.
Thus we can conclude that our platform will very rarely encounter file write con-
tention. Nevertheless, in HMR or other analytics engines file create and delete con-
tentions are inevitable in our platform since DLM operates on the inode basis. This
Chapter 4. Implementation 43
means that if a task creates a file under a directory that is locked then the task will need
to wait until another lock-holding task has finished its creation or deletion procedures.
Figure 4.7: Final Storage Tier Architecture.
4.4 Connecting GFS2 with Analytics Engines
We focus on applications that work in the context of YARN or support support Hadoop
FileSystem API. All these applications can access data from HDFS or file systems that
are compatible with the API. Theoretically, any data analytics application can work with
our platform as long as it is compatible with FileSystem API. However, we only tested
our platform with 2 above-mentioned applications and Hama framework[50].
Table 4.1 lists core HDFS API calls. There are a lot more calls in the actual
Hadoop API however we just show the main methods that typically impact performance.
We translate the calls to storage service. We use basic Java File streams to access
POSIX-compliant GFS2. Essentially making storage service accessible through Hadoop
FileSystem API. Some challenges may rise when implementing getF ileLocations because
this function can impact parallelism of tasks.
Chapter 4. Implementation 44
Table 4.1: Core HDFS API calls and their translation
Call name HDFS Call Description Call change descriptioncreate(Path p) Creates FSDataOutput-
Stream at a given HDFSpath
Instead of FSDataOut-putStream we returnjava.io.FileOutputStreamat a given path
getFileStatus(Path p) Return a FileStatus objectthat represents the path inHDFS.
Return FileStatus objectthat contains file metadata
mkdirs(Path f) Make the given file and allnon-existent parents into di-rectories in HDFS.
Create directory recursively,e.g. translate to mkdir -pDIR NAME call in POSIX-compliant storage service
open(Path f) Opens an FSDataInput-Stream at the indicatedPath in HDFS
Creates storageservice streamjava.io.FileInputStreamat a given path
getFileLocations(Path p, long start,long len)
Return an array containinghostnames, offset and size ofportions of the given file inHDFS
Storage service is fully de-coupled therefore we canreturn any active host-name on the compute layer.Reading/writing data at thecompute node will triggercommunication with stor-age master that will directthe call to the correspond-ing storage worker.
Initially we have developed our own connector to link Hadoop with GFS2. The
connector can interact with HMR Job Tracker, YARN Application Master, Spark Master
or other framework masters to spawn the tasks on those servers or VMs that have data
accessible.
Figure 4.8 shows how we connect application tier with the storage tier. In our platform
we co-locate YARN Node Managers with SAN clients. Note that, in the case when
resource manager is not used, application workers (e.g. Spark Standalone) would be co-
located with SAN clients. Any file in GFS2 is accessible through any server or VM that
has a SAN client. The Application Master interacts with GFS2 to determine file locations.
When the Application Master receives file metadata it will request the Resource Manager
Chapter 4. Implementation 45
for containers on specific nodes. Once the Resource Manager grants the permission, the
containers are launched and tasks will start executing inside them. Tasks will retrieve
data through local Connectors.
There is a way to include implementation of any file system without ever changing
Figure 4.8: Application Tier and Storage Tier Interaction.
the Hadoop source code. We only need to compile our implementation of a plugin for
file system or a newly designed file system into a jar file. And then we add that jar file
to hadoop project. Finally, we will need to add a property called fs.SCHEME.impl to
core − site.xml. This property specifies a core class of our file system or plugin. The
class specified should be the class that inherits from Hadoop FileSystem class.
We have build our connector based on S3 connector designed by Hadoop developers.
Our plugin maps requests to schema gfs2 : /// onto the local file system under the
mount directory specified in core−site.xml. The connector that we implemented can be
re-used for any shared POSIX-compliant file system such as Lustre, Ceph and NFS. We
also do not need to support S3 and other famous object stores because Hadoop already
provides that integration and we can re-use their connectors.
Chapter 4. Implementation 46
4.5 Cache integration
4.5.1 GFS2 to Alluxio connection
In our platform, we leverage an open-source in-memory distributed cache called Alluxio[35].
Alluxio has a simple design and follows HDFS structure. It has master-slave architecture
which makes it ideal candidate for collaborative caching. Unlike HDFS, Alluxio stores
file blocks onto RAM disks. RAM disks behave as regular disks but they store data into
the OS page cache which takes some space in a physical RAM of the server.
Since Alluxio follows HDFS structure it already has Hadoop FileSystem API im-
plemented. Alluxio is also compatible with various bottom tiers including HDFS, S3,
GCS, GlusterFS. To access bottom tiers Alluxio provides HDFS-like FileSystem API -
UnderFileSystem API. We implemented our version of UnderFileSystem API to allow
smooth access of GFS2. We essentially re-used our previous connector design and ex-
tended implementation of Alluxio’s LocalUnderFileSystem class that is used to access
local directories.
When Alluxio cache is empty, clients are directed to fetch data from GFS2 nodes.
The data access pattern is important because it will determine which nodes will cache
the data. It also can impact the performance of the initial job. Figure 4.9 shows how
data layout can impact job performance. After YARN ApplicationMaster requested 5 file
locations(from a.txt through e.txt) from Alluxio Master, it will respond with locations
of all files. In this example, since data is not yet cached, location returned is localhost
for all file requests which is node 1. Thus ApplicationMaster requests Resource Manager
to launch containers on node 1. However, each node only has 3 task slots (denoted as
grey circles) available due to RAM and CPU restrictions. ApplicationMaster launches 3
tasks on node 1 and denotes 2 other tasks as RACK-LOCAL and places them to nodes
2 and 3. Rack-local tasks fetch the data from node 1, because they assume that they
do not have data available locally. This results in caching data non-uniformly and will
affect performance of new jobs that re-use cache data. However, in our platform all files
can be accessed through any node that has SAN client. Therefore, with the correct file
locations extra communication overhead could be avoided.
Chapter 4. Implementation 47
Figure 4.9: Performance degradation due to uneven data representation.
In our work, we program an uniform data representation in our UnderFileSystem
API implementation. We use GFS2 path hashcodes and offsets to calculate a preference
node for the ApplicationMaster. Uniform data representation allows to return locations
such that the ApplicationMaster can assume that all of the healthy nodes have data
available. For instance, returning list of hostnames such as [node2, node1, node3] by
getF ileLocations provides task with NODE-LOCAL designation.
In our platform, we deploy Alluxio such that data is cached in AlluxioWorkers first
and then it gets propagated to SAN storage via GFS2. This means that consecutive reads
of the same data will hit the cache and achieve highest performance. We also configure
Alluxio to cache on reads, which also allows consecutive workloads share cached data.
We set cache WRITE policy to LOCAL FIRST. This helps us keep all the write data
locally. This means that a task will try to write all its file blocks on a single node until
it runs out of memory. Then it scans neighbouring nodes for available memory. On the
reads, in case file blocks are not in the Alluxio Workers, they are brought in from GFS2
Chapter 4. Implementation 48
during job execution.
Finally, we implemented asynchronous propagation of writes by extending the
ASYNC THROUGH feature in Alluxio. This allows data to be asynchronously prop-
agated to GFS2 (FusionStorage). Note that Alluxio provides ASYNC THROUGH as
an experimental feature which did not work on our platform because analytics engines
(e.g. HMR and Spark) created temp files and then renamed those files. We modified the
rename function in Alluxio Client to enable file persistence; hence we enabled Alluxio to
perform asynchronous writes seamlessly.
We also leverage the Alluxio tiered storage feature to evict blocks onto local disks
that are shared with shuffle spills. Essentially eviction is a copy from RAMdisk onto
physical disk and subsequent removal of block from RAMdisk.
We utilize a default scheduler of YARN to queue tasks and jobs: CapacityScheduler.
We have mentioned previously that Capacity Scheduler has “relaxed” locality in the de-
fault configuration: this implies that it ignores “preferences” if the node’s capacity is
exceeded (CPU and RAM are busy). This will have an effect on Alluxio cache through
production of file block replicas in the main memory of AlluxioWorker nodes. If the Al-
luxioWorker does not have a file block in the local cache but has been assigned a task to
work on that file block, it will copy the contents of the block from neighbouring Worker
Nodes who already have requested file block cached.
In our platform, uniform distribution of file blocks in GFS2 data representation allows
for uniform distribution of tasks on data reads. However, block replicas are still possible.
For instance, two tasks in two different VMs request data split from the same Alluxio
file block. This causes the Alluxio Worker to cache the same block on two different VMs.
However, the replicas can be reduced by decreasing the block size. Our platform uses the
default 64MB block size unlike the default Alluxio block size which is set to 512MB.
We also disable caching of a remote block that happens when Alluxio Worker at-
tempts to read a block from the memory of a remote Alluxio Worker. We modify Alluxio
such that a remote block copy does not happen using CACHE policy.
Chapter 4. Implementation 49
4.6 Spark integration
In Spark, caching is performed using the persist() command. Spark divides functions
in two categories: transformations and actions. Transformations are just records of
operation and get executed whenever an action is triggered (i.e. lazy operations). The
persist() command is a transformation. Therefore data is cached on the next action after
persist(). Spark allows for 3 main levels of caching: MEMORY, DISK and OFF HEAP.
There are also options for data serialization and combinations of levels. However, all
of these levels only store data in the local machine/VM. Consequently, a user needs to
worry about how much memory would be allocated for each Spark executor in order to
fit cached RDD partitions in available memory.
In our platform, we use Alluxio which is an external service to Spark. Therefore,
Figure 4.10: Performance comparison of Spark caching methods
we cannot use the persist() command in order to cache data. For external caching pur-
poses, Spark developers recommend using the checkpoint() or saveAsTextF ile() meth-
ods. There are many disadvantages associated with these two commands. First, the
Chapter 4. Implementation 50
checkpoint() command implementation is not the same as persist(); it requires two com-
putations for the same action: one to perform an action and the second one to perform
the actual checkpoint operation. Second, saveAsTextF ile is an action; therefore it re-
quires additional computation.
We performed a simple experiment where we tested 3 methods in the same applica-
tion. The application reads a large file, performs two random map operations, ”caches”
the RDD and finally does two back-to-back count operations. For persist() and check-
point() we just call these methods before the count() methods. As for saveAsTextFile we
also perform the textFile() operation to read the saved RDD. In all of the methods we
use Alluxio as caching layer in asynchronous mode. To use the persist command for Al-
luxio, we choose DISK ONLY option in Spark and point Spark disk writes to the Alluxio
RamFS directory using the spark.local.dir configuration. This shows a fair comparison
of three methods trying to do the same operation. Chart 4.10 shows the result of our
test. From the test run it becomes obvious that checkpoint and saveAsTextF ile are
very slow and cannot compete with persist.
Consequently, we have implemented AlluxioBlockManager.scala in Spark in order
to allow the persist command to use Alluxio Client API. AlluxioBlockManager class acts
in similar to the DiskStore class of Spark except instead of the FileOutputStream object
it creates an Alluxio FileOutStream object. We have tested our implementation with
Spark-1.6.3 because it has the implementation of ExternalBlockStore class that we rely
on. Nevertheless, it is still possible to implement both classes in later versions of Spark.
We leave that for future work.
4.7 Additional optimizations
4.7.1 Asynchronous Delete
Many iterative workloads delete previous data after each iteration. The time it takes to
delete a directory can greatly affect overall job performance. To overcome this challenge
HDFS does asynchronous deletion, i.e. it schedules block IDs to be removed later. In
Chapter 4. Implementation 51
our platform deletion is synchronous, i.e. a task will have to wait until the storage tier
returns the acknowledgement that the file is deleted.
Speeding up the deletion process is tricky because the storage tier controls file related
operations. We also try not to modify other tiers except for the cache. Hence, we
introduce a queue-based deletion mechanism where delete requests are added to the
end of a queue. Then a background thread will process a request at the head of the
queue. The mechanism operates at the caching tier, thus does not require modifications
to the storage tier. There are subtle issues with our approach since now there may be
inconsistencies between the cache and remote storage with respect to delete operations.
Inconsistencies due to delete operations can be severe when a user tries to create a file
or a directory with the same name as the file that is scheduled for deletion. To overcome
this problem, we move all created files to a specialized directory under the root directory
and only then we schedule a deletion of the specialized directory.
4.7.2 File consistency checker
It is important to detect any changes in the remote storage because there can be another
workload that adds files to GFS2 (e.g. transaction processing). In that case Alluxio may
not be notified because the Alluxio cache is only used for analytics workloads. Therefore,
it is important to constantly check for consistency between the Alluxio Master journal
entries and actual data in GFS2.
We designed and implemented an external file consistency checker that performs fast
lookup of file paths and timestamps. It creates a hashmap that is populated with a
snapshot of GFS2 mount. Then hashmap entries are compared against the Alluxio
Master entries. If there is extra data on Alluxio it is being left as is. This may be
an inconsistency due to a job’s file being temporarily stored on the cache layer or some
job performing asynchronous propagation. On the other hand, where there is extra
data on GFS2, then Alluxio Master is notified and a new entry is added to it. The file
consistency checker wakes up every 3 seconds to check GFS2 and Alluxio. In our job
runs, our consistency checker showed no significant overheads.
Chapter 4. Implementation 52
4.8 Summary
In this section we showed how storage component was integrated. We took Server SAN
as a storage service because Hadoop did not work with block devices before. We used
GFS2 to connect to Server SAN. We also discussed how we connected application layer
to GFS2. We covered challenges such as over-replication that were encountered during
implementation phase.
We also talked about Alluxio integration into our platform. We pointed out overheads
that caused performance degradation, and we showed our approach to resolve these over-
heads. We also stress on specific configuration settings and provide thorough explanation
as to why they are chosen.
We discuss code optimization in cache and analytics tiers. The optimization help
boost performance and usability. Finally, we show how control and data flows in our
platform. The flow analysis helps better understand overall platform architecture and
point out bottlenecks for future developments.
Chapter 5
Evaluation
This chapter presents performance results of our experimental testbed. We outline the
configuration of our testbed and as well as throughput measurements. We then perform
three sets of comparisons.
First, we perform comparisons with vanilla Spark deployments. Our goal is to evaluate
default Spark configurations in comparison to the Gluon deployment. We use real-world
data and SparkBench workloads[36] and show how performance changes with memory
utilization in both platforms. We also discuss Gluon’s remote data write/read statistics
in uniform as well as skewed workloads. We also show benefits of integrating idle memory
nodes in the computation.
Second, we focus on a Hadoop cluster that runs on top of HDFS. We run two well-
known workloads to show how Gluon can improve Hadoop performance. In this compar-
ison we use Intel’s HiBench suite[26]. Finally, we use Apache Hama[50] to see how
non-MapReduce framework can perform on Gluon.
5.1 Environment Setup
All of our experiments are executed on a cluster of servers at the University of Toronto.
We have 2 sets of dedicated servers in our cluster. The first set we use for the compute
layer pf the data analytics platforms. This set has 3 large servers (c181, c172, c178) with
53
Chapter 5. Evaluation 54
each server hosting 32GB RAM, 32 cores Intel(R) Xeon(R) CPU E5-2650 @ 2.00GHz,
one 300GB SATA HDD and one 10Gbit/s network card. The second set of servers is
on used for the storage layer and is located on a different rack. It is used purely for
storage layer. This set includes 3 extra-large servers (C160, c168, c171) with each server
hosting 48GB RAM, 32 cores Intel(R) Xeon(R) CPU E5-2650 @ 2.00GHz, five 300GB
SATA HDDs and one 10Gbit/s network card. Intra-rack throughputs vary in the range
700MB/s - 1GB/s whereas inter-rack throughputs are in the range 380MB/s - 550MB/s.
All disk write and read throughputs are 150MB/s and 180MB/s respectively.
Our setup provides total of 96 cores and 96GB of RAM and 900GB HDD for execution
and cache nodes. Storage layer has 96 cores and 144 GB of RAM and 4.5TB of disk space.
In the following we describe the versions and configurations for the software platforms
used in tests:
• Spark-HDFS: we install Hadoop-2.6.0 and Spark-1.6.3 on compute layer nodes:
c181, c172, c178. We configure YARN such that maximum of 72GB RAM and 72
CPU cores can be spared for HMR or Spark. We also provide HDFS with 600GB
of disk space for storage. 300GB of storage is reserved for shuffle data.
• vanilla Hadoop: we install Hadoop-2.6.0 on compute layer nodes: c181, c172,
c178. We configure YARN such that maximum of 72GB RAM and 72 CPU cores
can be spared for Hadoop MapReduce. We also provide HDFS with 600GB of disk
space for storage. 300GB of storage is reserved for shuffle data. We set Gluon cache
size to total of 20GB which leaves YARN cluster with 52GB of RAM for execution.
HMR applications do not utilize local memory as intensively as Spark applications
and they also do not support explicit .cache commands. Thus all HMR memory
resources are spared for execution only.
• Gluon: we use our modified version derived from packages mentioned above. We
install our version of Alluxio on the compute nodes. In our experiments we change
local memory sizes of Alluxio workers in accordance with Spark-HDFS cache sizes.
We spare 10% of shuffle storage to Alluxio local disk cache as a default setting.
Next, we install Huawei FusionStorage - Server SAN management software - on
Chapter 5. Evaluation 55
storage servers: c160, c168, c171. FusionStorage Manager was installed on c160
with FusionStorage Agents on c160, c168, c171. We also installed FusionStorage
Client on c181, c172, c178. Then we created a 3TB volume from this storage pool
and attached it to compute layer as a new block device. Next, we installed and
configured GFS2 on compute layer and mounted it on top of 3TB volume.
5.1.1 Benchmarks
We compare Gluon to two architectures: Spark-HDFS, vanilla Hadoop. We use cache-
intensive workloads for comparisons with Spark-HDFS. To compare Gluon with vanilla
HDFS we use Intel’s HiBench suite[26].
In Spark evaluation we use simple Spark Count program and two workloads from
SparkBench[36]: Logistic Regression and PageRank. The Spark Count program reads
randomly generated 8GB file, performs 2 random map operations, caches RDD and
counts the number of lines twice.
Logistic Regression is widely adopted machine learning tool to predict continuous
and categorical data[29][25]. For instance, it is used to predict whether the patient has
a given type of cancer based on a variety of characteristics such as blood tests, desease
history, age etc. Logistic Regression algorithm is an ideal candidate for the use of Spark
caching because it needs to hold RDD in the cache while it iterates over that RDD. The
algorithm calculates the parameter vector, updates it and broadcasts it in each iteration.
We use Logistic Regression to run on the real dataset - Wikipedia articles which includes
almost 7 million English articles[6]. We follow SparkBench approach and extract plain
text from wikipedia XML articles using WikiXMLJ parser[7]. We then convert fixed size
vocabulary into TF-IDF from the set of documents. We use output of TF-IDF as an
input to our Spark program. The output data from TF-IDF results in approximate size
of 18.7 GB. The data format includes two columns: (1) category index and (2) array of
TF-IDFs.
To evaluate data-skewed workloads we focus on PageRank algorithm[44]. The algo-
rithm was first used by Google web search engine to rank pages through measuring the
importance of website pages based on the number and quality of links to a page. We use
Chapter 5. Evaluation 56
Livejournal[4] graph data as the input data set of this workload. Livejounral contains 68
million edges in the graph and data size is approximately 1.3 GB.
In Hadoop evaluation we use generate a read-write testing technique called DFSIO.
DFSIO is a simple READ/WRITE test that spawns multiple mappers of the HMR
framework. Mappers write/read randomly generated data to/from the target storage
(e.g. HDFS). We then use Terasort. The TeraSort benchmark is probably the most well-
known Hadoop benchmark. The goal of TeraSort is to sort any given amount of data
as fast as possible. It is a benchmark that combines testing the HDFS and MapReduce
layers of a Hadoop cluster. In our case we wanted to sort a file of 10 GB size from
HiBench. We then re-use Livejournal graph to run PageRank on Hadoop Mapreduce.
We also tested Gluon under a completely different platform that is gaining popularity
in the Big Data community. In 2010, Google has completely replaced their MapReduce
platfrom in the favor of Pregel[38]. It is build in the context of Bulk Synchronous Par-
allel (BSP) - a compute paradigm based on message passing[56]. Again, readers are
encouraged to learn more about these concepts in the papers abovementioned. We used
Pregel-like framework called Apache Hama[50]. The main difference of BSP frameworks
from MapReduce is that the BSP workers typically load data into their local memories,
compute that local data and pass messages to their peers. BSP algorithms tend to have
multiple iterations unlike two-stage MapReduce. BSP workers write data back to the
underlying storage upon finishing all the required iterations.
Hama can run on top of Gluon seamlessly without any difficulty in installation. The
configuration settings are similar to those used in deploying Hama on top of HDFS. Hama
can run with YARN or in a standalone mode.
We run label propagation algorithm (LPA) on a LiveJournal graph we used previ-
ously. Label propagation is extensively used in social networks to detect communities
based on the influence of a particular member. It is essentially a clustering algorithm that
associates each vertex in the graph with a certain community. LPA represents workloads
from companies that process large graphs on daily basis. These companies may include
Facebook, Twitter, Google etc. We vary the number of Hama workers to see how the job
duration declines.
Chapter 5. Evaluation 57
5.2 Comparative evaluation using Spark
5.2.1 Spark count
In our first experiment, we test our AlluxioBlockManager implementation. We added
one class to Spark code in order to allow for .persist command store data in Alluxio. In
this experiment we provide large amount of memory (20GB) to both Spark MEM ONLY
and Alluxio.
Figure 5.1 shows results from 4 job runs. As we can see from the experiment native
Figure 5.1: Performance comparison of Spark caching methods
.persist is much faster than suggested functions that were designed to interact with
external services. This confirms that our implementation is on par with native Spark
caching given the same amount of memory.
Chapter 5. Evaluation 58
5.2.2 Logistic Regression
We start with small Executor sizes and increase them to see performance gains in two
Spark-HDFS configurations (MEM ONLY, MEM AND DISK), Gluon and off-the-shelf
Alluxio configurations. Since we focus on cache performance in this experiment we pre-
load 18.7 GB data to Gluon disk cache and ingest the same data to HDFS in Spark-HDFS
platform. This setup provides equal read performance for both architectures and focuses
on in-job caching performance.
Figure 5.2 shows job duration which includes training time (95%), testing time (2%)
Figure 5.2: Performance comparison of Spark caching methods vs. Gluon collaborativecaching and Alluxio caching. The number of cores for all runs is 21.
and warm up times. The default Spark setting lags significantly in lower-provisioned
runs. The reason is that the size of the cached data is 7.5GB and MEM ONLY config-
uration has to re-compute RDD partitions that did not fit in the cache. On the other
hand, we show that Gluon in default configuration outperforms Spark default by 2.88x
and matches to Spark MEM AND DISK configuration because blocks that do not fit
in RAMdisk are evicted to disk. Gluon(extra idle) configuration includes another idle
Chapter 5. Evaluation 59
memory node that does not run Spark program. From the experiment we can see that
extra RDD partitions are redirected to idle memory node, hence we obtain the perfor-
mance gain. In the over-provisioned scenario we can see that all configurations match
because all of the RDD is kept in local memory. Off-the-shelf Alluxio suffers significant
performance loss due to architectural overheads of saveAsTextF ile/saveAsObjectF ile
action operations.
5.2.3 PageRank
Although PageRank from SparkBench does not use .cache function, it still relies on
GraphLoader class from GraphX library. GraphLoader uses .cache intensively to con-
struct graph from a text file. In fact .cache is hardcoded in GraphX library and users
bound to use Spark’s MEM ONLY. We had to modify GraphX library in order to allow
for various caching configurations in PageRank. Due to GraphX being hardcoded for
Spark caching we were unable to store cached graph in the off-the-shelf Alluxio cache.
Figure 5.3 shows results of PageRank execution in 4 different modes. We again
observe a difference in performance between Spark’s default MEM ONLY and other con-
figurations in low-provision modes. The reason is that some graph parts need to be
re-constructed multiple times. Gluon-default again has the upper-hand and outperforms
MEM ONLY by 2.73x.
We also see that optimal MEM AND DISK option in Spark is slower than Gluon-
default. The LiveJournal graph results in task skews where at least one Executor receives
30% (e.g. 240) more tasks than the average (e.g. 180). This results in non-uniform cache
writes that under-utilize cache on one nodes and over-utilize(spill) on other nodes.
Gluon-extra-idle configuration has extra idle memory nodes of equal memory sizes as
those of busy nodes, i.e. if Gluon-default had 3GB of RAM assigned, then Gluon-extra-
idle has 3GB RAM from busy and 3GB RAM from idle nodes. We add Gluon-extra-idle
to show the benefits of utilizing idle memory.
Chapter 5. Evaluation 60
Figure 5.3: Performance comparison of Spark caching methods vs. Gluon collaborativecaching.
5.2.4 Gluon job statistics
We gathered remote read/write statistics from Gluon in a low-provisioned scenario. Ta-
ble 5.1 shows available memory, remote read/writes, eviction data sizes and final output
data sizes.
The LocalF irst policy makes sure that local memory is utilized to the fullest first
before pushing block to the neighbour. We did not observe task skews in Logistic Regres-
sion runs. This means that all AlluxioWorkers with total 3GB cache size have run out of
memory approximately at the same time. Therefore, we did not see any remote memory
pushes or fetches in default Gluon setting. On the other hand, in Gluon-extra-idle we
did see 33% and 2% of the data being pushed to idle node in 3GB cache size and 7GB
cache size runs respectively.
In PageRank, we observed approximately 17.3% of remote pushes and approximately
15% of remote fetches for small cache size in default configuration. Here, nodes ran out
of memory quite fast. Remote push/get statistics is attributed to task skews that made
Chapter 5. Evaluation 61
Table 5.1: Read and write statistics
Workload Busycachesize
Idlecachesize
RemoteMEMwrite (%)
RemoteMEMread(%)
Evictedto localdisk
Sent toremotestorage
LogisticRegression
3GB 0GB 0 0 4.5GB 0GB
LogisticRegression
3GB 3GB 33% 34% 1.5GB 0GB
LogisticRegression
7GB 0GB 0 0 1GB 0GB
LogisticRegression
7GB 7GB 2% 2% 0GB 0GB
PageRank 3GB 0GB 17% 15% 9.7GB 100MBPageRank 3GB 3GB 51% 54% 6.7GB 100MBPageRank 6GB 0GB 5% 4% 6.7GB 100MBPageRank 6GB 6GB 45% 51% 1GB 100MBPageRank 12GB 0GB 0.2% 0% 1GB 100MB
some nodes occupy memory at a faster rates.
By summing evicted data and cache sizes we can approximate the total amount of
cached data during whole program run. The memory pushes are highest when there
is idle memory available. This shows that full cache memory is utilized before evicting
blocks to disk.
5.2.5 Discussion
Overall, the main advantage of Gluon is that it can perform remote memory pushes
and fetches which allows it to utilize memory to it fully. Spark caching can also be
tuned by a user to the optimal level. However, it will take a preliminary set of runs to
understand how much data is being actually cached. In production systems, this may not
be an option. Although, no performance tuning is required for Gluon, in low-provisioned
cache-intensive applications it achieves 2.7x speedup over the Spark default mode and is
on par with the optimal Spark configurations.
We set AlluxioWorker sizes according to how much Spark application consumes to
make comparisons fair.
Chapter 5. Evaluation 62
(a) WRITE (b) READ
Figure 5.4: DFSIO - writing and reading randomly generated 10GB data using 40 map-pers (cores)
5.3 Comparative evaluation using Hadoop MapRe-
duce
In typical enterprise analytics scenarios, an application developer has to move the data
to HDFS from data warehouse to perform data analytics in HMR - an event we call
ingest. This creates usability issues and performance degradation. We take note of these
performance drops by simulating ingestion time in our tests.
5.3.1 DFSIO
In this test we measure 3 different settings for HDFS: HDFS-WRITE, HDFS-READ and
HDFS-INGESTED-READ. The third setting measures data ingestion time to HDFS
using hadoopfs− put command and add regular HDFS-READ time.
We compare 3 HDFS settings to 4 Gluon settings: asynchronous write (ASYNC),
synchronous write (SYNC), cold cache read (REMOTE) and local cache read (LOCAL).
Figure 5.4 shows write comparisons of vanilla Hadoop setup vs. Gluon. As we can see
from the figure, Gluon accelerates HDFS by write performance in asynchronous mode by
2.5 times. In Figure 5.4 we can compare read performance. HDFS-INGESTED-READ
is the slowest because ingesting 10GB file from remote storage is typically done using
single thread. Gluon-REMOTE, on the other hand, fetches data from FusionStorage
using 40 threads. Gluon-REMOTE also represents worst-case scenario for collaborative
Chapter 5. Evaluation 63
cache read, i.e. this particular case presents 100% cache miss example. Readers should
also note that Gluon-REMOTE data fetch latency depends on the type of the storage
and location of the storage servers. In our experiments, remote Server SAN is located
in the same datacenter but on a different set of racks. Finally, Gluon-LOCAL shows a
speedup of 1.85x in comparison to regular HDFS-READ. This scenario occurs when data
is fully cached and read from local RAMdisks.
5.3.2 Terasort
The workload consists of one job that spawns a set of tasks. Each task reads a file chunk
of 128 MB, sorts it using standard map/reduce sort (except for partitioner) and then
writes results back to the underlying file system. TeraSort is a good approximation of
typical single-stage shuffle-heavy job in the analytics world when a user wants to load
data, perform some manipulation and then store it.
Figure 5.5 shows job duration across 4 different setups: HDFS, HDFS-INGESTED,
Figure 5.5: TeraSort test. One job with 40 mappers and 20 reducers reading 128 MB filesorting its content and writing it back. The vertical axis indicates average job durationof a full job.
Gluon-HOT, Gluon-COLD. Like DFSIO, HDFS is an ideal HDFS setup where we assume
that input data resides inside the distributed file system. HDFS-INGESTED takes into
Chapter 5. Evaluation 64
account the data ingestion time. Gluon-HOT assumes that input data is fully pre-loaded
to the Alluxio cache layer. Gluon-COLD has no partitions of input data on the cache
layer. Gluon-HOT and Gluon-COLD both represent cache best and worst scenarios
respectively. Unlike DFSIO, Terasort shows lower speedup of Gluon-HOT over HDFS.
The reason is that Terasort does not only perform reads but also sort, shuffle and write.
5.3.3 PageRank
PageRank[44] is a complex CPU-intensive algorithm and ranks pages by looking at the
number and quality of links. In our case, all we need to understand is that PageRank is
an iterative HMR job, i.e. one PageRank job produces a chain of iterations of the same
program on different inputs. The input to the PageRank iteration is the output of the
previous iteration. Hence, PageRank is a good example of iterative job or a chain of jobs
in Hadoop MapReduce.
In this experiment, we re-use our graph data from Livejournal[4]. This brings our
Figure 5.6: PageRank test. 1 initialization job, 3 ranking job iterations with 40 mappersand 20 reducers reading 128 MB file chunks computing its content and writing it back.The vertical axis indicates average job duration of a full chain of 4 jobs.
experiment closer to the real-world scenarios. Figure 5.6 shows PageRank program job
duration running on top of vanilla HDFS cluster vs. Gluon cluster. In iterative job
such as PageRank, initial data load does not affect application performance. Hence,
Chapter 5. Evaluation 65
effects of cold Gluon cache are negligible in this particular scenario. We also do not
perform ingestion for HDFS cluster, i.e. we assume that data is already in HDFS. As
we can observe from the results, Gluon outperforms HDFS significantly due to caching
intermediate job outputs in memory.
5.3.4 Discussion
HMR is different from Spark because it follows a rigid map-then-reduce paradigm. Each
reduce output always goes to HDFS. This creates I/O bottlenecks for iterative jobs or
chains of jobs. In these scenarios, Gluon outperforms HDFS due to cached writes and
hot reads. Morever, a cache layer brings HMR job performance closer to Spark’s. Single
job cases represent worst scenarios for Gluon. Data is fetched from remote Server SAN.
However, even in this case, the performance is more or less comparable to that of the
ideal HDFS cluster. Moreover, we see significant performance degradation in HDFS if
data ingestion has to take place.
5.4 Graph Processing Framework - Hama
Figure 5.7 shows results of label propagation in the Hama framework. This is where
Gluon-HOT performs best regardless of the number of workers. The reason is that the
graph is initially loaded into memory tier. Thus each time Hama workers read they
just access local memory. Gluon-COLD on the other hand needs to be fetched from the
remote storage. Also since the output is only written once and has a smaller size than
the input Alluxio write performance does not impact job execution significantly. Another
interesting observation is that the performance gap decreases and all solutions ”catch up”
as the number of workers increase. The reason is that the more workers read in parallel
the smaller effect of the graph loading we can observe. All in all, Hama tests show that
CPU-intensive data analytics workloads are not impacted in terms of performance.
Chapter 5. Evaluation 66
Figure 5.7: Label Propagation. 76 iterations in a job with 1-50 workers reading 1.3GBfile at the first iteration, computing its content in all of the iterations and writing theresult back in the last iteration. The vertical axis indicates average job duration of 76iterations.
5.5 Conclusion
In this chapter we showed how small code modifications in Spark led to significant im-
provements over suggested built-in caching methods. We also showed that Gluon out-
performs Spark’s default configuration by more than 2.5x in low-provisioned job runs.
Moreover, Gluon in default mode is on par with Spark’s optimal configuration for cache-
intensive applications. We also showed Gluon remote memory push/fetch statistics in
uniform as well as data-skewed workloads. The measurements show that skewed work-
loads incur some data movements across busy cache workers. This tendency increases
dramatically with idle cache workers available.
In our evaluation, we also looked at the Hadoop MapReduce framework[8]. We con-
cluded that Gluon can achieve up to 1.85x increase in read throughput if data is re-used.
We also demonstrated how ingestion can significantly degrade performance and that it is
faster to fetch remote data on-demand. Finally, we showed that Gluon expedites iterative
jobs and chains of jobs by more than 30%.
We demonstrated how our platform performs in non-traditional analytics scenarios
Chapter 5. Evaluation 67
such as Hama. From our tests we see that there is neither significant performance im-
provement nor degradation if using Hama-like CPU-intensive frameworks.
In our experiments, we used well-known Big Data applications such as Logistic Re-
gression, Terasort and PageRank. We leveraged public data sources. This included
Livejournal graph data[4] and Wikipedia pages[6].
Chapter 6
Related Work
In this chapter we discuss research work that is related to our work. We explain how our
approach is different from previous works. We start by outlining research conducted on
caching layer in Spark and Hadoop. We then cover integrations of HPC storage platforms
with Hadoop ecosystem that do not use caching. Finally, we look at the designs that
target full-stack integration like Gluon.
6.1 Caching in Analytics
Caching in analytics frameworks has been studied extensively. Shared memory is not
new in analytics world, therefore there have been a large number of attempts to improve
Spark and Hadoop caching.
A lot of works focus on Hadoop and Spark caching efficiency[45][21][33][43][32][18][12][2].
However, they either focus on improving read performance or caching techniques. They
usually work on top of HDFS and ignore integration and flexibility. By and large, all these
works are complementary to Gluon and can be integrated to further improve Gluon’s
caching techniques.
Tungsten[5] that was designed in Databricks focuses on Spark JVM management. It
tries to boost on-heap memory management by exporting JVM objects to off-heap native
memory using Java Unsafe APIs. This offloads work from JVM garbage collector, hence
reduces its overhead. Tungsten is one of the most famous caching done on Spark, however
68
Chapter 6. Related Work 69
it mostly focuses on SQL-based Dataframes and only works with Spark explicitly.
Facade[42] performs a compiler-based transformation of analytics application. It then
can manipulate objects by moving them onto native off-heap RAM. Unlike Facade or
Tungsten, we provide an external memory store and management that allows for full
memory utilization as well as data propagation to cold storage.
SpongeFiles[20] is a distributed cache that is used in Hadoop MapReduce to avoid
spilling shuffle data to disk. It uses remote memory nodes to store data that is about
to be spilled to disk. It is best suited to skew-rich MapReduce workloads. Gluon is
similar to SpongeFiles in terms of data propagation hierarchy. However, Gluon connects
to a large variety of underlying storage platforms and checkpoints data asynchronously
as well as it works with newer Big Data applications such as Spark.
Apache Ignite[1] provides IGFS layer that can act as an in-memory file system, just
like Alluxio[35]. Apache Ignite tries to be many services including a scalable database,
a key-value cache and a filesystem. However, IGFS does not support as many storage
solutions as Gluon can. The framework also does not focus on transparent integration
of remote storage layers. Gluon is based on Alluxio which has much richer integration
support and it focuses on transparent data movement between cache and storage tier.
HDFS also provides CacheManager[2]. Here, users can manually specify frequently
accessed filenames for regular caching in the cluster. Similar to pin function in Alluxio.
Users can also specify the number of replicas that should be in memory. This can be
beneficial in cases when those popular files are smaller than the total available memory
in the cluster. The cached file replicas are being treated as regular HDFS replicas dur-
ing ApplicationMaster and task execution. This approach requires administrator know
which replicas to cache. This incurs significant usability issues for users.
EC-Cache[47] is based on Alluxio source code. It allows for balanced access of data
from object stores and cluster file systems by avoiding selective replication and relying
on erasure coding. Unlike EC-Cache, Gluon can have data imbalance that causes more
remote pushes/fetches. However, during our experiments,we did not see significant over-
heads due to remote memory data fetching/pushing. Moreover, EC-Cache aims to do
one particular optimization - avoid replication in the cache layer, while Gluon provides
Chapter 6. Related Work 70
a set of different optimizations - platform integration, cache collaboration, transparent
propagation.
6.2 HPC and shared storage integrations
The NFS Connector is a software program developed by NetApp[10]. The connector is
essentially a plug-in that allows Hadoop compute layer to access NFS server. There is
no caching layer in the NFS connector. Thus the locality is not supported. However,
the connector does spatial data pre-fetching. For example, if fooDir/foo1.txt is being
accessed, it can try to pre-fetch all other files under directory fooDir. With the suitable
configuration of endpoints and large OS memory buffers some locality can be achieved.
On the other hand, if NFS Connector fully replaces HDFS, then intermediate values are
stored on remote NFS servers. This will inevitably lead to worsened performance of HMR
or Spark programs.
Ceph is a highly scalable object-based parallel file system[58]. Ceph’s architecture
is very similar to that of HDFS. The main difference is that Ceph is POSIX compliant.
In Ceph-Hadoop integration data is spread across Ceph OSD servers[39]. The url of the
metadata server is exposed to Hadoop computation layer, e.g. ceph://mdtServerName:port.
Once the Application Master is launched it sends a request to metadata servers that re-
turn file information and object locations back to the Application Master. Unfortunately,
Ceph completely separates client programs from storage layer which by definition implies
that there is no locality for client programs. Another bottleneck could be that Ceph in-
tegration stores intermediate values on the remote storage.
Lustre is a parallel file system generally deployed in HPC clusters[27]. Like Ceph,
Lustre is highly scalable and separates storage layer from the clients as well as metadata
storage. There is one prominent integration designed by Intel in 2013[31]. The key idea
of this integration was to utilize Lustre during map-to-reduce phase transition. The
integration developers realized that all clients share the ”same view” of the file system.
Thus instead of storing, shuffling and sending intermediate key-value pairs to reducers,
the developers decided to store the pairs inside Lustre and inform reducers on how to
Chapter 6. Related Work 71
access these pairs. This is feasible because typically mappers and reducers are launched
on the same set of servers. In the Lustre integration all servers of the compute layer
have Lustre client installed. Therefore all Lustre clients share the same view of the file
system: an access of any file can be performed from any client node. This approach
reduces the overhead caused by processing and sending intermediate values. However,
this integration does not have cache therefore tasks have no local data.
Another integration with Lustre uses RDMA[37]. Lu et al. accelerate the shuffle phase
in Spark by leveraging RDMA to avoid the overhead of socket-based communication.
This approach mitigates locality overheads in the previous integrations. However, not all
users may have RDMA-enabled networks and storage servers location can be outside the
network zone.
Gfarm is a general-purpose distributed file system[54]. Gfarm has a similar architec-
ture to HDFS. It has a single metadata server (MDS) and multiple I/O servers. Each
I/O server manages their local file systems and provide access for files in these file sys-
tems. The client can access Gfarm storage using Gfarm client library. The key idea in
Gfarm is to measure round-trip times (RTT) from clients to I/O servers. Since Gfarm
relies heavily on data replication there needs to be an ordering of locations of replicas.
For instance, if a replica is located ”far” from client Gfarm will give low preference for
that client-replica pair. Moreover, Gfarm is POSIX-compliant; this makes it more user-
friendly in comparison to HDFS. Unfortunately, Gfarm integration works the same way
as HDFS, therefore causes the same set of usability issues.
GlusterFS is another open-source distributed file system[16]. However, unlike tra-
ditional distributed file stores, GlusterFS does not have any metadata manager. The
location of the files is determined through the static hash function. Thus if the client
wants to access a file it needs to pass file’s pathname as an argument for the hash func-
tion. We did not find a particular integration of Gluster with Hadoop, except the one
publicly available on GitHub[9]. We investigated open-source integration by analyzing
the implementation of FileSystem API. We deem this integration worth looking at due
non-traditionality of Gluster architecture. According to the open-source integration the
location of the files is determined through file’s pathname in the Gluster file system.
Chapter 6. Related Work 72
ApplicationMaster requests locations of file blocks from Gluster client that takes the
pathname(s), parses it and determines locations on Gluster volumes. Once locations are
determined Gluster client responds to ApplicationMaster. Since the data layout is dis-
tributed across the volumes high task concurrency can achieved. If we co-locate Gluster
volumes with Hadoop NodeManagers then all mapper and reducer containers will be
created on top of the volumes. Since ApplicationMaster can obtain exact locations of file
splits on Gluster Volumes high locality can be achieved. However this is only true for
reading data. There is a drawback that may arise on file creation stage. For instance,
if a mapper wants to create a file and write data to it, then there is no guarantee that
Gluster will create the file in the same physical unit where mapper runs.
Hadoop’s performance was also measured with PVFS [41]. The paper claims that
PVFS can be matched to HDFS. However, the PVFS configuration requires tight cou-
pling to Hadoop. Thus performance degradation is expected from workloads other than
Hadoop.
Hadoop natively provides object store integrations such as S3 connector. Unfortu-
nately, any documentation on integration of Hadoop and Amazon S3 available only shows
how to use S3 in Hadoop configurations without any detailed explanations of how data
is manipulated. For instance, T. White in his article [59] only explains how to configure
Hadoop cluster such that files are transferred from S3 to HDFS and then processed.
This is not helpful for our understanding of the related work because we are looking at
integrations that replace HDFS completely. Hadoop documentation also explains how
to connect to S3. It also states that S3 can be used as a default storage for YARN
applications[8]. After analysis of Hadoop source code we becomes clear that S3 connec-
tor is very simple and does not use caching or asynchronous propagation to remote S3
bucket.
6.3 Full-stack integrations
Azure Data Lake Store[46] is multi-tiered storage solution for analytics processing. It
uses RAM-to-HDD tiered architecture to propagate writes and load reads. However,
Chapter 6. Related Work 73
unlike Gluon, Azure Data Lake focuses on storage techniques and related challenges such
as security. The work does not focus on collaboration between nodes as well as caching
of intermediate data in analytics.
OctopusFS[30] is another multi-tiered storage platform. OctopusFS is very similar
to Gluon in that it also uses node collaboration to propagate data in the cluster. For
instance, RAM on all nodes is filed first and then data to goes to the next layer such
as SDD or HDD. OctopusFS also focuses on BigData applications such as Hadoop and
Spark. However, the work does not integrate Spark (i.e. only store final output data in
the storage) therefore the respective evaluation results do not get significant performance
improvements.
Triple-H[28] architecture is very similar to Gluon. The authors use HDFS as a cache
layer with RAMdisks and SDDs and remote storage is Lustre. They also propagate data
to cold storage transparently thus significantly increasing write performance. Tripl-H
has a solid architecture and explores variety of data placement strategy in the storage
hierarchy. However, storage hierarchy is vertical in Triple-H, while Gluon explores hor-
izontal and vertical hierarchy in data placement through collaboration between nodes.
Moreover, Gluon has been integrated with Spark and can accommodate larger variety of
storage platforms.
MixApart[40] is a modified version of Hadoop. It was developed at the University of
Toronto in 2013. It was one of the first projects to fully integrate Hadoop with NFS.
MixApart also utilizes on-disk caching algorithm. MixApart is motivated by two ob-
servations: first, NFS is a popular storage in most enterprise systems and it is quite
troublesome for companies to periodically transfer data to HDFS for analytics; second,
Facebook traces showed high data re-use in analytics workloads. MixApart has a dedi-
cated node called GateWay that connects with NFS. During the execution of MapReduce
program, MixApart uploads the data to its caching layer called XDFS which is also a
modified verison of HDFS. MixApart’s novelty in dynamic pre-fetching of the data by
looking at the queue of tasks. But MixApart completely disregards writing to remote
storage. It is also not compatible with resource managers and tightly coupled to old
version of Hadoop.
Chapter 6. Related Work 74
6.4 Conclusion
In this chapter, we have discussed rich literature of related work. There is a plethora of
good research publications related to Big Data caching practices. However, most of the
focus on in-depth optimization of one particular function, e.g. eviction. Others focus on
too much breadth trying to be cache for everything: analytics and transaction workloads.
Finally, there have been many attempts to integrate Hadoop ecosystem with HPC storage
solutions. Many of them ignore locality problems while others are not flexible and have
become outdated.
Chapter 7
Future Work and Final Remarks
In this dissertation, we showed how usability issues in state-of-art data analytics platforms
can lead to a failed job, bad performance or poor utilization of resources. We proposed
Gluon - our consolidated flexible platform for data analytics that can support many state-
of-art analytics frameworks. Our new architecture is based on previous case studies and
usability issues in current analytics engines and their storage solutions. In more detail
our contributions are as follows:
Our Gluon caching layer provides global collaboration across the memories of all par-
ticipating compute (and storage) nodes. In addition, Gluon supports full integration
of the collaborative caching service with traditional consolidated storage back-end ser-
vices. With Gluon we emphasize the principle of data locality for in-memory data on any
compute node. At the same time, we take full advantage of fast remote memory access
when opportunities for memory availability in collaborating nodes exist. We describe
data propagation from the execution layer to the storage layer. Finally, as mentioned,
the seamless integration between caching and consolidated storage in Gluon means that
any updates for any files stored on back-end storage can be integrated in a new data
analytics pass transparently, automatically, on-demand. This avoids cumbersome data
manipulations which separate on-disk data silos normally bring about e.g., for data an-
alytics systems based on HDFS.
We discovered that memory management in existing deployments can solve perfor-
mance only in the cases when users are very familiar with data access patterns during
75
Chapter 7. Future Work and Final Remarks 76
their program runs. With the global collaborative cache management provided by Gluon
we alleviated any user concerns about memory depletion as well as under-utilization in
Spark applications.
During our journey, we tried a variety of complex systems and identified best to
achieve our goal. We continuously changed open-source projects’ code and tested each
modification extensively. We focused on caching layer to ensure strong collaboration
between nodes and seamless data movement between tiers. We optimized Spark to easily
connect to the cache layer. Our connectors help to avoid overheads caused by architecture
limitations.
Our results show improvements in terms of usability, performance and robustness. We
have tested our system using real-world scenarios and data. We showed that Gluon can
provide optimal performance of native Spark applications in default mode and outperform
default Spark configurations by up to 3x. We also showed that caching increases write
performance for Hadoop MapReduce by 2.5x using asynchronous propagation.
Based on this initial prototype, our work can be continued in both depth (e.g. ad-
vancing caching layer) and breadth (e.g. advancing integration) as follows:
• Eviction - this includes concerns that regard eviction process and aim to further
improve eviction policies in the first two tiers. Currently, Gluon cannot estimate
RAM capacities correctly when too many processes try to write to the same cache
worker. Therefore, some cases may have task writing to full RAM which will cause
NoSpace exceptions. We also want to coordinate asynchronous checkpointing such
that it does not interfere with eviction processes.
• Memory copies - the second set of improvements is scheduled to be done on
lower-level memory management. We want to reduce as many memory copies
during writes and reads as possible. We also want to avoid serialization bottlenecks
in Spark and Hadoop by moving data raw. Due to recent improvements and cost
decline in network hardware remote data transfer is a smaller bottleneck than
typical CPU-intensive serialization process.
• Shuffler - we also want to advance the notion of ”memory pool” where all memory
Chapter 7. Future Work and Final Remarks 77
is consolidated into single pool that is shared across a variety of different appli-
cations. This includes re-doing spilling mechanisms in Spark and Hadoop during
shuffle phase. We believe sending spilled data to remote memory becomes faster
than spilling it to disk.
• Specialized Graph Processing - we also have tested our platform with Apache
Hama[50] to see if BSP-based[56] algorithms can work with Gluon. While testing
Hama we discovered interesting patterns in skewed graphs. Since Hama work-
ers typically rely on message passing the message queues become very imbalanced
across different workers. We want to investigate an opportunity of offloading mes-
sage queues to remote memory pool in Gluon and fetch them on demand. This
should allow for full memory utilization in graph processing frameworks and again
make users worry less about resource provisioning.
Bibliography
[1] Apache ignite. https://ignite.apache.org/.
[2] Apache software foundation. hdfs centralized cache management.
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/CentralizedCacheManagement.html.
[3] Fusionstorage block distributed storage system.
http://e.huawei.com/en/products/cloud-computing-dc/cloud-
computing/fusionstorage/fusionstorage-block.
[4] Livejournal social graph. https://snap.stanford.edu/data/soc-LiveJournal1.
html.
[5] Project tungsten: Bringing apache spark closer to
bare metal. https://databricks.com/blog/2015/04/28/
project-tungsten-bringing-spark-closer-to-bare-metal.html.
[6] Wikipedia data. http://dumps.wikimedia.org/enwiki/.
[7] Wikixmlj parser. https://code.google.com/p/wikixmlj/.
[8] Apache hadoop. http://hadoop.apache.org, 2009.
[9] Glusterfs-hadoop. https://github.com/gluster/glusterfs-hadoop, 2014.
[10] Netapp nfs connector. https://github.com/NetApp/
NetApp-Hadoop-NFS-Connector, 2014.
78
Bibliography 79
[11] S3 Amazon. Amazon simple storage service (amazon s3), 2012.
[12] Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth
Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated memory caching for
parallel jobs. In Proceedings of the 9th USENIX conference on Networked Systems
Design and Implementation, pages 20–20. USENIX Association, 2012.
[13] Joe Arnold. OpenStack Swift: Using, Administering, and Developing for Swift Object
Storage. ” O’Reilly Media, Inc.”, 2014.
[14] Brad Calder, Tony Wang, Shane Mainali, and Jason Wu. Windows azure blob, 2009.
[15] Tom Clark. Designing Storage Area Networks: A Practical Reference for Imple-
menting Storage Area Networks. Addison-Wesley Longman Publishing Co., Inc.,
2003.
[16] Alex Davies and Alessandro Orsaria. Scale out with glusterfs. Linux Journal,
2013(235):1, 2013.
[17] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large
clusters. Communications of the ACM, 51(1):107–113, 2008.
[18] Francis Deslauriers, Peter McCormick, George Amvrosiadis, Ashvin Goel, and An-
gela Demke Brown. Quartet: Harmonizing task scheduling and caching for cluster
computing. In HotStorage, 2016.
[19] Jens Dittrich and Jorge-Arnulfo Quiane-Ruiz. Efficient big data processing in hadoop
mapreduce. Proceedings of the VLDB Endowment, 5(12):2014–2015, 2012.
[20] Khaled Elmeleegy, Christopher Olston, and Benjamin Reed. Spongefiles: Mitigating
data skew in mapreduce using distributed memory. In Proceedings of the 2014 ACM
SIGMOD international conference on Management of data, pages 551–562. ACM,
2014.
[21] Avrilia Floratou, Nimrod Megiddo, Navneet Potti, Fatma Ozcan, Uday Kale, and
Jan Schmitz-Hermes. Adaptive caching algorithms for big data systems. 2015.
Bibliography 80
[22] Apache Giraph. Giraph, 2015.
[23] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D
Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for
fine-grained resource sharing in the data center. In NSDI, volume 11, pages 22–22,
2011.
[24] Steve Hoffman. Apache Flume: Distributed Log Collection for Hadoop. Packt Pub-
lishing Ltd, 2013.
[25] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic
regression, volume 398. John Wiley & Sons, 2013.
[26] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The hibench
benchmark suite: Characterization of the mapreduce-based data analysis. In Data
Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on,
pages 41–51. IEEE, 2010.
[27] Intel Corporation. Lustre * Software Release 2.x.
[28] Nusrat Sharmin Islam, Xiaoyi Lu, Md Wasi-ur Rahman, Dipti Shankar, and Dha-
baleswar K Panda. Triple-h: A hybrid approach to accelerate hdfs on hpc clus-
ters with heterogeneous storage architecture. In Cluster, Cloud and Grid Comput-
ing (CCGrid), 2015 15th IEEE/ACM International Symposium on, pages 101–110.
IEEE, 2015.
[29] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduc-
tion to statistical learning, volume 112. Springer, 2013.
[30] Elena Kakoulli and Herodotos Herodotou. Octopusfs: A distributed file system
with tiered storage management. In Proceedings of the 2017 ACM International
Conference on Management of Data, pages 65–78. ACM, 2017.
[31] Omkar Kulkarni. Hadoop mapreduce over lustre. In Lustre User’s Group Conference,
2013.
Bibliography 81
[32] Mayuresh Kunjir, Brandon Fain, Kamesh Munagala, and Shivnath Babu. Robus:
Fair cache allocation for data-parallel workloads. In Proceedings of the 2017 ACM
International Conference on Management of Data, pages 219–234. ACM, 2017.
[33] Jaewon Kwak, Eunji Hwang, Tae-kyung Yoo, Beomseok Nam, and Young-ri Choi.
In-memory caching orchestration for hadoop. In Cluster, Cloud and Grid Computing
(CCGrid), 2016 16th IEEE/ACM International Symposium on, pages 94–97. IEEE,
2016.
[34] Steven Levine. Red Hat Enterprise Linux 6 Global File System 2. Red Hat.
[35] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Eric Baldeschwieler, Scott Shenker, and
Ion Stoica. Tachyon: Memory throughput i/o for cluster computing frameworks.
memory, 18:1, 2013.
[36] Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. Sparkbench:
a comprehensive benchmarking suite for in memory data analytic platform spark.
In Proceedings of the 12th ACM International Conference on Computing Frontiers,
page 53. ACM, 2015.
[37] Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, and Dhabaleswar K DK Panda. High-
performance design of apache spark with rdma and its benefits on various workloads.
In Big Data (Big Data), 2016 IEEE International Conference on, pages 253–262.
IEEE, 2016.
[38] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn,
Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph pro-
cessing. In Proceedings of the 2010 ACM SIGMOD International Conference on
Management of data, pages 135–146. ACM, 2010.
[39] Carlos Maltzahn, Esteban Molina-Estolano, Amandeep Khurana, Alex J Nelson,
Scott A Brandt, and Sage Weil. Ceph as a scalable alternative to the hadoop
distributed file system. login: The USENIX Magazine, 35:38–49, 2010.
Bibliography 82
[40] Madalin Mihailescu, Gokul Soundararajan, and Cristiana Amza. Mixapart: Decou-
pled analytics for shared storage systems. In Presented as part of the 11th USENIX
Conference on File and Storage Technologies (FAST 13), pages 133–146, 2013.
[41] Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent,
and Scott Brandt. Mixing hadoop and hpc workloads on parallel filesystems. In
Proceedings of the 4th Annual Workshop on Petascale Data Storage, pages 1–5.
ACM, 2009.
[42] Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu.
Facade: A compiler and runtime for (almost) object-bounded big data applications.
In ACM Sigplan Notices, volume 50, pages 675–690. ACM, 2015.
[43] Hyunkyo Oh, Kiyeon Kim, Jae-Min Hwang, Junho Park, Jongtae Lim, Kyoungsoo
Bok, and Jaesoo Yoo. A distributed cache management scheme for efficient accesses
of small files in hdfs. The Journal of the Korea Contents Association, 14(11):28–38,
2014.
[44] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
citation ranking: bringing order to the web. 1999.
[45] Qifan Pu, Haoyuan Li, Matei Zaharia, Ali Ghodsi, and Ion Stoica. Fairride: Near-
optimal, fair cache sharing. In NSDI, pages 393–406, 2016.
[46] Raghu Ramakrishnan, Baskar Sridharan, John R Douceur, Pavan Kasturi, Balaji
Krishnamachari-Sampath, Karthick Krishnamoorthy, Peng Li, Mitica Manu, Spiro
Michaylov, Rogerio Ramos, et al. Azure data lake store: A hyperscale distributed
file service for big data analytics. In Proceedings of the 2017 ACM International
Conference on Management of Data, pages 51–63. ACM, 2017.
[47] KV Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ram-
chandran. Ec-cache: Load-balanced, low-latency cluster caching with online erasure
coding. In OSDI, pages 401–417, 2016.
Bibliography 83
[48] Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and
Carlo Curino. Apache tez: A unifying framework for modeling and building data
processing applications. In Proceedings of the 2015 ACM SIGMOD international
conference on Management of Data, pages 1357–1369. ACM, 2015.
[49] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. De-
sign and implementation of the sun network filesystem. In Proceedings of the Summer
USENIX conference, pages 119–130, 1985.
[50] Sangwon Seo, Edward J Yoon, Jaehong Kim, Seongwook Jin, Jin-Soo Kim, and
Seungryoul Maeng. Hama: An efficient matrix computation with the mapreduce
framework. In Cloud Computing Technology and Science (CloudCom), 2010 IEEE
Second International Conference on, pages 721–726. IEEE, 2010.
[51] Spencer Shepler, Mike Eisler, David Robinson, Brent Callaghan, Robert Thurlow,
David Noveck, and Carl Beame. Network file system (nfs) version 4 protocol. Net-
work, 2003.
[52] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The
hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage
systems and technologies (MSST), pages 1–10. IEEE, 2010.
[53] Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. Server-storage
virtualization: integration and load balancing in data centers. In Proceedings of the
2008 ACM/IEEE conference on Supercomputing, page 53. IEEE Press, 2008.
[54] Osamu Tatebe, Kohei Hiraga, and Noriyuki Soda. Gfarm grid file system. New
Generation Computing, 28(3):257–275, 2010.
[55] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehous-
ing solution over a map-reduce framework. Proceedings of the VLDB Endowment,
2(2):1626–1629, 2009.
Bibliography 84
[56] Leslie G Valiant. A bridging model for parallel computation. Communications of
the ACM, 33(8):103–111, 1990.
[57] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Ma-
hadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth
Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings
of the 4th annual Symposium on Cloud Computing, page 5. ACM, 2013.
[58] Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos
Maltzahn. Ceph: A scalable, high-performance distributed file system. In Pro-
ceedings of the 7th symposium on Operating systems design and implementation,
pages 307–320. USENIX Association, 2006.
[59] Tom White. Running hadoop mapreduce on amazon ec2 and amazon s3. Retrieved
March, 29:2009, 2007.
[60] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Mur-
phy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient dis-
tributed datasets: A fault-tolerant abstraction for in-memory cluster computing.
In Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation, pages 2–2. USENIX Association, 2012.
[61] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion
Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.