Scalable Collaborative Caching and Storage Platform for Data ......Next, Gluon brings together the bene ts of large scale, on-demand in-memory caching on one hand, and traditional,

Scalable Collaborative Caching and Storage Platform for

Data Analytics

by

Timur Malgazhdarov

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Edward S. Rogers Sr. Department of Electrical and Computer Engineering

University of Toronto

c© Copyright 2018 by Timur Malgazhdarov

Abstract

Scalable Collaborative Caching and Storage Platform for Data Analytics

Timur Malgazhdarov

Master of Applied Science

Edward S. Rogers Sr. Department of Electrical and Computer Engineering

University of Toronto

2018

The emerging Big Data ecosystem has brought about dramatic proliferation of paradigms

for analytics. In the race for the best performance, each new engine enforces tight cou-

pling of analytics execution with caching and storage functionalities. This one-for-all ap-

proach has led to either oversimplifications where traditional functionality was dropped

or more configuration options that created more confusion about optimal settings. We

avoid user confusion by following an integrated multi-service approach where we assign

responsibilities to decoupled services. In our solution, called Gluon, we build a collab-

orative cache tier that connects state-of-art analytics engines with a variety of storage

systems. We use both open-source and proprietary technologies to implement our archi-

tecture. We show that Gluon caching can achieve 2.5x-3x speedup when compared to

uncustomized Spark caching while displaying higher resource utilization efficiency. Fi-

nally, we show how Gluon can integrate traditional storage back-ends without significant

performance loss when compared to vanilla analytics setups.

ii

Acknowledgements

I would like to thank my supervisor, Professor Cristiana Amza, for her knowledge,

guidance and support. It was my privilege and honor to work under Professor Amza’s

supervision.

I would also like to thank my examination committee members: Professor Eyal de

Lara, Professor Ashvin Goel, and Professor Ashish Khisti for their valuable comments

and feedback. I am truly grateful to my colleagues and lab mates: Dr. Stelios Sotiriadis,

Seyed Ali Jokar, and Arnamoy Bhattacharyya for their knowledge, help, and support.

Last but not least, I would like to thank my family, especially my mother Nurgul

Yessetova for her understanding, love, and support.

iii

Contents

Acknowledgements iii

Contents iv

1 Introduction 1

2 Background 7

2.1 Analytics Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Hadoop MapReduce (HMR) . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Specialized Graph Processing . . . . . . . . . . . . . . . . . . . . 11

2.2 Resource Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Storage platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Network-attached storage . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Storage Area Networks . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Distributed systems with direct-attached storage . . . . . . . . . . 14

2.4 Distributed Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Alluxio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Common Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.1 Vanilla Hadoop Solution . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.2 Vanilla Spark Solution . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

iv

3 Thesis Idea and Design 19

3.1 Thesis Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Usability Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Case Study: Spark . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Proposed Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Collaborative caching layer . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2 Service Decoupling and Modularity . . . . . . . . . . . . . . . . . 27

3.3.3 Consolidated Storage Layer . . . . . . . . . . . . . . . . . . . . . 28

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Implementation 32

4.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Alluxio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 Server SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.3 GFS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.4 YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Control and Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Connecting storage component . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.1 Server SAN to filesystem connection . . . . . . . . . . . . . . . . 41

4.4 Connecting GFS2 with Analytics Engines . . . . . . . . . . . . . . . . . . 43

4.5 Cache integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.1 GFS2 to Alluxio connection . . . . . . . . . . . . . . . . . . . . . 46

4.6 Spark integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.7 Additional optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7.1 Asynchronous Delete . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7.2 File consistency checker . . . . . . . . . . . . . . . . . . . . . . . 51

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Evaluation 53

5.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

v

5.1.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2 Comparative evaluation using Spark . . . . . . . . . . . . . . . . . . . . . 57

5.2.1 Spark count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.3 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.4 Gluon job statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 Comparative evaluation using Hadoop MapReduce . . . . . . . . . . . . 62

5.3.1 DFSIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.2 Terasort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.3 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 Graph Processing Framework - Hama . . . . . . . . . . . . . . . . . . . . 65

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6 Related Work 68

6.1 Caching in Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2 HPC and shared storage integrations . . . . . . . . . . . . . . . . . . . . 70

6.3 Full-stack integrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 Future Work and Final Remarks 75

Bibliography 77

vi

Chapter 1

Introduction

Several data analytics paradigms have been recently proposed in order to accommodate

the growing needs of Big Data. Each new paradigm brought with it specialization for

a particular need of data analytics workloads. At the same time, each such specializa-

tion had as side effect a significant departure from existing data processing paradigms.

From a usability perspective, this trend makes it increasingly difficult to analyse the

trade-offs of existing offerings and determine the appropriate platform support, including

interfaces, environments, settings and configurations for both functionality and optimal

performance. In other words, as many different paradigms have proliferated to facilitate

various data management needs, they have made usability and platform management

and integration itself a growing concern.

For example, the initial MapReduce offerings, such as Apache Hadoop[8], came with

a departure from traditional approaches to data processing. Relational data access typ-

ically used SQL-based interfaces to data maintained by consolidated storage back-ends.

Newer data analytics systems, such as Hadoop, not only introduced a new Java-based

data processing language; they also required that data reside in a distributed fashion,

on compute nodes, which formed a separate data silo for data analytics. Spark[61] came

later with yet another data processing language, Scala, and also with an even more pro-

nounced decoupling from persistent data storage concerns. Both paradigms imply that

input data and intermediate data is stored in a distributed fashion on new commodity

distributed file systems, such as, HDFS[52]. Moreover, both Apache Hadoop and Spark

1

Chapter 1. Introduction 2

had their own data caching techniques with the only commonality the data locality and

distributed file system principles.

On the other hand, Apache Hama[50] and Giraph[22] have been recently introduced

for better support of graph-based data analytics as compared with Apache Hadoop and

Spark. The BSP[56] data processing paradigm, which they proposed, strays from the

data locality principle used in all former data analytics paradigms. This makes typi-

cal performance enhancements for distributed data analytics, such as, network traffic

avoidance and effective caching difficult or impossible.

In this work, we propose a scalable, unified, caching and storage platform for data

analytics, called Gluon. Our unified platform provides performance, robustness and ease

of use for any data analytics paradigm currently in use with little or no modifications.

Gluon comes with two essential services for integration of platform support for all types

of data analytics.

First, our Gluon caching layer supports global collaborative caching across the

memories of all participating compute (and storage) nodes. Second, Gluon supports

full integration of the collaborative caching service with traditional consolidated storage

back-end services.

With Gluon we emphasize the principle of data locality for in-memory data on any

compute node. At the same time, we take full advantage of fast remote memory access

when opportunities for memory availability in collaborating nodes exist. Such opportu-

nities may be present due to a variety of reasons. For example, compute nodes may be

temporarily idle due to imperfect load balancing, such as created by fault-induced strag-

glers, or skewed workloads. Furthermore, unused memory may be available on back-end

storage nodes, which can be leveraged by compute nodes.

Whenever data would be normally evicted from the local in-memory cache on any

compute node, Gluon has the capability to push the data to be evicted to a remote node.

Conversely, Gluon fetches remote in-memory data on-demand from collaborative nodes

upon subsequent local access. We currently opt for disjoint caching of data items in the

collaborative in-memory cache; therefore, upon a remote fetch, the data item is discarded

locally after use.


Next, Gluon brings together the benefits of large scale, on-demand in-memory caching

on one hand, and traditional, highly robust, on-disk data redundancy and archival

schemes on the other hand. Specifically, the global in-memory collaborative cache space

could be on the order of terrabytes in total size for a cluster of compute and storage

nodes. However, in the case that the total available cache space is close to exhausted, we

have the option to proactively start writing out dirty blocks of cache to persistent stor-

age. If the need of swapping out to disk arises, such blocks can be subsequently simply

discarded from the cache instead of synchronously written out to disk. Asynchronous

disk writes to back-end storage can also effectively support a periodic, transparent check-

pointing service for data analytics objects. Any data item can be checkpointed to stable

back-end storage with RAID-level redundancy by asynchronously writing the data items

to back-end storage e.g., periodically, and transparently, with no impact on the on-going

computation.

Finally, as mentioned, the seamless integration between caching and consolidated

storage in Gluon means that any updates for any files stored on back-end storage can

be integrated in a new data analytics pass transparently, automatically, on-demand.

This avoids cumbersome data manipulations which separate on-disk data silos normally

bring about e.g., for data analytics systems based on HDFS. For example, incremental

additions to log files that were previously processed by the data analytics framework

would normally need to be copied into the separate analytics data silos, possibly by

hand. In contrast, with Gluon, any data block from back-end storage can be brought

into any compute node’s cache, on-demand, at any time.

Figure 1.1 shows the proposed architecture of Gluon. The disk and storage man-

agement is fully outsourced to the consolidated storage layer. Replication, journaling,

data compression and other techniques are performed by the specialized storage software

installed on servers A, B and C. Data is asynchronously checkpointed from the cache

layer to the storage layer. The cache layer nodes share the same view of the storage files

or objects. Data is fetched from the storage layer on-demand. The cache layer manages

a memory pool for all applications running on top of the cache. Cache nodes collaborate

with each other and move data from busy nodes to more idle ones thus fully utilizing


Figure 1.1: Proposed architecture.

memory in the pool. Overall, Gluon dis-aggregates analytics engine components into

specialized optimally-managed components.

We implement our Gluon unified caching layer from a variety of (mostly) open-source

software components. These components are: YARN[57], Spark[61], Alluxio[35] (previ-

ously Tachyon), GFS2[34] and proprietary Huawei FusionStorage. However, our archi-

tecture is modular, and many of the existing components can be replaced, and similar

components could be interconnected for easy plug and play.

Gluon is based on RAM Disk hence currently offers a file API - it can be placed

under any application that can support Hadoop FileSystem API. The storage tier of

the platform is Server SAN software called FusionStorage. It consolidates all available

disks on the cluster into a storage pool. Server SAN nodes that consolidate the disks we

designate as SAN worker nodes. From the storage pool we create a volume of a large

size and attach it to the cluster nodes such that all nodes share this volume as one block

device. The nodes that have the volume attached we designate as storage client nodes.


Note that SAN workers and clients reside on different nodes. On top of the Server SAN

we install Global File System 2. GFS2 is installed on the storage client nodes. GFS2 is

a clustered file system that allows for synchronized access to a shared block device. In

our case the shared block device is the SAN volume from the SAN storage pool. The

cache tier is based on open-source Alluxio[35]. Alluxio is an in-memory cache that can

interact with YARN applications (MapReduce, Spark etc.). We modify Alluxio such that

it can have shared view of GFS2 and cache files from there, then it places them inside

the memory cache. We extended Alluxio to support asynchronous writes to GFS2. We

have also extended Spark to allow for seamless connection to Alluxio.

We show that our Gluon caching layer can be readily used by a variety of data

analytics packages with little or no modification. In our evaluation, we have used and

empirically tested Gluon in conjunction with Spark and Hadoop MapReduce (HMR). We

use real-world data and applications to test Spark and Gluon default configurations. We

also look at PageRank algorithm and utilize Spark’s GraphX library which is hardcoded

to cache graph data into Spark memory. We show that for cache-intensive workloads

Gluon outperforms Spark by 2.5x - 3x. We also show that Gluon has the same perfor-

mance as Spark with optimal configuration or over-provisioned RAM sizes. Moreover, we

show Gluon vs. HDFS comparisons in HMR workloads: Terasort and PageRank. Gluon

achieves up to 1.85x speedup in reads of re-used data. In addition, we demonstrate how

manual ingestion of data affects overall performance. Finally, Gluon expedites iterative

HMR jobs by more than 30%.

For Spark, HMR and many other analytics engines, Gluon provides the ease of use,

functionalities and opportunities for transparent performance boosting what each of the

current schemes is missing.

For Spark, we found that memory management is actually very brittle. The user

needs to explicitly specify the appropriate memory allocation to Spark, manually, oth-

erwise there is the risk of crashes for the Spark jobs. Moreover, out of the available

memory allocation as specified by the user, Spark always has a boundary for the memory

to be used for Spark computation versus the memory to be used for storage caching.

Newly proposed dynamic partitioner cannot solve usability problems too, because users


still need to choose memory fraction that can be reclaimed by storage space. With the

global collaborative cache management left to Gluon, both the user’s memory concern

and the potential memory waste due data skews is readily alleviated; moreover, we show

performance boosts for the Spark jobs whenever remote memory availability can be lever-

aged. Finally, in Spark, if a node crashes, then the data objects on that node need to be

recomputed either from scratch or from a user-inserted checkpoint. Gluon adds flexibility

by transparently performing asynchronous checkpointing of objects to stable back-end

storage with no observable overheads for the application.

For Hadoop MapReduce, we found that inter-job data exchange tightly coupled with

HDFS. Reducers always write to HDFS and the next set of mappers have to read from

disk. Gluon expedites this exchange and asynchronously checkpoints inter-job data in

case next job fails. Hence, it is best suited for iterative jobs, chains of jobs and high data

re-use jobs.

The next Chapter provides a brief review of popular analytics engines, storage and

cache solutions, and current vanilla deployments. Then, Chapter 3 reviews case studies

that affect analytics in production systems and proposes the new design. In Chapter 4,

we reveal implementation details and go deep into the technicalities of the platform we

developed. We also discuss benchmarks for our evaluation and the stress tests we per-

formed to understand implementation bottlenecks. Chapter 5 introduces the deployment

specifics, configurations and evaluation methodology followed by result analysis and dis-

cussion. This is tailed by related work and a final chapter that concludes this thesis and

introduces possible directions for the future work.

Chapter 2

Background

In this Chapter, we will discuss various analytics engines and storage platforms. In the

first section we will review basic mechanisms behind most popular analytics engines like

Apache MapReduce [19] and Apache Spark[61].

Over the last decade the popularity of Hadoop-related engines has been increasing.

Today analytics engines are fragmented throughout different areas of data processing.

For instance, Apache Hama[50] is targeted at large-scale graph processing algorithms.

Another example is Hive[55] that converts SQL queries into a chain of MapReduce jobs.

On top of these engines, Big Data world has introduced resource managers that are

overtaking job scheduling responsibilities from engine-specific schedulers. YARN[57] and

Mesos[23] are the most popular resource managers that can support a variety of applica-

tions including Hadoop MapReduce, Spark and Tez[48].

In the second section, we will cover popular storage platforms that are used in cloud

and enterprise. Network-attached storage and storage area networks are still being heav-

ily utilized by enterprises and cloud providers. Object stores such as S3[11] became

extremely popular due to rise of the cloud computing.

We will then recap how existing analytics engines are being deployed. Reviewing

advantages and disadvantages of different solutions helps us discover various aspects

that play crucial role when running analytics workloads in the production environment.

There have been a number of attempts to consolidate analytics with large-scale storage

services. Majority of proposed architectures either introduced new usability issues or

7

Chapter 2. Background 8

lacked satisfactory performance. Moreover, the emergence of YARN and Mesos also set

new rules of job scheduling that affected previous integration techniques. Finally, a novel

in-memory cache Alluxio opened new frontiers for consolidation mechanisms.

2.1 Analytics Engines

2.1.1 Hadoop MapReduce (HMR)

The core of computing in existing data analytics systems is the algorithm called MapRe-

duce [17]. It is an execution strategy used for processing large data sets. MapReduce

spawns multiple workers in parallel on commodity machines that usually host data being

processed.

The algorithm has two phases: Map phase and Reduce phase (Figure 2.1). During

Figure 2.1: A mapreduce example.

the map phase task executors or mappers work independent of each other on local input

data. Mappers extract input values from files and typically generate key-value pairs.

After the map phase intermediary data gets sorted by keys and split into partitions.


These partitions are then shuffled across computing machines, and then the reduce phase

starts execution. Intermediary data will be always stored on local disk if it cannot fit

into the memory. This process is called ”spilling”. It is important to understand where

intermediate values are being stored. For example, some big data architectures may store

key-value partitions on remote disk storage, and then access them again to shuffle across

the network. This can impact performance by disk and network bottlenecks.

Reducers start to execute once partitions become available to them; their job is to

reduce the number of keys by performing operations such as aggregate, filter, search etc.

The final results are then stored to the underlying file system, e.g. HDFS.

Hadoop MapReduce (HMR)[8] algorithm is one of the implementations of MapReduce

paradigm. HMR is designed to work in the bundle with Hadoop Distributed File System

(HDFS). Typically, HDFS nodes are co-allocated with nodes that run HMR tasks. Two

systems has to work together to achieve data locality such that an HMR task does not

fetch data from a remote node.

HMR tasks interact with HDFS typically on initial data load that occurs during map

phase and final data write in reduce phase (data write can happen in a map phase if there

is no reduce phase). After map phase an output record would be assigned a partition

id, i.e. a reducer that will process the record. After partition assignment, intermediate

records are collected in a circular memory buffer of each map task. If they occupy more

than 80% of the buffer then they are ”spilled” to a local disk. Before the spill, records of

each map task are sorted by partitions and later by keys. All the spills from map tasks of

one particular node get merged into one large file where records are sorted by partitions.

Then records get transferred to their related reducers.

HMR tasks access HDFS through FileSystem API which is an abstract Java class that

defines set of functions that need to be implemented. Functions include open(), create(),

mkdirs(), getFileStatus() etc. Implementing FileSystem API in order to access another

file system allows for organized development of plug-ins.


2.1.2 Spark

Like HMR, Spark also has MapReduce algorithm at its core. However, Spark does not

rely on a rigid map-then-reduce format but rather on a more general directed acyclic

graph (DAG) of operators. Figure 2.2 shows a DAG that describes an application.

This approach allows to avoid writing to disk after each reduce phase and to pass the

computation result down the execution pipeline. In a way, Spark[61] targets iterative

Figure 2.2: Spark application DAG example.

jobs or chains of HMR jobs that are typically bound by I/O bottlenecks. A typical spark

application has a driver program and many task programs that execute the same code.

Task programs run inside Spark Executors that are just JVMs with pre-defined heap size

and number of cores. Spark processes data in terms of Resilient Distributed Datasets

or RDDs, these datasets represent data at a particular stage of the application. RDD is

divided into partitions and distributed across Spark task programs.

Spark defines two types of operations: transformations and actions. Transforma-

tions are operations that do not require data shuffle. Transformations are lazy oper-

ations and thus are computed when triggered by a following action. Actions, on the

other hand, require shuffling and synchronization of data. Transformations include com-

mands like map(), filter(), flatMap() etc. Actions typically involve reduction of data:

reduceByKey(), groupByKey(), join etc.

Spark allows for in-memory caching. Spark’s Resilient Distributed Dataset (RDD)[60]

can be cached into the memory of an Executor at any point during computation. This

means that future steps of the computation that require the same dataset do not need

to recompute it.

Despite many advantages of Spark, users need to understand execution mechanisms

of the framework in detail. Moreover, tight coupling of caching may result in interference


with computation memory thus making cache behavior a user problem.

In Spark, every application is equipped with a local memory cache that an application

can use throughout the program execution. The advantage of Spark cache is that data

can be saved in the heap of the executor JVM, thus accessing data from within the

same JVM is very fast. However, the same heap space is used for computation therefore

careless use of heap’s memory can lead to major performance degradation. Therefore,

Spark users need to thoroughly understand how their application data can impact JVM

heap size. This includes size of the data partition to be re-used, size of each executor

heap space, java object serialization and it’s implication on data size and more. Hence,

tight coupling of cache and compute layers in Spark provides usability issues that can

easily lead to misuse of the cache that in turn leads to performance drop.

2.1.3 Specialized Graph Processing

Another batch processing engine that has started to compete with MapReduce recently

is Bulk Synchronous Parallel (BSP)[56]. BSP’s key idea lies in message passing. Through

communication between workers BSP can achieve a high level of synchronization. Never-

theless, BSP has its challenges. For example, how do we identify message passing routes,

i.e. which worker should be a sender or a receiver? The good news is that in graph

algorithms we do not consider this issue. Since any graph structure can tell us how

communication routes are defined. In principle, all the communication is done through

passing messages to node’s neighbours and vice versa. Therefore, BSP is a perfect match

for dependency-rich data structures like trees and graphs.

Apache Hama is a BSP framework that is a part of Apache Hadoop ecosystem[50].

Hama was inspired by Google BSP-based Pregel[38]. Hama spawns parallel workers

(typically a worker per CPU core) each worker processes messages and prepares a set of

outgoing messages. All messages are routed in a synchronization step after all workers

finished processing preparing their outgoing messages. After synchronization step mes-

sages are sent out to workers. The time period from processing messages to the end of

synchronization is called a superstep. The job is considered to be done when there are

no workers that need to send messages, i.e. all outgoing message queues are empty.


Unlike MapReduce, Hama does not store intermediate results. Workers keep mes-

sages in their respective queues and queues are stored in Java heaps. Therefore the only

two interactions Hama has with cold storage is during initial data load to workers and

final data save from workers after job is complete.

2.2 Resource Managers

Currently there two major players in the resource management of the data analytics

engines: YARN[57] and Mesos[23]. YARN is the most popular and an older framework.

It allows for fair resource negatiating across a variety of analytics applications. The

center of YARN[57] is the ResourceManager which is the main authority responsible

for distributing cluster resources among all applications in the system. Each node in

the cluster has a NodeManager that monitors node resources and application activity.

NodeManagers also launch containers for applications. A container is just a definition

of memory and CPU limits per application. The latest Hadoop versions heavily rely

on a capacity scheduler within YARN. This scheduler launches applications based on

their resource requirements (CPU and memory requirements). Each application has an

ApplicationMaster that negotiates resources from the ResourceManager and works with

NodeManagers to execute tasks.

When we talk about YARN, it is paramount to understand how YARN default sched-

uler works. By default, YARN relies on its capacity scheduler that assigns jobs based on

the available resources in the cluster. For example, if the tasks of a certain job are in

the queue YARN Capacity Scheduler will try to match task’s resource requirements with

resources available in the cluster. However, this approach can have a conflict with an-

other type of scheduling: data-location-based scheduling. Mesos was introduced later

than YARN. However, it’s primary goal is also to allow a large variety of frameworks

to execute seamlessly on the same set of machines. The argument that Mesos creates

make is that data analytics ecosystem is fragmented and users need different engines for

different types of problems. Hence, multi-framework clusters will be a commonality in

the future.


2.3 Storage platforms

Data stored on hardware disks can have different representations. At the bare metal

level data is stored in disk blocks (e.g. 4KB), hence the name block device or block

storage. A file system can introduce another level of abstraction to a block device and

represent data as a file or a directory to end users. An object storage can represent

block storage data in terms of unique objects. Analytics engines as most of other client

applications commonly operate on top of files or objects. Distributed storage platforms

may incorporate a file system representation and enforce POSIX-compliance. On the

other hand, some platforms expose virtual block devices and rely on client file systems.

Another set of platforms focus on co-locating clients with storage medium on the same

server (directly-attached storage) to provide faster performance.

2.3.1 Network-attached storage

Network attached storage (NAS) is a platform that separates client programs from storage

medium and allows for file-based or object-based access of data. NFS is one example of

such systems[49]. NFS has a file server decoupled from client servers. All data is stored

on the file server disks and client servers use network protocol to access remote files.

There are many other systems with similar architectures. These systems have problems

with scalability and high availability since all of data is stored on one node.

Another example can be Lustre[27]. Unlike NFS, it is a highly scalable distributed

filesystem that decouples client nodes from storage nodes. Lustre has many storage

nodes that manage their own data without knowledge of other cluster nodes. There

is a separate metadata managers that contain a table of all files and their respective

locations. This architecture allows for high scalability of requests unlike NFS. Ceph

is very similar to Lustre, but it also provides block storage interface as well as object

interface[58]. Other examples are cloud-based object stores like S3, OpenStack Swift and

Azure Blob[11][13][14].


2.3.2 Storage Area Networks

Storage area network (SAN) is consolidation of commodity disks that provide block level

access[15]. SAN is made of available block devices that are integrated into a single pool.

Then virtual block devices can be accessed over the network by clients. These virtual

devices appear as locally attached devices to the OS file system. SANs can support

protocols like iSCSI, FibreChannel and AoE. Unlike NAS, SANs expose block device

interface and delegate file system concerns to the client side. OS file systems are mounted

on top of virtual block volumes.

Server SAN is a SAN management software that helps to consolidate all disks on

commodity servers into a single pool of disks[53]. Users can create virtual volumes

from the pool. The volumes can then be attached as new disks to virtual or physical

machines. As in other large management systems, Server SANs typically have multiple

master nodes that control metadata about all disks in the pool(s) and about virtual

block devices. Slave/agent servers are responsible for managing disks on their servers and

reporting their state to the master. SAN clients expose virtual volumes to their respective

operating systems. Typically Server SAN replicates disk blocks across multiple disks and

servers in two- or three-way fashion. They provide data balancing, data compaction and

a variety of recovery mechanisms. The famous example of Server SANs is Amazon EBS

volumes. In this project we utilize a similar Server SAN architecture provided by Huawei

Technologies Inc. - FusionStorage solution[3].

2.3.3 Distributed systems with direct-attached storage

Direct-attached storage (DAS) is a digital storage that is directly attached to a server, i.e.

local disk. In this architecture, data is not sent over the network for storage but remains

on the server. Common single-node file systems such as ext4 and ext3 are mounted on

top of DAS. HDFS, for instance, is a DAS storage layer in the Hadoop framework[52].

It has a master/slave architecture. The NameNode is the master program that stores

and manages file namespace, file block locations, permissions, access times etc. It also

regulates access to files with client programs like HMR or Spark. HDFS is designed to


store files in terms of sequence of blocks on the DataNodes. It is usually configured with

3-way replication where each file block has 3 replicas scattered across the cluster. The

file block size is generally 64MB. By scattering blocks in the cluster HDFS can scale out

to a great extent.

Whenever the HMR program (that runs in ApplicationMaster) requires certain input

files it contacts the NameNode to get the file information, including locations of file

blocks. Then it requests containers from the ResourceManager to execute tasks. The

ApplicationMaster passes the ”preference nodes” information with the container request.

The preference nodes are those that contain input file blocks. The ResourceManager

may ignore the preference request because of resource unavailability and allocate the

containers on the nodes without required data. In this scenario, data is transferred to

the node with the container allocated. However, since there are 3 replicas of the same

file block the ResourceManager rarely ignores the ApplicationMaster preferences.

2.4 Distributed Cache

2.4.1 Alluxio

Alluxio is an in-memory cache - not just memory only - and its tiered storage feature

means it can theoretically access any storage. Because Alluxio exposes a storage inte-

gration layer through an API, applications can access any underlying persistent storage

and file systems. Alluxio can be deployed with any big data framework (Apache Spark,

Apache MapReduce, Apache Flink, Impala, etc.) on many storage systems or file systems

(Alibaba OSS, Amazon S3, EMC, NetApp, OpenStack Swift, Red Hat GlusterFS, and

more).

Alluxio is designed in the context of Hadoop[35]. This means that existing Spark and

MapReduce programs can run on top of Alluxio without any code modifications.

Alluxio’s design uses a single master called AlluxioMaster and multiple workers called

AlluxioWorkers. At a high level, Alluxio can be divided into three components, the mas-

ter, workers, and clients. The master and workers together form the Alluxio servers,


which are the main components of a typical Alluxio cluster. The clients are generally the

applications, such as Spark or MapReduce jobs.

The master is responsible for managing the global metadata of the system, e.g. the file

system tree. Clients may communicate with the master to read or write to this metadata.

Alluxio workers are responsible for managing local resources allocated to Alluxio. These

resources include local memory, SSD, or hard disk and are user configurable. Alluxio

workers store data as file blocks and serve requests from clients to read or write data by

reading or creating new file blocks; workers are very similar to HDFS DataNodes. The

worker is only responsible for the data in these file blocks; the actual mapping from file

to file blocks is only stored in the master. The Alluxio client provides users a gateway to

interact with the Alluxio workers. It exposes a cache system API. It initiates communi-

cation with master to carry out metadata operations and with workers to read and write

data that exist in Alluxio. Data that exists in the under storage(e.g. HDFS) but is not

available in Alluxio is accessed directly through an under storage client.

AlluxioWorkers store file blocks inside directories just like HDFS DataNodes. The

difference from HDFS is that AlluxioWorker’s directory is mounted as RamFS, i.e. OS

page cache.

2.5 Common Solutions

2.5.1 Vanilla Hadoop Solution

Companies that perform regular large data analytics typically deploy Hadoop in a sep-

arate cluster environment from their main data generation and curation engines. For

instance, Taobao, Chinese 3rd largest e-commerce site, accumulates logs in data ware-

house periodically transferring log data to analytics silo, i.e. HDFS.

Having another storage for analytics may incur additional costs. For instance, in

usual Hadoop deployments, data are stored on local node disks and 3-way replication is

employed to ensure reliability. This Hadoop-specific setup leads to increased storage ca-

pacity requirements overall. As a result, companies end up purchasing new hardware for


the sole purpose of running data analytics, resulting in substantial upfront infrastructure

investment, and increased management costs. Additionally, data ingestion can take some

time given the size of the data transferred, thus postponing a MapReduce or a Spark

job. Finally, periodic transfers have to be set up, configured and automated which incurs

additional engineering effort.

On the other hand, once required data is loaded to HDFS then performance of HMR

is at its optimal in terms of reads. The reason is that each data piece has 3(default)

replicas thus the probability that locality will be ignored by YARN is decreased by 3.

Also HDFS relies on the Linux-based file systems like ext3 and ext4 that manage OS

buffer cache. Given large RAM size on the analytics nodes, HDFS DataNode can store

most of its file blocks in the local memory. In addition, default HDFS settings allow it

to write first replica and asynchronously propagate 2 other replicas. With ext3 caching

onto OS buffer during writes, HDFS write performance can reach memory speed. Default

HDFS is fault-tolerant but not quite highly-available due to asynchronous distribution

of copies. To enforce synchronous copying dfs.min.replication parameter needs to be set

to a value of dfs.replication parameter.

2.5.2 Vanilla Spark Solution

Unlike HMR, Spark does not include its native file system. Spark can work with many

storage options like S3, NFS, HDFS etc. Typical Spark deployments can be of 3 types:

standalone, YARN or Mesos. Standalone Spark clusters are deployed for analytics work-

loads running only Spark programs whereas YARN or Mesos deployments allow other

engine jobs to execute in parallel with Spark jobs.

One of the key differences of Spark is that it can cache MapReduce inter-job data

thus it can decrease local disk or remote storage access frequency. For most of the jobs

that are iterative in nature or consist of a chain of smaller jobs Spark is most suitable.

Nevertheless, Spark does not cache data by default, it is up to a user to decide at

which point in the program data partitions need to be cached. Spark community provides

guidelines for coding techniques that can help achieve optimal performance. However, it

takes experience and knowledge of Spark internals in order to utilize Spark caching most


efficiently. In addition, failed tasks that do not finish a certain computation will have to

be re-tried and re-compute lost partitions. Improper caching and lost partitions will lead

to increased job execution time. Moreover, Spark JVMs cache data partition per user,

therefore if another user will need to access the same data partition it will be transferred

from the disk or remote store and cached to another Spark JVM.

2.6 Conclusion

In this Chapter, we outlined concepts and platforms that are essential building blocks of

our consolidated platform. We discussed processing engines such as HMR and Spark. All

of these are used in our final architecture. We also reviewed resource managers focusing

on YARN which is paramount in our platform. We described storage concepts in large

scale systems to show readers that storage tiers can be very different in design. We also

discussed Alluxio, the recently introduced caching tier for Hadoop ecosystem. Alluxio

helps our platform to improve read performance in high data re-use scenarios. Finally

we showed vanilla (common) Big Data stacks and pointed out possible flaws.

Chapter 3

Thesis Idea and Design

3.1 Thesis Idea

Our goal is to design a consolidated caching and storage architecture that meets the

requirements of data analytics workloads in terms of usability, cost, performance and

fault-tolerance. We propose decoupling caching and storage responsibilities from the

analytics layer and outsourcing them to external independent layers. Towards this we

design and implement a scalable collaborative caching tier that connects existing analytics

engines with robustness-oriented storage solutions.

In this chapter, first, we present several case studies that show usability issues in state-

of-the-art analytics engines. Second, we discuss the proposed design of the consolidated

architecture. We cover the collaborative cache, explain service interactions and describe

the consolidated storage layer. We focus on optimizing collaborative cache such that

our platform achieves good performance and avoids common usability issues. Hence in

this dissertation we make two main contributions: (1) building an integrated caching

and storage platform for data analytics and (2) optimizing data and control flow of

collaborative caching to improve usability, performance and robustness.

19

Chapter 3. Thesis Idea and Design 20

3.2 Usability Issues

3.2.1 Case Study: Spark

Apache Spark[61] offers caching mechanisms for intermediate data to avoid re-computation

of RDDs when they are re-used. Spark Executors keep computation objects inside the

Spark JVM heap. The same heap is utilized for cached data. Tight coupling of execution

and cache spaces leads a variety of interface options. However, instead of flexibility this

diversity comes with rigid constraints and possible confusion for users during configura-

tion of the Spark application. There is a variety of options available for users in order to

improve job performance. Spark’s .cache method uses Executor heap only as a default

option. Other options include MEMORY AND DISK and DISK ONLY. Users can also

choose if they want to store raw data or serialized data. Spark users need to understand

how much data will be stored in the cache in order to provide enough memory to Execu-

tors. When running Spark in a YARN[57] cluster, the configuration settings become even

trickier. YARN forces applications to run inside Containers. If an application exceeds

the Container limits, YARN will kill the application.

Executor memory falls under two categories in Spark: execution and storage. Exe-

cution memory is used for storing computation related objects. Storage memory, on the

other hand, is used for caching data. Both execution and storage share a unified region

called M. By default, M is set to be 0.75% of Executor heap and storage fraction can

occupy 50% of M. The fraction is configurable and up to the user to choose.

Figure 3.1 shows the DAG of a simple Spark program that a user wants to submit

to YARN cluster. The program reads 10 GB of the graph data from HDFS, extracts all

adjacency lists in the line to array and caches the lists. The listRDD is cached using

default .cache command. After caching, listRDD is used in two different map functions.

The first computation is to extract vertices and the second computation extracts edges.

Outputs of both maps are saved back to HDFS.

Let us assume that the user does not know about 0.75 fraction of M and compute-

storage split of memory, and submits Spark program to YARN cluster with 10 containers

of 1 GB size. The job fails with multiple tasks reporting GC : time limit exceeded


Figure 3.1: Spark application to extract graph data.

exception. After thorough investigation the user realizes there were only 3GB of space

available for caching and the garbage collector spent too much time evicting blocks.

Let us now assume that the user knows about M region and 50% compute-storage

split. She submits a Spark program to YARN cluster with a request of 27 containers

of 1 GB size for each. This results in a total of 27 GB of RAM allocated for 27 Spark

Executors. Each Executor reads an RDD partition of 10GB file from HDFS. After the

first map phase the actual data size results in 15 GB due to object de-serialization and

initial map overheads. Only 10.1 GB of data fit all Executors’ memory. The rest of 4.9

GB needs to be re-computed from the beginning in the second map function after cache.

This is an obvious performance loss due to misconfiguration.

Let us now assume that the user knows everything about previous runs. She decides to

submit Spark program with 27 containers of 1 GB size for each. However, she configures

caching to be MEMORY AND DISK. The job finishes smoothly and faster than previous

runs. However, after investigating the Spark UI, the user realizes that 2 Executors didn’t

use full storage memory fraction, while 3 other Executors spilled almost 3 GB to disk.

Hence the user realizes this is an optimal performance that program can achieve, however

it can be further improved.

All cases above demonstrate how the tight coupling of execution and memory in

Spark can easily result in job failure, performance loss and/or under-utilization of memory

resources. We, therefore, conclude that the Spark application will benefit from an external

collaborative cache that can grow as needed and utilize all assigned resources fully by

evicting data to remote node or disk on demand.


3.2.2 HDFS

Analytics engines generally have their own storage component (e.g. Hadoop’s HDFS)

that represents a standalone storage system[52]. Having another storage for analytics

may incur additional costs. For instance, in usual Hadoop deployments, data are stored

on local node disks and 3-way replication is employed to ensure reliability. This Hadoop-

specific setup leads to increased disk capacity requirements overall. As a result, customers

end up purchasing new hardware for the sole purpose of running data analytics, resulting

in substantial upfront infrastructure investment, and increased management costs. In

general, HDFS is not used as an enterprise storage, but is widely adopted as data ana-

lytics storage. This leads us to conclude that customers use multiple storage silos: (1)

one silo containing data for transaction processing such as enterprise and web application

processing, with (2) a second silo for analytics. This approach requires users to look for

and deploy mechanisms to periodically transfer data between silos. The emergence of

Apache Flume[24] explains the need for fast data transfer across silos.

HDFS is also considered to be a highly fault-tolerant system. However, it only has

one metadata server and it is up to the system administrator to make it more available.

On the other hand, existing storage-oriented systems like Lustre, Ceph, Huawei’s Fusion-

Storage and others strive to excel at fault-tolerance and high availability. For instance,

Huawei FusionStorage, in the default configuration, has 3 metadata servers (MDC) that

are coordinated by a Zookeeper cluster. Furthermore, since storage silos process other

workloads, e.g. webserver, placing analytics stack on the same set of storage servers is not

a good idea. Therefore, our collaborative cache is decoupled from the storage servers, i.e.

placed on a different set of servers or VMs in the data center. Since propagation to remote

storage disks from the caching layer can be a bottleneck we propose to asynchronously

propagate data to storage silos. We describe our design in more detail in the next section.


3.3 Proposed Design

Our conceptual design addresses the usability issues discussed previously. We want to

encourage flexibility in our platform. We design a platform called Gluon that provides

flexible support for the majority of workloads with existing storage systems using col-

laborative caching. To achieve that, we leverage open-source commodity compute, cache

and storage solutions. This further contributes to our usability claim. Our design con-

sists of two tiers: (1) data analytics and in-memory collaborative caching tier and (2)

consolodated storage tier.

As an analytics tier we propose to integrate any engine that is compatible with Hadoop

FileSystem API. First, our caching layer supports global collaboration across the mem-

ories of all participating compute (and storage) nodes. The cache is designed to be scal-

able, independent from the analytics engines and to utilize the given resources efficiently.

This cache should propagate analytics data to decoupled storage services and fetch data

from them on-demand.

Second, Gluon supports full integration of the collaborative caching service with tra-

ditional consolidated storage back-end services. As a storage tier we propose to integrate

any storage solution that can ensure fault-tolerance, scalability and high availability.

The consolidated storage will provide persistent storage service for all the data analytics

needs.

3.3.1 Collaborative caching layer

We propose an in-memory collaborative caching layer interposed between the storage

service and analytics engines. A cache service provides data locality in our platform.

This helps improve read performance and reduce communication overhead with remote

storage service. Collaboration between cache nodes can increase cache utilization to the

maximum. Analytics workloads can work on skewed data where some nodes have to

cache more than others. Collaboration should allow us to push extra data from a local

node to remote nodes that have spare idle CPU cycles and available memory. The data

can also be brought back to the local node from remote node on-demand.


Figure 3.2 demonstrates the proposed architecture of our platform. Co-locating cache

Figure 3.2: Proposed architecture.

nodes with computation nodes is preferred because it will allow for best locality. In our

design, we have one cache manager and multiple cache workers. The cache manager holds

metadata about each worker information. Analytics programs connect to cache layer

using cache clients. Cache client have the interface to communicate with the manager

and workers. Each cache worker controls its local resources such as RAM and disk. It also

maintains data and reports to the cache manager upon change. Cache workers are also

responsible for propagating data to storage service using corresponding storage clients.

Since we are co-locating cache with execution, data caching policies are paramount. We

describe our data caching policies next.

Data movement

Figure 3.3 demonstrates how data is propagated in our collaborative cache. Tasks can

interact with any of the cache workers and are able to write data to any of them. The

policy, however, should always favour local memory first and only when this is depleted


Figure 3.3: Data movement of task writes in collaborative cache

a task writes to remote memory. If all remote nodes’ memories are depleted, then a task

needs to wait until any cache worker has successfully evicted blocks to their respective

local disk and has free memory. In the background, data blocks are asynchronously

propagated by cache workers to the remote storage silo.

During reads, data is brought from the remote storage silo on-demand and cached in

the local memory of the workers. We only cache a block on read when the block is not

currently present on the cache layer, i.e. we avoid block replicas in the cache. Caching

data during reads optimizes the performance of subsequent re-use of the same data set.

This is very helpful in machine learning algorithms such as Logistic Regression which

requires multiple passes over the same set of data.

Consistency between workers

In general, in distributed systems, data inconsistencies may arise when replicas of the

same data block are being modified from different locations. However, in analytics work-

loads tasks perform writes to separate disjoint files. HDFS API, for instance, does not

allow for file modifications but only creation or appending. This results in each reducer

writing to disjoint files. Therefore, in our cache, although we allow replicas to exist in


rare scenarios, we will not allow joint writes to the cached blocks.

Resource sharing

Co-locating the cache layer with execution will make them compete for the same physical

resources, such as, RAM and disk. However, we already saw from the case study that

Spark already splits memory into dedicated areas for execution and storage. We saw that

the typically fraction given of 0.375 to the storage memory. We assign this amount to

the cache and set all Spark memory to be compute memory only. Unlike Spark, Hadoop

MapReduce does not use native caching. However, HMR severely suffers from inter-job

data exchange in chained or iterative workloads. Hence it will only benefit from caching

layer that can store inter-job data in memory. Local disks on analytics layer that are

used for spilling shuffle data can also be shared with the collaborative cache for storing

evicted memory blocks. We also assign top layer disks buffer and give that memory to the

corresponding cache worker because we offload disk-related operations to storage service.

However, we may still run out of cache space quickly. Therefore, cache layer will grow

independently from the compute layer where some nodes are co-located with compute

layer while others can be co-located with storage or other more idle services.

As we have previously mentioned, Spark users often need to know how much memory

their partitions occupy. With our independent caching layer the users worry less about

memory management during computation. By offloading data to external cache service

users don’t have to worry that Spark JVMs will slow down due memory thrashing and

long GC times.

Connectors

Our architecture requires us to introduce two clients one for cache and another for storage

client. We integrate storage client into cache layer instead of compute layer. In our final

architecture Hadoop API connects to cache layer.

We are required to make changes either in configuration or source code of analytics

engines in order to seamlessly connect to caching layer. For instance, Spark caching

mechanism is fine-tuned to store data in JVM heap or node disk. Storing data in external


service is not implemented in Spark. We implement new external service manager in

Spark that integrates seamlessly such that a user just needs to change one configuration

setting.

3.3.2 Service Decoupling and Modularity

Our architecture proposes to decouple computation and caching from storage respon-

sibilities. We use connectors and client programs to help decoupled services interact.

In the Hadoop ecosystem, applications interact with HDFS through a FileSystem API.

Application workers connect to HDFS DataNodes through an HDFS Client. In our plat-

form we rely on Hadoop FileSystem API, because the majority of analytics engines have

already implemented the API. We can think of HDFS and analytics engines as decoupled

services. However, in vanilla HMR or Spark setup HDFS is placed on the same nodes as

analytics engine. We discussed that this placement incurs usability issues.

In our platform, we propose to remove local storage solution such as HDFS from an-

alytics nodes. We place storage solutions onto different set of nodes that can be located

on different racks. Gluon needs to be modular and able to integrate existing analytics

and storage platforms. Figure 3.4 shows analytics and storage services that can be inte-

grated in Gluon. Analytics applications that run inside containers assigned by Resource

Manager connect to storage service through a client program. For instance, if storage

layer is HDFS then storage clients can be HDFS clients. In this scenario, Spark, HMR or

any other analytics engine would connect to storage service through Hadoop FileSystem

API. Storage service is responsible for disk and data management as well as replication.

Decoupling storage service helps customers to deploy new analytics engines in their

system. For instance, if an enterprise stores data from transaction processing in NFS[51],

then deploying an analytics engine on top of NFS just requires installing a storage client.

There will be no need for ad-hoc data ingesting to analytics silo which provides a signif-

icant improvement in terms of usability and cost.

A storage client is responsible for translating storage service calls from applications.

A majority of analytics engines (e.g. Apache) are run inside of JVMs and implemented

using Java or Scala languages. Therefore, a storage client should be compiled .jar exe-


Figure 3.4: Decoupling Storage and Analytics.

cutable that is run inside of application JVM. The storage client is a set of functions that

translates Java calls (Hadoop FileSystem API) into respective calls of storage service.

The storage service can represent data of different types: files, file blocks, objects or disk

blocks. Depending on the data representation, the storage client can be more than just a

Java Connector. For instance, for HMR or Spark or any other analytics engine accessing

block device using SCSI or iSCSI interface is not possible because they all need file or

object mapping to read/write data. Our platform can support file-based storage services

as well as block-based storage.

We propose to install storage clients on compute nodes. We design a storage client

based on HDFS client and storage service specifications. We change HDFS calls according

to the requirements of the target storage service.

3.3.3 Consolidated Storage Layer

Shared view

Large-scale storage solutions provide a shared view of data to client nodes. This is true

for a majority of such systems, e.g., Lustre, Ceph, NFS, S3, HDFS. Each system solves

contention issues in different ways. Some use locks while others rely on distributed object

stores. Gluon strives to provide integration with any large-scale storage solution therefore

needs to account for contention issues as well. Gluon avoids contention issues the same


way it avoids inconsistency between workers. Since writes to files are disjoint there is no

need to worry about locking an inode to flush data. The worst case contention scenario

is when tasks try to create new paths under the same directory. The directory inode is

locked by each task. Nevertheless, path creation times are typically insignificant when

compared to actual data writes in Big Data analytics workloads.

Data consistency

We already mentioned that data is asynchronously propagated to remote storage. Data

can also be brought to cache on-demand. However, analytics engines are not the only

ones using the remote storage service. Other engines, such as, webservers or databases

can aggregate their data inside the same storage silo. In this case, the Gluon cache

layer needs to be aware of updates from other services to make sure that its view is

consistent with the last storage update. The data analytics and caching layer needs to

perform consistency checks regularly and without extra overheads to the cache workers

or analytics tasks.

Inconsistencies may also happen between the storage and cache layer. For instance,

storage solution can accept data from transaction workloads and can update existing

data by adding, extending, modifying or removing files. This results in two types of

inconsistencies: (1) storage has more up-to-date data that the cache is unaware of and

(2) cache has more up-to-date data that the storage is unaware of.

Our design detects both types of inconsistencies and notifies the cache manager about

changes in the storage tier. In the second case, we should ignore the inconsistency because

the cache may have extra data due to temporary files created during analytics job runs

or intermediate data that is not pushed down to storage tier due to delete-on-finish

behaviour. On the other hand, in the first case, it is quite tricky to know what action

cache manager should take because file could have been either created or modified. When

file is created it is just another table entry for cache manager. It is quite straightforward

to implement. However, in modify cases cache manager needs to understand which part

of the file was changed and which chunk of the file to invalidate, and how to do that

without interfering with analytics workload. Due to cache invalidation complexities we


leave this feature for future work.

Asynchronous propagation to storage

In the background, data blocks are asynchronously propagated by cache workers to the

remote storage silo. However, cached data is typically not propagated in Spark work-

loads. There are two types of data writes in analytics workloads: intermediate and final.

Intermediate data is typically written locally (not in HDFS) and it is required to perform

shuffle/synchronization/re-use steps. Sometimes intermediate data can be cached data

(e.g. Spark). In general loosing intermediate data will result in task re-computation

which sometimes can be costly. However it is not as costly as loosing final output data

which will require whole job to be re-computed. Intermediate data is typically destroyed

by applications upon completion.

In Gluon, we propagate both intermediate and final output data. This provides lower

probability of data re-computation. For instance, if Spark executor crashes or gets killed

by YARN then it’s RDD partition gets destroyed and needs to be re-computed by other

executors. However, if Spark executors stores RDD partition in Gluon, then it can re-

cover partition from external memory layer. If the whole node crashes then there will

be a chance of RDD partitions persisted in remote storage. This way all RDD partitions

from lost nodes can be fetched from cold storage. This way we not only perform caching

but also asynchronous checkpointing of RDD partitions.

In our platform we define reasonable trade-off between fault-tolerance and perfor-

mance. In reality, our final output propagation is the same as that of HDFS replica

propagation in the default mode. Default HDFS configuration does not enforce syn-

chronous data replication, i.e. replicas are propagated during task run and/or after task

is finished. HDFS administrators have to explicitly set synchronous data propagation

option. Gluon also provides this option.

3.4 Summary

In this Chapter, we discussed a case study in simple analytics application. We showed


how usability issue can lead to failed job, poor performance and under-utilization of

resources. We also proposed Gluon - our consolidated flexible platform that can incor-

porate majority of state-of-art frameworks. Our new architecture is based on usability

studies in current analytics engines and their storage solutions. Our Gluon caching layer

supports global collaboration accross the memories of all participating compute (and

storage) nodes. Second, Gluon supports full integration of the collaborative caching ser-

vice with traditional consolidated storage back-end services.

With Gluon we emphasize the principle of data locality for in-memory data on any


when opportunities for memory availability in collaborating nodes exist. We describe

data propagation from execution layer to storage layer.

Finally, as mentioned, the seamless integration between caching and consolidated

storage in Gluon means that any updates for any files stored on back-end storage can be

integrated in a new data analytics pass transparently, automatically, on-demand. This

avoids cumbersome data manipulations which separate on-disk data silos normally bring

about e.g., for data analytics systems based on HDFS.

Chapter 4

Implementation

This chapter presents details of the implementation of our consolidated platform as well

as platform optimization and improvements. We start with a description of the system

components. We discuss each component in detail. Then we talk about how we glue all

components together. Finally we present implemented optimizations.

Our component for the collaborative caching layer is based on open-source Alluxio[35].

Alluxio is an in-memory cache that can interact with YARN applications (MapReduce,

Spark etc.). Alluxio caches files from a storage service and places them inside the memory

cache.

Components for the consolidate storage layer include open-source Global File System

2[34] and proprietary Server SAN - Huawei FusionStorage[3]. On top of the Server SAN

we install GFS2.

We create our own set of connectors to integrate caching and storage layers. We

essentially connect storage layer with analytics engines first. Then we insert the cache

tier in between. We show how each component is integrated into our platform, challenges

of integration and final design optimizations.

32

Chapter 4. Implementation 33

4.1 Components

4.1.1 Alluxio

Alluxio is an in-memory cache - not just memory only - and its tiered storage feature

means it can theoretically be extended to access any storage. Because Alluxio exposes a

storage integration layer through an API, applications can access any integrated under-

lying persistent storage and file systems. We chose Alluxio because it has a flexible code

base and has a focus on data analytics caching in contrast to Ignite[1] that also tries to

accommodate transaction-based workloads.

Alluxio’s design uses a single master called AlluxioMaster and multiple workers called

AlluxioWorkers. At a high level, Alluxio can be divided into three components, the mas-

ter, workers, and clients. The master and workers together form the Alluxio servers,

which are the main components of a typical Alluxio cluster. The clients are generally the

applications, such as Spark or MapReduce jobs.

The master is responsible for managing the global metadata of the system, e.g. the

inode tree. AlluxioClients may communicate with the master to read/write from/to the

global metadata table. Alluxio workers are responsible for managing local resources al-

located, such as, RAM, SDD and HDD. Alluxio workers manage all data as file blocks

and are very similar to HDFS DataNodes. The worker is only responsible for data in

its node; the actual mapping from file to file blocks is only stored in the master. The

AlluxioClient provides users a gateway to interact with the Alluxio workers. It exposes a

cache system API. Data that exists in the under storage(e.g. HDFS) but is not available

in the Alluxio cache is accessed directly through an under storage client. AlluxioWorkers

store file blocks inside directories just like HDFS DataNodes. The difference from HDFS

is that the AlluxioWorker mounts directory as RamFS, i.e. all data is stored in the OS

page cache.

AlluxioClient runs inside a task executor (e.g. Spark Executor). It initiates com-

munication with the master to carry out metadata operations and with workers to read

and write data that exist in the Alluxio cache. It can access RamFS and create random

access files. It can also connect to remote nodes and pass data through TCP/IP network.


Depending on configuration, AlluxioClients can create two output streams during writes:

(1) RamFS output stream and (2) understorage output stream (e.g. HDFS stream).

Alluxio stores file blocks to RamFS and files to the underlying storage. It is paramount

to note that a file block is typically smaller than a file itself, i.e. a file constitutes

more than one block. Upon write, the AlluxioClient creates a single file stream. While

writing to the file it creates multiple block streams. This approach is performed when

the CACHE THROUGH policy has been set. There are other write policies, such as,

MUST CACHE, THROUGH and experimental ASYNC THROUGH. By default Alluxio

has MUST CACHE which means that writes are never propagated to the underly-

ing storage. In the Gluon cache we ignore all policies except one. We focus on the

ASYNC THROUGH policy. This policy assigns a set of background threads to copy

RamFS blocks to the corresponding file in the underlying storage.

If a block is not present in RamFS, the AlluxioClient reads it from the underlying stor-

age. There are 3 read policies in Alluxio: CACHE PROMOTE, CACHE, NO CACHE.

The first policy always places a block into the highest tier. The highest tier is considered

to be a RamFS directory of a node that is reading the block. Even when a block is read

from remote RamFS, its copy is created in a local RamFS directory. This policy results

in multiple replicas of blocks in the memory tier. In our prototype we only want to cache

into the memory layer once, thus we want to avoid replicas on different RamFS nodes.

Therefore we focus on a CACHE read policy.

4.1.2 Server SAN

Huawei FusionStorage[3] is a main component in Huawei Server SAN solution. It can

be deployed on multiple general-purpose x86 servers to consolidate the local SSDs or

HDDs on all the servers into virtual storage resource pools to provide the block storage

capabilities.

FusionStorage consolidates local hard disks on all servers into multiple storage re-

source pools. Based on the storage resource pools, the FusionStorage software provides

block device interfaces for upper-layer software, for example, creating and deleting vol-

umes and snapshots. Volumes are accessed through SCSI or iSCSI protocols.


FusionStorage automatically stores a piece of data into several identical data copies

on different servers. The data is represented as a disk block (e.g. 4KB). The storage au-

tomatically ensures strong data consistency between the data copies and that even data

distribution, thereby preventing data hotspots. All the hard disks in storage resource

pools can function as the hot spare disks for storage resource pools. FusionStorage

Figure 4.1: High level SAN architecture of the storage tier.

helps consolidate all disks on commodity servers into a single pool of disks[53]. Fusion-

Storage is similar in architecture to Ceph block storage solution[58] which is open-source.

In this dissertation, we only disclose FusionStorage implementation details that are cov-

ered by publicly available white paper[3]. Readers can find more details about SAN

implementation from Ceph source code. Figure 4.1 demonstrates SAN system architec-

ture. Users can create virtual volumes (vol1, vol2, vol3) from the SAN pool. The volumes

can then be attached as new disks to virtual or physical machines, labelled as 1, 2 and 3.

Client nodes can access these volumes as block devices where data is stored in the form

of disk blocks, denoted as green and red circles. In our platform SAN clients reside on

different nodes to enforce decoupled architecture. SAN servers that are denoted as A, B

and C include metadata management, disk management and caching mechanisms. Typ-


ically SAN servers replicate disk blocks across multiple disks and servers: 3 red replicas

and 3 green replicas. They provide data balancing, thin provisioning and a variety of

recovery mechanisms.

4.1.3 GFS2

GFS2 is a shared-disk file system for a Linux commodity cluster. GFS2 is very different

from distributed file systems (such as HDFS, Lustre or GlusterFS) since it does not have

a metadata master and allows all nodes to have concurrent access to the same shared

block storage. Moreover, GFS2 can be used as a local filesystem, just like ext3. It is a

POSIX-compliant filesystem

It is primarily designed for Storage Area Network (SAN) applications in which each

node in a GFS2 cluster has equal access to the storage. In order to limit access to areas

of the storage to maintain filesystem integrity, a lock manager is used. In GFS2 this is

a distributed lock manager (DLM). DLM works on an inode basis, i.e. each writer will

lock an inode while writing to it. It is also possible to use GFS2 as a local filesystem with

the lock nolock lock manager instead of the DLM. The locking mechanism is replaceable

and can be easily integrated in case of a future need of a more specialized lock manager.

The design of GFS2 is a perfect match for SAN-like environments such as Fusion-

Storage. It is compatible with a variety of block device protocols, e.g., SCSI, iSCSI,

FibreChannel, AoE, or any other device which can be presented under Linux as a block

device shared by a number of nodes, for example a DRBD device.

4.1.4 YARN

YARN is essentially a system for managing distributed applications. It consists of a

central ResourceManager, which arbitrates all available cluster resources, and a per-node

NodeManager, which takes coordination from the ResourceManager and is responsible

for managing resources available on a single node.

In YARN, the ResourceManager is, primarily, a capacity scheduler. Essentially, it’s


strictly limited to arbitrating available resources in the system among the competing ap-

plications. It optimizes for maximum cluster utilization against various constraints such

as capacity guarantees, fairness, and SLAs.

YARN has a special program called the ApplicationMaster. The ApplicationMaster

is, in effect, an instance of a library that can be used by different analytics engines to

negotiate resources from the ResourceManager and work with the NodeManager(s) to

execute and monitor the containers and their resource consumption. For instance, the

Spark driver program can run inside the ApplicationMaster. It is responsible for nego-

tiating appropriate resource containers from the ResourceManager, tracking their status

and monitoring progress.

YARN is designed to allow individual applications (via the ApplicationMaster) to

utilize cluster resources in a shared, secure and multi-tenant manner. Also, it remains

aware of cluster topology in order to efficiently schedule and optimize data access i.e.

reduce data motion for applications to the extent possible. In order to meet those goals,

the ResourceManager has extensive information about an application’s resource needs,

which allows it to make better scheduling decisions across all applications in the clus-

ter. This leads us to the ResourceRequest and the resulting Container. Essentially an

application can ask for specific resource requests via the ApplicationMaster to satisfy

its resource needs. The Scheduler responds to a resource request by granting a con-

tainer, which satisfies the requirements laid out by the ApplicationMaster in the initial

ResourceRequest. The ResourceRequest object contains hostnames and corresponding

container sizes (CPU and RAM). YARN has enabled relaxed locality in default mode.

This means that data locality can be ignored if requested host does not have required

CPU and RAM. If a ResourceRequest has a hostname and a container size fulfilled then

an allocation is designated as NODE LOCAL because consequent task execution can

access data locally.


4.2 Control and Data Flow

Figures 4.2, 4.3, 4.5 and 4.4 show how components interact with each other. Since our

architecture is complex and involves many software technologies and a multitude of third-

party libraries it is best to demonstrate control and data flow in multiple figures.

Figure 4.2 outlines the high level view of our implementation. It shows a snapshot

Figure 4.2: Components in the consolidated platform.

of all components during a single application run. In the figure we use a generic analyt-

ics engine. In our prototype we have tested 2 different frameworks: Spark and Hadoop

MapReduce. We will walk readers through each component interaction by zooming in

each region of Figure 4.2.

Figure 4.3 shows how a job is initially submitted to the cluster.

In the first operation client (1) submits a application jar file to the Resource Man-

ager. Then the ResourceManager (2) decides where to allocate a container for the main

program of the application and requests the Node Manager to (3) allocate a container.

Once the container is allocated main program is started within the ApplicationMaster.

The main program then (4) communicates with the AlluxioMaster to retrieve any rel-


Figure 4.3: Top tier job submission process in Gluon.

evant data information, such as, file block locations, permissions, file sizes etc. The

AlluxioMaster is always aware of the current state of each worker and underlying file

system. Updates to the AlluxioMaster happen through AlluxioWorkers. For example,

when the AlluxioWorker creates new file block, it will notify the AlluxioMaster. After

the Application Master retrieves file block information (5) it creates a ResourceRequest

object and sends it to the ResourceManager. The ResourceRequest contains a list of

hostnames per task execution. The ResourceManager attempts to allocate a container

based on the user preference (e.g. NODE LOCAL). If the preference cannot be deliv-

ered it will grant RACK LOCAL container. Again the Resource Manager (6) requests

NodeManagers to provide containers. NodeManagers (7) allocate new containers for task

execution.

Figure 4.4: Cache and bottom tier interaction with top tier in Gluon. The READoperation.

Once containers are allocated, the ApplicationMaster’s main program starts the tasks


in their assigned containers (Figures 4.5, 4.4). To do READ operation, Task Executors

use the AlluxioClient to open input stream from GFS2 and start (8) reading a file block

directly. Reading from a GFS2 file triggers the ServerSAN client to fetch necessary disk

blocks (e.g. 4KB blocks) that correspond to a requested inode from the remote Server-

SAN. While reading, a file block will also be (9) stored in RamFS - caching on read. If the

RamFS directory is running out of space the Task Executor requests the AlluxioWorker

to evict some blocks. The AlluxioWorker uses an LRU evictor to move blocks to the

local-disk file system (e.g. ext3).

In case a file block exists in RamFS of any of the AlluxioWorker nodes then

Figure 4.5: Cache and bottom tier interaction with top tier in Gluon. The WRITEoperation.

BlockInStream object is created and the block is read directly from memory of that

node. In this case, a blockId is calculated using position in the file. Typically, Applica-

tion Master assigns partition of a file to a task. Partition information contains offset and

partition size.

Another procedure takes place during the WRITE operation. Unlike the READ op-

eration, before performing the write operation the Task Executor adds a journal entry

into the AlluxioMaster using AlluxioClient’s createFile API. Then the Task Executor (8)

checks whether there is available memory in RamFS and creates RamFS output stream

directly. If there is no enough space in the RamFS directory, the Executor requests the

AlluxioWorker to (9) evict some blocks and at the same time tries to write to remote

nodes with some available space in RamFS. After the Task Executor writes all file blocks

to RamFS it then notifies the AlluxioWorker to persist written file blocks to GFS2. The


AlluxioWorker locks these blocks (a lock prevents eviction) and, immediately after blocks

are written to RamFS, it (9) starts to move blocks to GFS2. Consequently, blocks are

propagated to the remote SAN. The Task Executor does not wait for the propagation to

finish.

4.3 Connecting storage component

4.3.1 Server SAN to filesystem connection

This choice presents an interesting challenge because Server SAN is not a file system but

a block storage. Also most of the popular NAS filesystems can already be integrated into

Hadoop ecosystem. SANs on the other hand are rarely covered. SANs communicate over

SCSI protocol. They present data the same way as physical block device does. Server

SAN can be replaced with Ceph, Lustre, NFS and many other systems in our platform.

NFS and Lustre are file systems and thus require less effort in terms of integration. By

connecting SAN to Hadoop ecosystem we can cover majority of storage platforms.

If we are to connect SAN clients to Hadoop, the obvious solution is to (1) create

Figure 4.6: Over-replication problem.


an individual volume from SAN pool, (2) mount ext3/ext4 on each volume and (3) co-

locate HDFS DataNode with each filesystem. However, as we discussed we face 3-way

replication in Hadoop and in Server SAN, this leads to data redundancy and unneces-

sary overhead. Figure 4.6 shows how for each HDFS file block (red and green squares)

there will be 3 disk blocks (red and green circles) stored in the SAN. Also if we con-

figure Hadoop to have no replication we risk having an unreliable system regardless of

replication on Server SAN, because we need file level reliability. Nevertheless, there is

another option - to configure Server SAN such that one large volume is shared among

Hadoop nodes. This way, even if one of the Hadoop nodes fail, we still have the access

to the same shared volume from other Hadoop nodes. Now, the challenge arises on how

to transform shared disk block level to the file level access since Hadoop only works with

files.

We employ shared-disk GFS2 to access data on Server SAN. GFS2 is installed on the

Server SAN client nodes. GFS2 is a clustered file system that allows for synchronized

access to a shared block device. Figure 4.7 shows our storage tier architecture. In our

case the shared block device is the Server SAN volume (vol1) from the SAN storage pool.

GFS2 is mounted after installation onto the nodes where SAN is installed, i.e. Server

SAN client nodes 1, 2 and 3. The mount directory is the same on all Server SAN client

nodes, i.e. when a user creates a file a.txt under mount directory /gfs2 on Server SAN

client 1, the file a.txt appears under mount directory /gfs2 on Server SAN client 2.

The disadvantage of clustered file system like GFS2 is contention during parallel

writes to the same file. GFS2 relies on Distributed Lock Manager to control parallel

writes. However, in analytics workloads it is rare that tasks write to the same output

file. For instance, HMR tasks write files in the reduce stage, and reducers each have

their own output partitions. Therefore, file write contention does not happen in typical

MapReduce scenario. Further investigating HDFS reveals that it does not allow multiple

writers to the same file.

Thus we can conclude that our platform will very rarely encounter file write con-

tention. Nevertheless, in HMR or other analytics engines file create and delete con-

tentions are inevitable in our platform since DLM operates on the inode basis. This


means that if a task creates a file under a directory that is locked then the task will need

to wait until another lock-holding task has finished its creation or deletion procedures.

Figure 4.7: Final Storage Tier Architecture.

4.4 Connecting GFS2 with Analytics Engines

We focus on applications that work in the context of YARN or support support Hadoop

FileSystem API. All these applications can access data from HDFS or file systems that

are compatible with the API. Theoretically, any data analytics application can work with

our platform as long as it is compatible with FileSystem API. However, we only tested

our platform with 2 above-mentioned applications and Hama framework[50].

Table 4.1 lists core HDFS API calls. There are a lot more calls in the actual

Hadoop API however we just show the main methods that typically impact performance.

We translate the calls to storage service. We use basic Java File streams to access

POSIX-compliant GFS2. Essentially making storage service accessible through Hadoop

FileSystem API. Some challenges may rise when implementing getF ileLocations because

this function can impact parallelism of tasks.


Table 4.1: Core HDFS API calls and their translation

Call name HDFS Call Description Call change descriptioncreate(Path p) Creates FSDataOutput-

Stream at a given HDFSpath

Instead of FSDataOut-putStream we returnjava.io.FileOutputStreamat a given path

getFileStatus(Path p) Return a FileStatus objectthat represents the path inHDFS.

Return FileStatus objectthat contains file metadata

mkdirs(Path f) Make the given file and allnon-existent parents into di-rectories in HDFS.

Create directory recursively,e.g. translate to mkdir -pDIR NAME call in POSIX-compliant storage service

open(Path f) Opens an FSDataInput-Stream at the indicatedPath in HDFS

Creates storageservice streamjava.io.FileInputStreamat a given path

getFileLocations(Path p, long start,long len)

Return an array containinghostnames, offset and size ofportions of the given file inHDFS

Storage service is fully de-coupled therefore we canreturn any active host-name on the compute layer.Reading/writing data at thecompute node will triggercommunication with stor-age master that will directthe call to the correspond-ing storage worker.

Initially we have developed our own connector to link Hadoop with GFS2. The

connector can interact with HMR Job Tracker, YARN Application Master, Spark Master

or other framework masters to spawn the tasks on those servers or VMs that have data

accessible.

Figure 4.8 shows how we connect application tier with the storage tier. In our platform

we co-locate YARN Node Managers with SAN clients. Note that, in the case when

resource manager is not used, application workers (e.g. Spark Standalone) would be co-

located with SAN clients. Any file in GFS2 is accessible through any server or VM that

has a SAN client. The Application Master interacts with GFS2 to determine file locations.

When the Application Master receives file metadata it will request the Resource Manager


for containers on specific nodes. Once the Resource Manager grants the permission, the

containers are launched and tasks will start executing inside them. Tasks will retrieve

data through local Connectors.

There is a way to include implementation of any file system without ever changing

Figure 4.8: Application Tier and Storage Tier Interaction.

the Hadoop source code. We only need to compile our implementation of a plugin for

file system or a newly designed file system into a jar file. And then we add that jar file

to hadoop project. Finally, we will need to add a property called fs.SCHEME.impl to

core − site.xml. This property specifies a core class of our file system or plugin. The

class specified should be the class that inherits from Hadoop FileSystem class.

We have build our connector based on S3 connector designed by Hadoop developers.

Our plugin maps requests to schema gfs2 : /// onto the local file system under the

mount directory specified in core−site.xml. The connector that we implemented can be

re-used for any shared POSIX-compliant file system such as Lustre, Ceph and NFS. We

also do not need to support S3 and other famous object stores because Hadoop already

provides that integration and we can re-use their connectors.


4.5 Cache integration

4.5.1 GFS2 to Alluxio connection

In our platform, we leverage an open-source in-memory distributed cache called Alluxio[35].

Alluxio has a simple design and follows HDFS structure. It has master-slave architecture

which makes it ideal candidate for collaborative caching. Unlike HDFS, Alluxio stores

file blocks onto RAM disks. RAM disks behave as regular disks but they store data into

the OS page cache which takes some space in a physical RAM of the server.

Since Alluxio follows HDFS structure it already has Hadoop FileSystem API im-

plemented. Alluxio is also compatible with various bottom tiers including HDFS, S3,

GCS, GlusterFS. To access bottom tiers Alluxio provides HDFS-like FileSystem API -

UnderFileSystem API. We implemented our version of UnderFileSystem API to allow

smooth access of GFS2. We essentially re-used our previous connector design and ex-

tended implementation of Alluxio’s LocalUnderFileSystem class that is used to access

local directories.

When Alluxio cache is empty, clients are directed to fetch data from GFS2 nodes.

The data access pattern is important because it will determine which nodes will cache

the data. It also can impact the performance of the initial job. Figure 4.9 shows how

data layout can impact job performance. After YARN ApplicationMaster requested 5 file

locations(from a.txt through e.txt) from Alluxio Master, it will respond with locations

of all files. In this example, since data is not yet cached, location returned is localhost

for all file requests which is node 1. Thus ApplicationMaster requests Resource Manager

to launch containers on node 1. However, each node only has 3 task slots (denoted as

grey circles) available due to RAM and CPU restrictions. ApplicationMaster launches 3

tasks on node 1 and denotes 2 other tasks as RACK-LOCAL and places them to nodes

2 and 3. Rack-local tasks fetch the data from node 1, because they assume that they

do not have data available locally. This results in caching data non-uniformly and will

affect performance of new jobs that re-use cache data. However, in our platform all files

can be accessed through any node that has SAN client. Therefore, with the correct file

locations extra communication overhead could be avoided.


Figure 4.9: Performance degradation due to uneven data representation.

In our work, we program an uniform data representation in our UnderFileSystem

API implementation. We use GFS2 path hashcodes and offsets to calculate a preference

node for the ApplicationMaster. Uniform data representation allows to return locations

such that the ApplicationMaster can assume that all of the healthy nodes have data

available. For instance, returning list of hostnames such as [node2, node1, node3] by

getF ileLocations provides task with NODE-LOCAL designation.

In our platform, we deploy Alluxio such that data is cached in AlluxioWorkers first

and then it gets propagated to SAN storage via GFS2. This means that consecutive reads

of the same data will hit the cache and achieve highest performance. We also configure

Alluxio to cache on reads, which also allows consecutive workloads share cached data.

We set cache WRITE policy to LOCAL FIRST. This helps us keep all the write data

locally. This means that a task will try to write all its file blocks on a single node until

it runs out of memory. Then it scans neighbouring nodes for available memory. On the

reads, in case file blocks are not in the Alluxio Workers, they are brought in from GFS2


during job execution.

Finally, we implemented asynchronous propagation of writes by extending the

ASYNC THROUGH feature in Alluxio. This allows data to be asynchronously prop-

agated to GFS2 (FusionStorage). Note that Alluxio provides ASYNC THROUGH as

an experimental feature which did not work on our platform because analytics engines

(e.g. HMR and Spark) created temp files and then renamed those files. We modified the

rename function in Alluxio Client to enable file persistence; hence we enabled Alluxio to

perform asynchronous writes seamlessly.

We also leverage the Alluxio tiered storage feature to evict blocks onto local disks

that are shared with shuffle spills. Essentially eviction is a copy from RAMdisk onto

physical disk and subsequent removal of block from RAMdisk.

We utilize a default scheduler of YARN to queue tasks and jobs: CapacityScheduler.

We have mentioned previously that Capacity Scheduler has “relaxed” locality in the de-

fault configuration: this implies that it ignores “preferences” if the node’s capacity is

exceeded (CPU and RAM are busy). This will have an effect on Alluxio cache through

production of file block replicas in the main memory of AlluxioWorker nodes. If the Al-

luxioWorker does not have a file block in the local cache but has been assigned a task to

work on that file block, it will copy the contents of the block from neighbouring Worker

Nodes who already have requested file block cached.

In our platform, uniform distribution of file blocks in GFS2 data representation allows

for uniform distribution of tasks on data reads. However, block replicas are still possible.

For instance, two tasks in two different VMs request data split from the same Alluxio

file block. This causes the Alluxio Worker to cache the same block on two different VMs.

However, the replicas can be reduced by decreasing the block size. Our platform uses the

default 64MB block size unlike the default Alluxio block size which is set to 512MB.

We also disable caching of a remote block that happens when Alluxio Worker at-

tempts to read a block from the memory of a remote Alluxio Worker. We modify Alluxio

such that a remote block copy does not happen using CACHE policy.


4.6 Spark integration

In Spark, caching is performed using the persist() command. Spark divides functions

in two categories: transformations and actions. Transformations are just records of

operation and get executed whenever an action is triggered (i.e. lazy operations). The

persist() command is a transformation. Therefore data is cached on the next action after

persist(). Spark allows for 3 main levels of caching: MEMORY, DISK and OFF HEAP.

There are also options for data serialization and combinations of levels. However, all

of these levels only store data in the local machine/VM. Consequently, a user needs to

worry about how much memory would be allocated for each Spark executor in order to

fit cached RDD partitions in available memory.

In our platform, we use Alluxio which is an external service to Spark. Therefore,

Figure 4.10: Performance comparison of Spark caching methods

we cannot use the persist() command in order to cache data. For external caching pur-

poses, Spark developers recommend using the checkpoint() or saveAsTextF ile() meth-

ods. There are many disadvantages associated with these two commands. First, the


checkpoint() command implementation is not the same as persist(); it requires two com-

putations for the same action: one to perform an action and the second one to perform

the actual checkpoint operation. Second, saveAsTextF ile is an action; therefore it re-

quires additional computation.

We performed a simple experiment where we tested 3 methods in the same applica-

tion. The application reads a large file, performs two random map operations, ”caches”

the RDD and finally does two back-to-back count operations. For persist() and check-

point() we just call these methods before the count() methods. As for saveAsTextFile we

also perform the textFile() operation to read the saved RDD. In all of the methods we

use Alluxio as caching layer in asynchronous mode. To use the persist command for Al-

luxio, we choose DISK ONLY option in Spark and point Spark disk writes to the Alluxio

RamFS directory using the spark.local.dir configuration. This shows a fair comparison

of three methods trying to do the same operation. Chart 4.10 shows the result of our

test. From the test run it becomes obvious that checkpoint and saveAsTextF ile are

very slow and cannot compete with persist.

Consequently, we have implemented AlluxioBlockManager.scala in Spark in order

to allow the persist command to use Alluxio Client API. AlluxioBlockManager class acts

in similar to the DiskStore class of Spark except instead of the FileOutputStream object

it creates an Alluxio FileOutStream object. We have tested our implementation with

Spark-1.6.3 because it has the implementation of ExternalBlockStore class that we rely

on. Nevertheless, it is still possible to implement both classes in later versions of Spark.

We leave that for future work.

4.7 Additional optimizations

4.7.1 Asynchronous Delete

Many iterative workloads delete previous data after each iteration. The time it takes to

delete a directory can greatly affect overall job performance. To overcome this challenge

HDFS does asynchronous deletion, i.e. it schedules block IDs to be removed later. In


our platform deletion is synchronous, i.e. a task will have to wait until the storage tier

returns the acknowledgement that the file is deleted.

Speeding up the deletion process is tricky because the storage tier controls file related

operations. We also try not to modify other tiers except for the cache. Hence, we

introduce a queue-based deletion mechanism where delete requests are added to the

end of a queue. Then a background thread will process a request at the head of the

queue. The mechanism operates at the caching tier, thus does not require modifications

to the storage tier. There are subtle issues with our approach since now there may be

inconsistencies between the cache and remote storage with respect to delete operations.

Inconsistencies due to delete operations can be severe when a user tries to create a file

or a directory with the same name as the file that is scheduled for deletion. To overcome

this problem, we move all created files to a specialized directory under the root directory

and only then we schedule a deletion of the specialized directory.

4.7.2 File consistency checker

It is important to detect any changes in the remote storage because there can be another

workload that adds files to GFS2 (e.g. transaction processing). In that case Alluxio may

not be notified because the Alluxio cache is only used for analytics workloads. Therefore,

it is important to constantly check for consistency between the Alluxio Master journal

entries and actual data in GFS2.

We designed and implemented an external file consistency checker that performs fast

lookup of file paths and timestamps. It creates a hashmap that is populated with a

snapshot of GFS2 mount. Then hashmap entries are compared against the Alluxio

Master entries. If there is extra data on Alluxio it is being left as is. This may be

an inconsistency due to a job’s file being temporarily stored on the cache layer or some

job performing asynchronous propagation. On the other hand, where there is extra

data on GFS2, then Alluxio Master is notified and a new entry is added to it. The file

consistency checker wakes up every 3 seconds to check GFS2 and Alluxio. In our job

runs, our consistency checker showed no significant overheads.


4.8 Summary

In this section we showed how storage component was integrated. We took Server SAN

as a storage service because Hadoop did not work with block devices before. We used

GFS2 to connect to Server SAN. We also discussed how we connected application layer

to GFS2. We covered challenges such as over-replication that were encountered during

implementation phase.

We also talked about Alluxio integration into our platform. We pointed out overheads

that caused performance degradation, and we showed our approach to resolve these over-

heads. We also stress on specific configuration settings and provide thorough explanation

as to why they are chosen.

We discuss code optimization in cache and analytics tiers. The optimization help

boost performance and usability. Finally, we show how control and data flows in our

platform. The flow analysis helps better understand overall platform architecture and

point out bottlenecks for future developments.

Chapter 5

Evaluation

This chapter presents performance results of our experimental testbed. We outline the

configuration of our testbed and as well as throughput measurements. We then perform

three sets of comparisons.

First, we perform comparisons with vanilla Spark deployments. Our goal is to evaluate

default Spark configurations in comparison to the Gluon deployment. We use real-world

data and SparkBench workloads[36] and show how performance changes with memory

utilization in both platforms. We also discuss Gluon’s remote data write/read statistics

in uniform as well as skewed workloads. We also show benefits of integrating idle memory

nodes in the computation.

Second, we focus on a Hadoop cluster that runs on top of HDFS. We run two well-

known workloads to show how Gluon can improve Hadoop performance. In this compar-

ison we use Intel’s HiBench suite[26]. Finally, we use Apache Hama[50] to see how

non-MapReduce framework can perform on Gluon.

5.1 Environment Setup

All of our experiments are executed on a cluster of servers at the University of Toronto.

We have 2 sets of dedicated servers in our cluster. The first set we use for the compute

layer pf the data analytics platforms. This set has 3 large servers (c181, c172, c178) with

53

Chapter 5. Evaluation 54

each server hosting 32GB RAM, 32 cores Intel(R) Xeon(R) CPU E5-2650 @ 2.00GHz,

one 300GB SATA HDD and one 10Gbit/s network card. The second set of servers is

on used for the storage layer and is located on a different rack. It is used purely for

storage layer. This set includes 3 extra-large servers (C160, c168, c171) with each server

hosting 48GB RAM, 32 cores Intel(R) Xeon(R) CPU E5-2650 @ 2.00GHz, five 300GB

SATA HDDs and one 10Gbit/s network card. Intra-rack throughputs vary in the range

700MB/s - 1GB/s whereas inter-rack throughputs are in the range 380MB/s - 550MB/s.

All disk write and read throughputs are 150MB/s and 180MB/s respectively.

Our setup provides total of 96 cores and 96GB of RAM and 900GB HDD for execution

and cache nodes. Storage layer has 96 cores and 144 GB of RAM and 4.5TB of disk space.

In the following we describe the versions and configurations for the software platforms

used in tests:

• Spark-HDFS: we install Hadoop-2.6.0 and Spark-1.6.3 on compute layer nodes:

c181, c172, c178. We configure YARN such that maximum of 72GB RAM and 72

CPU cores can be spared for HMR or Spark. We also provide HDFS with 600GB

of disk space for storage. 300GB of storage is reserved for shuffle data.

• vanilla Hadoop: we install Hadoop-2.6.0 on compute layer nodes: c181, c172,

c178. We configure YARN such that maximum of 72GB RAM and 72 CPU cores

can be spared for Hadoop MapReduce. We also provide HDFS with 600GB of disk

space for storage. 300GB of storage is reserved for shuffle data. We set Gluon cache

size to total of 20GB which leaves YARN cluster with 52GB of RAM for execution.

HMR applications do not utilize local memory as intensively as Spark applications

and they also do not support explicit .cache commands. Thus all HMR memory

resources are spared for execution only.

• Gluon: we use our modified version derived from packages mentioned above. We

install our version of Alluxio on the compute nodes. In our experiments we change

local memory sizes of Alluxio workers in accordance with Spark-HDFS cache sizes.

We spare 10% of shuffle storage to Alluxio local disk cache as a default setting.

Next, we install Huawei FusionStorage - Server SAN management software - on


storage servers: c160, c168, c171. FusionStorage Manager was installed on c160

with FusionStorage Agents on c160, c168, c171. We also installed FusionStorage

Client on c181, c172, c178. Then we created a 3TB volume from this storage pool

and attached it to compute layer as a new block device. Next, we installed and

configured GFS2 on compute layer and mounted it on top of 3TB volume.

5.1.1 Benchmarks

We compare Gluon to two architectures: Spark-HDFS, vanilla Hadoop. We use cache-

intensive workloads for comparisons with Spark-HDFS. To compare Gluon with vanilla

HDFS we use Intel’s HiBench suite[26].

In Spark evaluation we use simple Spark Count program and two workloads from

SparkBench[36]: Logistic Regression and PageRank. The Spark Count program reads

randomly generated 8GB file, performs 2 random map operations, caches RDD and

counts the number of lines twice.

Logistic Regression is widely adopted machine learning tool to predict continuous

and categorical data[29][25]. For instance, it is used to predict whether the patient has

a given type of cancer based on a variety of characteristics such as blood tests, desease

history, age etc. Logistic Regression algorithm is an ideal candidate for the use of Spark

caching because it needs to hold RDD in the cache while it iterates over that RDD. The

algorithm calculates the parameter vector, updates it and broadcasts it in each iteration.

We use Logistic Regression to run on the real dataset - Wikipedia articles which includes

almost 7 million English articles[6]. We follow SparkBench approach and extract plain

text from wikipedia XML articles using WikiXMLJ parser[7]. We then convert fixed size

vocabulary into TF-IDF from the set of documents. We use output of TF-IDF as an

input to our Spark program. The output data from TF-IDF results in approximate size

of 18.7 GB. The data format includes two columns: (1) category index and (2) array of

TF-IDFs.

To evaluate data-skewed workloads we focus on PageRank algorithm[44]. The algo-

rithm was first used by Google web search engine to rank pages through measuring the

importance of website pages based on the number and quality of links to a page. We use


Livejournal[4] graph data as the input data set of this workload. Livejounral contains 68

million edges in the graph and data size is approximately 1.3 GB.

In Hadoop evaluation we use generate a read-write testing technique called DFSIO.

DFSIO is a simple READ/WRITE test that spawns multiple mappers of the HMR

framework. Mappers write/read randomly generated data to/from the target storage

(e.g. HDFS). We then use Terasort. The TeraSort benchmark is probably the most well-

known Hadoop benchmark. The goal of TeraSort is to sort any given amount of data

as fast as possible. It is a benchmark that combines testing the HDFS and MapReduce

layers of a Hadoop cluster. In our case we wanted to sort a file of 10 GB size from

HiBench. We then re-use Livejournal graph to run PageRank on Hadoop Mapreduce.

We also tested Gluon under a completely different platform that is gaining popularity

in the Big Data community. In 2010, Google has completely replaced their MapReduce

platfrom in the favor of Pregel[38]. It is build in the context of Bulk Synchronous Par-

allel (BSP) - a compute paradigm based on message passing[56]. Again, readers are

encouraged to learn more about these concepts in the papers abovementioned. We used

Pregel-like framework called Apache Hama[50]. The main difference of BSP frameworks

from MapReduce is that the BSP workers typically load data into their local memories,

compute that local data and pass messages to their peers. BSP algorithms tend to have

multiple iterations unlike two-stage MapReduce. BSP workers write data back to the

underlying storage upon finishing all the required iterations.

Hama can run on top of Gluon seamlessly without any difficulty in installation. The

configuration settings are similar to those used in deploying Hama on top of HDFS. Hama

can run with YARN or in a standalone mode.

We run label propagation algorithm (LPA) on a LiveJournal graph we used previ-

ously. Label propagation is extensively used in social networks to detect communities

based on the influence of a particular member. It is essentially a clustering algorithm that

associates each vertex in the graph with a certain community. LPA represents workloads

from companies that process large graphs on daily basis. These companies may include

Facebook, Twitter, Google etc. We vary the number of Hama workers to see how the job

duration declines.


5.2 Comparative evaluation using Spark

5.2.1 Spark count

In our first experiment, we test our AlluxioBlockManager implementation. We added

one class to Spark code in order to allow for .persist command store data in Alluxio. In

this experiment we provide large amount of memory (20GB) to both Spark MEM ONLY

and Alluxio.

Figure 5.1 shows results from 4 job runs. As we can see from the experiment native

Figure 5.1: Performance comparison of Spark caching methods

.persist is much faster than suggested functions that were designed to interact with

external services. This confirms that our implementation is on par with native Spark

caching given the same amount of memory.


5.2.2 Logistic Regression

We start with small Executor sizes and increase them to see performance gains in two

Spark-HDFS configurations (MEM ONLY, MEM AND DISK), Gluon and off-the-shelf

Alluxio configurations. Since we focus on cache performance in this experiment we pre-

load 18.7 GB data to Gluon disk cache and ingest the same data to HDFS in Spark-HDFS

platform. This setup provides equal read performance for both architectures and focuses

on in-job caching performance.

Figure 5.2 shows job duration which includes training time (95%), testing time (2%)

Figure 5.2: Performance comparison of Spark caching methods vs. Gluon collaborativecaching and Alluxio caching. The number of cores for all runs is 21.

and warm up times. The default Spark setting lags significantly in lower-provisioned

runs. The reason is that the size of the cached data is 7.5GB and MEM ONLY config-

uration has to re-compute RDD partitions that did not fit in the cache. On the other

hand, we show that Gluon in default configuration outperforms Spark default by 2.88x

and matches to Spark MEM AND DISK configuration because blocks that do not fit

in RAMdisk are evicted to disk. Gluon(extra idle) configuration includes another idle


memory node that does not run Spark program. From the experiment we can see that

extra RDD partitions are redirected to idle memory node, hence we obtain the perfor-

mance gain. In the over-provisioned scenario we can see that all configurations match

because all of the RDD is kept in local memory. Off-the-shelf Alluxio suffers significant

performance loss due to architectural overheads of saveAsTextF ile/saveAsObjectF ile

action operations.

5.2.3 PageRank

Although PageRank from SparkBench does not use .cache function, it still relies on

GraphLoader class from GraphX library. GraphLoader uses .cache intensively to con-

struct graph from a text file. In fact .cache is hardcoded in GraphX library and users

bound to use Spark’s MEM ONLY. We had to modify GraphX library in order to allow

for various caching configurations in PageRank. Due to GraphX being hardcoded for

Spark caching we were unable to store cached graph in the off-the-shelf Alluxio cache.

Figure 5.3 shows results of PageRank execution in 4 different modes. We again

observe a difference in performance between Spark’s default MEM ONLY and other con-

figurations in low-provision modes. The reason is that some graph parts need to be

re-constructed multiple times. Gluon-default again has the upper-hand and outperforms

MEM ONLY by 2.73x.

We also see that optimal MEM AND DISK option in Spark is slower than Gluon-

default. The LiveJournal graph results in task skews where at least one Executor receives

30% (e.g. 240) more tasks than the average (e.g. 180). This results in non-uniform cache

writes that under-utilize cache on one nodes and over-utilize(spill) on other nodes.

Gluon-extra-idle configuration has extra idle memory nodes of equal memory sizes as

those of busy nodes, i.e. if Gluon-default had 3GB of RAM assigned, then Gluon-extra-

idle has 3GB RAM from busy and 3GB RAM from idle nodes. We add Gluon-extra-idle

to show the benefits of utilizing idle memory.


Figure 5.3: Performance comparison of Spark caching methods vs. Gluon collaborativecaching.

5.2.4 Gluon job statistics

We gathered remote read/write statistics from Gluon in a low-provisioned scenario. Ta-

ble 5.1 shows available memory, remote read/writes, eviction data sizes and final output

data sizes.

The LocalF irst policy makes sure that local memory is utilized to the fullest first

before pushing block to the neighbour. We did not observe task skews in Logistic Regres-

sion runs. This means that all AlluxioWorkers with total 3GB cache size have run out of

memory approximately at the same time. Therefore, we did not see any remote memory

pushes or fetches in default Gluon setting. On the other hand, in Gluon-extra-idle we

did see 33% and 2% of the data being pushed to idle node in 3GB cache size and 7GB

cache size runs respectively.

In PageRank, we observed approximately 17.3% of remote pushes and approximately

15% of remote fetches for small cache size in default configuration. Here, nodes ran out

of memory quite fast. Remote push/get statistics is attributed to task skews that made


Table 5.1: Read and write statistics

Workload Busycachesize

Idlecachesize

RemoteMEMwrite (%)

RemoteMEMread(%)

Evictedto localdisk

Sent toremotestorage

LogisticRegression

3GB 0GB 0 0 4.5GB 0GB

LogisticRegression

3GB 3GB 33% 34% 1.5GB 0GB

LogisticRegression

7GB 0GB 0 0 1GB 0GB

LogisticRegression

7GB 7GB 2% 2% 0GB 0GB

PageRank 3GB 0GB 17% 15% 9.7GB 100MBPageRank 3GB 3GB 51% 54% 6.7GB 100MBPageRank 6GB 0GB 5% 4% 6.7GB 100MBPageRank 6GB 6GB 45% 51% 1GB 100MBPageRank 12GB 0GB 0.2% 0% 1GB 100MB

some nodes occupy memory at a faster rates.

By summing evicted data and cache sizes we can approximate the total amount of

cached data during whole program run. The memory pushes are highest when there

is idle memory available. This shows that full cache memory is utilized before evicting

blocks to disk.

5.2.5 Discussion

Overall, the main advantage of Gluon is that it can perform remote memory pushes

and fetches which allows it to utilize memory to it fully. Spark caching can also be

tuned by a user to the optimal level. However, it will take a preliminary set of runs to

understand how much data is being actually cached. In production systems, this may not

be an option. Although, no performance tuning is required for Gluon, in low-provisioned

cache-intensive applications it achieves 2.7x speedup over the Spark default mode and is

on par with the optimal Spark configurations.

We set AlluxioWorker sizes according to how much Spark application consumes to

make comparisons fair.


(a) WRITE (b) READ

Figure 5.4: DFSIO - writing and reading randomly generated 10GB data using 40 map-pers (cores)

5.3 Comparative evaluation using Hadoop MapRe-

duce

In typical enterprise analytics scenarios, an application developer has to move the data

to HDFS from data warehouse to perform data analytics in HMR - an event we call

ingest. This creates usability issues and performance degradation. We take note of these

performance drops by simulating ingestion time in our tests.

5.3.1 DFSIO

In this test we measure 3 different settings for HDFS: HDFS-WRITE, HDFS-READ and

HDFS-INGESTED-READ. The third setting measures data ingestion time to HDFS

using hadoopfs− put command and add regular HDFS-READ time.

We compare 3 HDFS settings to 4 Gluon settings: asynchronous write (ASYNC),

synchronous write (SYNC), cold cache read (REMOTE) and local cache read (LOCAL).

Figure 5.4 shows write comparisons of vanilla Hadoop setup vs. Gluon. As we can see

from the figure, Gluon accelerates HDFS by write performance in asynchronous mode by

2.5 times. In Figure 5.4 we can compare read performance. HDFS-INGESTED-READ

is the slowest because ingesting 10GB file from remote storage is typically done using

single thread. Gluon-REMOTE, on the other hand, fetches data from FusionStorage

using 40 threads. Gluon-REMOTE also represents worst-case scenario for collaborative


cache read, i.e. this particular case presents 100% cache miss example. Readers should

also note that Gluon-REMOTE data fetch latency depends on the type of the storage

and location of the storage servers. In our experiments, remote Server SAN is located

in the same datacenter but on a different set of racks. Finally, Gluon-LOCAL shows a

speedup of 1.85x in comparison to regular HDFS-READ. This scenario occurs when data

is fully cached and read from local RAMdisks.

5.3.2 Terasort

The workload consists of one job that spawns a set of tasks. Each task reads a file chunk

of 128 MB, sorts it using standard map/reduce sort (except for partitioner) and then

writes results back to the underlying file system. TeraSort is a good approximation of

typical single-stage shuffle-heavy job in the analytics world when a user wants to load

data, perform some manipulation and then store it.

Figure 5.5 shows job duration across 4 different setups: HDFS, HDFS-INGESTED,

Figure 5.5: TeraSort test. One job with 40 mappers and 20 reducers reading 128 MB filesorting its content and writing it back. The vertical axis indicates average job durationof a full job.

Gluon-HOT, Gluon-COLD. Like DFSIO, HDFS is an ideal HDFS setup where we assume

that input data resides inside the distributed file system. HDFS-INGESTED takes into


account the data ingestion time. Gluon-HOT assumes that input data is fully pre-loaded

to the Alluxio cache layer. Gluon-COLD has no partitions of input data on the cache

layer. Gluon-HOT and Gluon-COLD both represent cache best and worst scenarios

respectively. Unlike DFSIO, Terasort shows lower speedup of Gluon-HOT over HDFS.

The reason is that Terasort does not only perform reads but also sort, shuffle and write.

5.3.3 PageRank

PageRank[44] is a complex CPU-intensive algorithm and ranks pages by looking at the

number and quality of links. In our case, all we need to understand is that PageRank is

an iterative HMR job, i.e. one PageRank job produces a chain of iterations of the same

program on different inputs. The input to the PageRank iteration is the output of the

previous iteration. Hence, PageRank is a good example of iterative job or a chain of jobs

in Hadoop MapReduce.

In this experiment, we re-use our graph data from Livejournal[4]. This brings our

Figure 5.6: PageRank test. 1 initialization job, 3 ranking job iterations with 40 mappersand 20 reducers reading 128 MB file chunks computing its content and writing it back.The vertical axis indicates average job duration of a full chain of 4 jobs.

experiment closer to the real-world scenarios. Figure 5.6 shows PageRank program job

duration running on top of vanilla HDFS cluster vs. Gluon cluster. In iterative job

such as PageRank, initial data load does not affect application performance. Hence,


effects of cold Gluon cache are negligible in this particular scenario. We also do not

perform ingestion for HDFS cluster, i.e. we assume that data is already in HDFS. As

we can observe from the results, Gluon outperforms HDFS significantly due to caching

intermediate job outputs in memory.

5.3.4 Discussion

HMR is different from Spark because it follows a rigid map-then-reduce paradigm. Each

reduce output always goes to HDFS. This creates I/O bottlenecks for iterative jobs or

chains of jobs. In these scenarios, Gluon outperforms HDFS due to cached writes and

hot reads. Morever, a cache layer brings HMR job performance closer to Spark’s. Single

job cases represent worst scenarios for Gluon. Data is fetched from remote Server SAN.

However, even in this case, the performance is more or less comparable to that of the

ideal HDFS cluster. Moreover, we see significant performance degradation in HDFS if

data ingestion has to take place.

5.4 Graph Processing Framework - Hama

Figure 5.7 shows results of label propagation in the Hama framework. This is where

Gluon-HOT performs best regardless of the number of workers. The reason is that the

graph is initially loaded into memory tier. Thus each time Hama workers read they

just access local memory. Gluon-COLD on the other hand needs to be fetched from the

remote storage. Also since the output is only written once and has a smaller size than

the input Alluxio write performance does not impact job execution significantly. Another

interesting observation is that the performance gap decreases and all solutions ”catch up”

as the number of workers increase. The reason is that the more workers read in parallel

the smaller effect of the graph loading we can observe. All in all, Hama tests show that

CPU-intensive data analytics workloads are not impacted in terms of performance.


Figure 5.7: Label Propagation. 76 iterations in a job with 1-50 workers reading 1.3GBfile at the first iteration, computing its content in all of the iterations and writing theresult back in the last iteration. The vertical axis indicates average job duration of 76iterations.

5.5 Conclusion

In this chapter we showed how small code modifications in Spark led to significant im-

provements over suggested built-in caching methods. We also showed that Gluon out-

performs Spark’s default configuration by more than 2.5x in low-provisioned job runs.

Moreover, Gluon in default mode is on par with Spark’s optimal configuration for cache-

intensive applications. We also showed Gluon remote memory push/fetch statistics in

uniform as well as data-skewed workloads. The measurements show that skewed work-

loads incur some data movements across busy cache workers. This tendency increases

dramatically with idle cache workers available.

In our evaluation, we also looked at the Hadoop MapReduce framework[8]. We con-

cluded that Gluon can achieve up to 1.85x increase in read throughput if data is re-used.

We also demonstrated how ingestion can significantly degrade performance and that it is

faster to fetch remote data on-demand. Finally, we showed that Gluon expedites iterative

jobs and chains of jobs by more than 30%.

We demonstrated how our platform performs in non-traditional analytics scenarios


such as Hama. From our tests we see that there is neither significant performance im-

provement nor degradation if using Hama-like CPU-intensive frameworks.

In our experiments, we used well-known Big Data applications such as Logistic Re-

gression, Terasort and PageRank. We leveraged public data sources. This included

Livejournal graph data[4] and Wikipedia pages[6].

Chapter 6

Related Work

In this chapter we discuss research work that is related to our work. We explain how our

approach is different from previous works. We start by outlining research conducted on

caching layer in Spark and Hadoop. We then cover integrations of HPC storage platforms

with Hadoop ecosystem that do not use caching. Finally, we look at the designs that

target full-stack integration like Gluon.

6.1 Caching in Analytics

Caching in analytics frameworks has been studied extensively. Shared memory is not

new in analytics world, therefore there have been a large number of attempts to improve

Spark and Hadoop caching.

A lot of works focus on Hadoop and Spark caching efficiency[45][21][33][43][32][18][12][2].

However, they either focus on improving read performance or caching techniques. They

usually work on top of HDFS and ignore integration and flexibility. By and large, all these

works are complementary to Gluon and can be integrated to further improve Gluon’s

caching techniques.

Tungsten[5] that was designed in Databricks focuses on Spark JVM management. It

tries to boost on-heap memory management by exporting JVM objects to off-heap native

memory using Java Unsafe APIs. This offloads work from JVM garbage collector, hence

reduces its overhead. Tungsten is one of the most famous caching done on Spark, however

68

Chapter 6. Related Work 69

it mostly focuses on SQL-based Dataframes and only works with Spark explicitly.

Facade[42] performs a compiler-based transformation of analytics application. It then

can manipulate objects by moving them onto native off-heap RAM. Unlike Facade or

Tungsten, we provide an external memory store and management that allows for full

memory utilization as well as data propagation to cold storage.

SpongeFiles[20] is a distributed cache that is used in Hadoop MapReduce to avoid

spilling shuffle data to disk. It uses remote memory nodes to store data that is about

to be spilled to disk. It is best suited to skew-rich MapReduce workloads. Gluon is

similar to SpongeFiles in terms of data propagation hierarchy. However, Gluon connects

to a large variety of underlying storage platforms and checkpoints data asynchronously

as well as it works with newer Big Data applications such as Spark.

Apache Ignite[1] provides IGFS layer that can act as an in-memory file system, just

like Alluxio[35]. Apache Ignite tries to be many services including a scalable database,

a key-value cache and a filesystem. However, IGFS does not support as many storage

solutions as Gluon can. The framework also does not focus on transparent integration

of remote storage layers. Gluon is based on Alluxio which has much richer integration

support and it focuses on transparent data movement between cache and storage tier.

HDFS also provides CacheManager[2]. Here, users can manually specify frequently

accessed filenames for regular caching in the cluster. Similar to pin function in Alluxio.

Users can also specify the number of replicas that should be in memory. This can be

beneficial in cases when those popular files are smaller than the total available memory

in the cluster. The cached file replicas are being treated as regular HDFS replicas dur-

ing ApplicationMaster and task execution. This approach requires administrator know

which replicas to cache. This incurs significant usability issues for users.

EC-Cache[47] is based on Alluxio source code. It allows for balanced access of data

from object stores and cluster file systems by avoiding selective replication and relying

on erasure coding. Unlike EC-Cache, Gluon can have data imbalance that causes more

remote pushes/fetches. However, during our experiments,we did not see significant over-

heads due to remote memory data fetching/pushing. Moreover, EC-Cache aims to do

one particular optimization - avoid replication in the cache layer, while Gluon provides


a set of different optimizations - platform integration, cache collaboration, transparent

propagation.

6.2 HPC and shared storage integrations

The NFS Connector is a software program developed by NetApp[10]. The connector is

essentially a plug-in that allows Hadoop compute layer to access NFS server. There is

no caching layer in the NFS connector. Thus the locality is not supported. However,

the connector does spatial data pre-fetching. For example, if fooDir/foo1.txt is being

accessed, it can try to pre-fetch all other files under directory fooDir. With the suitable

configuration of endpoints and large OS memory buffers some locality can be achieved.

On the other hand, if NFS Connector fully replaces HDFS, then intermediate values are

stored on remote NFS servers. This will inevitably lead to worsened performance of HMR

or Spark programs.

Ceph is a highly scalable object-based parallel file system[58]. Ceph’s architecture

is very similar to that of HDFS. The main difference is that Ceph is POSIX compliant.

In Ceph-Hadoop integration data is spread across Ceph OSD servers[39]. The url of the

metadata server is exposed to Hadoop computation layer, e.g. ceph://mdtServerName:port.

Once the Application Master is launched it sends a request to metadata servers that re-

turn file information and object locations back to the Application Master. Unfortunately,

Ceph completely separates client programs from storage layer which by definition implies

that there is no locality for client programs. Another bottleneck could be that Ceph in-

tegration stores intermediate values on the remote storage.

Lustre is a parallel file system generally deployed in HPC clusters[27]. Like Ceph,

Lustre is highly scalable and separates storage layer from the clients as well as metadata

storage. There is one prominent integration designed by Intel in 2013[31]. The key idea

of this integration was to utilize Lustre during map-to-reduce phase transition. The

integration developers realized that all clients share the ”same view” of the file system.

Thus instead of storing, shuffling and sending intermediate key-value pairs to reducers,

the developers decided to store the pairs inside Lustre and inform reducers on how to


access these pairs. This is feasible because typically mappers and reducers are launched

on the same set of servers. In the Lustre integration all servers of the compute layer

have Lustre client installed. Therefore all Lustre clients share the same view of the file

system: an access of any file can be performed from any client node. This approach

reduces the overhead caused by processing and sending intermediate values. However,

this integration does not have cache therefore tasks have no local data.

Another integration with Lustre uses RDMA[37]. Lu et al. accelerate the shuffle phase

in Spark by leveraging RDMA to avoid the overhead of socket-based communication.

This approach mitigates locality overheads in the previous integrations. However, not all

users may have RDMA-enabled networks and storage servers location can be outside the

network zone.

Gfarm is a general-purpose distributed file system[54]. Gfarm has a similar architec-

ture to HDFS. It has a single metadata server (MDS) and multiple I/O servers. Each

I/O server manages their local file systems and provide access for files in these file sys-

tems. The client can access Gfarm storage using Gfarm client library. The key idea in

Gfarm is to measure round-trip times (RTT) from clients to I/O servers. Since Gfarm

relies heavily on data replication there needs to be an ordering of locations of replicas.

For instance, if a replica is located ”far” from client Gfarm will give low preference for

that client-replica pair. Moreover, Gfarm is POSIX-compliant; this makes it more user-

friendly in comparison to HDFS. Unfortunately, Gfarm integration works the same way

as HDFS, therefore causes the same set of usability issues.

GlusterFS is another open-source distributed file system[16]. However, unlike tra-

ditional distributed file stores, GlusterFS does not have any metadata manager. The

location of the files is determined through the static hash function. Thus if the client

wants to access a file it needs to pass file’s pathname as an argument for the hash func-

tion. We did not find a particular integration of Gluster with Hadoop, except the one

publicly available on GitHub[9]. We investigated open-source integration by analyzing

the implementation of FileSystem API. We deem this integration worth looking at due

non-traditionality of Gluster architecture. According to the open-source integration the

location of the files is determined through file’s pathname in the Gluster file system.


ApplicationMaster requests locations of file blocks from Gluster client that takes the

pathname(s), parses it and determines locations on Gluster volumes. Once locations are

determined Gluster client responds to ApplicationMaster. Since the data layout is dis-

tributed across the volumes high task concurrency can achieved. If we co-locate Gluster

volumes with Hadoop NodeManagers then all mapper and reducer containers will be

created on top of the volumes. Since ApplicationMaster can obtain exact locations of file

splits on Gluster Volumes high locality can be achieved. However this is only true for

reading data. There is a drawback that may arise on file creation stage. For instance,

if a mapper wants to create a file and write data to it, then there is no guarantee that

Gluster will create the file in the same physical unit where mapper runs.

Hadoop’s performance was also measured with PVFS [41]. The paper claims that

PVFS can be matched to HDFS. However, the PVFS configuration requires tight cou-

pling to Hadoop. Thus performance degradation is expected from workloads other than

Hadoop.

Hadoop natively provides object store integrations such as S3 connector. Unfortu-

nately, any documentation on integration of Hadoop and Amazon S3 available only shows

how to use S3 in Hadoop configurations without any detailed explanations of how data

is manipulated. For instance, T. White in his article [59] only explains how to configure

Hadoop cluster such that files are transferred from S3 to HDFS and then processed.

This is not helpful for our understanding of the related work because we are looking at

integrations that replace HDFS completely. Hadoop documentation also explains how

to connect to S3. It also states that S3 can be used as a default storage for YARN

applications[8]. After analysis of Hadoop source code we becomes clear that S3 connec-

tor is very simple and does not use caching or asynchronous propagation to remote S3

bucket.

6.3 Full-stack integrations

Azure Data Lake Store[46] is multi-tiered storage solution for analytics processing. It

uses RAM-to-HDD tiered architecture to propagate writes and load reads. However,


unlike Gluon, Azure Data Lake focuses on storage techniques and related challenges such

as security. The work does not focus on collaboration between nodes as well as caching

of intermediate data in analytics.

OctopusFS[30] is another multi-tiered storage platform. OctopusFS is very similar

to Gluon in that it also uses node collaboration to propagate data in the cluster. For

instance, RAM on all nodes is filed first and then data to goes to the next layer such

as SDD or HDD. OctopusFS also focuses on BigData applications such as Hadoop and

Spark. However, the work does not integrate Spark (i.e. only store final output data in

the storage) therefore the respective evaluation results do not get significant performance

improvements.

Triple-H[28] architecture is very similar to Gluon. The authors use HDFS as a cache

layer with RAMdisks and SDDs and remote storage is Lustre. They also propagate data

to cold storage transparently thus significantly increasing write performance. Tripl-H

has a solid architecture and explores variety of data placement strategy in the storage

hierarchy. However, storage hierarchy is vertical in Triple-H, while Gluon explores hor-

izontal and vertical hierarchy in data placement through collaboration between nodes.

Moreover, Gluon has been integrated with Spark and can accommodate larger variety of

storage platforms.

MixApart[40] is a modified version of Hadoop. It was developed at the University of

Toronto in 2013. It was one of the first projects to fully integrate Hadoop with NFS.

MixApart also utilizes on-disk caching algorithm. MixApart is motivated by two ob-

servations: first, NFS is a popular storage in most enterprise systems and it is quite

troublesome for companies to periodically transfer data to HDFS for analytics; second,

Facebook traces showed high data re-use in analytics workloads. MixApart has a dedi-

cated node called GateWay that connects with NFS. During the execution of MapReduce

program, MixApart uploads the data to its caching layer called XDFS which is also a

modified verison of HDFS. MixApart’s novelty in dynamic pre-fetching of the data by

looking at the queue of tasks. But MixApart completely disregards writing to remote

storage. It is also not compatible with resource managers and tightly coupled to old

version of Hadoop.


6.4 Conclusion

In this chapter, we have discussed rich literature of related work. There is a plethora of

good research publications related to Big Data caching practices. However, most of the

focus on in-depth optimization of one particular function, e.g. eviction. Others focus on

too much breadth trying to be cache for everything: analytics and transaction workloads.

Finally, there have been many attempts to integrate Hadoop ecosystem with HPC storage

solutions. Many of them ignore locality problems while others are not flexible and have

become outdated.

Chapter 7

Future Work and Final Remarks

In this dissertation, we showed how usability issues in state-of-art data analytics platforms

can lead to a failed job, bad performance or poor utilization of resources. We proposed

Gluon - our consolidated flexible platform for data analytics that can support many state-

of-art analytics frameworks. Our new architecture is based on previous case studies and

usability issues in current analytics engines and their storage solutions. In more detail

our contributions are as follows:

Our Gluon caching layer provides global collaboration across the memories of all par-

ticipating compute (and storage) nodes. In addition, Gluon supports full integration

of the collaborative caching service with traditional consolidated storage back-end ser-

vices. With Gluon we emphasize the principle of data locality for in-memory data on any


when opportunities for memory availability in collaborating nodes exist. We describe

data propagation from the execution layer to the storage layer. Finally, as mentioned,

the seamless integration between caching and consolidated storage in Gluon means that

any updates for any files stored on back-end storage can be integrated in a new data

analytics pass transparently, automatically, on-demand. This avoids cumbersome data

manipulations which separate on-disk data silos normally bring about e.g., for data an-

alytics systems based on HDFS.

We discovered that memory management in existing deployments can solve perfor-

mance only in the cases when users are very familiar with data access patterns during

75

Chapter 7. Future Work and Final Remarks 76

their program runs. With the global collaborative cache management provided by Gluon

we alleviated any user concerns about memory depletion as well as under-utilization in

Spark applications.

During our journey, we tried a variety of complex systems and identified best to

achieve our goal. We continuously changed open-source projects’ code and tested each

modification extensively. We focused on caching layer to ensure strong collaboration

between nodes and seamless data movement between tiers. We optimized Spark to easily

connect to the cache layer. Our connectors help to avoid overheads caused by architecture

limitations.

Our results show improvements in terms of usability, performance and robustness. We

have tested our system using real-world scenarios and data. We showed that Gluon can

provide optimal performance of native Spark applications in default mode and outperform

default Spark configurations by up to 3x. We also showed that caching increases write

performance for Hadoop MapReduce by 2.5x using asynchronous propagation.

Based on this initial prototype, our work can be continued in both depth (e.g. ad-

vancing caching layer) and breadth (e.g. advancing integration) as follows:

• Eviction - this includes concerns that regard eviction process and aim to further

improve eviction policies in the first two tiers. Currently, Gluon cannot estimate

RAM capacities correctly when too many processes try to write to the same cache

worker. Therefore, some cases may have task writing to full RAM which will cause

NoSpace exceptions. We also want to coordinate asynchronous checkpointing such

that it does not interfere with eviction processes.

• Memory copies - the second set of improvements is scheduled to be done on

lower-level memory management. We want to reduce as many memory copies

during writes and reads as possible. We also want to avoid serialization bottlenecks

in Spark and Hadoop by moving data raw. Due to recent improvements and cost

decline in network hardware remote data transfer is a smaller bottleneck than

typical CPU-intensive serialization process.

• Shuffler - we also want to advance the notion of ”memory pool” where all memory

Chapter 7. Future Work and Final Remarks 77

is consolidated into single pool that is shared across a variety of different appli-

cations. This includes re-doing spilling mechanisms in Spark and Hadoop during

shuffle phase. We believe sending spilled data to remote memory becomes faster

than spilling it to disk.

• Specialized Graph Processing - we also have tested our platform with Apache

Hama[50] to see if BSP-based[56] algorithms can work with Gluon. While testing

Hama we discovered interesting patterns in skewed graphs. Since Hama work-

ers typically rely on message passing the message queues become very imbalanced

across different workers. We want to investigate an opportunity of offloading mes-

sage queues to remote memory pool in Gluon and fetch them on demand. This

should allow for full memory utilization in graph processing frameworks and again

make users worry less about resource provisioning.

Bibliography

[1] Apache ignite. https://ignite.apache.org/.

[2] Apache software foundation. hdfs centralized cache management.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-

hdfs/CentralizedCacheManagement.html.

[3] Fusionstorage block distributed storage system.

http://e.huawei.com/en/products/cloud-computing-dc/cloud-

computing/fusionstorage/fusionstorage-block.

[4] Livejournal social graph. https://snap.stanford.edu/data/soc-LiveJournal1.

html.

[5] Project tungsten: Bringing apache spark closer to

bare metal. https://databricks.com/blog/2015/04/28/

project-tungsten-bringing-spark-closer-to-bare-metal.html.

[6] Wikipedia data. http://dumps.wikimedia.org/enwiki/.

[7] Wikixmlj parser. https://code.google.com/p/wikixmlj/.

[8] Apache hadoop. http://hadoop.apache.org, 2009.

[9] Glusterfs-hadoop. https://github.com/gluster/glusterfs-hadoop, 2014.

[10] Netapp nfs connector. https://github.com/NetApp/

NetApp-Hadoop-NFS-Connector, 2014.

78

https://ignite.apache.org/

https://snap.stanford.edu/data/soc-LiveJournal1.html

https://snap.stanford.edu/data/soc-LiveJournal1.html

https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

http://dumps.wikimedia.org/enwiki/

https://code.google.com/p/wikixmlj/

http://hadoop.apache.org

https://github.com/gluster/glusterfs-hadoop

https://github.com/NetApp/NetApp-Hadoop-NFS-Connector

https://github.com/NetApp/NetApp-Hadoop-NFS-Connector

Bibliography 79

[11] S3 Amazon. Amazon simple storage service (amazon s3), 2012.

[12] Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth

Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated memory caching for

parallel jobs. In Proceedings of the 9th USENIX conference on Networked Systems

Design and Implementation, pages 20–20. USENIX Association, 2012.

[13] Joe Arnold. OpenStack Swift: Using, Administering, and Developing for Swift Object

Storage. ” O’Reilly Media, Inc.”, 2014.

[14] Brad Calder, Tony Wang, Shane Mainali, and Jason Wu. Windows azure blob, 2009.

[15] Tom Clark. Designing Storage Area Networks: A Practical Reference for Imple-

menting Storage Area Networks. Addison-Wesley Longman Publishing Co., Inc.,

2003.

[16] Alex Davies and Alessandro Orsaria. Scale out with glusterfs. Linux Journal,

2013(235):1, 2013.

[17] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large

clusters. Communications of the ACM, 51(1):107–113, 2008.

[18] Francis Deslauriers, Peter McCormick, George Amvrosiadis, Ashvin Goel, and An-

gela Demke Brown. Quartet: Harmonizing task scheduling and caching for cluster

computing. In HotStorage, 2016.

[19] Jens Dittrich and Jorge-Arnulfo Quiane-Ruiz. Efficient big data processing in hadoop

mapreduce. Proceedings of the VLDB Endowment, 5(12):2014–2015, 2012.

[20] Khaled Elmeleegy, Christopher Olston, and Benjamin Reed. Spongefiles: Mitigating

data skew in mapreduce using distributed memory. In Proceedings of the 2014 ACM

SIGMOD international conference on Management of data, pages 551–562. ACM,

2014.

[21] Avrilia Floratou, Nimrod Megiddo, Navneet Potti, Fatma Ozcan, Uday Kale, and

Jan Schmitz-Hermes. Adaptive caching algorithms for big data systems. 2015.

Bibliography 80

[22] Apache Giraph. Giraph, 2015.

[23] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D

Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for

fine-grained resource sharing in the data center. In NSDI, volume 11, pages 22–22,

2011.

[24] Steve Hoffman. Apache Flume: Distributed Log Collection for Hadoop. Packt Pub-

lishing Ltd, 2013.

[25] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic

regression, volume 398. John Wiley & Sons, 2013.

[26] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The hibench

benchmark suite: Characterization of the mapreduce-based data analysis. In Data

Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on,

pages 41–51. IEEE, 2010.

[27] Intel Corporation. Lustre * Software Release 2.x.

[28] Nusrat Sharmin Islam, Xiaoyi Lu, Md Wasi-ur Rahman, Dipti Shankar, and Dha-

baleswar K Panda. Triple-h: A hybrid approach to accelerate hdfs on hpc clus-

ters with heterogeneous storage architecture. In Cluster, Cloud and Grid Comput-

ing (CCGrid), 2015 15th IEEE/ACM International Symposium on, pages 101–110.

IEEE, 2015.

[29] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduc-

tion to statistical learning, volume 112. Springer, 2013.

[30] Elena Kakoulli and Herodotos Herodotou. Octopusfs: A distributed file system

with tiered storage management. In Proceedings of the 2017 ACM International

Conference on Management of Data, pages 65–78. ACM, 2017.

[31] Omkar Kulkarni. Hadoop mapreduce over lustre. In Lustre User’s Group Conference,

2013.

Bibliography 81

[32] Mayuresh Kunjir, Brandon Fain, Kamesh Munagala, and Shivnath Babu. Robus:

Fair cache allocation for data-parallel workloads. In Proceedings of the 2017 ACM

International Conference on Management of Data, pages 219–234. ACM, 2017.

[33] Jaewon Kwak, Eunji Hwang, Tae-kyung Yoo, Beomseok Nam, and Young-ri Choi.

In-memory caching orchestration for hadoop. In Cluster, Cloud and Grid Computing

(CCGrid), 2016 16th IEEE/ACM International Symposium on, pages 94–97. IEEE,

2016.

[34] Steven Levine. Red Hat Enterprise Linux 6 Global File System 2. Red Hat.

[35] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Eric Baldeschwieler, Scott Shenker, and

Ion Stoica. Tachyon: Memory throughput i/o for cluster computing frameworks.

memory, 18:1, 2013.

[36] Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. Sparkbench:

a comprehensive benchmarking suite for in memory data analytic platform spark.

In Proceedings of the 12th ACM International Conference on Computing Frontiers,

page 53. ACM, 2015.

[37] Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, and Dhabaleswar K DK Panda. High-

performance design of apache spark with rdma and its benefits on various workloads.

In Big Data (Big Data), 2016 IEEE International Conference on, pages 253–262.

IEEE, 2016.

[38] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn,

Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph pro-

cessing. In Proceedings of the 2010 ACM SIGMOD International Conference on

Management of data, pages 135–146. ACM, 2010.

[39] Carlos Maltzahn, Esteban Molina-Estolano, Amandeep Khurana, Alex J Nelson,

Scott A Brandt, and Sage Weil. Ceph as a scalable alternative to the hadoop

distributed file system. login: The USENIX Magazine, 35:38–49, 2010.

Bibliography 82

[40] Madalin Mihailescu, Gokul Soundararajan, and Cristiana Amza. Mixapart: Decou-

pled analytics for shared storage systems. In Presented as part of the 11th USENIX

Conference on File and Storage Technologies (FAST 13), pages 133–146, 2013.

[41] Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent,

and Scott Brandt. Mixing hadoop and hpc workloads on parallel filesystems. In

Proceedings of the 4th Annual Workshop on Petascale Data Storage, pages 1–5.

ACM, 2009.

[42] Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu.

Facade: A compiler and runtime for (almost) object-bounded big data applications.

In ACM Sigplan Notices, volume 50, pages 675–690. ACM, 2015.

[43] Hyunkyo Oh, Kiyeon Kim, Jae-Min Hwang, Junho Park, Jongtae Lim, Kyoungsoo

Bok, and Jaesoo Yoo. A distributed cache management scheme for efficient accesses

of small files in hdfs. The Journal of the Korea Contents Association, 14(11):28–38,

2014.

[44] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank

citation ranking: bringing order to the web. 1999.

[45] Qifan Pu, Haoyuan Li, Matei Zaharia, Ali Ghodsi, and Ion Stoica. Fairride: Near-

optimal, fair cache sharing. In NSDI, pages 393–406, 2016.

[46] Raghu Ramakrishnan, Baskar Sridharan, John R Douceur, Pavan Kasturi, Balaji

Krishnamachari-Sampath, Karthick Krishnamoorthy, Peng Li, Mitica Manu, Spiro

Michaylov, Rogerio Ramos, et al. Azure data lake store: A hyperscale distributed

file service for big data analytics. In Proceedings of the 2017 ACM International

Conference on Management of Data, pages 51–63. ACM, 2017.

[47] KV Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ram-

chandran. Ec-cache: Load-balanced, low-latency cluster caching with online erasure

coding. In OSDI, pages 401–417, 2016.

Bibliography 83

[48] Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and

Carlo Curino. Apache tez: A unifying framework for modeling and building data

processing applications. In Proceedings of the 2015 ACM SIGMOD international

conference on Management of Data, pages 1357–1369. ACM, 2015.

[49] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. De-

sign and implementation of the sun network filesystem. In Proceedings of the Summer

USENIX conference, pages 119–130, 1985.

[50] Sangwon Seo, Edward J Yoon, Jaehong Kim, Seongwook Jin, Jin-Soo Kim, and

Seungryoul Maeng. Hama: An efficient matrix computation with the mapreduce

framework. In Cloud Computing Technology and Science (CloudCom), 2010 IEEE

Second International Conference on, pages 721–726. IEEE, 2010.

[51] Spencer Shepler, Mike Eisler, David Robinson, Brent Callaghan, Robert Thurlow,

David Noveck, and Carl Beame. Network file system (nfs) version 4 protocol. Net-

work, 2003.

[52] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The

hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage

systems and technologies (MSST), pages 1–10. IEEE, 2010.

[53] Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. Server-storage

virtualization: integration and load balancing in data centers. In Proceedings of the

2008 ACM/IEEE conference on Supercomputing, page 53. IEEE Press, 2008.

[54] Osamu Tatebe, Kohei Hiraga, and Noriyuki Soda. Gfarm grid file system. New

Generation Computing, 28(3):257–275, 2010.

[55] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,

Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehous-

ing solution over a map-reduce framework. Proceedings of the VLDB Endowment,

2(2):1626–1629, 2009.

Bibliography 84

[56] Leslie G Valiant. A bridging model for parallel computation. Communications of

the ACM, 33(8):103–111, 1990.

[57] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Ma-

hadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth

Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings

of the 4th annual Symposium on Cloud Computing, page 5. ACM, 2013.

[58] Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos

Maltzahn. Ceph: A scalable, high-performance distributed file system. In Pro-

ceedings of the 7th symposium on Operating systems design and implementation,

pages 307–320. USENIX Association, 2006.

[59] Tom White. Running hadoop mapreduce on amazon ec2 and amazon s3. Retrieved

March, 29:2009, 2007.

[60] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Mur-

phy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient dis-

tributed datasets: A fault-tolerant abstraction for in-memory cluster computing.

In Proceedings of the 9th USENIX conference on Networked Systems Design and

Implementation, pages 2–2. USENIX Association, 2012.

[61] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion

Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.

Documents

Scalable Collaborative Caching and Storage Platform for Data ......Next, Gluon brings together the bene ts of large scale, on-demand in-memory caching on one hand, and traditional,