200
User-Space Process Virtualization in the Context of Checkpoint-Restart and Virtual Machines A dissertation presented by Kapil Arya to the Faculty of the Graduate School of the College of Computer and Information Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy Northeastern University Boston, Massachusetts August 2014 Copyright c August 2014 by Kapil Arya

User-space process virtualization in the context of ...336364/...NORTHEASTERN UNIVERSITY GRADUATE SCHOOL OF COMPUTER SCIENCE Ph.D. THESIS APPROVAL FORM THESIS TITLE: User-Space Process

  • Upload
    lamnhan

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

User-Space Process Virtualization in the Context of

Checkpoint-Restart and Virtual Machines

A dissertation presented

by

Kapil Arya

to the Faculty of the Graduate School

of the College of Computer and Information Science

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Northeastern University

Boston, Massachusetts

August 2014

Copyright c© August 2014 by Kapil Arya

NORTHEASTERN UNIVERSITYGRADUATE SCHOOL OF COMPUTER SCIENCE

Ph.D. THESIS APPROVAL FORM

THESIS TITLE: User-Space Process Virtualization in the Context ofCheckpoint-Restart and Virtual Machines

AUTHOR: Kapil Arya

Ph.D. Thesis approved to complete all degree requirements for the Ph.D. degreein Computer Science

Distribution: Once completed, this form should be scanned and attached to the frontof the electronic dissertation document (page 1). An electronic version of the documentcan then be uploaded to the Northeastern University-UMI website.

Abstract

Checkpoint-Restart is the ability to save a set of running processes to a check-point image on disk, and to later restart them from the disk. In addition toits traditional use in fault tolerance, recovering from a system failure, it hasnumerous other uses, such as for application debugging and save/restore ofthe workspace of an interactive problem-solving environment. Transparentcheckpointing operates without modifying the underlying application pro-gram, but it implicitly relies on a “Closed World Assumption” — the world(including file system, network, etc.) will look the same upon restart as itdid at the time of checkpoint. This is not valid for more complex programs.Until now, checkpoint-restart packages have adopted ad hoc solutions foreach case where the environment changes upon restart.

This dissertation presents user-space process virtualization to decouple ap-plication processes from the external subsystems. A thin virtualization layeris introduced between the application and each external subsystem. It pro-vides the application with a consistent view of the external world and allowsfor checkpoint-restart to succeed. The ever growing number of external sub-systems make it harder to deploy and maintain virtualization layers in amonolithic checkpoint-restart system. To address this, an adaptive pluginbased approach is used to implement the virtualization layers that allow thecheckpoint-restart system to grow organically.

The principle of decoupling the external subsystem through process vir-tualization is also applied in the context of virtual machines for providinga solution to the long standing double-paging problem. Double-paging oc-curs when the guest attempts to page out memory that has previously beenswapped out by the hypervisor and leads to long delays for the guest as thecontents are read back into machine memory only to be written out again.The performance rapidly drops as a result of significant lengthening of thetime to complete the guest I/O request.

Acknowledgments

No dissertation is accomplished without the support of many people and I

can only begin to thank all those who have helped me in completing it.

I am indebted to my advisor, Gene Cooperman, for his patience, encour-

agement, support, and guidance over the years. It is because of Gene that

I decided to go for a Ph.D., while I was a Master’s student at Northeastern.

Gene taught me about how to do research and to distinguish the ideas that

only I would find interesting, from the ideas that are important. I could not

have asked for a better teacher and without him, this document would not

exist.

I am thankful to Panagiotis (Pete) Manolios, Alan Mislove and William

Robertson for serving on my committee and for providing their insightful

input and constructive criticism. I resoundingly thank Peter Desnoyers for

always being available to discuss ideas and for providing constructive feed-

back on several occasions.

I also want to thank the International Student and Scholar Institute (ISSI)

team and Bryan Lackaye for helping with the administrative matters during

my stay at Northeastern.

I was fortunate to be mentored by Alex Garthwaite during the summer

internships at VMware. His guidance and encouragement is always there

and never seems to fade away. Alex agreed to be the external member in

my committee and I am thankful for his feedback and thoughtful comments

that have not only improved the quality of this dissertation, but also pro-

vided ideas for future directions. His dictum that a good dissertation is a

completed one, became my mantra during the last two years.

I also want to thank Yury Baskakov for all the help that I received while

working on the Tesseract project. He never got tired of my random specula-

tions and was always there to provide further insights and also to cover my

blind spots. A special thanks goes to Jerri-Ann Meyer and Joyce Spencer for

their continued support of the project. Finally, I want to thank Ron Mann

for his continued advise and guidance that has helped me become a better

engineer.

I am grateful to Alok Singh Gehlot for his friendship, all the advice he

provided me over the years, and for his constant reminder that it’s not done

until it’s done. He was always available for me and without his guidance, I

would not have been at Northeastern for my Master’s and later, Ph.D.

I want to thank Rohan Garg and Jaideep Ramachandran for going through

the thesis drafts and sitting through my practice talks and for providing valu-

able feedback. Over the years, I have had the support of a lot of friends and

I want to thank Jaijun Cao, Harsh Raju Chamarthi, Tyler Denniston, Anand

Gehlot, Gregory Kerr, Samaneh Kazemi Nafchi, Artem Polyakov, Sumit Puro-

hit, Praveen Singh Solanki, Ana-Maria Visan, Vishal Vyas, any others I regret-

tably failed to name. I am enormously thankful to Surbhi for her enduring

friendship and companionship through all these years.

Finally, I owe much to my family. I want to express my deepest gratitude

for my grandparents, Smt. Mohini Devi and Sh. Omdutt Ji, my parents, Smt.

Jamana Devi and Sh. Nem Singh Ji, my aunt and uncle, Smt. Sangeeta Devi

and Sh. Hari Singh Ji, my uncles Sh. Kamlesh Ji and Sh. Dilip Ji, and my

siblings and cousins, Kavita, Lalita, Shilpa, and Anil, for their never ending

love, dedication and support. I am forever indebted to them.

To my grandfather

Shri Omdutt Ji Solanki

And my school teacher

Shri Devi Singh Ji Kachhwaha

Contents

Contents

List of Figures

List of Tables

1 Overview 1

1.1 Closed-World Assumption . . . . . . . . . . . . . . . . . . . 2

1.2 Double-Paging Anomaly . . . . . . . . . . . . . . . . . . . . 4

1.3 Process Virtualization . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.1 Process Virtualization through Plugins . . . . . . . . 7

1.5.2 Application-Specific Plugins . . . . . . . . . . . . . . 8

1.5.3 Third-Party Plugins . . . . . . . . . . . . . . . . . . . 9

1.5.4 Solving the Double-Paging Problem . . . . . . . . . . 9

1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Concepts Related to Checkpoint-Restart and Virtualization 13

2.1 Checkpoint-Restart . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Kernel-Level Transparent Checkpoint-Restart . . . . . 15

2.1.2 User-Level Transparent Checkpoint-Restart . . . . . . 18

2.1.3 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . 21

2.2 System Call Interpositioning . . . . . . . . . . . . . . . . . . 21

CONTENTS

2.3 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Language-Specific Virtual Machines . . . . . . . . . . 22

2.3.2 Process Virtualization . . . . . . . . . . . . . . . . . . 22

2.3.3 Lightweight O/S-based Virtual Machines . . . . . . . 23

2.3.4 Virtual Machines . . . . . . . . . . . . . . . . . . . . 24

2.4 DMTCP Version 1 . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.1 Library Call Wrappers . . . . . . . . . . . . . . . . . 27

2.4.2 DMTCP Coordinator . . . . . . . . . . . . . . . . . . 27

2.4.3 Checkpoint Thread . . . . . . . . . . . . . . . . . . . 27

2.4.4 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.5 Restart . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.6 Checkpoint Consistency for Distributed Processes . . 29

3 Adaptive Plugins as a Mechanism for Virtualization 31

3.1 The Ever Changing Execution Environment . . . . . . . . . . 31

3.1.1 PID: Virtualizing Kernel Resource Identifiers . . . . . 32

3.1.2 SSH Connection: Virtualizing a Protocol . . . . . . . 33

3.1.3 InfiniBand: Virtualizing a Device Driver . . . . . . . . 35

3.1.4 OpenGL: A Record/Replay Approach to Virtualizing a

Device Driver . . . . . . . . . . . . . . . . . . . . . . 36

3.1.5 POSIX Timers: Adapting to Application Requirements 36

3.2 Virtualizing the Execution Environment . . . . . . . . . . . . 37

3.2.1 Virtualize Access to External Resources . . . . . . . . 37

3.2.2 Capture/Restore the State of External Resources . . . 38

3.3 Adaptive Plugins as a Synthesis of System-Level and Application-

Level Checkpointing . . . . . . . . . . . . . . . . . . . . . . 39

4 The Design of Plugins 41

4.1 Plugin Architecture . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.1 Virtualization through Function Wrappers . . . . . . 43

4.1.2 Event Notifications . . . . . . . . . . . . . . . . . . . 46

CONTENTS

4.1.3 Publish/Subscribe Service . . . . . . . . . . . . . . . 49

4.2 Design Recipe for Virtualization through Plugins . . . . . . . 50

4.3 Plugin Dependencies . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Dependency Resolution . . . . . . . . . . . . . . . . . 52

4.3.2 External Resources Virtualized by Other Plugins . . . 54

4.3.3 Multiple Plugins Wrapping the Same Function . . . . 55

4.4 Extending to Multiple Processes . . . . . . . . . . . . . . . . 56

4.4.1 Unique Resource-id for Shared Resources . . . . . . . 57

4.4.2 Checkpointing Shared Resources . . . . . . . . . . . 58

4.4.3 Restoring Shared Resources . . . . . . . . . . . . . . 61

4.5 Three Base Plugins . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.1 Coordinator Interface Plugin . . . . . . . . . . . . . . 62

4.5.2 Thread Plugin . . . . . . . . . . . . . . . . . . . . . . 62

4.5.3 Memory Plugins . . . . . . . . . . . . . . . . . . . . . 63

4.6 Implementation Challenges . . . . . . . . . . . . . . . . . . 65

4.6.1 Wrapper Functions . . . . . . . . . . . . . . . . . . . 65

4.6.2 New Process/Program Creation . . . . . . . . . . . . 67

4.6.3 Checkpoint Deadlock on a Runtime Library Resource 68

4.6.4 Blocking Library Functions and Checkpoint Starvation 69

5 Expressivity of Plugins 71

5.1 File Descriptor Related Plugins . . . . . . . . . . . . . . . . . 73

5.2 Pid, System V IPC, and Timer Plugins . . . . . . . . . . . . . 77

5.3 Application-Specific Plugins . . . . . . . . . . . . . . . . . . 77

5.4 SSH Connection . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5 Batch-Queue Plugin for Resource Managers . . . . . . . . . 81

5.6 Ptrace Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.7 Deterministic Record-Replay . . . . . . . . . . . . . . . . . . 85

5.8 Checkpointing Networks of Virtual Machines . . . . . . . . . 87

CONTENTS

5.9 3-D Graphic: Support for Programmable GPUs in OpenGL 2.0

and Higher . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.10 Transparent Checkpointing of InfiniBand . . . . . . . . . . . 89

5.11 IB2TCP: Migrating from InfiniBand to TCP Sockets . . . . . 89

6 Tesseract: Reconciling Guest I/O and Hypervisor Swapping in

a VM 91

6.1 Redundant I/O . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Motivation: The Double-Paging Anomaly . . . . . . . . . . . 94

6.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3.1 Extending The Hosted Platform To Be Like ESX . . . 97

6.3.2 Reconciling Redundant I/Os . . . . . . . . . . . . . . 99

6.3.3 Tesseract’s Virtual Disk and Swap Subsystems . . . . 102

6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4.1 Explicit Management of Hypervisor Swapping . . . . 105

6.4.2 Tracking Memory Pages and Disk Blocks . . . . . . . 106

6.4.3 I/O Paths . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4.4 Managing Block Indirection Metadata . . . . . . . . . 111

6.5 Guest Disk Fragmentation . . . . . . . . . . . . . . . . . . . 112

6.5.1 BSST Defragmentation . . . . . . . . . . . . . . . . . 113

6.5.2 Guest VMDK Defragmentation . . . . . . . . . . . . . 115

6.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.6.1 Inducing Double-Paging Activity . . . . . . . . . . . . 116

6.6.2 Application Performance . . . . . . . . . . . . . . . . 117

6.6.3 Double-Paging and Guest Write I/O Requests . . . . . 121

6.6.4 Fragmentation in Guest Read I/O Requests . . . . . . 122

6.6.5 Evaluating Defragmentation Schemes . . . . . . . . . 123

6.6.6 Using SSD For Storing BSST VMDK . . . . . . . . . . 126

6.6.7 Overheads . . . . . . . . . . . . . . . . . . . . . . . . 127

6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 128

CONTENTS

6.7.1 Hypervisor Swapping and Double Paging . . . . . . . 128

6.7.2 Associations Between Memory and Disk State . . . . 130

6.7.3 I/O and Memory Deduplication . . . . . . . . . . . . 131

6.8 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7 Impact for the Future 133

7.1 Compiled Code In Scripting Languages: Fast-Slow Paradigm 133

7.2 Support for Hadoop-style Big Data . . . . . . . . . . . . . . 134

7.3 Cybersecurity . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.4 Algorithmic debugging . . . . . . . . . . . . . . . . . . . . . 135

7.5 Reversible Debugging . . . . . . . . . . . . . . . . . . . . . . 136

7.6 Android-Based Mobile Computing . . . . . . . . . . . . . . . 136

7.7 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . 136

8 Conclusion 137

A Plugin Tutorial 139

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

A.2 Anatomy of a plugin . . . . . . . . . . . . . . . . . . . . . . 140

A.3 Writing Plugins . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.3.1 Invoking a plugin . . . . . . . . . . . . . . . . . . . . 141

A.3.2 The plugin mechanisms . . . . . . . . . . . . . . . . 141

A.4 Application-Initiated Checkpoints . . . . . . . . . . . . . . . 145

A.5 Plugin Manual . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.5.1 Plugin events . . . . . . . . . . . . . . . . . . . . . . 146

A.5.2 Publish/Subscribe . . . . . . . . . . . . . . . . . . . . 151

A.5.3 Wrapper functions . . . . . . . . . . . . . . . . . . . 152

A.5.4 Miscellaneous utility functions . . . . . . . . . . . . . 152

Bibliography 155

List of Figures

1.1 Application surface of a running process . . . . . . . . . . . . . 5

2.1 Architecture of DMTCP . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Virtualization of Process Id . . . . . . . . . . . . . . . . . . . . . 33

3.2 Two processes communicating over SSH . . . . . . . . . . . . . 33

3.3 Virtualizing an SSH connection . . . . . . . . . . . . . . . . . . 34

4.2 Event notifications for write-ckpt and restart events . . . . . . . 47

4.4 Nested wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Plugin dependency for distributed processes . . . . . . . . . . . 61

5.1 Restoring an SSH connection . . . . . . . . . . . . . . . . . . . 80

6.1 Some cases of redundant I/O in a virtual machine. . . . . . . . 93

6.2 An example of double-paging. . . . . . . . . . . . . . . . . . . . 96

6.3 Double-paging with Tesseract. . . . . . . . . . . . . . . . . . . . 102

6.4 Write I/O and hypervisor swapping. . . . . . . . . . . . . . . . 103

6.5 Examples of reference count with Tesseract and with defragmen-

tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.6 VMware Workstation I/O Stack . . . . . . . . . . . . . . . . . . 108

6.7 Modified scatter-gather list to avoid double-paging . . . . . . . 109

6.8 Splitting scatter-gather list during read . . . . . . . . . . . . . . 110

6.9 Defragmenting the BSST. . . . . . . . . . . . . . . . . . . . . . 114

LIST OF FIGURES

6.10 Defragmenting the guest VMDK. . . . . . . . . . . . . . . . . . 115

6.11 Trends for scores and pauses in SPECjbb runs with varying guest

memory pressure and 10% host overcommitment. . . . . . . . . 118

6.12 Maximum single pauses observed in SPECjbb instantaneous scor-

ing with varying guest memory pressure and 10% host memory

overcommitment. . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.13 Scores and total pause times for SPECjbb runs with varying host

overcommitment and 60 MB memhog. . . . . . . . . . . . . . . 120

6.14 Comparing maximum single pauses for SPECjbb under various

defragmentation schemes with varying host memory overcom-

mitment and 60 MB memhog . . . . . . . . . . . . . . . . . . . 121

6.15 Scores and pauses in SPECjbb runs under various defragmenta-

tion schemes with 10% host overcommitment. . . . . . . . . . . 123

6.16 Score and pauses in SPECjbb under various defragmentation schemes

with varying host overcommitment and 60 MB memhog. . . . . 124

6.17 Comparing maximum single pauses for SPECjbb under various

defragmentation schemes with 10% host memory overcommit-

ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.18 Tesseract performances with BSST placed on an SSD disk. . . . 126

List of Tables

2.1 Comparison of various checkpointing systems. . . . . . . . . . . 21

5.1 Comparison of process virtualization based checkpoint-restart with

prior art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Statistics for various plugins. . . . . . . . . . . . . . . . . . . . 74

6.1 Holes in write I/O requests for varying host overcommitment and

60 MB memhog inside the guest. . . . . . . . . . . . . . . . . . 122

6.2 Holes in read I/O requests for Tesseract without defragmentation

for varying levels of host overcommitment and 60 MB memhog

inside the guest. . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.3 Total I/Os with BSST and guest defragmentation. . . . . . . . . 125

6.4 Average read and write prepare/completion times in microsec-

onds for baseline and Tesseract with and without defragmenta-

tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

CHAPTER 1

Overview

Checkpoint-restart is a powerful mechanism to save the state of one or more

running processes to disk and later restore it. In addition to the tradi-

tional use case of fault tolerance in long-running jobs, other use cases of

checkpoint-restart include process migration, debugging, and save/restore

of workspace.

At a high-level, checkpointing a process can be viewed as writing all of

process memory, including shared libraries, text and data, to a checkpoint

image. Accordingly, restarting involves recreating the process memory by

reading the checkpoint image from the disk. This works for simple programs,

but for complex programs, one also needs to save and restore information

about threads, open files, etc. In more sophisticated applications, it involves

saving the network state (in-flight data, etc.), and information about the

external environment such as the terminal, the standard input/output/error,

and so on.

Current checkpointing techniques fall into two categories: application-

level and system-level. Application-level checkpointing requires modifica-

tions to the target program to insert checkpoint-restart code. The developer

identifies the relevant state and data to be checkpointed and implements the

mechanism for checkpointing and restoring them. While it is flexible and

allows the programmer to optimize and have greater control over the check-

1

2 CHAPTER 1. OVERVIEW

pointing process, there is a high cost paid by the developer for implementing

and maintaining it. Further, the timing and frequency of checkpoints may

not be specified in a flexible manner and could be limited to certain “safe”

points in the program. System-level (or transparent) checkpointing on the

other hand works without modifying the target application program. How-

ever, a simple implementation is less flexible in that it requires the same

environment on restart (the case of homogeneous computer hosts).

1.1 Closed-World Assumption

Traditionally checkpoint-restart packages have made a closed-world assump-

tion:

The execution environment (file system, network, etc.) does not

change between checkpoint and restart. Thus to save and restore

the state of the processes of a computation, it suffices to save the

state of the CPU registers, the process’s virtual memory, and kernel

state.

While the closed world assumption holds for simple programs, it is not

valid for more complex programs (such as distributed processes), and can

cause checkpoint-restart to fail in remarkable ways. For example a process

with open files will fail to restart if the underlying filesystem mount-point

has changed, or if the host has a new IP address while the process remembers

the old one. At a more basic level, the restarted process will have a new

process id (pid) provided by the kernel. Thus, any attempt by the target

application to re-use a previously cached old pid will result in a failure.

One way to overcome the closed-world assumption is application-level

checkpointing — modifying the application program to account for the chang-

ing environment. As mentioned earlier, this approach is costly and hard to

maintain.

1.1. CLOSED-WORLD ASSUMPTION 3

For these reasons, the existing systems have been used mostly for applica-

tions that obey the closed-world assumption such as isolated batch jobs run-

ning solely on traditional multi-core computer nodes within a cluster. The

closed world assumption is enforced by posing several restrictions on the fea-

tures that an application can use or by creating special-purpose workarounds

to handle exceptions to the closed-world assumption.

For example, Condor [110] restricts applications from using multi-process

jobs, interprocess communication, multi-threading, timers, and file locks,

etc. [109]. BLCR [52] is implemented through a Linux kernel module, which

restores the original pid when it is still unused and fails if it is unavailable.

CRIU [111] places all target processes in a Linux container (lightweight vir-

tual machine), which has private namespaces for kernel objects, but is iso-

lated from other processes within the same host.

The closed world assumption breaks down as users ask to checkpoint

more general types of software that communicate with the external world.

Examples include communication with system daemons (e.g., NSCD, LDAP

authentication servers), 3-D graphics libraries (e.g., OpenGL), connections

with database servers, networks of virtual machines, hybrid computations

using CPU accelerators (e.g., GPU and Xeon Phi), Hadoop-style computa-

tions, a broader variety of network models (TCP sockets, InfiniBand, the

SCIF network for the Intel Xeon Phi), competing implementations of Infini-

Band libraries (QLogic/PSM versus InfiniBand OpenIB verbs), and so on.

These complex applications have created a dilemma. A system for pure

transparent checkpointing has no knowledge of the application’s external

world, and an application-level checkpointing system would require the

writer of the target application to insert code that adapts to the modified

external environment after restart. This conflict is the core problem being

solved.

4 CHAPTER 1. OVERVIEW

1.2 Double-Paging Anomaly

Hypervisors often overcommit memory to achieve higher VM consolidation

on the physical host. When overcommitting host physical memory, guest

memory is paged in and out from a hypervisor-level swap file to reclaim

host memory. Further, guests running in the virtual machines manage their

own physical address space and may overcommit memory as needed.

Double-paging is an often-cited problem in multi-level scheduling of mem-

ory between virtual machines (VMs) and the hypervisor. This problem oc-

curs when both a virtualized guest and the hypervisor overcommit their re-

spective physical address-spaces. When the guest pages out memory previ-

ously swapped out by the hypervisor, it initiates an expensive sequence of

steps causing the contents to be read in from the hypervisor-level swapfile

only to be written out again, significantly lengthening the time to complete

the guest I/O request. As a result, performance rapidly drops.

1.3 Process Virtualization

Often, application processes violate the closed-world assumption. When

restarting from a checkpoint image, the recreated objects derived from ex-

ternal systems/services may not be the same as their pre-checkpoint version.

This is due to the changing execution environment across a checkpoint-

restart boundary. In order to successfully restart an application process, we

need to virtualize these objects in such a way that the application view of

the objects does not change across checkpoint and restart.

Definition: The application surface of a running application is a set of code

and associated data that includes all application-specific objects (code+data)

and excludes all opaque objects derived from any outside systems/services.

(An opaque object is an object for which the application knows nothing

about the internal structure. The opaque object is only accessible through

1.3. PROCESS VIRTUALIZATION 5

Process

Application

ApplicationSurface

ExternalResource

real names

virtual names

Translation layer

Figure 1.1: Application surface of a running process. The virtual names lieinside the application surface, whereas the real names lie outside the surface.

an identifying handle)

Definition: User-space process virtualization finds a surface that is at least as

large as the application surface, such that any virtualized view of an object

lies inside this surface and any real view lies outside this surface (see Fig-

ure 1.1). On restart, the opaque objects are recreated to provide semanti-

cally equivalent functionality to their pre-checkpoint version. Process virtu-

alization then links these opaque objects with their virtualized view inside

the application surface (through the identifying handles).

There can be more than one possible application surface. Typically one

chooses an application surface close to a well known API for the sake of

stability and maintainability. A wrapper around any call to the API will

update both the virtual and the real view in a consistent manner.

Remarks:

1. In virtualizing a pid, we will see that libc will retain the real pid known

to the kernel. Thus libc is outside the application surface. But the ap-

plication knows only the virtual pid that resides inside the application

surface.

6 CHAPTER 1. OVERVIEW

2. In the case of a shadow device driver, the user-space memory of the

application may contain both some opaque objects (e.g., InfiniBand

queues) and their virtualized views. In this case the application surface

excludes parts of the user-space memory of the application process.

3. Because daemons and the kernel are opaque to the application, they

always lie outside the application surface.

4. An application may create an auxiliary child process (or even dis-

tributed processes in the case of MPI). In this case, the application

surface includes these auxiliary processes.

The goal of user-space process virtualization is to break the tight coupling

between the application process and an external subsystem not under the

control of the application process. In effect, each API is designed to provide

a stable interface to a single system service under the lifetime of a process.

This thesis will demonstrate the ability to find an application surface and

a corresponding API, for which a software translation layer can be built,

enabling the application process to continue to receive the corresponding

system service from an alternative external subsystem. This decouples the

application process from the external subsystem.

1.4 Thesis Statement

User-space process virtualization can be used to decouple application pro-

cesses from external subsystems to allow checkpoint-restart without enforc-

ing a strict “closed-world assumption”. The method of decoupling subsys-

tems applies beyond checkpointing as seen in a solution to the long standing

double-paging problem.

1.5. CONTRIBUTIONS 7

1.5 Contributions

This dissertation shows that a checkpointing system can “adapt” to the ex-

ternal environment, one subsystem at a time, by using the user-space process

virtualization technique. To that end, this work introduces a plugin archi-

tecture based on adaptive plugins to virtualize these external subsystems. A

plugin is responsible for virtualizing and checkpointing exactly one external

subsystem to allow the application to adapt to the modified external subsys-

tem.

The plugin architecture allows us to do selective (or partial) virtualiza-

tion of the underlying resources for efficiency purposes. Plugins can be load-

ed/unloaded to suit application requirements. Further, it allows the check-

pointing system to be extended organically, in a non-monolithic manner.

1.5.1 Process Virtualization through Plugins

To demonstrate the strength of the plugin architecture for user-space pro-

cess virtualization, this work presents principled techniques for the follow-

ing problems, which have resisted successful checkpoint-restart solutions for

at least a decade (these plugins are original with this dissertation):

• The PID plugin (§5.2) virtualizes the process and thread identifiers

assigned by the kernel.

• The System V IPC plugin (§5.2) virtualizes the shared memory, semaphore,

and message queue identifiers assigned by the kernel.

• The Timer plugin (§5.2) virtualizes posix timers as well as as clock

identifiers assigned by the kernel.

• The SSH plugin (§5.4) virtualizes the underlying SSH connection be-

tween two processes to allow recreation on restart.

8 CHAPTER 1. OVERVIEW

• The IB2TCP plugin (§5.11) virtualizes the InfiniBand device driver to

allow a computation to be checkpointed on the InfiniBand hardware

and restarted on the TCP hardware.

Notice that the Zap [86] system virtualized the kernel resource identi-

fiers such as pids and System V IPC ids in kernel space. However, the work

of this dissertation virtualizes entirely in user space without any applica-

tion or kernel modifications or kernel modules. Further, this work extends

the notion of user-space virtualization to processes/services outside the ker-

nel such as SSH connections, network daemons and device drivers. This

is achieved either through interposing library calls or by creating shadow

agents/processes for the external resources.

1.5.2 Application-Specific Plugins

Next, we show that plugins can be used for application-specific adapta-

tions, providing the benefits of application-level checkpointing without hav-

ing to modify the base application. The following application-specific plug-

ins (§5.3) are original with this dissertation:

• Malloc plugin virtualizes access to the underlying memory allocation

library (e.g., libc malloc, tcmalloc, etc.).

• DL plugin is used to ensure atomicity for dlopen/dlsym functions with

respect to checkpoint-restart.

• CkptFile plugin provides heuristics for checkpointing open files. It also

helps the file plugin to locate files on restart.

• Uniq-Ckpt plugin is used to control the checkpoint file names, loca-

tions, etc.

1.5. CONTRIBUTIONS 9

1.5.3 Third-Party Plugins

Finally, the success of the plugin architecture can also be seen in third party

plugins. We show that third parties can write orthogonal customized plugins

to fit their needs. The following demonstrates original work due to plugins

created by third party contributors (this dissertation is not claiming these

results):

• Ptrace plugin [127] virtualizes the ptrace system call to allow check-

pointing of an entire gdb session for reversible debugging.

• Record-replay plugin [126] provides a light-weight deterministic re-

play mechanism by recording library calls for reversible debugging.

• KVM plugin [44] is used for checkpointing the KVM/Qemu virtual ma-

chine.

• Tun plugin [44] is used for checkpointing the Tun/Tap network inter-

face for checkpointing a network of virtual machines.

• RM plugin [93] is used for checkpointing in a batch-queue environ-

ment and can handle multiple batch-queue systems.

• InfiniBand plugin [27] provides the first non MPI-specific transparent

checkpoint-restart of InfiniBand network.

• OpenGL plugin [62] uses a record-prune-replay technique for check-

pointing 3D graphics (OpenGL 2.0 and beyond).

1.5.4 Solving the Double-Paging Problem

The process virtualization principles are also applied in the context of vir-

tual machines. The double-paging problem is directly and transparently ad-

dressed by applying the decoupling principle [11]. The guest and hyper-

visor I/O operations are tracked to detect redundancy and are modified to

10 CHAPTER 1. OVERVIEW

create indirections to existing disk blocks containing the page contents. The

indirection is created by introducing a thin virtualization layer to virtualize

access to the guest disk blocks. Further, the virtualization is done completely

in user space.

1.6 Organization

The remainder of this dissertation is organized as follows.

A literature review is presented in Chapter 2 and various checkpoint-

restart mechanisms are discussed. The review also includes various virtual-

ization schemes in the context of checkpointing. (Literature for the double-

paging problem is reviewed in Chapter 6)

Chapter 3 provides several examples to motivate the need for virtualiz-

ing the execution environment. This chapter then uses this motivation to

outline two basic requirements for virtualizing the execution environment.

It is argued there that an adaptive plugin based approach is well suited for

process virtualization.

Chapter 4 describes the design of adaptive plugins and presents the plu-

gin architecture. The proposed plugin architecture is shown to meet the vir-

tualization requirements laid out in Chapter 3. This is followed by a design

recipe for developing new plugins. Dependencies among multiple plugins

are also discussed and an approach to dependency resolution is provided.

Finally, some implementation challenges involved in designing plugins are

presented.

Chapter 5 provides some case studies involving various plugins. In-

cluded there are seven plugins that provide novel checkpointing solutions

of their corresponding subsystems. Some application-specific plugins are

also demonstrated along with several plugins that provide virtualization of

kernel resource identifies in the user space.

Chapter 6 then turns to the double-paging problem. Like the core issue

1.6. ORGANIZATION 11

in checkpoint-restart, here also one is presented by distinct subsystems that

must be combined in a unified virtualization scheme. The core problem is

described and motivated, and a design and implementation of a solution is

presented. We also discuss some of the side-effects of the proposed solution

and finally present evaluation.

Chapter 7 provides some new directions and applications of checkpoint-

restart to non-traditional use-cases that can be pursued based on this disser-

tation, with a conclusion presented in Chapter 8.

Finally, a plugin tutorial is presented in Appendix A, thus providing a

concrete view of the plugin API.

CHAPTER 2

Concepts Related to

Checkpoint-Restart and

Virtualization

This dissertation intersects with four broad areas. The first is that of checkpoint-

restart at the process level. The second concerns system/library call inter-

positioning for modifying process behavior. The third concerns process level

virtualization. The fourth concerns the double-paging problem in the con-

text of virtual machines. The literature for the first three areas is reviewed

here, whereas the related work for the double-paging problem is discussed

in Chapter 6. Since this work builds on the DMTCP software package, a brief

overview of the legacy DMTCP software (DMTCP version 1) is also provided.

2.1 Checkpoint-Restart

Checkpoint-restart has a long history with several mechanisms proposed

over the years [90, 97, 98, 35]. It is often used for process migration,

for load balancing, for fault tolerance, and so on [34]. The work of Milo-

jicic et al. [81] provides a review of the field of process migration. Egwu-

tuoha et al. [35] provides a survey of various checkpoint/restart implemen-

13

14 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

tations in high performance computing. The website checkpointing.org

also lists several checkpoint-restart systems. There are three primary ap-

proaches to checkpointing: virtual machine snapshotting, application-level

checkpointing, and transparent checkpointing.

Virtual machine snapshotting

Virtual machine (VM) snapshotting is a form of checkpointing for virtual

machines and is often used for virtual machine migration. A complex appli-

cation is treated as a black box, and its application surface is expanded to

include the entire guest physical memory, operating system state, devices,

etc. Checkpointing an application involves involves saving everything inside

the application surface (i.e. the entire virtual machine). While this tech-

nique is general and has been discussed quite extensively [80], it is also

slower and produces larger checkpoint images because the checkpoint mod-

ule is unable to exclude unnecessary parts of guest physical memory from

the application surface. Hence, it is not commonly used for mechanisms of

checkpoint-restart.

Application-level checkpointing

Application-level checkpointing is the simplest form of checkpointing. The

developer of the application inserts checkpointing code directly inside the

application to save the process state, such as data structures, to a file on disk

that is later used to resume the computation. This is application-specific and

requires extensive knowledge of the application. The knowledge of the ap-

plication internals provides complete flexibility, but places a larger burden

on the end user. There are several techniques [129] and frameworks that

provide tools to assist in application-level checkpointing. Examples include

pickling for Python [120] and Boost serialization [108] for C++. A some-

what lighter mode of application-level checkpointing is the save/restore

2.1. CHECKPOINT-RESTART 15

workspace feature for interactive sessions. Notably, Bronevetsky et al. have

applied this to shared memory parallelism in the context of OpenMP [24, 25]

and distributed parallelism in the context of MPI [100, 23], where they pro-

vide tools to lighten the end-user burden for writing checkpointing code.

The rest of this section focuses on several varieties of transparent check-

pointing, in which the end-user does not need to make any changes to the

target application.

Transparent checkpointing

This is sometimes called system-level or system-initiated checkpointing. It

is the ability to checkpoint an application without making any changes to

the application source or binary. The history of transparent checkpointing

extends back at least to 1990 [73]. While, there are many systems that

perform single-process checkpointing [91, 33, 89, 92, 73, 74, 29, 1, 3, 76],

we will focus on systems that support multiple processes and/or distributed

processes. Transparent system-level checkpointing technique can be further

broken down into Kernel-level and user-level checkpointing. The two tech-

niques are further discussed in Sections 2.1.1 and 2.1.2 respectively.

2.1.1 Kernel-Level Transparent Checkpoint-Restart

In kernel-level checkpointing, the operating system is modified to support

checkpointing for applications. This approach leads to checkpoints being

more tightly coupled to kernel versions. While there have been several such

kernel-level packages, the difficulty of supporting multiple kernel versions

makes it more difficult. It also makes future ports to other operating systems

more difficult.

16 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

The Zap system and its derivatives

As an extension of CRAK (Checkpoint and Restart as a Kernel Module) [139],

Zap [86, 67] implements checkpoint-restart using a kernel module. Zap can

be considered a precursor to the Linux Containers (LXC) [117] as it also

provides a virtualized view of the kernel resources. Zap uses a pod (process

domain) abstraction, that provides a group of processes with a consistent vir-

tualized view. The pods abstraction virtualizes kernel resource identifiers to

present a pod-specific view. This isolates the process from the external world

and provides a conflict free environment when migrating processes to other

nodes. The downside of this implementation is the inability of processes in-

side a pod to communicate with processes outside the pod. It intercepts all

systems calls operating on the virtualized kernel resource identifiers, trans-

lating their arguments and return values as needed. System call interception

is also required for all processes in the system and poses runtime overhead

for processes outside the pods.

Zap was later extended to support distributed network applications by

Laadan et al. [68] to create ZapC and by Janakiraman et al. [59] to create

CRUZ. The key enhancement was the support for virtualization of the net-

work layer to decouple the processes from the node they are running on.

This allowed these systems to checkpoint-restart distributed computations

over a cluster. For ZapC network virtualization was achieved by inserting

hooks into the network stack using netfilter. The source and destination

addresses were translated between virtual and real addresses for both in-

coming and outgoing network packets.

The work of this dissertation is based entirely in the user space and

doesn’t require any kernel modification or kernel modules. As explained

by Laadan [66], the kernel module based approach incurs a burden both on

users because it is cumbersome to install, and on developers because main-

taining it on top of quickly changing upstream kernels is a sisyphean task and

2.1. CHECKPOINT-RESTART 17

development quickly falls behind. Further, user-space virtualization poses no

runtime overhead for processes that are not part of the computation being

checkpointed. Finally, this work can be used to virtualize agents/process-

es/services outside the kernel. Examples include SSH connection, network

daemons and device drivers.

Berkeley Lab Checkpoint Restart (BLCR)

BLCR [52] is another widely used checkpointing system that is implemented

as a kernel module. It is used primarily in high performance computing.

BLCR is often used along with MPI libraries to checkpoint a distributed com-

putation. The BLCR does not have any support for virtualization and may

fail if a kernel resource identifier (such as a pid) is not available at the time of

restart. It also relies on MPI daemons to handle changed network addresses,

mount points, etc. However, if the application has cached a directory name

from before checkpoint and tries to open it after restart, it may fail.

Another notable kernel based system was Chpox by Sudakov et al. [105].

Initially, Chpox was implemented as a kernel module for Linux 2.4, whereas

a later version for Linux 2.6 required base kernel modifications as well.

Pure kernel-level approaches

A more recent attempt by Laadan et al. [68] also implemented a single-host

in-kernel solution. It consisted of some user-space utilities and a series of

patches to the Linux 2.6 kernel to add checkpoint support in the mainline

kernel itself. This was proposed for inclusion in the Linux kernel, but ulti-

mately not accepted due to its invasive approach that touched/modified a

large number of kernel subsystems [8].

18 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

2.1.2 User-Level Transparent Checkpoint-Restart

User level checkpointing works without any changes to the operating system

kernel. The use of published APIs (e.g., POSIX and the Linux proc filesystem)

to communicate with the kernel and to perform checkpoint-restart makes it

highly stable.

Checkpointing library

The ground-breaking work of Plank et al. [92] on Libckpt uses a library to

do the checkpointing and the application program is linked to this user-level

library. Similar techniques are used by Condor [76]. These techniques are

not completely transparent to the user as the application code is modified,

recompiled, and relinked with the dynamic library. However, the amount of

code changes is often fairly small (e.g., for Libckpt, the application program-

mer needs to rename the main() to ckpt_target()). The main disadvantage

of using such systems is the restrictions imposed on the operating system

features such as interprocess communication, that the application program

can use [109]. Further, these systems do not support process trees or dis-

tributed computations.

Distributed checkpointing with MPI

Although application-level checkpointing for distributed programs dates back

at least to 1997 [17], most practical systems were built around MPI-based

distributed computations for supporting high performance computing. They

use hooks or callback functions for specific MPI implementations [31, 54,

137, 138, 104, 21, 133, 49, 52, 99]. (MPI, Message Passing Interface, is

a standard for message-based distributed high performance computation.)

Most MPI implementors chose to build a custom checkpoint-restart service.

This came about when InfiniBand became the preferred network for high

performance computing, and there was still no package for transparent check-

2.1. CHECKPOINT-RESTART 19

pointing over InfiniBand. Examples of checkpoint-restart services can be

found in Open MPI [54, 55], LAM/MPI [99] (now incorporated into MVA-

PICH2 [77, 41]), MPICH-V [22], and MVAPICH2 [41], as well as a fault-

tolerant “backplane”, CIFTS [51]. Each checkpoint-restart service would dis-

connect from the network prior to checkpoint, and re-connect after restart.

Hence, while the network was disconnected, the MPI checkpoint-restart ser-

vice was then able to delegate single-host checkpointing to the BLCR [52]

kernel module. This created an extra layer of complication, but it was un-

avoidable at that time, due to the lack of support for transparent checkpoint-

ing over InfiniBand. On restart, the network connections are restored and

the checkpointer is called upon to restore the user processes. Since it’s work-

ing at the MPI level, the ability to adapt to the environment outside of MPI

is limited, and generally proves difficult to maintain.

Bronevetsky et al. produced a novel application-level checkpointing de-

sign for the special case of MPI [23]. In this approach, a pre-compiler in-

struments the application MPI code with additional information needed for

checkpointing, thus coming close to the ideal of transparent checkpointing.

The application programmer then adds code indicating valid points in the

program for a potential checkpoint. The use of a pre-compiler relieved much

of the burden of adding application-specific code to support checkpointing.

Cryopid

Cryopid [18] and Cryopid2 [85] use the ptrace system call to attach to

a running process and create a core dump of the application process that is

later used to restart the computation. The checkpointable features supported

are quite limited as compared to other checkpointing packages, and adding

a new feature is often harder.

20 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

Checkpoint Restart In Userspace (CRIU)

CRIU [111] is a more recent checkpointing package based on Linux con-

tainers (LXC) [117]. The support is restricted to process trees and contain-

ers. The Linux kernel API was extended for new kernel features to sup-

port the user space tool. Like Cryopid, it also uses the ptrace system call

to inject checkpointing code inside the user processes. The checkpointing

code executes in the context of a process to gather all the relevant informa-

tion using the extended kernel API. Due to security issues, the checkpoint-

ing capability is only available for users with CAP_SYS_ADMIN capability.

(CAP_SYS_ADMIN capability is a successor to the Linux setuid-root feature

that is used to grant admin privileges to select applications/processes.)

Distributed MultiThreaded Checkpointing (DMTCP)

DMTCP version 1 [7] is implemented using user space shared libraries. The

original DMTCP supported TCP sockets, but was limited in that it did not

support distributed computations communicating over ssh or InfiniBand.

Further, even in the single-host case, it did not support virtualization of

such kernel resources as pids, System V IPC, POSIX and System V shared

memory, and POSIX timers. Section 2.4 provides a brief background on the

architecture and the working of DMTCP version 1.

This work represents a rewrite of the original DMTCP [7], in order to

introduce user-space process virtualization for checkpointing the external

environment. This enables us to checkpoint a wide variety of applications.

The virtualization layer is implemented completely in user space with mini-

mal overhead. Process virtualization goes beyond virtualizing the kernel re-

source identifiers and can be used to virtualize even higher level constructs

and abstractions such as the SSH protocol, as discussed in Chapter 3. Ta-

ble 2.1 summarizes the difference between this work and the prominent

transparent checkpointing packages.

2.2. SYSTEM CALL INTERPOSITIONING 21

Ckpt Multi-host Resource Virtualization Applic- Third-

System computations kernel other specific party

resources resources tuning plugins

BLCR 7 7 7 7 7

Zap 7 3 7 7 7

CRIU 7 3 7 7 7

Cryopid2 7 7 7 7 7

DMTCP (v1) 3 7 7 7 7

Extensible 3 3 3 3 3

CKPT

Table 2.1: Comparison of various checkpointing systems. The other resourcevirtualization refers to the ability to virtualize protocols, device drivers, etc.

2.1.3 Fault Tolerance

Fault tolerance [70, 58] is a broader concept not discussed here. It enables

a system to continue operating properly in the event of a failure of one

of its components. Several strategies can be deployed to make a system

fault tolerant such as: redundancy, partial re-execution, atomic transactions,

instrumentation of data, and so on.

2.2 System Call Interpositioning

The concept of wrappers, as implemented in DMTCP, have a long and inde-

pendent history under the more general heading of interposition. Interpo-

sition techniques have been used for a wide variety of purposes [123, 136,

65]. See especially [123] for a survey of a wide variety of interposition tech-

niques. The work of Garfinkel [42] discusses practical problems associated

with system call interpositioning. The packages PIN [88] and DynInst [124]

are two examples of software packages that provide interposition techniques

at the level of binary instrumentation.

22 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

2.3 Virtualization

Virtualization is the process of allowing unmodified source code or an un-

modified binary to transparently run under varied external environments

(different CPU, different network, different graphics server (e.g., X11-server),

etc.). Most of the original checkpointing packages [73, 74, 26, 31, 71] ig-

nored these issues and concentrated on homogeneous checkpointing.

Virtualization techniques have been developed since the 1960s. Since

then, systems have implemented different flavors of virtualization. In this

section, we discuss the four types of virtualization techniques in common

use today that are closest in spirit to this work.

2.3.1 Language-Specific Virtual Machines

A language-specific virtual machine, sometimes also known as an applica-

tion virtual machine, a runtime environment, or a process virtual machine,

allows an application to execute on any platform without having to write any

platform-specific code. This is achieved by creating a platform-independent

programming environment that abstracts the details of the underlying hard-

ware or operating system. This abstraction is provided at the level of a

high-level programming language. Notable examples include Java Virtual

Machine (JVM) [75], .NET framework [122], and Android virtual machines

(Dalvik) [20, 36].

Language-specific virtual machines are often implemented using an in-

terpreter, with an option of using just-in-time compilation for performance

close to that of a compiled language [32].

2.3.2 Process Virtualization

Process virtualization allows a process to be migrated or restarted in a new

external environment, while preserving the process’s view of the external

world. For example, a kernel may assign to a restarted process a different

2.3. VIRTUALIZATION 23

pid than the original pid at the time of checkpoint. The earliest checkpoint-

ing packages had assumed that the targeted user process would not save

the value of the pid of a peer process, but rather would re-discover that

pid on each use. As software complexity grew, this assumption became

unreliable. More recent packages either modified the Linux kernel (e.g.,

BLCR [52]), or ran inside a Linux Container, a lightweight virtual machine

(e.g., CRIU [111]).

Process virtualization (as exemplified by this work) has been considered

intensively in the context of checkpointing only recently. Nevertheless, it has

important forerunners in process hijacking [136] and in the checkpointing

packages [76, 135] used in Condor’s Standard Universe. Similarly, there are

connections of process virtualization with dynamic instrumentation (e.g.,

Paradyn/DynInst [124], PIN [88]).

2.3.3 Lightweight O/S-based Virtual Machines

O/S virtualization allows several isolated execution environments to run

within a single operating system kernel. This technique exhibits better per-

formance and density compared to virtual machines. On the downside, it

cannot host a guest operating system different from the host operating sys-

tem, or a different guest kernel (different Linux distributions is fine). Some

examples include FreeBSD Jail [61], Solaris Zones [96], Linux Containers

(LXC) [117], Linux-VServer [116], OpenVZ [118] and Virtuozzo [119].

Linux Containers are a kernel-level tool for providing a type of virtual-

ization in the form of namespaces for process spaces and network spaces.

This provides an alternative approach for such tasks as that of pid virtu-

alization. The CRIU [111] checkpointing system uses LXC namespaces to

virtualize kernel resource identifiers within the container. The namespaces

avoid the problem of name conflicts for kernel resource identifiers during

process migration.

24 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

Although process-level virtualization and Library OS [6, 95, 107] both

operate in user space without special privileges, the goal of Library OS

is quite different. A Library OS modifies or extends the system services

provided by the operating system kernel. For example, Drawbridge [95]

presents a Windows 7 personality, so as to run Windows 7 applications un-

der newer versions of Windows. Similarly, the original exokernel operating

system [37] provided additional operating system services beyond those of

a small underlying operating system kernel, and this was argued to often be

more efficient that a larger kernel directly providing those services.

2.3.4 Virtual Machines

Hardware virtualization uses an abstract computing platform. Thus, it hides

the hardware platform (the host software). On top of the host software, a

virtual machine (guest software) is running. The guest software executes as

if it were running directly on the physical hardware, with a few restrictions,

such as the network access, display, keyboard, and disk storage. Examples

of virtual machines include VMware, Qemu/KVM [114], Xen [15], Virtu-

alBox [130], and Lguest [115]. The virtual machines often run a set of

tools inside the guest operating system to inspect and control its behavior.

Further, in some cases the guest operating system is modified to provide

additional support/features and the technique is referred to as paravirtu-

alization. Some notable examples of paravirtualization are Xen [15] and

Microsoft Hyper-V [125].

One could also include binary instrumentation techniques such PIN [88]

and DynInst [124] in a discussion of virtualization, but this tends not to be

used much with checkpointing.

The work of this thesis introduces process virtualization for abstractions

beyond the traditional kernel resource identifiers in order to virtualize nu-

merous external subsystems such as SSH connections, InfiniBand network,

2.4. DMTCP VERSION 1 25

KVM and Tun/Tap interfaces, SLURM and Torque batch queues, and GPU

drivers. The modular approach to virtualize these external subsystems al-

lows the checkpointing system to grow organically (see Chapter 4). By vir-

tualizing these external environments, this work enabled some projects to

be the “first” to support checkpointing.

2.4 DMTCP Version 1

DMTCP (Distributed MultiThreaded CheckPointing) is free, open source soft-

ware (http://dmtcp.sourceforge.net, LGPL license) and traces its

roots to early 2005 [30]. The DMTCP approach has always insisted on not

making modifications to the kernel, and not requiring any root (administra-

tive) privileges. While this was sometimes more difficult than an approach

with full privileges inside the kernel, it integrates better with complex cyber

infrastructures. DMTCP’s lack of administrative privilege provides a level of

security assurance.

As a side effect of working completely in the user-space, DMTCP relies

only on the published APIs (e.g., POSIX and the Linux proc filesystem) to

perform checkpoint-restart. Thanks to the highly stable kernel API, the same

DMTCP software can be used on Linux kernel ranging from the latest bleed-

ing edge release to Linux 2.6.5 (released in April, 2004). In this section,

we provide a only brief overview of the checkpoint-restart mechanisms of

DMTCP. More Details can be found in Ansel et al. [7].

Using DMTCP with an application is as simple as:

dmtcp_launch ./myapp arg1 ...

# From a second terminal window:

dmtcp_command --checkpoint

dmtcp_restart ckpt_myapp_*.dmtcp

This checkpoint image contains a complete standalone image of the ap-

26 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

plication with all the relevant information required to restart it later. It can

be replicated and migrated as needed. DMTCP also creates a restart script

to help automate restart of distributed computation.

DMTCP

COORDINATOR

CKPT MSG

CKPT THREAD

USER PROCESS 1

SIG

US

R2

SIG

US

R2

USER THREAD B

USER THREAD A

CKPT MSG

SIG

US

R2

connectionsocket

USER THREAD C

CKPT THREAD

USER PROCESS 2

Figure 2.1: Architecture of DMTCP

As seen in Figure 2.1, a computation running under DMTCP consists of

a centralized coordinator process and several user processes. The user pro-

cesses may be local or distributed. User processes may communicate with

each other using sockets, shared-memory, pseudo-terminals, etc. Further,

each user process has a checkpoint thread which communicates with the co-

ordinator. The checkpoint thread is created by the DMTCP library dmtcphi-

jack.so, that is loaded into each of the application processes at startup (be-

fore calling application’s main() function) by using the LD_PRELOAD fea-

ture of the loader. The DMTCP library install signal handler for the check-

point signal that is later used to quiesce user threads. The checkpoint thread

is responsible for creating checkpoint images as and when requested by the

coordinator.

2.4. DMTCP VERSION 1 27

2.4.1 Library Call Wrappers

The DMTCP library adds wrappers around a small number of libc func-

tions. For efficiency reasons, it avoids wrapping any frequently invoked sys-

tem calls such as read and write. The wrappers are used to gather infor-

mation about the current process and to track all forked child processes as

well as remote processes created via SSH and to automatically put them un-

der checkpoint control. The local child processes inherit the LD_PRELOAD

environment variable, whereas for the remote child processes, the comman-

dline is modified to launch them under DMTCP control. In the case of sock-

ets, DMTCP needs to know whether the sockets are TCP/IP sockets (and

whether they are listener or non-listener sockets), UNIX domain sockets, or

pseudo-terminals. Again, it uses wrappers around socket, connect, accept,

open, close, etc., to do that.

2.4.2 DMTCP Coordinator

DMTCP uses a stateless centralized process, the DMTCP coordinator, to syn-

chronize the separate phases at the time of checkpoint and restart. The

checkpoint threads communicates with the DMTCP coordinator through a

socket connection. Checkpoint procedure can be initiated by the coordi-

nator on an explicit request from the user through its interactive interface,

through the dmtcp_command utility, or on expiration of a predefined check-

point interval. It should be noted that the coordinator is a single point of

failure since the entire computation relies on it.

2.4.3 Checkpoint Thread

The checkpoint thread waits for a checkpoint request from the coordinator.

On receiving a checkpoint request, the checkpoint thread quiesces the user

threads (by sending a checkpoint signal) and takes the process through the

phases of creating a checkpoint image. Similarly, during restart, it takes the

28 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

process through the restart phases and finally un-quiesces the user threads.

The checkpoint thread is dormant during the normal execution of the pro-

cess and is only active during the checkpoint/restart procedures.

2.4.4 Checkpoint

On receiving the checkpoint request from the coordinator, the checkpoint

thread sends the checkpoint signal to all the user threads in the process.

This quiesces the user threads by forcing them to block inside a signal han-

dler previously installed by DMTCP. The checkpoint image is created by writ-

ing all of user-space memory to a checkpoint image file. Each process has its

own checkpoint image. Prior to creating the checkpoint image, the check-

point thread also copies into the user-space memory, any kernel state that is

required to restart the process such as the state of associated with network

sockets, files, and pseudo-terminals.

At the time of checkpoint, all of user-space memory is written to a check-

point image file. The user threads are then allowed to resume executing

application code. Note that user-space memory includes all of the run-time

libraries (libc, libpthread, etc.), which are also saved in the checkpoint im-

age.

DMTCP doesn’t directly handle asynchronous DMA operations that may

be pending or ongoing at the time of checkpoint. This could result in a

inconsistent checkpoint state as the “quiesce” property has been violated.

2.4.5 Restart

As the first step of restart phase, DMTCP group all restart images from a

single node under a single dmtcp_restart process. The dmtcp_restart process

recreates all file descriptors. It then uses a discovery service to discover the

new addresses for processes migrated to new hosts and restores network

connections. It then forks a child process for each checkpoint image. These

2.4. DMTCP VERSION 1 29

individual processes then restore their memory areas. Next, the user threads

are recreated using the original thread stacks. All user threads restore their

pre-checkpoint context using the longjmp system call and are forced to

wait in the signal handler. The checkpoint thread then restoring the kernel

state that was saved during the checkpoint phase. Finally, the checkpoint

thread un-quiesces the user threads and the user threads resume executing

application code.

2.4.6 Checkpoint Consistency for Distributed Processes

In case of distributed processes, one needs to determine a consistent global

state of the asynchronous system at the time of checkpoint. The notion of

the global state of the system was formalized by Chandy and Lamport [28].

The central idea is to use marker (snapshot) messages. A process that wants

to initiate a checkpoint, records its local state and sends a marker message

on each of its outgoing channels. All other processes save their local state

on receiving the first marker message on some incoming channel. For every

other channel, any messages received before the marker message were ob-

viously sent before the snapshot “cut off”. Hence they are included in the

local snapshot.

Chandy and Lamport were primarily concerned with “uncoordinated snap-

shots” (no centralized coordinator). DMTCP employs a strategy of “coordi-

nated snapshots” using a global barrier. This makes the implementation of

Chandy-Lamport consistency particularly easy, since messages can be sent

only prior to the global barrier. Processes are “quiesced” (frozen) at the bar-

rier. Next, the checkpoint thread of each process receives all pending data in

the network, after which a globally consistent snapshot is taken. The details

of the DMTCP implementation follow.

To initiate a checkpoint, the coordinator broadcasts a quiesce message

to each process in the computation. On receiving the message, the check-

30 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

point manager thread in each process quiesces the user threads, sends an

acknowledgement to the coordinator, and waits for the drain message. Af-

ter receiving acknowledgements from all processes, the coordinator lifts the

global barrier and broadcasts the drain message. On receiving the drain

message, the checkpoint manager thread sends a special cookie (marker mes-

sage) through the “send” end of each socket. Next, it reads data from the

“receive” end of each socket until the special cookie is received. Since user

threads in all the processes have already been quiesced, there can be no

more in-flight data. The received in-flight data has now been copied into

user-space memory, and will be included in the checkpoint image.

On restart, once the socket connections have been restored, the check-

point manager thread sends the saved in-flight data (previously read from

the “receive” end of the socket) back to its peer processes. The peer processes

then refill the network buffers, by pushing the data back into the network

through the “send” end of each restored socket connection. The checkpoint

manager thread then sends a message to the coordinator to indicate the end

of the refill phase and waits for the resume message. Once the coordina-

tor has received messages indicating end of refill phase from all involved

processes, it lifts the global barrier and broadcasts the resume message. On

receiving the resume message, the checkpoint manager un-quiesces the user

threads and they resume executing user code.

CHAPTER 3

Adaptive Plugins as a Mechanism

for Virtualization

This chapter introduces several important examples of the need to integrate

checkpointing with an external subsystem: Pid virtualization, SSH virtual-

ization, virtualization of the InfiniBand network, virtualization of OpenGL,

and virtualization of POSIX timers. The concept of process virtualization is

introduced in concrete examples.

Virtualization of InfiniBand [27] and OpenGL [62] were extensive projects

requiring much domain knowledge. The specific results represent long-

standing open problems and are not part of this dissertation. We use those

examples to motivate the need for process virtualization, and we use those

examples to argue for the expressivity of process virtualization in Chapter 5.

3.1 The Ever Changing Execution Environment

In the next subsections, five examples of strategies for process virtualization

are described, in order to make clear the rich design space available for

process virtualization. In each of these cases, the nature of its virtualization

requirement is unique. The five examples are:

31

32 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

1. virtualization of kernel resource identifiers, using the example of process

id (pid) (Section 3.1.1);

2. virtualization of protocols, using the SSH protocol as its example (Sec-

tion 3.1.2);

3. a shadow device driver approach for transparent checkpointing over In-

finiBand (Section 3.1.3);

4. a record-replay approach, using transparent checkpointing of OpenGL

3D-graphics as an example (Section 3.1.4); and

5. adapting to application requirements for more control over checkpoint-

ing (Section 3.1.5).

3.1.1 PID: Virtualizing Kernel Resource Identifiers

Pid is one of the simplest examples of the kernel resource identifiers that

needs virtualization. The operating system kernel is unlikely to assign the

same pid on restart as existed at the time of checkpoint. Even if the kernel

were to allow a mechanism to request a particular pid, the requested pid

might be in use (assigned to a different process).

If the target application has saved the pre-checkpoint pid and tries to use

it after restart, it could have undesired effects. For example, if the process

uses the saved pid to send a signal after restart, in the best case, the process

will fail because the saved pid is invalid. In the worst case, the saved pid

might correspond to some other process and signal will be sent to that other

process.

To avoid these situations, we must provide a mechanism such that the

processes can continue to use the saved pid after restart without any un-

desired side effects. This can be done by providing the application process

with a virtual pid that never changes for the duration of the process lifetime.

When communicating with the kernel, the corresponding real pid that the

3.1. THE EVER CHANGING EXECUTION ENVIRONMENT 33

User Process

PID: 4000

User Process

PID: 4001

Virt. PID Real PID

4000 26524001 3120

Translation Table

getpid()26524000

kill(4001, 9) KERNEL

4001Sending signal 9to pid 31203120

Figure 3.1: Virtualization of kernel resource identifiers (example shown forprocess id)

kernel knows about is looked up in the translation table and passed on to

the kernel. Figure 3.1 shows a simple schematic of a translation layer be-

tween the user processes and the operating system kernel along with a pid

translation table to convert between virtual and real pids. At each restart,

the translation table is refreshed to update the real pids.

3.1.2 SSH Connection: Virtualizing a Protocol

Pid virtualization is a classic example of virtualizing low level kernel re-

source identifiers using a translation layer. However, the same solution

doesn’t suffice for higher level abstractions, such as an SSH connection.

app1 app2

std

io

Node1 Node2

socketSSH client

(ssh) (sshd)SSH server

std

io

Figure 3.2: SSH connection: ssh Node2 app2The user process, app1, forks a child SSH client process (ssh) to call the SSHserver (sshd) on the remote node to create a remote peer process, app2.

34 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

Recall that the ssh command operates by connecting across the net-

work to a remote SSH daemon, sshd, as shown in Figure 3.2. Since the

SSH daemon is privileged, it is not possible for the unprivileged user-space

checkpointing system to start a new SSH daemon during restart. The issue

becomes even more complicated when the client and server processes are

restarted at entirely different network addresses on different hosts.

For virtualizing an SSH connection, it doesn’t suffice to virtualize just the

network address. Instead, it must virtualize the entire SSH client-server con-

nection. In essence, the SSH daemon represents a privileged process running

a certain protocol. Regardless of whether the protocol is an explicit standard

or a de facto standard internal to the subsystem, process virtualization must

virtualize that protocol. Checkpointing and restarting the privileged SSH

daemon is not an option.

app1 app2

std

io

Node1 Node2

SSH serverSSH client

(ssh) socket

virt_ssh virt_sshd

(sshd)

std

io

std

io

std

io

Figure 3.3: Virtualizing an SSH connection: ssh Node2 app2The call to launch an SSH client process is intercepted to launch virtualssh client (virt_ssh) and server (virt_sshd) processes. virt_ssh andvirt_sshd are unprivileged processes.

Process virtualization provides a principled and robust algorithm for trans-

parently checkpointing an SSH connections. As shown in Figure 3.3, the SSH

3.1. THE EVER CHANGING EXECUTION ENVIRONMENT 35

connection is virtualized by creating virt_ssh and virt_sshd helper pro-

cesses that shadow the SSH client and server processes respectively. The

virt_ssh and virt_sshd processes are owned by the user and are placed

under checkpoint control. The ssh and sshd processes are not check-

pointed.

On restart, the user processes are restored along with virt_ssh and

virt_sshd processes (without the underlying SSH connection) on new

hosts. The virt_ssh process then recreates a new SSH connection (see Sec-

tion 5.4).

3.1.3 InfiniBand: Virtualizing a Device Driver

Both ssh for a traditional TCP network and the new InfiniBand network

are intimately connected with high performance implementations of MPI

(Message Passing Interface). An implementation usually retains ssh and

TCP in addition to InfiniBand support, since typical MPI implementations

bootstrap their operation through ssh in order to create additional MPI

processes (MPI ranks), and to exchange InfiniBand addresses among peers.

InfiniBand virtualization has been a particular challenge both due to its

complexity [134, 63, 16] and due to the fact that much of the state is hid-

den either within a proprietary device driver or within the hardware itself.

The solution here is to use a shadow device driver approach [106]. The

InfiniBand plugin (§5.10) maintains a replica of the device driver and hard-

ware state by intercepting and recording the InfiniBand library calls. On

restart, this replica is used to recreate and restore the state of the InfiniBand

connection.

36 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

3.1.4 OpenGL: A Record/Replay Approach to Virtualizing

a Device Driver

Scientific visualization is yet another example that requires a different kind

of virtualization solution. Some graphics computations are extremely GPU-

intensive. Further, most scientific visualizations today use OpenGL for 3D-

graphics. If a scientist walks away from a visualization and needs to restart

it the next day, there will be wasted time to reproduce it. Further, switch-

ing between multiple scientific visualizations becomes extremely inefficient.

Hence, checkpoint-restart is a critical technology. However, it is difficult

to checkpoint, because much of the graphics state is encapsulated into a

vendor-proprietary hardware GPU chip.

The OpenGL plugin (§5.9) achieves checkpoint-restart of 3-D graphics

by using a process virtualization strategy of record (record all OpenGL calls),

prune (prune any calls not needed to reproduce the most recent graphics state),

and replay (replay the calls during restart in order to place the GPU into a

semantically equivalent state to the state that existed prior to checkpoint).

3.1.5 POSIX Timers: Adapting to Application

Requirements

A posix timer is an external resource maintained within the kernel and has

an associated kernel resource identifier known as timer id. As with pid virtu-

alization, the timer-id needs to be virtualized as well and can use the same

strategy.

Consider a process that is checkpointed while a timer is still armed, i.e.

the timeout specified with the timer has not expired yet. On restart, what

is the desired behavior? Should the timer expire immediately or should it

expire after exhausting the remaining timeout period? There is no single

correct answer as the desired result is application dependent. For an appli-

cation that is waiting for a response from a web server, it is desired to expire

3.2. VIRTUALIZING THE EXECUTION ENVIRONMENT 37

the timer on restart. However, for an application process that is monitor-

ing a peer process for potential deadlocks, the time should continue for the

remaining time period.

3.2 Virtualizing the Execution Environment

As seen in the previous section, it is imperative to virtualize the external

resources in order to fully support checkpoint restart for any application. In

order to be successful, virtualization should be done transparently to the ap-

plication. This assumes that the application is interacting with the external

resource through a fixed set of API. Two basic requirements for virtualizing

an external resource for checkpointing are:

1. Virtualize external subsystems.

2. Capture/restore the state of external resources.

Next, we talk about each of these requirements and elaborate on their im-

portance and discuss what additional features are required for a complete

virtualization solution.

3.2.1 Virtualize Access to External Resources

Since external resources may change between checkpoint and restart, we

need to virtualize them. This can be achieved through a translation layer

between the application process and the resource. Virtualizing a resource

may be as simple as translating between virtual and real identifiers such

as pid-virtualization (Section 3.1.1) or it may involve more sophisticated

mechanisms like shadow device drivers (Section 3.1.3). Depending upon the

external resource, the translation may be active throughout the computation

(e.g., for pids) or only during the restart procedure (for SSH).

Further, the translation layer should ensure that the access to a resource

is atomic with respect to checkpoint-restart i.e. a checkpoint shouldn’t be

38 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

allowed while the process is in the middle of manipulating/accessing the re-

source. Not doing this may result in an inconsistent state at restart. Consider

pid virtualization where a thread tries to send a signal to another thread us-

ing the virtual tid (thread id). The pid virtualization layer translates the

virtual tid to the real tid and sends the signal using real tid. Further con-

sider that the process is checkpointed after the translation from virtual to

real, but before the signal is actually sent. On restart, the process will re-

sume and will try to send the signal with the old real tid, which of course is

not valid now.

Share the virtualized view with peers

Virtualizing access to external resources gets complicated in a distributed

environment. Processes communicate with their peers. This demands a

consistent virtualization layer across all involved parties. It becomes more

evident after restart, when the translation table is updated to reflect the

current view of the external resource. These updates must be shared with all

the peer processes to allow them to update their own translation tables. For

example, in case of network address virtualization, each process must inform

its peers of its new network address on restart to allow them to restore socket

connections.

3.2.2 Capture/Restore the State of External Resources

When restarting a process from a previous checkpoint, we need to restore

the process view of the external resource. We need to identify the relevant

information that would be required to restore/recreate the external resource

during restart. This information should be gathered at the time of check-

point and should be saved as part of the checkpoint image. This information

can then be read from the ckpt image on restart.

3.3. ADAPTIVE PLUGINS 39

Quiesce the external resource

During checkpoint, the external resources should be quiesced to ensure a

consistent state. For example, an asynchronous disk read operation must be

allowed to finish before writing the process memory to the checkpoint image

to avoid data transformation due to on going memory updates (DMA).

Consistency of the computation state

As discussed above, a virtualization scheme should be transparent to the

user application. Thus, the application view of the external resource should

be consistent before and after checkpoint. Similarly, the application process

should not observe any change in its own state before and after checkpoint.

This involves preserving the state of the running process (e.g., threads, mem-

ory layout, and file descriptors) between checkpoint and restart.

Note that it is acceptable to alter the process state and/or the state of

external resource while perform checkpoint-restart. However, such changes

should be reverted and the pre-checkpoint view of the application should

be restored before the application process is allowed to resume executing

application code.

3.3 Adaptive Plugins as a Synthesis of

System-Level and Application-Level

Checkpointing

So far we have discussed the motivation for virtualizing the execution envi-

ronment along with the basic requirements for achieving the same. In this

section we will discuss possible design choices.

There are two basic approaches for achieving the goals discussed in Sec-

tion 3.2. One is to use application-specific checkpointing by having the ap-

40 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

plication developer write extra code for supporting checkpointing. However,

as discussed in Section 2.1, this is not an ideal solution as it requires knowl-

edge of the internals of the applications and puts a burden on the developer.

The second approach is to use an existing monolithic checkpointing system

such as DMTCP version 1 and insert the virtualization code in it along with

a large number of heuristics to satisfy a variety of application needs (e.g.,

heuristics for posix timers as discussed in Section 3.1.5). However, there is

no universal set of heuristics that can be used with all applications as each

application requires specific heuristics to cater its needs.

In this work, we present adaptive plugins as an ideal compromise be-

tween these two extreme approaches to meet the virtualization require-

ments. An adaptive plugin is responsible for virtualizing a single external

resource. By basing plugins on top of a transparent checkpointing package

such as DMTCP, the simplicity of transparent checkpointing is maintained.

With plugins, no target application code is ever modified, yet they enable

application-specific fine tuning for checkpoint-restart. We have already seen

examples where the external resource needs to be virtualized in previous

sections. The posix timer plugin is an example of application-specific heuris-

tic plugin. A memory cutout plugin to reduce the memory footprint of the

process for reducing checkpoint image size would be yet another example of

an application-specific plugin.

CHAPTER 4

The Design of Plugins

In the previous chapter, we discussed several use cases that require virtual-

ization of external resources in order to support checkpoint-restart. External

resources may include, but are not limited to kernel resource identifiers,

protocols, and hardware device drivers. We further listed the two basic re-

quirements for virtualizing an external resource and discussed how a design

based on adaptive plugins is well suited for such tasks.

Section 4.1 introduces a basic framework of a plugin architecture that pro-

vides the same set of services for virtualizing external resources that were

introduced informally in Chapter 3. A plugin is an implementation of the

process virtualization abstraction. In process virtualization, an external sub-

system is virtualized by a plugin. All software layers above the layer of that

plugin see a modified subsystem.

Section 4.2 then uses these requirements to provide a design recipe for

virtualization through plugins. Section 4.3 then takes into account the is-

sue of dependencies among multiple plugins within the same application

process. Section 4.4 extends that design recipe to multiple processes, in-

cluding distributed processes on multiple hosts. Section 4.5 describes three

special-purpose plugins that are required for checkpointing all processes.

This chapter concludes with Section 4.6, containing some implementation

challenges.

41

42 CHAPTER 4. THE DESIGN OF PLUGINS

Operating System Kernel

Memory Plugin

Plugin EngineRuntime Libraries(libc, etc.)

Ap

pli

cati

on

Ta

rget

Target Application (program+data)

Thread Plugin

Coordinator Interface Plugin

Lib

rari

es

Ru

nti

me

Lib

rari

es

Ba

se P

lug

inIn

tern

al

an

d T

hir

d−

Pa

rty

Plu

gin

Lib

sCapture/Restore State

Virtualize ResourceLibrary Wrappers

Library Wrappers Virtualize Resource

Capture/Restore State

Figure 4.1: Plugin Architecture.

4.1 Plugin Architecture

An application consists of program and data. It interacts with the execution

environment through various libraries. For example, the libc runtime library

provides access to the kernel resources, a device driver library may provide

access to the underlying device hardware, and so on. Thus one can imagine

virtualizing the execution environment by intercepting the relevant library

calls. This allows us to inspect and modify the behavior of the underlying

subsystem as seen by the application.

Figure 4.1 shows a high level view of the plugin architecture. It has

4.1. PLUGIN ARCHITECTURE 43

two main components: (1) plugins, and (2) the plugin engine. Plugins

and the plugin engine are implemented as separate dynamic libraries. They

are loaded into the application using the LD_PRELOAD feature of the Linux

loader.

Plugin

A plugin is a checkpoint subsystem that virtualizes a single external resource

or subsystem with the help of function wrappers (§4.1.1). It save/restores

the state of the external subsystem. Examples of external subsystems are:

process-id, network sockets, InfiniBand, etc. Application processes are con-

sidered as if they are independent and inter process communication through

pids, sockets, etc. is handled through plugins. Further, a plugin is transpar-

ent to the target application and can be enabled/disabled for the application

as needed. Finally, third parties can write orthogonal customized plugins to

fit their needs.

Plugin Engine

The plugin engine provides event notification services (§4.1.2) to assist plug-

ins to capture/restore the state of their specific external resources. It further

interacts with a coordinator interface plugin to provide publish/subscribe

services (§4.1.3) to enable plugins to interact with each other and share the

translation tables for resource virtualization.

4.1.1 Virtualization through Function Wrappers

Since the underlying resources provided by the operating system may change

between checkpoint and restart, there is a need to virtualize them. The plu-

gin virtualizes the external resources by putting wrappers around interesting

library calls, which interpose when the target application makes such a call.

In case of pids, the virtualization can be done using a simple table translat-

44 CHAPTER 4. THE DESIGN OF PLUGINS

ing between virtual and real pids as shown in Listing 4.1. The arguments

passed to the library call are modified to replace the virtual pid with the real

pid. Similarly, the return value can also be modified as required. The virtual

pid column of this table is saved as part of checkpoint image and at restart

time the real pid column is populated as processes/threads are recreated.

int kill(pid_t pid, int sig) {

disable_checkpoint();

real_pid = virt_to_real(pid);

int ret = REAL_kill(real_pid, sig);

enable_checkpoint();

return ret;

} �Listing 4.1: A simple wrapper for kill

As seen in the above listing, a function wrapper is implemented by defin-

ing a function of the same name as the call it is going to wrap. Real function

here refers to the function by the same signature, in a later plugin or a run-

time library. It is possible for multiple plugins to create wrappers around a

single library function. The order of execution of wrappers is determined

by a plugin hierarchy corresponding to the order in which the plugins are

invoked (Section 4.3).

Capture/Restore state of external resource

Wrappers are also used to “spy” on the parameters used by an application to

create a system resource, in order to assist in creating a semantically equiv-

alent copy on restart. At the time of checkpoint, a plugin saves the current

state of its underlying resources into the process memory. The state can be

obtained from a number of places such as the process environment and the

4.1. PLUGIN ARCHITECTURE 45

operating system kernel. In some cases, the function wrappers can also be

used to gather the information about the external resources. For example, in

the “socket” wrapper (Listing 4.2), the socket plugin will save the associated

domain and protocol information along with the socket identifier.

int socket(int domain, int type, int protocol) {

disable_checkpoint();

int ret = REAL_socket(domain, type, protocol);

if (ret != -1) {

register_new_socket(ret, domain, type, protocol);

}

enable_checkpoint();

return ret;

} �Listing 4.2: Wrapper for socket() to record socket state

Atomic transactions

Plugins may have to perform atomic operations that must not be interrupted

by a checkpoint. For example, the translation and call to real function should

be done atomically with respect to checkpoint-restart. Otherwise, there is a

possibility of checkpointing after the translation but before the real function

is called. In that case, on restart, the translated value is no longer valid

and can impact the correctness of the program. The plugin engine provides

disable_checkpoint and enable_checkpoint services for enclosing the critical

section as seen in Listing 4.1.

The disable_checkpoint and enable_checkpoint services are implemented

using a modified write-biased reader-writer lock. The modification allows a

recursive reader lock even if the writer is queued and waiting for the lock.

The checkpoint thread must acquire the writer lock before it can quiesce the

46 CHAPTER 4. THE DESIGN OF PLUGINS

user threads. On the other hand, the user threads acquire and release the

reader lock as part of a call to disable_checkpoint and enable_checkpoint

respectively. If a checkpoint request arrives while a user thread is in the

middle of a critical section, the checkpoint thread will wait until the user

thread comes out of the critical section and releases the reader lock. A user

thread is not allowed to acquire a reader lock if the checkpoint thread is

already waiting for the writer lock to prevent checkpoint starvation.

Atomicity is especially important for wrappers that create or destroy a

resource instance. For example, when creating a network socket, if the

checkpoint is taken right after the socket is created but before the socket

plugin has a chance to register it, the socket may not be create at restart as

no record exists of the socket. Thus one must atomically create and record

socket state as shown in Listing 4.2.

Wrappers can be considered the most basic of all virtualization tools. A

flexible, robust implementation of wrapper functions turns out to be surpris-

ingly subtle and is discussed in more detail in Section 4.6.1.

4.1.2 Event Notifications

Event notifications are used to inform other plugins (within the same pro-

cess) of interesting events. Any plugin can generate notifications. Plugin

engine then delivers these notification to all available plugin in a sequential

fashion. The order of delivery of notification depends on the plugin hier-

archy as discussed in Section 4.3. Plugins must declare an event hook in

order to receive event notifications. A plugin may decide to ignore any or all

notifications.

Figure 4.2 shows the “write-ckpt” and “restart” events generated by the

coordinator interface plugin which are then delivered to all other plugins by

the plugin engine.

4.1. PLUGIN ARCHITECTURE 47

Plugin Engine

Target Application

Socket Plugin

Fork/Exec Plugin

Pid Plugin

Coordinator Interface Plugin

Memory Plugin

write−ckpt

wri

te−

ckp

t

(1)

(2)

(3)

(4)

(5)

(6)

(a) Event notification for write-ckpt

Plugin Engine

Target Application

Socket Plugin

Fork/Exec Plugin

Pid Plugin

Coordinator Interface Plugin

Memory Pluginresta

rt

rest

art

(1)

(6)

(5)

(4)

(3)

(2)

(b) Event notification for restart

Figure 4.2: Event notifications for write-ckpt and restart events. The numbersin the parenthesis indicate the order in which messages are sent. Notice that therestart event notification is delivered in the opposite order of write-ckpt event.

Some of the interesting notifications are:

• Initialize: generated during the process initialization phase (even be-

fore main() is called). The plugins can initialize data structures, etc. A

plugin may choose to register an exit-handler using atexit() which will

be called when the process is terminating.

• Write-Ckpt: each plugin saves the state of the external resources into

process’s memory. The memory plugin(s) then create the checkpoint

image.)

• Resume: generated during the checkpoint cycle.

• Restart: generated during restart phase.

• AtFork: generated during a fork and works similar to the libc function,

pthread_atfork.

48 CHAPTER 4. THE DESIGN OF PLUGINS

dmtcp_event_hook(is_pre_process, type, data) {

if (is_pre_process) {

switch (type) {

case Initialize:

myInit(); break;

case Write_Ckpt:

myWriteCkpt(); break;

...

}

}

if (!is_pre_process) {

switch (type) {

case Resume:

myResume(); break;

case Restart:

myRestart(); break;

...

}

}

} �Listing 4.3: An event hook inside a plugin

The Resume and Restart notifications are sent to plugins in the oppo-

site order from the Write-Checkpoint notification (see Listing 4.3 and Fig-

ure 4.2b). This is to ensure that any dependencies of a plugin are restored

before the plugin itself is restored. For example, the memory plugin (re-

sponsible for writing out or reading back the checkpoint image) is always

the lowest layer (see Figure 4.1). This is so that other plugins may save data

in the process’s memory during checkpoint, and find it again at the same

address during restart.

4.1. PLUGIN ARCHITECTURE 49

Target Application

Coordinator Interface Plugin

Plugin Engine

Socket Plugin

Coordinator Interface Plugin

Plugin Engine

Target Application

Socket Plugin

Coordinator

current local addr

current remote addr

curren

t loc

al a

ddr

curren

t rem

ote

addr

Node 1 Node 2

Figure 4.3: Publish/Subscribe example for sockets.

4.1.3 Publish/Subscribe Service

In a distributed environment, a publish/subscribe service is needed so that a

given type of plugin may communicate with its peers in different processes.

Typically, on restart, once the process resources have been recreated, the

plugins publish their virtual ids along with the corresponding real ids using

the publish/subscribe service. Next they subscribe for updates from other

processes and update their translation tables accordingly. This was seen

for the pid virtualization plugin (Section 3.1.1). Similarly, when a parallel

computation is restarted on a new cluster, the socket plugin must exchange

socket addresses among peers.

At the heart of the publish/subscribe services is a key-value database

whose key corresponds to the virtual name and whose value corresponds to

the real name of the underlying resource. The database is populated when

plugins publish the key-value pairs. Once the plugin has published all of

the relevant key-value pairs, it may now subscribe by sending queries to the

database. The plugins are notified as soon as a match for the queried key is

available. Typically, the key-value database is used only at restart time, as

doesn’t need to be preserved across checkpoint-restart.

50 CHAPTER 4. THE DESIGN OF PLUGINS

Figure 4.3 shows an example of the socket plugins exchanging their cur-

rent network address with their peers. During the Write-Checkpoint phase,

the socket peers agree on using a unique key (see Section 4.4.1) to iden-

tify the connection. While restarting, this unique key is used to publish the

current network address.

It is possible to have multiple publish/subscribe APIs that differ accord-

ing to scope. It is left to the plugins to choose the scope best suited for their

needs. Two trivial scopes are node-private and cluster-wide. Node-private

publish/subscribe API is sufficient for plugins dealing with resources limited

to a single node, such as pseudo-terminals, shared-memory, and message-

queues. Whereas plugins dealing with resources that may span over multiple

nodes, such as sockets and InfiniBand, should use the cluster-wide publish/-

subscribe API.

The node-private publish/subscribe service may be implemented using

shared-memory while the cluster-wide publish/subscribe service must be

provided by some centralized resource such as the DMTCP coordinator.

4.2 Design Recipe for Virtualization through

Plugins

So far we have seen the plugin architecture and the services provided by

it. We have also seen how these services suffice to meet the virtualization

requirements. We use this information to create a typical recipe for writing a

new plugin to virtualize an “external resource”. One is usually given a name

or id (identifier) to provide a link to the external resource. The id may be for

an InfiniBand queue pair, for a graphics window, for a database connection,

for a connection from a guest virtual machine to its host/hypervisor, and so

on.

4.2. DESIGN RECIPE FOR VIRTUALIZATION THROUGH PLUGINS 51

In all of these cases, the recipe is:

1. Intercept communication to the external resource (usually by inter-

posing between library calls), and translate between any real ids from

the external resource and virtual ids that are passed to the application

software. A plugin maintains this translation table of virtual/real ids.

2. Quiesce the external resource (or wait until the external resource has

itself reached a quiescent state);

3. Interrogate the state of the external resource sufficiently to be able to

reconstruct a semantically equivalent resource at restart time.

4. Checkpoint the application. The checkpoint will include state infor-

mation about the external resource, as well as a translation table of

virtual/real ids.

5. At restart time, the state information for the external resource is used

to create a semantically equivalent copy of the external resource. The

translation table is then updated to maintain the same virtual ids,

while replacing the real ids of the original external resource with the

real ids of the newly created copy of the external resource.

It is not always efficient to quiesce and save the state of an external

resource. The many disks used by Hadoop are a good example of this. The

data in an external database server is another example. It is not practical to

drain and save all of the external data in secondary storage.

There are two potential approaches. The first approach is to delay the

checkpoint during a critical phase. In the case of Hadoop, one would delay

the checkpoint until the Hadoop computation has executed a reduce oper-

ation, in order to not overly burden the resources of the Hadoop back end.

A similar approach can be taken for NVIDIA GPUs. In many cases, there

are also strategies for plugins to transparently detect this critical phase and

delay the checkpoint until that time.

52 CHAPTER 4. THE DESIGN OF PLUGINS

The second approach is to allow for a partial closed-world assumption

in which some state (data/contents) is assumed to be compatible across

checkpoint and restart. In case of the external database server, the external

data already lies in fault tolerant storage and is compatible across checkpoint

and restart. Thus the solution is to maintain a virtual id that identifies the

external storage of the server. That virtual id is used at restart time to restore

the connection to the database server.

4.3 Plugin Dependencies

Some plugins may have dependencies on other plugins. For example, the

File plugin depends on the Pid plugin to restore file descriptors pointing to

“/proc/PID/maps” and so on. Each plugin provides the list of dependencies

which must be satisfied to successfully load the given plugin. The depen-

dency declaration also affects the level of parallelism that can be achieved

when performing phases such as Checkpoint, Resume and Restart.

Subject to the dependencies among plugins, this design provides end

users with the possibility of selective virtualization. Selectively including only

some plugins is advantageous for three reasons: (i) performance reasons

(some end-user plugins might have high overhead); (ii) software mainte-

nance (other plugins can be removed while debugging a particular plugin);

and (iii) platform-specific plugins.

4.3.1 Dependency Resolution

Similar in spirit to modern software package formats such as RPM and deb,

a plugin provides a list of features/services that it provides, depends on,

or conflicts with. For example, the socket plugin may provide services for

“TCP”, “UDS” (Unix Domain Sockets), and “Netlink” socket types and de-

pends on the “File” plugin (to restore file system based unix domain sock-

ets).

4.3. PLUGIN DEPENDENCIES 53

The dmtcp_launch program, that is used to launch an application un-

der checkpoint control, compiles list of all available plugins by looking at

various environment variables, such as LD_LIBRARY_PATH. A user-defined

list of plugins can also be specified to be loaded into the application. The

dmtcp_launch program examines this plugin list and creates a partial or-

der of dependencies among the plugins. The list of available plugins is

searched to fulfill any missing dependencies for the user-defined plugins.

If a match is found, plugins are loaded automatically. Otherwise an error is

reported. If two or more plugins provide the same feature/service, a conflict

is recorded and the user is provided with the conflicting plugins.

void dmtcp_plugin_dependencies(const char ***provides,

const char ***requires,

const char ***conflicts) {

static const char *_provides[] = { "TCP ", "UDS", " Ne t l ink ",

NULL};

static const char *_requires[] = { " F i l e ", NULL};

static const char *_conflicts[] = {NULL};

*provides = _provides;

*requires = _requires;

*conflicts = _conflicts;

} �Listing 4.4: Dependencies declared by a plugin. The dmtcp_launch utility

uses these fields to generate a partial order among the given plugins and to

report any missing dependencies or any conflicts.

Listing 4.4 provides an example of dependency information as exported

by the socket plugin. Since the plugins are implemented as shared libraries,

the dmtcp_launch program can perform dlopen/dlsym to find and call

the dmtcp_plugin_dependencies function to learn about the dependencies.

54 CHAPTER 4. THE DESIGN OF PLUGINS

Further, this approach assumes a common naming scheme to resolve

matches/dependencies across plugins. This could be automated by scan-

ning symbols in the object files, for example, for both definitions and uses.

If a symbol is defined in more than one plugin, it can be listed as a potential

source of conflict to help the plugin writer in debugging plugins.

Parallel event handling

In Section 4.1.2, we discussed how the plugin engine assumed serial delivery

of event notifications due to plugin dependencies expressed in a linear order

(Figure 4.2). However, for non-linear plugin dependencies, a dependency

graph can be created to relax the order of notification delivery. The event

notifications can be processed by multiple plugins in parallel as long as there

is no dependency between them. This is useful in modern multi-core systems

to allow idle CPU cores to process the event notifications for the plugins. It is

also useful for plugins that need to perform asynchronous operations during

event handling. In such cases, rather than blocking on a single plugin, the

event notification can be carried out in parallel in other plugins.

4.3.2 External Resources Virtualized by Other Plugins

Plugins may use resources that are virtualized by an earlier plugin. For ex-

ample, plugins are allowed to create threads, open sockets, use files etc.

However, if the resource is created/used in a way that bypasses the wrap-

pers created by the earlier plugin, the resources may not be virtualize/save-

restored. In situations where this is not true, only the plugin using the

resources can save-restore its state. This is done to avoid circular depen-

dencies. If the save-restore/virtualization is absolutely required, the plugin

should be broken into two or more smaller plugins and the newer plugin

should be moved higher in the plugin-hierarchy.

4.3. PLUGIN DEPENDENCIES 55

4.3.3 Multiple Plugins Wrapping the Same Function

Multiple plugins are allowed to place wrappers around the same library

call. For example, the open("/proc/PID/maps", ...) function is

wrapped by the file plugin as well as the pid plugin. The file plugin needs

to be able to save/restore the file descriptor, whereas the pid plugin has to

convert the virtual PID to a real one. Figure 4.4 shows nested-wrappers

provided by the pid plugin and the file plugin.

func1(...) {

p="/proc/1234/maps" ...

...

fd = open(p, ...) ...

close(fd) ...}

REAL_open(...)

open(...) { ...

} ...

Target Application File Plugin PID Plugin Libc

REAL_open(...)

open(...) { ...

...}

close(...) {

REAL_close(...)

} ...

... ...

...}

close(...) {

...

...}

getpid() {

sys_close(...)

sys_getpid()

open(...) { ...

...}

sys_open(...)

}

REAL_getpid()

getpid() {

...

...

Figure 4.4: Nested wrappers: open function is wrapped both by the File pluginand by the Pid plugin.

Once a plugin has performed all the required pre-processing actions, it

calls the function wrapper in the next plugin library. This is done by using the

RTLD_NEXT feature of dlsym function call. The RTLD_NEXT service will find

the next occurrence of the given function in the library search order after

the current library. For example, in case of open wrapper in the File plugin

from Figure 4.4, dlsym(RTLD_NEXT, “open”) would return the address of

the open function defined in the Pid plugin. However, dlsym(RTLD_NEXT,

“close”) would return the address of the close function defined in Libc as

the close wrapper is not defined in the Pid plugin.

Since the wrappers execute both before and after the library call, a plugin

that was loaded earlier can place a wrapper around the wrapper created by

a later plugin. Thus the pre-processing takes place in the order of plugin

load sequence, whereas the post-processing takes place in the reverse order.

56 CHAPTER 4. THE DESIGN OF PLUGINS

4.4 Extending to Multiple Processes

Until this point, plugins have been described in the context of a single pro-

cess. For distributed computations, the interaction among distributed pro-

cesses is critical to making the plugin model practical. As we have seen,

the plugins virtualize the resources for several reasons. However, in case

of multiple processes, several processes may be using a common resource.

For example, several processes may share a file descriptor open to the same

file. A mapped memory region may be shared. A socket may be shared

among multiple processes. Several processes may have duplicate pointers

to the same underlying resource. These duplicate pointers may be created

explicitly (e.g., the dup() system call creates a duplicate file descriptor), or

implicitly (by creating a child process; the child process automatically gets a

copy of all the file-descriptors, shared memory, etc.).

How does one ensure correctness if multiple processes are using the same

resource and hence virtualizing it independently of each other? Should all

processes save/restore the common resource or only one of them?

The correct answer is that only a single process should be allowed to

save/restore the state of the underlying resource. This is required for two

reasons: (i) for some resources, part of the state to be checkpointed can

be read only once. This is the case with data in kernel buffers or network

data; and (ii) if multiple processes recreate the resource during restart, it

may no longer be shared. In some situations, it is impossible to recreate the

resource (e.g. sockets) by multiple processes, while in other case, recreating

the resource multiple times is permitted but results in incorrect behavior

(e.g. same file can be opened by multiple processes resulting in loss of

semantics).

4.4. EXTENDING TO MULTIPLE PROCESSES 57

Single process

It is possible to have duplicate pointers within a single process. Thus the

plugins must ensure that only one copy is checkpointed and the duplication

is restored during restart. This requires the ability of the plugins to identify

duplicate resources during the checkpoint phase. For some resources, the

operating system kernel (or the execution environment) assigns a unique

id at the time of creation. Examples include sockets, pid, System V shared

memory objects, semaphores, etc. When these resources are duplicated, the

duplicates may be detected easily by querying the kernel for the resource id.

Multiple processes

The two key issues in dealing with multiple processes are: (i) checkpoint-

restart of shared resources; and (ii) finding the current location of peer pro-

cesses. We employ the publish/subscribe service to assist us in dealing with

these issues. While it allows a central coordinator to mediate among multi-

ple processes, it also implicitly produces a barrier. Hence, it is important to

use that facility sparingly for the sake of efficiency.

4.4.1 Unique Resource-id for Shared Resources

Duplicate detection for the remaining resources must be done by keeping

track of when the duplicates are created — explicitly or implicitly. This

is done by assigning a unique resource-id to each resource when it is cre-

ated. The resources duplication is tracked by putting wrappers around cor-

responding library calls (such as dup or fork). Once detected, the duplicates

are assigned the same resource-id as the original resource.

A globally unique resource-id can be created in several ways. One possi-

ble solution is to use a mixture of hostname, virtual/real pid of the process

creating the resource, creation timestamp, etc.

58 CHAPTER 4. THE DESIGN OF PLUGINS

4.4.2 Checkpointing Shared Resources

Since only one process should be allowed to save the state of the shared

resources and the original resource creator might not be present, we must

select a checkpoint-leader process for each resource. The checkpoint-leader

is responsible for saving and restoring the state of the underlying resource.

Checkpoint-leader election — consensus across processes

The processes sharing the underlying resource may elect a checkpoint-leader

using several mechanisms. The basic idea is to have consensus across par-

ticipating processes. Ansel et al. [7] used the fcntl system call to set own-

ership of the file descriptors. Each process tries to set itself as the owner of

the given file descriptor. The centralized coordinator process was used to

create a global barrier to signal the end of election after each process had a

chance to make the system call. The last process to perform the system call

is considered the checkpoint-leader. An example is shown in Listing 4.5.

checkpoint_file(int fd) {

// Participate in checkpoint-leader election;

// publish ourself as the owner of the resource

fcntl(fd, F_SETOWN);

// Now wait for the election to be over

wait_for_global_barrier(LEADER_ELECTION);

// If we are the owner, we are ckpt-leader

if (fcntl(fd, F_GETOWN) == getpid()) {

// capture the state of the file descriptor

capture_state(fd);

}

} �Listing 4.5: An example of leader election using the fcntl system call.

4.4. EXTENDING TO MULTIPLE PROCESSES 59

While this approach works for shared file descriptors, it doesn’t work for

other resources, such as files. There can be multiple unique file descriptors

that are opened on the same file. In this case, each unique file descriptor

gets a checkpoint leader. This results in checkpointing of multiple copies

of the file. The publish/subscribe service can be used to provide a better

solution. Each process publishes itself as the checkpoint-leader using the

unique resource-id of the resource. The last process to publish is elected the

checkpoint-leader. Since files can have multiple unique file descriptors (and

hence multiple unique resource-ids) associated with them, we can publish

using the absolute file path or the inode number for leader election.

Global barriers

As mentioned above, a global barrier allows plugins in different processes to

synchronize during checkpoint and restart. A simple implementation of the

global barrier requires a centralized coordinator that keeps the count of all

processes that have reached the barrier. Once all processes reach the barrier,

it lifts the barrier and allows them to proceed as shown in Listing 4.6.

void wait_for_global_barrier(BarrierId id) {

MessageType msg, rmsg;

msg.type = GLOBAL_BARRIER;

msg.barrierId = id;

// Tell the coordinator that we have reached the barrier

send_msg_to_coordinator(msg);

// Wait until all other peers reach the barrier

recv_msg_from_coordinator(&rmsg);

assert(rmsg.type = GLOBAL_BARRIER_LIFTED);

// barrier has been lifted

} �Listing 4.6: Global barrier.

60 CHAPTER 4. THE DESIGN OF PLUGINS

Global barriers are costly as each process has to communicate with the

centralized coordinator process. If each plugin implements several global

barriers, the performance impact can be significant in terms of checkpoint

and restart times. The total number of global barriers can be reduced signif-

icantly by using process level anonymous global barriers that can be imple-

mented in the coordinator interface plugin as show in Listing 4.7.

void implement_global_barriers() {

// Create an anonymous global barrier

wait_for_global_barrier(BARRIER_ANON_1);

// generate event notification indicating

// lifting of anonymous barrier 1

generate_event(ANON_GLOBAL_BARRIER_1);

wait_for_global_barrier(BARRIER_ANON_2);

generate_event(ANON_GLOBAL_BARRIER_2);

wait_for_global_barrier(BARRIER_ANON_3);

generate_event(ANON_GLOBAL_BARRIER_3);

...

} �Listing 4.7: Global barrier.

Consider the example of leader election. On receiving the event notifica-

tion for ANON_GLOBAL_BARRIER_1 event, each plugin will participate in

leader election for its resources by publishing itself as the checkpoint leader.

On receiving the event notification for ANON_GLOBAL_BARRIER_2, each

plugin can check to see if it is the checkpoint-leader by subscribing to the

checkpoint leader information for the unique resource id.

4.4. EXTENDING TO MULTIPLE PROCESSES 61

File Plugin

Socket Plugin

Memory Plugin(s)

Fork/Exec Plugin

Pid Plugin

Coord Interface Plugin

Thread Plugin

Res

um

e/R

esta

rt

Wri

teC

hec

kp

oin

t

Figure 4.5: Plugin dependency for distributed processes

4.4.3 Restoring Shared Resources

Note that memory regions are restored before plugins can restore the state of

their corresponding resources. In case of shared resources, the checkpoint-

leader recreates the underlying resources and then shares them with other

processes using publish/subscribe service. The checkpoint leader publishes

while the remaining processes subscribe to the resource-id.

Remark: Resources involving file-descriptors can be shared by passing them

over the Unix Domain Sockets.

Note that sharing of resources forces a certain dependency among plu-

gins that is summarized in Figure 4.5. The required dependency can be

observed by noting the required actions of a plugin at the time of restart.

The pid-plugin is responsible for virtualizing the pids which is required for

fork/exec plugin to restore the process-trees. Once the process-trees have

been created, the file, socket, System V shared memory, etc. plugins may

recreate/restore the resources and share them with other processes.

62 CHAPTER 4. THE DESIGN OF PLUGINS

4.5 Three Base Plugins

In this section we discuss three special-purpose plugins: the coordinator

interface plugin, the thread plugin, and the memory plugins.

4.5.1 Coordinator Interface Plugin

A centralized coordinator process is used to synchronize checkpoint-restart

between multiple processes on the same or different hosts. A coordinator

interface plugin communicates with the coordinator process and generates

events related to checkpointing when requested by the coordinator. It cre-

ates a checkpoint-manager thread, which listens to the coordinator process

for a checkpoint message while the user threads are executing application

code. On receiving a coordinator message, the checkpoint-manager thread

generates the checkpoint, resume, or restart event which are then delivered

to all other plugins.

The coordinator interface plugin and the coordinator process can best be

thought of as a single programming unit. It is this programming unit that

implements global barriers at the time of checkpoint or restart.

The special case of a single standalone target process can be supported by

a minimal coordinator interface plugin, which directly generates the three

basic event notifications: checkpoint, resume, and restart. In this case, one

does not need any external coordinator process.

At the other extreme, a coordinator interface plugin can be written to

support a set of redundant coordinators. This alternative eliminates the

possibility of a single point of failure.

4.5.2 Thread Plugin

The thread plugin is responsible for saving and restoring the state of all user

threads during checkpointing. The plugin engine invokes the checkpoint-

manager thread through the write-ckpt event hook. The checkpoint manager

4.5. THREE BASE PLUGINS 63

then sends a POSIX signal to all user threads. This forces the user threads

into a checkpoint-specific signal handler (which was defined earlier within

the thread plugin). The handler causes each user thread to save its context

(register values, etc.) into the process memory and to then wait on a lock.

When the checkpoint completes, the thread plugin releases all user threads

from their locks, and user execution resumes.

On restarting, the memory plugin restores user-space memory from a

checkpoint image, and control is then passed to a restart event hook of the

thread plugin. Only the primary thread of the restarted process exists at this

time. That thread recreates the other threads, restores their context, and re-

leases the user threads from the locks that were entered prior to checkpoint.

(The state of a lock depends only on user-space memory.)

4.5.3 Memory Plugins

Compression

Encryption

Write to network socket

Zero−page detection

Prepare list of memory areas

Runtime Libraries, Plugin Engine

Other Plugin Libraries

Var

ious

Mem

ory

Plu

gin

s

Figure 4.6: Various memory plugins stacked together

Memory plugins are responsible for writing the contents of a process’s

memory into the checkpoint image. The checkpoint image is read during

64 CHAPTER 4. THE DESIGN OF PLUGINS

restart process to recreate the process memory. Memory plugins are the last

in the plugin loading sequence as every other plugin necessarily depends on

the memory resource. Figure 4.6 shows an example of sequence of memory

plugins that perform zero-page optimizations followed by compression and

encryption before writing the checkpoint data to a network socket. A pro-

cess on the other end of the socket may then save the data onto persistent

storage.

At restart time, a special application, dmtcp_restart, is needed to boot-

strap the restart procedure to load the restoration code corresponding to

all the memory plugins involved. Control is then passed to memory plug-

ins which then perform restoration of rest of process memory. After restor-

ing memory, the rest of the plugins recreate/restore their corresponding re-

sources. User threads are then recreated and the process resumes executing

application code.

Here we list some characteristics of the memory plugins:

1. Since writing the checkpoint image is the last step in checkpoint pro-

cess, the memory plugins must appear last in the plugin sequence.

2. If it is possible for memory plugins to alter the memory maps of the

current process, the first memory plugin must create a list of memory

areas to be written to the checkpoint image. The memory plugins can

then map new memory area for checkpoint purposes only and these

areas will not be checkpointed.

3. The memory plugins pass information to the next memory plugin using

a pipe mechanism i.e. each plugin may process the incoming data and

send the processed (and potentially modified) data to the next plugin.

Data piping can be implemented by creating hooks for writing and

reading memory.

4.6. IMPLEMENTATION CHALLENGES 65

4. The plugins agree on some notion of end-of-data to finish writing the

checkpoint image.

5. Last memory plugin disposes the data onto persistent storage (file) or

writes to a pipe/socket. There can be a different process on the other

end of the pipe/socket which then saves it to a persistent device, or

it restarts the process on the fly. The last memory plugin here means

the final or lowest memory plugin (e.g., the “write to network socket”

plugin in Figure 4.6).

6. Last memory plugin is responsible for reading from the checkpoint im-

age.

7. During restart, memory plugins are responsible for restoring other run-

time libraries, thus these plugin libraries must be self contained.

Remark: Note that the state managed by the memory plugins will not be

compressed or encrypted in our running example of memory plugins. This

is necessary to solve the problem of bootstrapping on restart. If the boot-

strapping code were also encrypted, it would be impossible to bootstrap.

4.6 Implementation Challenges

In this section we describe some of the implementation challenges that we

faced in implementing the plugin based virtualization in DMTCP version 2.

4.6.1 Wrapper Functions

We discuss three different implementation techniques that were tried in suc-

cession, before settling on a fourth choice: a hybrid of the second and third

options:

1. dlopen/dlsym: This is a naive approach, well-known in the literature. It

allows the plugin to define a system call of the same name, whose body

66 CHAPTER 4. THE DESIGN OF PLUGINS

uses dlopen/dlsym to open the run-time library (e.g. libc, libpthread,

etc.), and then call the system call in the run-time library. However,

this fails when creating a wrapper for the GNU implementation of

calloc. The GNU implementations of dlopen and dlsym would call

calloc, thus creating a circular dependency. Wrapping occurrence of

dlopen/dlsym from a user’s application creates a similar circular de-

pendency. However, a still more severe criticism is that if the wrapper

function directly calls the run-time library, then nested wrappers be-

come impossible. In our implementation, multiple plugins frequently

wish to wrap the same system call.

2. offsets within a run-time library: This was implemented in order to

avoid the use of dlopen/dlsym. A base address is chosen within

the run-time library. (It may be the start address of the library or an

unusual system call unlikely to be needed by wrappers.) For all sys-

tem calls to be wrapped, the offset from that system call to the base

address is calculated before launching the end-user application. The

end-user application is then launched and the base address is recalcu-

lated. Next, the base address is used along with offsets to determine

the addresses of the functions in the run-time library. At this point, the

functions in the run-time library can be called using the corresponding

addresses. This solves the issues caused by circular dependencies (e.g.

dlopen, dlsym, calloc). However, nested wrappers still cannot

be implemented.

3. dlsym/RTLD_NEXT: The POSIX option RTLD_NEXT for dlsym is de-

signed in part to implement wrapper functions. This option causes

dlsym to search the sequence of currently open libraries for the next

matching symbol beyond the current library. This fixes the problem of

implementing nested wrappers, but it does not solve the problem of

circular dependencies.

4.6. IMPLEMENTATION CHALLENGES 67

The ultimate solution requires an additional observation: The run-time

library sometimes internally calls a system call (as with dlopen/dlsym

calling calloc). It is a mistake for the plugin to execute the wrapper function

around this internal call. Yet, when dlsym internally calls calloc, the ELF

loader will call the first definition of calloc that it finds. The first library to

be loaded was libdmtcp.so, as part of the design of DMTCP. So, the calloc

wrapper in libdmtcp.so is called.

A standard wrapper for callocwithin libdmtcp.so would then call dlsym

to determine the address of calloc within libc.so. But this would create the

circularity. Instead, the wrapper detects that this is a circular call originating

from the run-time library (libc.so). Upon detecting this, the calloc wrap-

per reverts to second method above (offsets within a run-time library) in

order to directly call the implementation of calloc within libc. Thus the

circularity is broken.

4.6.2 New Process/Program Creation

When a process forks to create a new child process, the thread that calls

fork() is the only thread in the new process. This poses certain challenges

for plugins especially when dealing with locks. If at the time of fork(), some

other thread is holding a lock, the threads in the new process may deadlock

on this lock. The solution is to install atfork() handles in all plugins that

use locks or similar artifacts and whenever a child process is created, it re-

initialized the locks before doing anything. An alternate is to use the AtFork

event generated by the fork/exec plugin. Glibc and firefox are two real

world examples which install atfork handles to re-initialize the locks for their

respective malloc-arenas.

New programs created by calling execve() have a different set of prob-

lems. Since the new program gets completely new address space, all infor-

mation that was gathered by the plugin prior to exec is lost. Plugins that

68 CHAPTER 4. THE DESIGN OF PLUGINS

need to preserve information across exec need a lifeboat where they can

put the information for later use. A typical example of lifeboat would be a

temporary file created on disk. The plugins serialize the previously captured

information to the lifeboat. Since the plugins are independent of each other,

there can be multiple lifeboats per process.

Remark: As an optimization, it is possible to provide a single lifeboat that

can be used by all the plugins.

4.6.3 Checkpoint Deadlock on a Runtime Library

Resource

Atomic wrapper operations are also desired when dealing with resources

that use locks for atomicity. Suppose a user thread is quiesced while holding

the resource lock. Later on, if the resource is needed to complete check-

point, it can cause a deadlock within the process. For example, in one of

the most frequent scenario, a user thread is quiesced while performing mal-

loc/free inside glibc. The checkpoint thread is blocked when it calls any

of these functions during the checkpoint process. There are two possible

solutions: (i) modify checkpointing logic to never call these functions, and

(ii) create wrappers around these function which call disable_checkpoint,

enable_checkpoint around the call to the real library functions as shown in

Listing 4.8

malloc(size) {

disable_checkpoint()

ret_val = real_malloc(size)

enable_checkpoint()

return ret_val

} �Listing 4.8: Malloc wrapper to avoid deadlock during checkpointing

4.6. IMPLEMENTATION CHALLENGES 69

4.6.4 Blocking Library Functions and Checkpoint

Starvation

There are certain wrappers around blocking library functions that need to

virtualize the underlying system resource. As discussed in Section 4.1.1,

the call to library function and translation between real and virtual names

should be atomic with respect to checkpointing. However, if a function call

is blocking, the checkpoint may never succeed. Examples of such function

are waitpid and pthread_join, etc.

pid_t waitpid(pid, <args>) {

while (true) {

disable_checkpoint()

real_pid = virtual_to_real(pid)

// WNOHANG flag tells waitpid to return

// immediately if the operation would block.

ret_val = real_waitpid(real_pid, WNOHANG | <args>)

virt_pid = real_to_virtual(ret_val)

enable_checkpoint()

if (ret_val != -1) // Success

return virt_pid

// If error other than timeout, the function failed.

if (errno != ETIMEDOUT)

break

// Yield CPU to avoid spinning

yield()

}

return -1;

} �Listing 4.9: Wrapper for waitpid with non-blocking calls to the real waitpid

function

70 CHAPTER 4. THE DESIGN OF PLUGINS

In these situations, one can modify the wrapper as seen in Listing 4.9 to

call the non-blocking version of the function in a loop until it succeeds or

returns an error other than timeout. The timed version waits for the given

time period before returning instead of blocking indefinitely.

In some situations, the blocking call may not provide a non-blocking

version. In those cases a potential solution is to use signalling mechanism

to force the call to return with an error. At this point, the checkpoint can

take place. However, the wrapper must be re-executed from the beginning

to avoid any stale state.

CHAPTER 5

Expressivity of Plugins

This chapter presents a large variety of examples of adaptive plugins, to

demonstrate the expressivity of the plugin framework. They fall into sev-

eral categories, each of which represents a unique type of contribution, in

generalizing the traditional functionality of checkpoint-restart.

Some of the plugins represent long-standing challenges. Not only do

these plugins provide additional functionality for checkpoint-restart, but

they do so with far fewer lines of code than the previously available less func-

tional approaches.These include transparent checkpointing of: InfiniBand

networks by Cao et al. [27]; hardware accelerated 3-D graphics (OpenGL

2.0 and beyond) by Kazemi Nafchi et al. [62]; a network of virtual machines

by Garg et al. [44]; and GDB sessions by Visan et al. [127]. Each of these

efforts was led by a different author. Thus they represent trials of the new

plugin feature by independent users. The full details of each plugin can be

found in the publications and technical reports of those authors.

While I believe any of these could have been done by adding support

in any of the existing checkpointing package, the amount of effort (both

in terms of person-hours and lines of code) would have been enormous.

Instead, by using the adaptive plugins to implement a process virtualization

approach, the job was made much easier. In all cases, the plugin writers

71

72 CHAPTER 5. EXPRESSIVITY OF PLUGINS

didn’t need to learn the details of DMTCP internals, allowing them to focus

only on the plugin.

Plugin Lines Novelty Prior Art Lines

of code of code

SSH session 1,021 The only solution — —

GDB session 938 The only solution — —

Batch-Queue 1,715 The only solution — —

KVM/Tun 1,100 Full snapshots of net-work of VMs

Single VMsnapshots

??

OpenGL 4,500 Supports programmableGPUs (OpenGL 2.0 andbeyond)

VMGL [69] 78,000

InfiniBand 2,500 Native InfiniBand check-point for both MPI andnon-MPI jobs

MPI-specific [55]

17,000

IB2TCP 1,000 InfiniBand to TCP mi-gration for both MPI andnon-MPI jobs

MPI-specific [55]

??

Table 5.1: Process virtualization based checkpoint-restart is both more generaland typically an order of magnitude less in implementation size

The expressivity is measured along two dimensions (see Table 5.1). The

first dimension is a measurement of lines of code for the plugins. Since

each example was a “first” for that functionality, we compare with lines of

code for a pevious published implementation with lesser functionality where

possible.

In the second dimension, we compare functionality with that application

identified as having the most previous functionality in the corresponding

domain. Thus a two-fold argument is presented. The process virtualization

approach permits implementations with much larger functionality than had

previously been practical with moderate resources. Second, the process vir-

tualization approach results in an implementation with many fewer lines of

code than would have been practical by other approaches. (Of course, the

5.1. FILE DESCRIPTOR RELATED PLUGINS 73

fewer lines of code in the plugin is made possible by using the base support

for plugins in DMTCP version 2.)

Note that some of the plugins discussed in this chapter were not created

as part of this thesis. Instead, they were created by different authors using

the plugin API. Further details of each plugin can be found in the publica-

tions and technical reports of those authors.

Statistics for various plugins

Table 5.2 provides several statistics including the source lines of code, the

number of library call wrappers and various services used by the plugins.

The lines of code were obtained by using SLOCCount [132].

Section 5.1 provides a brief overview of the plugins related to file descrip-

tor handling. Section 5.2 provides an overview of the working of the plugin

handling System V IPC mechanism. A few application-specific plugins are

discussed in Section 5.3. The remaining sections provide various case stud-

ies where new functionality was implemented, whereas previously in other

checkpoint-restart packages, the added functionality was implemented only

through independent, auxiliary applications.

5.1 File Descriptor Related Plugins

Since file descriptors may be used for file objects, socket connections, or

event notifications, the corresponding plugins share some code for handling

generic file descriptors. This results in a cleaner design and smaller code

footprint. The shared code provides services for generating unique file de-

scriptor ids, detecting/managing duplicate file descriptors, leader election,

and re-sharing of file descriptors on restart.

Note that DMTCP version 1 provided support for checkpointing TCP and

Unix domain sockets for checkpointing distributed applications. It also pro-

vided limited support for handling files and pseudo-terminals. For this work,

74 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Plugin Language Lines of Code Wrappers Services usedInternal Plugins

File C/C++ 2,276∗ 48 a,b,c,d,eSocket C/C++ 1,356∗ 17 a,b,c,dEvent C/C++ 909∗ 12 a,b,c,d,ePid C/C++ 1,644 47 c,d,eSysVIPC C/C++ 1,154 14 a,b,c,d,eTimer C/C++ 419 14 a,c,d,eSSH C/C++ 1,021 3 a,b,c,d,e

Contrib PluginsBatch-Queue C/C++ 1,715 13 e†

Ptrace C/C++ 938 7 a,b,cRecord-replay C/C++ 8,071 164 a,b,c,eKVM C 749 2 a,b,c,eTun C 351 3 a,b,c,eOpenGL C/C++ 4,500 119 a,b,c,e,fInfiniBand C 2,788 34 a,b,c,d,eIB2TCP C/C++ 804 31 c,d,e

Application-Specific PluginsMalloc C/C++ 116 10 fDlopen C/C++ 28 3 fModify-env C 134 0 c,eCkptFile C/C++ 37 0 a,cUniq-Ckpt C/C++ 39 0 a,c

∗: Uses additional 899 lines of shared common code.†: Uses specialized utilities to detect restart.

Plugins Services:(a) Write checkpoint hook(b) Resume hook(c) Restart hook(d) Publish/Subscribe(e) Virtualization(f) Protect critical sections of code

Table 5.2: Statistics for various plugins.

5.1. FILE DESCRIPTOR RELATED PLUGINS 75

the plugins were created by rewriting the existing solution from DMTCP ver-

sion 1. This greatly enhanced the available features and provided an easier

way for the user to fine tune checkpointing. This section provides a brief

overview of the three plugins.

File plugin

The File plugin is responsible for handling file descriptors pointing to regular

files and directories. For implementation purposes, it also handles pseudo-

terminals (ptys) and FIFO (first in first out) objects, since they have similar

semantics as file objects. Apart from restore the relevant file descriptors,

the File plugin also needs to translate the file paths if the computation is

restarted on a system with different mount points or by a different user.

There are several ways to provide file path translation. A simple mecha-

nism involves recording the relative file paths on checkpoint and using the

relative path information on restart to find the file. Another approach may

involve wild card substitution, where a certain component of the file path is

transparently replaced with a different one. For example, if a mount point

has changed from /mnt/foo to /bar, the plugin would replace /mnt/foo/baz

with /bar/baz.

The file plugin also deploys some heuristics to determine if it also needs

to save and restore the associated file data. In some cases, the file data must

always be checkpointed. Examples include unlinked files (Linux allows a file

to be unlinked while a process still has a valid file descriptor) and temporary

files created by programs like vim and emacs.

For a simpler design, the heuristics part of the File plugin is now im-

plemented as a separate plugin (Ckpt-File). This way the user can tweak

this relatively simple newer plugin according to their wishes. Similarly, the

file path translation mechanism can also be moved into its own plugin. As

obvious, the original File plugin will depend on these two plugins for their

services.

76 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Socket plugin

The Socket plugin is responsible for checkpointing and restoring the TCP/IP

sockets, Unix domain sockets, and netlink sockets. Potentially, this plugin

can also be split into three different plugins, but for implementation pur-

poses it is kept as a single unit. Further, since the Unix domain sockets may

be backed by a file on the disk, it also depends on the File plugin for file path

translation. The Socket plugin assigns a unique id to each end of a socket

connection. In our implementation, the unique id comprises of the unique-

id of the process that originally created the socket file descriptor and a per-

process monotonously incrementing counter. At the time of checkpoint, the

processes on each end of a socket connection perform a handshake to ex-

change the unique socket id. On restart, this unique socket id is used to find

the current location of the peer process using the publish-subscribe service.

Event plugin

The Event plugin is responsible for checkpointing and restoring the file de-

scriptors used for event notifications. Apart from supporting the older poll

system call (used for monitoring file descriptors), this plugin provides sup-

port for epoll (similar to poll), eventfd (used for event wait/notify mech-

anism from user space), signalfd (used for accepting signals targeted at

the caller), and inotify (used for monitoring file system events) system

calls. Inotify is the most difficult to checkpoint and restart. The desired be-

havior on restart is not well-defined and may be application dependent. For

example, inotify can be used to get notification if a file has been renamed.

Suppose that the file is renamed after checkpoint. On restart, the file will be

present with a new name and thus won’t be renamed. In this case, it is not

clear if an event notification should be generated or not. The plugin can be

modified to allow the user to specify the default behavior for use with the

application.

5.2. PID, SYSTEM V IPC, AND TIMER PLUGINS 77

5.2 Pid, System V IPC, and Timer Plugins

We have already discussed the Pid plugin as an example of virtualizing the

kernel resource identifiers in Section 3.1.1.

The System V IPC (SysVIPC) plugin support checkpointing of System V

shared memory, semaphores, and message queues. The operating system

kernel generates an identifier for each System V IPC object. The identifier

may change on restart and thus we need to virtualize it. The SysVIPC plugin

virtualizes these identifiers in a similar manner to the Pid plugin. A virtual

id is generated for each System V IPC object and a translation is kept for

translating between virtual and real ids. In addition to virtualizing the re-

source ids, the SysVIPC plugin also needs to checkpoint the associated state

of the System V IPC object. For example, the memory contents of the shared

memory region need to be checkpointed, the semaphore value needs to be

restored, and the message queue needs to be drained on checkpoint and re-

filled on restart. Since these objects are potentially shared between multiple

processes, the plugin performs leader election using the publish-subscribe

mechanism.

Lastly, we discussed the virtualization of clock and timer ids in Sec-

tion 3.1.5. As described there, in addition to virtualizing the resource ids,

application-specific fine tuning is required to control the behavior of timers

on restart.

5.3 Application-Specific Plugins

The CkptFile plugin is used to provide heuristics for saving the contents of

open files during checkpoint. The plugin can be used to read wildcard pat-

terns from a configuration file for dynamically updating the heuristics. The

File plugin consults the CkptFile plugin for each open file. The CkptFile

plugin may respond whether to checkpoint the data of the given file or not.

78 CHAPTER 5. EXPRESSIVITY OF PLUGINS

The Environ plugin provides heuristics for restoring/updating the process

environment variables after a restart. This is useful for processes that use

environment variables to find addresses, etc. of system services, daemons,

etc. The Environ plugin reads patterns from a configuration file to selectively

update the restarting process’s environment.

The Uniq-Ckpt plugin is responsible for keeping a rolling set of checkpoint

images as configured by the user. It can automatically delete or rename the

older checkpoint images to save disk space.

The Malloc plugin puts wrappers around malloc, free, etc. to avoid dead-

lock inside malloc library as explained in Section 4.6.3. The plugin can be

further used to switch to a different malloc implementation for debugging.

The Dlopen plugin provides wrappers for dlopen, dlsym, and dlclose li-

brary calls. The dlopen wrapper is used to ensure atomicity with respect to

checkpointing so that the process doesn’t get checkpointed while the library

is still being initialized. The dlsym wrapper is used to create wrappers for

function that are present in the library being loaded. The dlsym wrapper can

return the address of the wrapper function (defined in the plugin) instead

of the library function. The wrapper function then may call the real function

in the newly loaded library.

5.4 SSH Connection

The issues involved with checkpointing an SSH session as discussed in Sec-

tion 3.1.2 are reviewed followed by a description of the solution based on

our virtualization scheme. Previous support for distributed checkpointing

covered the common uses of ssh where it is used to launch remote jobs

but not used for active communication. In some HPC environments (e.g.,

Open MPI), this is the default behavior. Remote processes are launched over

SSH, and later establish a simple TCP socket for efficient communication.

This work provides support for active communications over SSH.

5.4. SSH CONNECTION 79

Recall that SSH allows two processes to securely communicate over an

insecure network. A user process uses an SSH client process to connect to a

remote SSH server (daemon) process. On creating a secure connection, the

SSH server process (sshd) launches the child process (app2), as shown in

Figure 3.2. The process app1 appears to read and write locally through a

pipe to app2.

The SSH daemon is a privileged process running a certain protocol. In

the process virtualization approach, the plugin must virtualize that protocol.

Further, checkpointing and restarting the privileged SSH daemon by an un-

privileged user is not possible, since the user cannot recreate the privileged

ssh daemon (sshd) on restart.

Launching remote process under checkpoint control

Recall that a process on Node1 launches a remote process on Node2 by

running the SSH client program as ssh Node2 app2. The earlier DMTCP

used a strategy of detecting an codeexec that calls ssh Node2 app2 and

replacing it by ssh Node2 dmtcp_launch app2. Ad hoc code was used

that allowed ssh to create a remote process under checkpoint control, but

it was assumed that the application would then close the SSH connection.

The solution for supporting long-lived SSH connections is shown in Fig-

ure 3.3. In essence, following a process virtualization approach, the SSH

plugin defines a wrapper function around the exec family of system calls.

It then replaces a call by exec to ssh Node2 app2 with a call to:

ssh Node2 dmtcp_launch virt_sshd app2

For technical reasons, the plugin actually creates two auxiliary processes,

virt_ssh and virt_sshd. (The code for these processes is part of the

SSH plugin, which arranges for them to run as separate processes.) These

processes also allow us to recreate the SSH connection on restart — even

in the less common situations where the app1 process has exited, leaving a

child of app1 to continue to employ the SSH connection from Node1.

80 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Checkpoint

At the time of checkpoint, only processes app1, app2, virt_ssh, and

virt_sshd are checkpointed. The ssh and sshd process are not under

checkpoint control and are not checkpointed. Further, the virt_ssh and

virt_sshd can directly “drain” any in-flight network data that has not yet

reached its destination at the time of checkpoint. Thus, they act as buffers

to hold network data prior to resume or restart. During resume, the drained

data is written directly to the corresponding pipes between the user pro-

cesses and the dmtcp helper processes.

app1 app2st

dio

Node1 Node2

SSH server

socket

virt_ssh virt_sshd

(sshd)(ssh)

SSH client

std

io sshd helper

stdio

stdio

std

io

Figure 5.1: Restoring an SSH connection. The virt_ssh process launchedsshd_helper on Node2 that relays stdio between ssh and virt_sshd.

Restart

Figure 5.1 illustrates how the four checkpointed processes are restored dur-

ing restart. The four processes on Node1 and Node2 are restarted via:

ssh Node1 dmtcp_restart <virt_ssh.ckpt> <app1.ckpt>ssh Node2 dmtcp_restart <virt_sshd.ckpt> <app2.ckpt>

5.5. BATCH-QUEUE PLUGIN FOR RESOURCE MANAGERS 81

Note that in the general case, Node1 and Node2 may both have been remote

nodes. Next, an SSH connection must be created between the two processes,

virt_ssh and virt_sshd. To accomplish this, the virt_ssh will use

publish/subscribe to discover the address of the virt_sshd process. Next,

virt_ssh will fork a child process, which “execs” into the following pro-

gram:

ssh Node2 sshd_helper <virt_sshd address>

Finally, the sshd_helper process will relay the data of its stdio pipes

from the SSH server process through stdio pipes to the virt_sshd pro-

cess. The sshd_helper process exits when the virt_sshd process exits.

The sshd_helper process is never part of any subsequent checkpoint.

5.5 Batch-Queue Plugin for Resource Managers

One of the long-standing functionality requirements for batch-queue man-

agers at various HPC centers is the ability to suspend a low priority job

to allow execution of a high priority job as soon as it arrives. While there

have been MPI-specific solutions to support this use-case (see Section 2.1.2),

they have not been integrated into the batch-queue systems for the lack of

complete functionality. The batch-queue plugin by Polyakov [93] solves this

problem by providing a native checkpoint-restart facility that can be embed-

ded in the batch-queue itself.

The goal of the batch-queue plugin is to recreate the original parallel

computation in a transparent manner. This mechanism is invisible both to

any resource manager and to the MPI libraries themselves. During restart,

the batch-queue plugin must adapt to a new execution environment created

by the resource manager at that time. The plugin must detect the newly

available nodes during restart, and arrange for launching the restarted user

processes onto appropriate nodes. Issues specific to a resource manager may

82 CHAPTER 5. EXPRESSIVITY OF PLUGINS

arise during this process, such as the creation by the resource manager of a

new read-only nodefile that is inconsistent with the pre-checkpoint version

(see below).

Recall that modern resource management (RM) systems allocate resources

for jobs, which are then launched in background in a non-interactive mode.

Although the RM systems don’t intervene much in a program’s execution

(except for PMI, see an example blow), they do modify part of its execution

environment. For example, some of them redirect a program’s standard in-

put, output and error to special files, and later move those files to the user’s

working directory once the program is finished or killed. They also provide

services for remote launch of programs such as tm_spawn for TORQUE PBS,

lsb_launch() for Load Sharing Facility (LSF), and even standalone commands

such as srun for SLURM.

The batch-queue plugin can handle the new execution environment dur-

ing restart. It detects the available nodes, and launches the restarting pro-

cesses onto the nodes as required. The new program may not have per-

missions to overwrite some environment files (e.g., nodefile) and may need

to update these file descriptors to point to the copy of files saved during

checkpoint.

We next discuss some of the virtualization strategies provided by the

batch-queue plugin.

Support for batch system remote launch mechanism

To fully support parallel programs in modern RM systems, the remote child

processes should be automatically placed under checkpoint control. For all

supported batch systems this plugin uses the same technique to provide this

service: it patches the command line passed to the remote launch mecha-

nism by adding a prefix, dmtcp_launch < options >. For example, in the case

of TORQUE PBS, a wrapper for tm_spawn updates the passed arguments to

insert the dmtcp_launch command.

5.5. BATCH-QUEUE PLUGIN FOR RESOURCE MANAGERS 83

Communication between Batch Systems and the Application

A common issue for any resource manager is the binding of stdin/out/err

to files. Those files must be saved in the checkpoint image, for the sake of

consistency and transparency. At restart time, the plugin must discover the

bindings of stdin/out/err to the new files created by the resource manager.

Any saved content from prior to checkpoint must be written into those files.

Batch systems usually communicate with applications using special en-

vironment variables. Some batch systems use auxiliary files in addition to

the environment variables. For example, TORQUE saves a list of its allocated

nodes into a read-only nodefile, which can be cached by the application. But

at restart time, a new read-only nodefile will be generated, different from

the one cached by the application. To address this situation, the batch-queue

plugin creates a temporary file containing the original nodefile contents and

modifies the file descriptor of the restarted application to point to this alter-

nate nodefile.

Communication between MPI Application and External PMI Interface

Most modern MPI implementations use or support the Process Management

Interface (PMI) [14]. The PMI model comprises three entities: the MPI li-

brary, PMI library and the process manager. Currently there are several im-

plementations of process manager entities, including the standalone Hydra

package, and the PMI server of the SLURM resource manager.

While the multi-host capable Socket plugin transparently supports the

Hydra implementation, additional plugin support is needed to integrate the

SLURM PMI implementation. SLURM requires an MPI process to commu-

nicate with the SLURM job step daemon, which is not under checkpoint

control. In this case, an batch-queue plugin finalizes PMI session before

checkpointing and recreates it afterward.

84 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Specialized peer-discovery and remote launch service

The processes may be restarted on different nodes. The number of slots

(number of processes per node) may be different for the new nodes. The

batch-queue plugin employs a node discovery tool to find the new nodes

and to map old resources to the newly allocated node set. For TORQUE

RM, the plugin analyzes the new nodefile and for SLURM it parses the

SLURM_JOB_NODELIST and SLURM_TASKS_PER_NODE environment vari-

ables. After this step resource allocation is available in RM-independent for-

mat. Next, the old resources are mapped onto new ones. Once the resources

have been mapped, the application is launched using the appropriate RM

system mechanism. The mapping algorithm should consider the slots when

matching resources between the old and new sets. It should be noted that

the processes that were launched on the head node of a cluster usually have

a special environment (special stdin/out/err connections and access to the

nodefile) and may need special treatment.

5.6 Ptrace Plugin

The ptrace system call is used by a superior process (e.g., gdb, strace,

etc.) to attach to an inferior process (e.g., a.out) in order to trace it. The

ptrace system call uses CPU hardware support, making it harder to check-

point. The inferior process can’t perform a checkpoint until it is detached or

allowed to run freely during the checkpoint phase. A ptrace plugin is used

to solve these problems [127]. The ptrace plugin in the superior process de-

taches the inferior process before checkpointing and re-attaches right after

restart.

The ptrace plugin in the inferior process has an added responsibility. It

is often the case that the inferior threads are quiesced while they are in

possession of a system resource, or while executing a critical section in the

code. This can result in a deadlock. To fix this, the ptrace plugin forces the

5.7. DETERMINISTIC RECORD-REPLAY 85

user threads to release resources before entering a quiescent state. This is

done by using Pre/Post-Quiesce event notifications. Pre-Quiesce is generated

by the user thread just before entering the quiesce state. While processing

this hook, each thread ensures that it is not holding any system resources,

locks, etc. that can result in a deadlock. The Post-Quiesce phase forces the

inferior thread to wait until the superior can attach to it after restart.

5.7 Deterministic Record-Replay

The record-replay plugin is needed by any reversible debugger that uses

checkpoint, restart and re-execute. FReD (Fast Reversible Debugger) [112]

can add reversibility to any debugger by using checkpoint, restart and re-

execute strategy. FReD uses DMTCP for checkpointing. Deterministic record-

replay for FReD was achieved by creating a record-replay plugin to be used

with DMTCP. This plugin is generally placed before any other plugin in the

plugin hierarchy, to allow it to “hijack” library calls. Due to its complex-

ity, the record-replay plugin is the largest plugin in terms of lines of code

(see Table 5.2).

There are several potential sources of nondeterminism in program ex-

ecution, and record-replay must address all of them: thread interleaving,

external events (I/O, etc.), and memory allocation. While correct replay of

external events is required for all kind of programs, memory accuracy is of-

ten not an issue for higher-level languages like Python and Perl, which do

not expose the underlying heap to the user’s program.

FReD handles all these aspects by wrapping various system calls. Rele-

vant events are captured by interposing on library calls using dlopen/dlsym

for creating function wrappers for interesting library functions. The wrap-

pers record events into the log on the first execution and then return the

appropriate values (or block threads as required) on replay.

We start recording when directed by FReD (often after the first check-

86 CHAPTER 5. EXPRESSIVITY OF PLUGINS

point). The system records the events related to thread-interleaving, exter-

nal events, and memory allocation into a log. On replay, it ensures that the

events are replayed in the same order as they were recorded. The plugin

guarantees deterministic replay — even when executing on multiple cores

— so long as the program is free of data races.

Thread interleaving

FReD uses wrappers around library calls such as

pthread_mutex_lock and pthread_mutex_unlock, to enforce the cor-

rect thread interleaving during replay. Apart from the usual pthread_xxx

functions, some other functions that can enforce a certain interleaving are

blocking functions like read. For example, a thread can signal another

thread by writing into the write-end of a pipe when the other thread is do-

ing a blocking read on the read-end of the pipe.

Replay of external events

Applications typically interact with the outside world as part of their execu-

tion. They also interact with the debugger and the user, as part of the debug-

ging process. Composite debugging requires separating these streams. For

debuggers that trace a program in a separate process, the I/O by the process

being debugged is recorded and replayed whereas the I/O by the debugger

process is ignored.

For interpreted languages, the situation becomes trickier as the record-

replay plugin cannot differentiate between the debugger I/O and the appli-

cation I/O. FReD handles this situation heuristically. It designates the stan-

dard input/output/error file descriptors as pass-through devices. Activity on

the pass-through devices is ignored by the record-replay component.

5.8. CHECKPOINTING NETWORKS OF VIRTUAL MACHINES 87

Memory accuracy

One important feature of FReD is memory-accuracy: the addresses of ob-

jects on the heap do not change between original execution and replay. This

is important because it means that developers can use address literals in

expression watchpoints (assuming they are supported by the underlying de-

bugger).

With true replay of application program, one would expect the memory

layout to match the record phase, but the DMTCP libraries have to perform

different actions during normal run and on restart. This results in some

memory allocation/deallocations originating from DMTCP libraries that can

alter the memory layout. Another cause for the change in memory layout

is the memory allocated by the operating system kernel when the process

doesn’t specify a fixed address. An example is the mmap system call without

any address hint. In this case, the kernel is free to choose any address for

the memory region.

Memory-accuracy is accomplished by logging the arguments, as well as

the return values of mmap, munmap, etc. on record. On replay, the real

functions or system calls are re-executed in the exact same order. However,

the record-replay plugin provides a hint to the kernel to obtain the same

memory address as was received at record-time. FReD handles any conflicts

caused by memory allocation/deallocation originating from DMTCP itself by

forcing use of a separate allocation arena for DMTCP requests.

5.8 Checkpointing Networks of Virtual

Machines

Garg et al. [43] used DMTCP and plugins to provide a generic checkpoint-

restart mechanism for three cases of virtual machines: user-space (stan-

dalone) QEMU [121], KVM/QEMU [114], and Lguest [115]. In all three

88 CHAPTER 5. EXPRESSIVITY OF PLUGINS

cases, the hypervisor (VMM — virtual machine monitor) was based on Linux

as the host operating system. These examples covers three distinct virtual-

ization scenarios: entirely user-space virtualization (QEMU), full virtualiza-

tion using a Linux kernel driver (KVM/QEMU), and paravirtualization using

a Linux kernel driver [115].

The user-space QEMU virtual machine did not require any specific plugin.

The KVM/QEMU and Lguest virtual machines required a new plugin consist-

ing of approximately 200 lines of code. In addition, the kernel driver from

Lguest required an additional 40 lines of new code to support checkpoint-

restart capability. The authors estimated the implementation time at approx-

imately five to ten person days. This is in contrast with the number of lines

of code required for libvirt.

Garg et al. [44] further implemented the first system to checkpoint a

network of virtual machines by virtualizing the tun/tap interface using a

plugin. The tun plugin consisted of approximately 350 lines of code.

5.9 3-D Graphic: Support for Programmable

GPUs in OpenGL 2.0 and Higher

Kazemi Nafchi et al. [62] describe a mechanism for transparently check-

pointing hardware-accelerated 3D graphics. The approach is based on DMTCP

with a plugin to record-prune-replay of OpenGL library calls. The calls not

relevant to the last graphics frame prior to checkpointing is discarded. The

remaining OpenGL calls are replayed on restart. The plugin uses approxi-

mately 4,500 lines of code.

Previously, Lagar-Cavillaet al. [69] presented VMGL for vector-independent

checkpoint restart. VMGL used a shadow device driver for OpenGL, which

shadows most OpenGL calls to model OpenGL state, and restores it when

restarting form a checkpoint. The code to maintain OpenGL state was ap-

5.10. TRANSPARENT CHECKPOINTING OF INFINIBAND 89

proximately 78,000 lines of code.

Further, the new plugin has added functionality. Lagar-Cavillaet al. sup-

ported only OpenGL 1.5 (fixed pipeline functionality). The approach of the

new plugin was demonstrated to apply to programmable GPUs (OpenGL 2.0

and beyond).

5.10 Transparent Checkpointing of InfiniBand

The InfiniBand plugin by Cao et al. [27] is the first to support checkpoint-

restart of native InfiniBand network. Previous checkpoint-restart systems [55]

were MPI-specific. This plugin provides support for checkpointing UPC, an

example of a PGAS language, which runs more efficiently when it runs na-

tively over the InfiniBand fabric (instead of on top of an MPI layer). For

applications such as these, there is no alternate solution.

Compared to approximately 3,000 lines of code for the InfiniBand plugin,

the checkpoint-restart functionality in Open MPI uses approximately 17,000

lines of code (without counting the InfiniBand-specific code). This is in ad-

dition to the single process checkpointer, BLCR, that is used by OpenMPI.

5.11 IB2TCP: Migrating from InfiniBand to TCP

Sockets

Some traditional checkpoint-restart services, such as that for Open MPI [55],

offer the ability to checkpoint over one network, and restart on a second net-

work. This is especially useful for interactive debugging. A set of checkpoint

images from an InfiniBand-based production cluster can be copied to an

Ethernet/TCP-based debug cluster. Thus if a bug is encountered after run-

ning for hours on the production cluster, the most recent checkpoints can

be used to restart on the debug cluster under a symbolic debugger, such as

GDB.

90 CHAPTER 5. EXPRESSIVITY OF PLUGINS

The IB2TCP plugin enables checkpointing over InfiniBand and restart-

ing over Ethernet in the similar fashion. An important contribution of the

IB2TCP plugin [27], is that unlike the BLCR kernel-based approach, the

DMTCP/IB2TCP approach supports using an Ethernet-based cluster that uses

a different Linux kernel, something that occurs frequently in practice. Fur-

ther, the IB2TCP plugin can be used with the InfiniBand plugin or without

InfiniBand plugin (but with limited support for checkpointing).

CHAPTER 6

Tesseract: Reconciling Guest I/O

and Hypervisor Swapping in a VM

The previous chapters were concerned with adaptive plugins, a virtualiza-

tion mechanism that decoupled the application process from the execution

environment to facilitate transparent checkpoint-restart. In this chapter, I

will present a virtualization mechanism that decouples the guest virtual disk

from the guest operating system to prevent redundant I/O operations be-

tween the guest and the hypervisor.

Guests running in virtual machines read and write state between their

memory and virtualized disks. Hypervisors such as VMware ESXi [57] like-

wise may page guest memory to and from a hypervisor-level swap file to

reclaim memory. To distinguish these two cases, we refer to the activity

within the guest OS as paging and that within the hypervisor as swapping.

In overcommitted situations, these two sets of operations can result in a

two-level scheduling anomaly known as “double paging”. Double-paging

occurs when the guest attempts to page out memory that has previously

been swapped out by the hypervisor and leads to long delays for the guest

as the contents are read back into machine memory only to be written out

again (see Sections 6.1 and 6.2). While the double-paging anomaly is well

known [46, 48, 47, 128, 82], its impact on real workloads is not established.

91

92 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Our approach addresses the double-paging problem directly in a man-

ner transparent to the guest(see Section 6.3). First, the virtual machine is

extended to track associations between guest memory and either blocks in

guest virtual disks or in the hypervisor swap file. Second, the virtual disks

are extended to support a mechanism to redirect virtual block requests to

blocks in other virtual disks or the hypervisor swap file. Third, the hyper-

visor swap file is extended to track references to its blocks. Using these

components to restructure guest I/O requests, we eliminate the main effects

of double-paging by replacing the original guest operations with indirections

between the guest and swap stores. An important benefit of this approach

is that where hypervisors typically attempt to avoid swapping pages likely

to be paged out by the guest, the two levels may now cooperate in selecting

pages since the work is complementary.

We have prototyped our approach on the VMware Workstation [56] plat-

form enhanced to explicitly swap memory in and out. While the current

implementation focuses on deduplicating guest I/Os for contents stored in

the hypervisor swap file, it is general enough to also deduplicate redundant

contents between guest I/Os themselves or between the hypervisor swap file

and guest disks (see Section 6.4).

In Section 6.5, we also show the impact of an unexpected side-effect of

our solution: loss of locality caused by indirections to the hypervisor swap

file which can substantially slow down subsequent guest I/Os. Finally, we

describe techniques to detect this loss of locality and to recover it. These

techniques isolate the expensive costs of the double-paging effect and mak-

ing them asynchronous with respect to the guest.

In Section 6.6, we present results using a synthetic benchmark that show,

for the first time, the cost of the double-paging problem. Finally, in Sec-

tion 6.7, we discuss related work.

6.1. REDUNDANT I/O 93

Host

DevicePaging

Guest

DiskVirtual

Guest Physical Memory

PPN

(2) (1)

(a) Host swap out followed by guestdisk read

Host

DevicePaging

Guest

DiskVirtual

Guest Physical Memory

(2)

(1)

PPN vCPU

(b) Host swap out followed by guestoverwriting the entire page

Host

DevicePaging

Guest

DiskVirtual

Guest Physical Memory

PPN

(1) (2)

notdirty

(c) Host swap out of an unmodifiedguest page

Host

DevicePaging

Guest

DiskVirtual

Guest Physical Memory

PPN

(2) (1)

(d) Host swap out followed by guestdisk write (Double-Paging)

Figure 6.1: Some cases of redundant I/O in a virtual machine.

6.1 Redundant I/O

Figure 6.1 shows some examples of redundant I/O resulting from bad in-

teraction between hypervisor swapping and guest I/O. In Figure 6.1a, the

hypervisor swap out is followed by guest overwriting the entire page by

doing a disk read. From the hypervisor’s point of view, the guest has ac-

cessed the page, and so it unnecessarily swaps in the guest page. Similarly,

in Figure 6.1b, the host swap out is followed by the guest zeroing out the

entire page. Here again, the hypervisor swap in is wasteful. In Figure 6.1c,

the guest reads a page from the disk into its physical memory. The page

is “clean” i.e. the contents have not been modified by the guest. However,

when under memory pressure, the hypervisor tries to swap out this page

as well. Ideally, the hypervisor could have discarded the page contents and

94 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

later restore them from guest disk if needed. Finally, in Figure 6.1d, the

guest tries to page out a page that is already swapped out by the host. This

is the case of double-paging.

The first two cases, (Figures 6.1a and 6.1b) have already been addressed

in some commercial products such as the VMware ESX hypervisor. Further,

concurrent work of Amit et al. [5] implements solutions for the first three

cases (using mmap structures as the remapping mechanism or boundary in

Linux) but ignore the fourth. Tesseract has a system that addresses solutions

for the first two cases (Figures 6.1a and 6.1b) along with a solution to the

double-paging case(Figure 6.1d). In addition, it can serve as a basis for a

third case (Figure 6.1c) and a fifth case – guest write followed by another

guest write.

6.2 Motivation: The Double-Paging Anomaly

Tesseract has four objectives. First, to extend VMware’s hosted platforms,

WorkStation and Fusion, to explicitly manage how the hypervisor pages out

memory so that its swap subsystem can employ many of the optimizations

used by the ESX platform. Second, to prototype the mechanisms needed

to identify redundant I/Os originating from the guest and virtual machine

monitor (VMM) and eliminate these. Third, to use this prototype to justify

restructuring the underlying virtual disks of VMs to support this optimiza-

tion. Finally, to simplify the hypervisor’s memory scheduler so that it need

not avoid paging out memory that guest may decide to page. To address

these, the project initially focused on the double-paging anomaly.

One of the tasks of the hypervisor is to allocate and map host (or ma-

chine) memory to the VMs it is managing. Likewise, one of the tasks of

the guest operating system in a VM is to manage the guest physical address

space, allocating and mapping it to the processes running in the guest. In

both cases, either the set of machine memory pages or the set of guest phys-

6.2. MOTIVATION: THE DOUBLE-PAGING ANOMALY 95

ical pages may be oversubscribed.

In overcommitted situations, the appropriate memory scheduler must

repurpose some memory pages. For example, the hypervisor may reclaim

memory from a VM by swapping out guest pages to the hypervisor-level

swap file. Having preserved the contents of those pages, the underlying ma-

chine memory may be used for a new purpose. The guest OS may reclaim

memory within a VM too to allow a guest physical page to be used by a new

virtual mapping.

As hypervisor-level memory reclamation is transparent to the guest OS,

the latter may choose to page out to a virtualized disk pages that were

already swapped by the hypervisor. In such cases, hypervisor must syn-

chronously allocate machine pages to hold the contents and read the already

swapped contents back into that memory so they can be saved, in turn, to

the guest OS’s swap device. This multi-level scheduling conflict is called

double-paging.

Figure 6.2 illustrates the double-paging problem. Suppose the hypervisor

decides to reclaim a machine page (MPN) that is backing a guest physical

page (PPN). In step 1, the mapping between the PPN and MPN is invalidated

and, in step 2, the contents of MPN is saved to the hypervisor’s swap file.

Suppose the guest OS later decides to reallocate PPN for a new guest virtual

mapping. It, in turn, in step 3a invalidates the guest-level mappings to that

PPN and initiates an I/O to preserve its contents in a guest virtual disk (or

guest VMDK). In handling the guest I/O request, the hypervisor must ensure

that the contents to be written are available in memory. So, in step 4, the

hypervisor faults the contents into a newly allocated page (MPN2) and, in

step 5, establishes a mapping from PPN to MPN2. This sequence puts extra

pressure on the hypervisor memory system and may further cause additional

hypervisor-level swapping as a result of allocating MPN2. In step 6, the guest

OS completes the I/O by writing the contents of MPN2 to the guest VMDK.

Finally, the guest OS is able to zero the contents of the new MPN so that the

96 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Host

Paging

Device

Guest

Disk

hypervisor view

guest view

Guest

MemoryHost

LP1

(3a)(3b)

(6)

(5)

(1)

(4)

(2)

Phys Mem.PPN

MPN MPN2

(1), (2) : Swap out(3a,3b) : Guest block write request

(4) : Memory allocation and swap in(5) : Establish PPN to MPN mapping(6) : Write block to guest disk(7) : Zero the new MPN for reuse

Figure 6.2: An example of double-paging.

PPN that now maps to it can be used for a new virtual mapping in step 7.

A hypervisor has no control over when a virtualized guest may page

memory out to disk, and may even employ reclamation techniques like bal-

looning [128] in addition to hypervisor-level swapping. Ballooning is a tech-

nique that co-opts the guest into choosing pages to release back to the plat-

form. It employs a guest driver or agent to allocate, and often pin, pages

in the guest’s physical address-space. Ballooning is not a reliable solution in

overcommitted situations since it requires guest execution to choose pages

and release memory and the guest is unaware of which pages are backed

by MPNs. Hypervisors that do not also page risk running out of memory.

While preferring ballooning, VMware uses hypervisor swapping to guaran-

tee progress. Because levels of overcommitment vary over time, hypervisor

swapping may interleave with the guest, under pressure from ballooning,

6.3. DESIGN 97

also paging. This can lead to double paging.

The double-paging problem also impacts hypervisor design. Citing the

potential effects of double-paging, some [82] have advocated avoiding the

use of hypervisor-level swapping completely. Others have attempted to mit-

igate the likelihood through techniques such as employing random page

selection for hypervisor-level swapping [128] or employing some form of

paging-aware paravirtualized interface [48, 47]. For example, VMware’s

scheduler uses heuristics to find “warm” pages to avoid paging out what

the guest may also choose to page out. These heuristics have extended ef-

fects, for example, on the ability to provide large (2MB) mappings to the

guest. Our goals are to address the double-paging problem in a manner

that is transparent to the guest running in the VM and identifies and elides

the unnecessary intermediate steps such as steps 4, 5 and 6 in Figure 6.2

and to simplify hypervisor scheduling policies. Although we do not demon-

strate that double-paging is a problem in real workloads, we do show how

its effects can be mitigated.

6.3 Design

We now describe our prototype’s design. First, we describe how we extended

the hosted platform to behave more like VMware’s server platform, ESX.

Next, we outline how we identify and eliminate redundant I/Os. Finally, we

describe the design of the hypervisor swap subsystem and the extensions to

the virtual disks to support indirections.

6.3.1 Extending The Hosted Platform To Be Like ESX

VMware supports two kinds of hypervisors: the hosted platform in which

the hypervisor cooperatively runs on top of an unmodified host operating

system such as Windows or Linux, and ESX where the hypervisor runs as

the platform kernel, the vmkernel. Two key differences between these two

98 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

platforms are how memory is allocated and mapped to a VM, and where the

network and storage stacks execute.

In the existing hosted platform, each VM’s device support is managed in

the vmx, a user-level process running on the host operating system. Privi-

leged services are mediated by the vmmon device driver loaded into the host

kernel, and control is passed between the vmx and the VMM and its guest

via vmmon. An advantage of the hosted approach is that the virtualization

of I/O devices is handled by libraries in the vmx and these benefit from the

device support of the underlying host OS. Guest memory is mmapped into

the address space of the vmx. Memory pages exposed to the VMM and guest

by using the vmmon device driver to pin the pages in the host kernel and

return the MPNs to the VMM. By backing the mmapped region for guest

memory with a file, hypervisor swapping is a simple matter of invalidating

all mappings for the pages to be released in the VMM, marking, if necessary,

those pages as dirty in the vmx’s address space, and unpinning the pages on

the host.

In ESX, network and storage virtual devices are managed in the vmker-

nel. Likewise, the hypervisor manages per-VM pools of memory for backing

guest memory. To page memory out to the VM’s swap file, the VMM and

vmkernel simply invalidate any guest mappings and schedule the pages’ con-

tents to be written out. Because ESX explicitly manages the swap state for

a VM including its swap file, it is able to employ a number of optimizations

unavailable on the current hosted platform. These optimizations include the

capturing of writes to entire pages of memory [4], and the cancellation of

swap-ins for swapped-out guest PPNs that are targets for disk read requests.

The first optimization is triggered when the guest accesses an unmapped

or write-protected page and faults into the VMM. At this point, the guest’s

instruction stream is analyzed. If the page is shared [128] and the effect

of the write does not change the content of the page, page-sharing is not

broken. Instead, the guest’s program counter is advanced past the write and

6.3. DESIGN 99

it is allowed to continue execution. If the guest’s write is overwriting an

entire page, one or both of two actions are taken. If the written pattern is

a known value, such as repeated 0x00, the guest may be mapped a shared

page. This technique is used, for example, on Windows guests because Win-

dows zeroes physical pages as they are placed on the freelist. Linux, which

zeroes on allocation of a physical page, is simply mapped a writeable zeroed

MPN. Separately, any pending swap-in for that PPN is cancelled. Since the

most common case is the mapping of a shared zeroed-page to the guest, this

optimization is referred to as the PShareZero optimization.

The second optimization is triggered by interposition on guest disk read

requests. If a read request will overwrite whole PPNs, any pending swap-ins

associated with those PPNs are deferred during write-preparation, the pages

are pinned for the I/O, and the swap-ins are cancelled on successful I/O

completion.

We have extended Tesseract so that its guest-memory and swap mecha-

nisms behave more like those of ESX. Instead of mmapping a pagefile to pro-

vide memory for the guest, Tesseract’s vmx process mmaps an anonymously-

backed region of its address space, uses madvise to mark the range as NOT-

NEEDED, and explicitly pins pages as they are accessed by either the vmx or

by the VMM. Paging by the hypervisor becomes an explicit operation, read-

ing from or writing to an explicit swap file. In this way, we are able to also

employ the above optimizations on the hosted platform. We consider these

as part of our baseline implementation.

6.3.2 Reconciling Redundant I/Os

Tesseract addresses the double-paging problem transparently to the guest al-

lowing our solution to be applied to unmodified guests. To achieve this goal,

we employ two forms of interposition. The first tracks writes to PPNs by the

guest and is extended to include a mechanism to track valid relationships

100 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

between guest memory pages and disk blocks that contain the same state.

The second exploits the fact that the hypervisor interposes on guest I/O re-

quests in order to transform the requests’ scatter-gather lists. In addition,

we modify the structure of the guest VMDKs and the hypervisor swap file,

extending the former to support indirections from the VMDKs into the hy-

pervisor swap disk. Finally, when the guest reallocates the PPN and zeroes

its contents, we apply the PShareZero optimization in step 7 in Figure 6.2.

In order to track which pages have writable mappings in the guest, MPNs

are initially mapped into the guest read-only. When written by the guest, the

resulting page-fault allows the hypervisor to track that the guest page has

been modified. We extend this same tracking mechanism to also track when

guest writes invalidate associations between guest pages in memory and

blocks on disk. The task is simpler when the hypervisor, itself, modifies guest

memory since it can remove any associations for the modified guest pages.

Likewise, virtual device operations into guest pages can create associations

between the source blocks and pages. In addition, the device operations may

remove prior associations when the underlying disk blocks are written. This

approach, employed for example to speed the live migration of VMs from

one host to another [87], can efficiently track which guest pages in memory

have corresponding valid copies of their contents on disks.

The second form of interposition occurs in the handling of virtualized

guest I/O operations. The basic I/O path can be broken down into three

stages. The basic data structure describing an I/O request is the scatter-

gather list, a structure that maps one or more possibly discontiguous mem-

ory extents to a contiguous range of disk sectors. In the preparation stage,

the guest’s scatter-gather list is examined and a new request is constructed

that will be sent to the underlying physical device. It is here that the unmod-

ified hypervisor handles the faulting in of swapped out pages as shown in

steps 4 and 5 of Figure 6.2. Once the new request has been constructed, it is

issued asynchronously and some time later there is an I/O completion event.

6.3. DESIGN 101

To support the elimination of I/Os to and from virtual disks and the hy-

pervisor block-swap store (or BSST), each guest VMDK has been extended

to maintain a mapping structure allowing its virtual block identifiers to refer

to blocks in other VMDKs. Likewise, the hypervisor BSST has been extended

with per-block reference counts to track whether blocks in the swap file are

accessible from other VMDKs or from guest memory.

The tracking of associations and interposition on guest I/Os allows four

kinds of I/O elisions:

swap - guest-I/O a guest I/O follows the hypervisor swapping out a page’s

contents (Figures 6.1a and 6.1d)

swap - swap a page is repeatedly swapped out to the BSST with no inter-

vening modification

guest-I/O - swap the case in which the hypervisor can take advantage of

prior guest reads or writes to avoid writing redundant contents to the

BSST (Figure 6.1c)

guest-I/O - guest-I/O the case in which guest I/Os can avoid redundant

operations based on prior guest operations where the results known

reside in memory (for reads) or in a guest VMDK (for writes)

For simplicity, Tesseract focuses on the first two cases since these capture the

case of double-paging. Because Tesseract does not introspect on the guest,

it cannot distinguish guest I/Os related to memory paging from other kinds

of guest I/O. But the technique is general enough to support a wider set

of optimizations such as disk deduplication for content streamed through a

guest. It also complements techniques that eliminate redundant read I/Os

across VMs [82].

102 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Guest

Disk

BSST

Blo

ck I

ndir

ecti

on L

ayer

LP1

Guest Physical Memory

Host Memory

PPN

MPN

hypervisor view

guest view

Figure 6.3: Double-paging with Tesseract.

6.3.3 Tesseract’s Virtual Disk and Swap Subsystems

Figure 6.3 shows our approach embodied in Tesseract. The hypervisor swaps

guest memory to a block-swap store (BSST) VMDK, which manages a map

from guest PPNs to blocks in the BSST, a per-block reference-counting mech-

anism to track indirections from guest virtual disks, and a pool of 4KB disk

blocks. When the guest OS writes out a memory page that happens to be

swapped out by the hypervisor, the disk subsystem detects this condition

while preparing to issue the write request. Rather than bringing memory

contents for the swapped out page back to memory, the hypervisor updates

the appropriate reference counts in the BSST, issues the I/O, and updates

metadata in guest VMDK and adds a reference to the corresponding disk

block in BSST.

Figure 6.4 shows timelines for the scenario when guest OS is paging out

an already swapped page with and without Tesseract. With Tesseract we are

able to eliminate the overheads of a new page allocation and a disk read.

To achieve this, Tesseract modifies the I/O preparation and I/O comple-

tion steps. For write requests, the memory pages in the scatter-gather list are

6.3. DESIGN 103

VMM SwapOut Allocate Memory

Synchronous SwapIn Guest

Write I/OZeroWrite

UpdatePTE...

(a) Baseline (without Tesseract)

VMM SwapOutGuestWrite

WriteMetadata

PShareZero

UpdatePTE...

(b) With Tesseract

Figure 6.4: Write I/O and hypervisor swapping.

checked for valid associations to blocks in the BSST. If these are found, the

target VMDK’s mapping structure is updated for those pages’ corresponding

virtual disk blocks to reference the appropriate blocks in the BSST and the

reference counts of these referenced blocks in the BSST are incremented. For

read requests, the guest I/O request may be split into multiple I/O requests

depending on where the source disk blocks reside.

Consider the state of a guest VMDK and the BSST as shown in Fig-

ure 6.5a. Here, a guest write operation wrote five disk blocks in which

two were previously swapped to the BSST. In this example, block 2 still con-

tains the swapped contents of some PPN and has a reference count reflecting

this fact and the guest write. Hence, its state has “swapped” as true and a

reference count of 2. Similarly, block 4 only has a nonzero reference count

because the PPN whose swapped contents originally created the disk block

has since been accessed and its contents paged back in. Hence, its state has

“swapped” as false and a reference count of 1. To read these blocks from

the guest VMDK now requires three read operations: one against the guest

VMDK and two against the BSST. The results of these read operations must

then be coalesced in the read completion path.

One can view the primary cost of double-paging in an unmodified hy-

pervisor as impacting the write-preparation time for guest I/Os. Likewise,

one can view the primary cost of these cases in Tesseract as impacting the

read-completion time. To mitigate these effects, we consider two forms of

defragmentation. Both strategies make two assumptions:

104 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

1

D

3

D

5

Guest VMDK

2

Block-Swap Store (BSST)

swapped: true

swapped: false

refcnt: 2

refcnt: 1

(a) With Tesseract

1

D

3

D

5

Guest VMDK

2

Block-Swap Store (BSST)

swapped: false

swapped: false

refcnt: 0

swapped: truerefcnt: 2

refcnt: 0

swapped: falserefcnt: 1

2

(b) With Tesseract and BSST defragmentation

1

2

3

4

5

Guest VMDK

S

Block-Swap Store (BSST)

swapped: true

swapped: false

refcnt: 1

refcnt: 0

(c) With Tesseract and guest VMDK defragmentation

Figure 6.5: Examples of reference count with Tesseract and with defragmenta-tion.

• the original guest write I/O request (represented in blue) captures the

guest’s notion of expected locality, and

• the guest is unlikely to immediately read the same disk blocks back

into memory

6.4. IMPLEMENTATION 105

Based on these assumptions, we extended Tesseract to asynchronously reor-

ganize the referenced state in the BSST. In Figure 6.5b, we copy the refer-

enced blocks into a contiguous sequence in the BSST and update the guest

VMDK indirections to refer to the new sequence. This approach reduces

the number of split read operations. In Figure 6.5c, we copy the references

blocks back to the locations in the original guest VMDK where the guest

expects them. With this approach, the typical read operation need not be

split. In effect, Tesseract asynchronously performs the expensive work that

occurred in steps 4, 5, and 6 of Figure 6.2 eliminating its cost to the guest.

6.4 Implementation

Our prototype extends VMware Workstation as described in Section 6.3.1.

Here, we provide more detail.

6.4.1 Explicit Management of Hypervisor Swapping

VMware Workstation relies on the host OS to handle much of the work as-

sociated with swapping guest memory. A pagefile is mapped into the vmx’s

address space and calls to the vmmon driver are used to lock MPNs backing

this memory as needed by the guest. When memory is released through hy-

pervisor swapping, the pages are dirtied, if necessary, in the vmx’s address

space and unlocked by vmmon. Should the host OS need to reclaim the

backing memory, it does so as if the vmx were any other process: it writes

out the state to the backing pagefiles and repurposes the MPN.

For Tesseract, we modified Workstation to support explicit swapping of

guest memory. First, we eliminated the pagefile and replaced it with a spe-

cial VMDK, the block swap store (BSST) into which swapped-out contents

are written. The BSST maintains a partial mapping from PPNs to disk blocks

tracking the contents of currently swapped-out PPNs. In addition, BSST

106 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

maintains a table of reference counts on the blocks in the BSST referenced

by other guest VDMKs.

Second, we split the process for selecting pages for swapping from the

process for actually writing out contents to the BSST and unlocking the back-

ing memory. This split is motivated by the fact that having eliminated dupli-

cate I/Os between hypervisor swapping and guest paging, the system should

benefit by both levels of scheduling choosing the same set of pages. The se-

lected swap candidates are placed in a victim cache to “cool down”. Only

the coldest pages are eventually written out to disk. This victim cache is

maintained as a percentage of locked memory by the guest—for our study,

10%. Should the guest access a page in the pool, it is removed from the pool

without being unlocked.

When the guest pages out memory, it does so to repurpose a given guest

physical page for a new linear mapping. Since this new use will access that

guest physical page, one may be concerned that this access will force the

page to be swapped in from the BSST first. However, because the guest will

either zero the contents of that page or read into it from disk and because the

VMM can detect that the whole page will be overwritten before it is visible

to the guest, the vmx is able to cancel the swap-in and complete the page

locking operation.

6.4.2 Tracking Memory Pages and Disk Blocks

There are two steps to maintaining a mapping between disk blocks and pages

in memory. The first is recognizing the pages read and written in guest and

hypervisor I/O operations. By examining scatter-gather lists of each I/O,

one can identify when the contents in memory and on disk match. While

we plan to maintain this mapping for all associations between guest disks

and guest memory, we currently only track the associations between blocks

in the BSST and main memory.

6.4. IMPLEMENTATION 107

The second step is to track when these associations are broken. For guest

memory, this event happens when the guest modifies a page of memory. The

VMM tracks when this happens by trapping the fact that a writable mapping

is required and this information is communicated to the vmx. For device

accesses, on the other hand, this event is tracked either through explicit

checks in the module which provides devices the access to guest memory, or

by examining page-lists for I/O operations that read contents into memory

pages.

6.4.3 I/O Paths

When the guest OS is running inside a virtual machine, guest I/O requests

are intercepted by the VMM, which is responsible for storage adaptor virtu-

alization, and then passed to the hypervisor, where further I/O virtualization

occurs.

Figure 6.6 identifies the primary modules in VMware Workstation’s I/O

stack. Guest operating system generates scatter-gather lists for I/O (1).

Tesseract inspects scatter-gather lists of incoming guest I/O requests in the

SCSI Disk Device layer, where a request to the guest VMDK may be updated

(2). Any extra I/O requests to the BSST may be issued (3) as shown in

Table 6.2. The Asynchronous I/O manager issues sends to I/O requests to

the host file system (4). On completion, the asynchronous I/O manager

generates completion events (5). Waiting for the completion of all the I/O

requests needed to service the original guest I/O request is isolated to the

SCSI Disk Device layer as well (6). When running with defragmentation

enabled (see Section 6.5), Tesseract allocates a pool of worker threads for

handling defragmentation requests.

Guest Write I/Os

Guest I/O requests have PPNs in scatter-gather lists. The vmx rewrites the

scatter-gather list, replacing guest PPNs with virtual pages from its address

108 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Asynchronous I/O Manager

Host File Layer

VMX

Virtual Machine Monitor (VMM)

Guest Operating System

(1)

(3)

(4)

SCSI Disk Device

(5)

I/O completionI/O dispatch

Block Indirection Layer(2) (6)

Guest I/O requests(1) : S/G list received from guest(2) : Tesseract updates S/G list

: (write): swapped pages are removed: (read) : guest VMDK indirections are looked up

(3) : dispatch I/O request: (write): a single request with holes: (read) : one request to guest VMDK;

one or more requests to BSST(4) : asynchronous I/O scheduled

. . .I/O takes place asynchronously. . .

(5) : completion events generate for each dispatched I/O(6) : notify guest of completion:

: (write): create guest to BSST indirections: (read) : wait for all requests; merge results

Figure 6.6: VMware Workstation I/O Stack

space before passing it further to the physical device. Normally, for write

I/O requests, if a page was previously swapped, so that PPN does not have

a backing MPN, the hypervisor allocates a new MPN and brings page’s con-

tents from disk.

With Tesseract, we check if the PPNs are already swapped out to BSST

blocks by querying the PPN BSST-block mapping. We then use a virtual

6.4. IMPLEMENTATION 109

address of a special dummy page in the scatter-gather list for each page that

resides in the BSST. On completion of the I/O, metadata associated with the

guest VMDK is updated to reflect the fact that the contents of guest disk

blocks for BSST-resident pages are in the BSST. This sequence allows the

guest to page out memory without inducing double-paging.

1 2 3 4 5 6 7 8

(a) Scatter-gather prepared by the guest OS for disk write.

���������

���������

������

������

���������

���������

������

������

1 3 5 8

(b) Modified scatter-gather to avoid double-paging

��������

pages swapped out to BSSTpages in host memory dummy page

Figure 6.7: The pages swapped out to BSST are replaced with a dummy pageto avoid double-paging. Indirections are created for the corresponding guestdisk blocks.

Figure 6.7 illustrates how write I/O requests to the guest VMDK are han-

dled by Tesseract. Tesseract recognizes that contents for pages 2, 4, 6 and 7

in the scatter-gather list provided by the guest OS reside in the BSST (Fig-

ure 6.7a). When a new scatter-gather list to be passed to the physical device

is formed, a dummy page is used for each BSST resident (Figure 6.7b).

Guest Read I/Os and Guest Disk Fragmentation

Recognizing that data may reside in both the guest VMDK and the BSST is a

double-edged sword. On the guest write path it allows us to dismiss pages

that are already present in the BSST and thus avoid swapping them in just to

be written out to the guest VMDK. However, when it comes to guest reads,

the otherwise single I/O request might have to be split into multiple I/Os.

This happens when some of the data needed by the I/O is located in the

BSST.

Since data that has to be read from the BSST may not be contiguous on

disk, the number of extra I/O requests to the BSST may be as high as the

number of data pages in the original I/O request that reside in the BSST. We

110 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

refer to a collection of pages in the original I/O request for which a separate

I/O request to the BSST must be issued as a hole. Read I/O requests to the

guest VMDK which have holes are called fragmented.

We modify a fragmented request so that all pages that should be filled

in with the data from the BSST are replaced with a dummy page which will

serve as a placeholder and will get random data read from the guest VMDK.

So in the end for each fragmented read request we issue one modified I/O

request to the guest VMDK and N requests to the BSST, where N is the

number of holes. After all the issued I/Os are completed, we signal the

completion of the originally issued guest read I/O request.

������

������

���������

���������

���������

���������

���������

���������

����������������������������

����������������������������

�������������������������������������������������

�������������������������������������������������

���������������������

���������������������

�����������������������������������

�����������������������������������

����������������������������

����������������������������

������������������������������������������

������������������������������������������

�����������������������������������

�����������������������������������

����������������������������

����������������������������

1 3 5 8 2 4 6 7

1 2 3 4 5 6 7 8

��������

pages swapped out to BSSTpages in host memory dummy page

Figure 6.8: Original guest read request split into multiple reads requests dueto holes in the guest VMDK.

In Figure 6.8, the guest read I/O request finds disk blocks for pages 2, 4,

6 and 7 located on the BSST, where they are taking non-contiguous space.

Tesseract issues one read request to the guest VMDK to get data for pages 1,

3, 5 and 8. In the scatter-gather list sent to the physical device, a dummy

page is used as a read target for pages 2, 4, 6 and 7. Together with that one

read I/O request to the guest VMDK, four read I/O requests are issued to

the BSST. Each of those four requests reads data from one of the four disk

blocks in the BSST.

Optimization of Repeated Swaps

In addition to addressing the double-paging anomaly by tracking guest I/Os

whose contents exist in the BSST, we also implemented an optimization for

back-to-back swap-out requests for a memory page whose contents remain

6.4. IMPLEMENTATION 111

clean. If a page’s contents are written out to the BSST, and later swapped

back in, we continue to track the old block in the BSST as a form of victim

cache. If the same page is chosen to be swapped out again and there has

been no intervening modification of the contents of the page in memory, we

simply adjust the reference count (see Section 6.4.4) for the block copy that

is already in the BSST.

6.4.4 Managing Block Indirection Metadata

Tesseract keeps in-memory metadata for tracking PPN-to-BSST block map-

pings and for recording block indirections between guest and BSST VMDKs.

The PPN-to-BSST block mapping is stored as key-value pair using a hash

table. Indirection between guest and BSST VMDKs are tracked in a similar

manner.

Tesseract also keeps reference counts for the BSST blocks. When a new

PPN-to-BSST mapping is created, the reference count for the corresponding

BSST block is set to 1. The reference count is incremented in the write

prepare stage for PPNs found to have PPN-to-BSST block mappings. This

ensures that such BSST blocks are not repurposed while the guest write

is still in progress. Later, on the write completion path, the guest-VMDK-

to-BSST indirection is created. The reference count of the BSST blocks is

decremented during hypervisor swap in operation. It is also decremented

when the guest VMDK block is overwritten by new contents and the previous

guest block indirection is invalidated. Blocks with zero reference counts are

considered free and reclaimable.

Metadata Consistency

While updating metadata in memory is faster than updating it on the disk,

it poses consistency issues. What if the system crashes before the metadata

is synced back to persistent storage? To reduce the likelihood of such prob-

lems, Tesseract periodically synchronizes the metadata to disk on the same

112 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

schedule used by the VMDK management library for virtual disk state. How-

ever, because reference counts in the BSST and block-indirections in VMDKs

are written at different stages in an I/O request, crashes must be detected

and a fsck-like repair process run.

Entanglement of guest VMDKs and BSST

Once indirections are created between guest and BSST VMDK, it becomes

impossible to move just the guest VMDK. To disentangle the guest VMDK,

we must copy each block from the BSST to its guest VMDK for which there

is an indirection. This can be done both online and offline. More details

about the online process are in Section 6.5.2.

6.5 Guest Disk Fragmentation

As mentioned in Section 6.4.3, when running with Tesseract, guest read I/O

requests might be fragmented in the sense that some of the data the guest

is asking for in a single request may reside in both the BSST and the guest

VMDK.

The fragmentation level depends on the nature of the workload, the

guest OS, and swap activity at the guest and the hypervisor level. Our ex-

periments with SPECjbb2005 [103] showed that even for moderate level of

memory pressure as much as 48% of all read I/O requests had at least one

hole.

By solving double-paging problem Tesseract significantly reduced write-

prepare time of the guest I/O requests since synchronous swap-in requests

no longer cause delays. However, a non-trivial overhead was added to read-

completion. Indeed, instead of waiting for a single read I/O request to the

guest VMDK, the hypervisor may now have to wait for several extra read

I/O requests to the BSST to complete before reporting the completion to the

guest.

6.5. GUEST DISK FRAGMENTATION 113

To address these overheads, Tesseract was extended with a defragmen-

tation mechanism that improves read I/O access locality and thus reduces

read-completion time. We investigated two approaches to implementing

defragmentation - BSST defragmentation and guest VMDK defragmentation.

While defragmentation is intended to help reduce read-completion time, it

has its own cost. Defragmentation requests are asynchronous and reduce

time to complete affected guest I/Os, but, at the same time, they contribute

to a higher disk load and in the extreme cases may have an impact on read-

prepare times. The defragmentation activity can be throttled on detecting

performance bottlenecks due to higher disk load. ESX, for example, pro-

vides a mechanism, SIOC, that measures latencies to detect overload and

enforce proportional-share fairness [50]. The defragmentation mechanism

could participate in this protocol.

6.5.1 BSST Defragmentation

BSST defragmentation uses guest write I/O requests as a hint of which BSST

blocks might be accessed together in a single I/O read request in the future.

Given that information we then group together the identified blocks in the

BSST.

Figure 6.9 shows a scatter-gather list of the write I/O request that goes

to the guest VMDK. In that request, the contents of pages 2, 4, 6 and 7 is

already present in the BSST. As soon as these blocks are identified, a worker

thread picks up a reallocation job that will allocate a new set of contiguous

blocks in BSST and will copy the contents of BSST blocks for pages 2, 4,

6 and 7 into that new set of block. This copying allows those blocks to be

read later as a single I/O request issued by the guest and reflects its own

expectation of the locality of these blocks.

BSST defragmentation is not perfect. If multiple guest VMDK writes cre-

ate indirections to the same BSST blocks, multiple copies of those blocks

114 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

���������

���������

������

������

���������

���������

������

������

4

7 6

7642

2

1 3 5 8

BSST DiskGuest Disk

Figure 6.9: Defragmenting the BSST.

may be made in the BSST. Further, since blocks are still present in both the

guest VMDK and the BSST, extra I/O requests to the BSST cannot be en-

tirely eliminated. In addition, BSST defragmentation tries to predict read

access locality from write access locality and obviously the boundaries of

read requests will not match with the boundaries of the write requests. So

each read I/O request that without defragmentation would have required

reads from both the guest VMDK and the BSST will still be split into the one

which goes to the guest VMDK and one or more requests to the BSST. All

this contributes to longer read completion times as shown in Table 6.4.

However, it is relatively easy to implement BSST defragmentation with-

out worriying too much about data races with the I/O going to the guest

VMDK. It can significantly reduce the number of extra I/Os that have to be

issued to the BSST to service the guest I/O request as shown in Table 6.3.

If a guest read I/O request preserves the locality observed at the time

of guest writes, we need more than one read I/O request from the BSST

only when it hits more than one group of blocks created during BSST de-

fragmentation. Although this is entirely dependent on a workload, one can

expect read requests to typically be smaller than write requests, and, so, the

number of extra I/O requests to BSST being reduced to one (fits into one de-

fragmented area) or two (crosses the boundary of two defragmented areas)

in many cases.

6.5. GUEST DISK FRAGMENTATION 115

4

7 6

2

2 4 761 3 5 8

BSST DiskGuest Disk

Figure 6.10: Defragmenting the guest VMDK.

6.5.2 Guest VMDK Defragmentation

Like BSST defragmentation, guest VMDK defragmentation uses the scatter-

gather lists of write I/O requests to identify BSST blocks that must be copied.

But unlike BSST defragmentation, these blocks are copied to the guest VMDK.

The goal is to restore the guest VMDK to the state it would have had with-

out Tesseract. Tesseract with guest VMDK defragmentation replaces swap-in

operations with asynchronous copying from the BSST to the guest VMDK.

For example, in Figure 6.10, blocks 2, 4, 6 and 7 are copied to the relevant

locations on the guest VMDK by a worker thread.

We enqueue a defragmentation request as soon as the scatter-gather list

of the guest write I/O request is processed and blocks to be asynchronously

fetched to the guest VMDK are identified. The defragmentation requests are

organized as a priority queue. If a guest read I/O request needs to read

data from the block that has not been copied from the BSST, the priority of

the defragmentation request that refers to the block is raised to highest and

the guest read I/O request is blocked until copying of all the missing blocks

finishes.

While Tesseract with guest defragmentation can have an edge over Tesser-

act without defragmentation, it is not always a win. With guest defragmen-

tation, before a guest I/O read request has a chance to be issued to the

guest VMDK, it may become blocked waiting for a defragmentation request

to complete. This may end up being slower than issuing requests to the

BSST and the guest VMDK in parallel.

116 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Disentanglement of Guest and BSST VMDKs.

Guest defragmentation has an added benefit of removing the entanglement

between guest and BSST VMDK. Once there are no block indirections be-

tween guest and BSST VMDK, the guest VMDK can be moved easily. This

also allows us to disable Tesseract’s double-paging optimization on-the-fly.

6.6 Evaluation

We ran our experiments on an AMD Opteron 6168 (Magny-Cours) with 12

1.9 GHz cores, 1.5 GB of memory and a 1 TB 7200rpm Seagate SATA drive, a

1 TB 7200rpm Western Digital SATA drive, and a 128 GB Samsung SSD drive.

We used OpenSUSE 11.4 as the host OS and a 6 VCPU 700 MB VM running

Ubuntu 11.04. We used Jenkins [113] to monitor and manage execution of

the test cases.

To ensure same test conditions for all test runs, we created a fresh copy

of the guest virtual disk from backup before each run. For the evaluation

we ran SPECjbb2005 [103] that was modified to emit instantaneous scores

every second. It was run with 6 warehouses for 120 seconds. The heap size

was set to 450 MB. The SPECjbb benchmark creates several warehouses and

processes transactions for each of them.

We induced hypervisor-level swapping by setting a maximum limit on

the pages the VM can lock. The BSST VMDK was preallocated. Swap-out

victim cache size was chosen to be 10% of the VM’s memory size.

All experiments except the one with SSD represent results from five trial

runs. The SSD experiment represents results from three trial runs.

6.6.1 Inducing Double-Paging Activity

To control hypervisor swapping, we set a hypervisor-imposed limit on the

machine memory available for the VM. Guest paging was induced by running

6.6. EVALUATION 117

the SPECjbb benchmark with a working set larger than the available guest

memory.

To induce double-paging, the guest must page out the pages that were

already swapped by the hypervisor. Since, the hypervisor would choose only

the cold pages from the guest memory, we employed a custom memhog that

would lock some pages in the guest memory for a predetermined amount

of time inside the guest. While the pages were locked by this memhog, a

different memhog would repeatedly touch the rest of available guest pages

making them “hot”. At this point the pages locked by the first memhog are

considered “cold” and swapped out by the hypervisor.

Next, memhog unlocks all its memory and the SPECjbb benchmark is

started inside the guest. Once the warehouses have been created by SPECjbb,

the memory pressure increases inside the guest. The guest is forced to find

and page out “cold pages”. The pages unlocked by memhog are good candi-

dates as they have not been touched in the recent past.

We used memhog and memory locking in our setup to make the exper-

iments more repeatable. In real world the conditions we were simulating

could have been observed, for example, when execution phase shift of an

application occurs, or when an application that caches a lot of data in mem-

ory and not actively uses is descheduled and another memory intensive ap-

plication is woken up by the guest.

As a baseline we ran with Tesseract disabled. This effectively disabled

analysis and rewriting of guest I/O commands so that all pages affected by

an I/O command that happened to be swapped out by the hypervisor had to

be swapped back in before the command could be issued to disk.

6.6.2 Application Performance

While it is hard to control and measure the direct impact of individual

double-paging events, we use the pauses or gaps observed in the logged

118 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

instantaneous scores of each SPECjbb run to characterize the application be-

havior. Depending upon the amount of double-paging activity, the pauses

can be as big as 60 seconds in a 120 second run and negatively affect the

final score. Often the pauses are associated with garbage collection activity.

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

tesseract

(a) No memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

tesseract

(b) 30 MB memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

tesseract

(c) 60 MB memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

tesseract

(d) 90 MB memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

tesseract

(e) 120 MB memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

tesseract

(f) 150 MB memhog

Figure 6.11: Trends for scores and pauses in SPECjbb runs with varying guestmemory pressure and 10% host overcommitment.

6.6. EVALUATION 119

Varying Levels of Guest Memory Pressure

Figure 6.11 shows scores and pause times for different sizes of memhog in-

side the guest with 10% host overcommitment. When the guest is trying

to page out pages which are swapped by the hypervisor, the latter is swap-

ping them back in and is forced to swap out some other pages. This cascade

effect is responsible for increased pause period for the baseline. With Tesser-

act, however, the pause periods grow at a lower rate. This growth can be

explained by longer wait times due to increased disk activity. Although the

scores are about the same for higher guest memory pressure, the total pauses

for Tesseract are less than that for the baseline.

0 30 60 90 120 150 180 240

Memhog Sizes (MB)

0

5

10

15

20

25

30

Max p

ause

/blo

ckage tim

e (se

conds)

tesseract

baseline

Figure 6.12: Maximum single pauses observed in SPECjbb instantaneous scor-ing with varying guest memory pressure and 10% host memory overcommit-ment.

Figure 6.12 shows the effect of increased memory pressure on the length

of the biggest application pause. The bars represent the range of maximum

pauses for individual sets of runs. There are five runs in each set. Notice that

Tesseract clearly outperforms the baseline. The highest maximum pause

time with Tesseract is 7 seconds, compared to 30 seconds for the baseline.

This shows that the application is more responsive with Tesseract.

120 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

0 10 20 30 40 50 60Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

tesseract

(a) 0% host overcommitment

0 10 20 30 40 50 60Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

tesseract

(b) 5% host overcommitment

0 10 20 30 40 50 60Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

tesseract

(c) 15% host overcommitment

0 20 40 60 80 100Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjb

b s

core

baseline

tesseract

(d) 20% host overcommitment

Figure 6.13: Scores and total pause times for SPECjbb runs with varying hostovercommitment and 60 MB memhog.

Varying Levels of Host Memory Pressure

To study the effect of increasing memory pressure by the hypervisor, we

ran the application with various levels of host overcommitment with 60 MB

memhog inside the guest.

Figure 6.13 shows the effect of increasing host memory pressure on the

application scores and total pause times. For lower host pressure (0% and

5%), the score and pause times for the baseline and Tesseract are about the

same. However, for higher memory pressure there is a significant difference

in the performance. For example, in the 20% case, the baseline observes

total pauses in the range of 80–110 seconds. Tesseract, on the other hand,

observes total pauses in a much lower range of 30–60 seconds.

6.6. EVALUATION 121

0 5 15 20

Host Memory Overcommitment (%)

0

10

20

30

40

50

60

70

80

Max p

ause

/blo

ckage tim

e (se

conds)

no-defrag

guest-defrag

bsst-defrag

baseline

Figure 6.14: Comparing maximum single pauses for SPECjbb under vari-ous defragmentation schemes with varying host memory overcommitment and60 MB memhog

Figure 6.14 focuses on the maximum pauses seen by the application as

host memory pressure grows. While the maximum pauses are insignificant

at lower memory pressure, with a higher pressure Tesseract clearly outper-

forms the baseline.

6.6.3 Double-Paging and Guest Write I/O Requests

Table 6.1 shows why double-paging is affecting guest write I/O performance.

As expected, if the host is not experiencing memory pressure, none of the

1,030 guest write I/O requests refer to pages swapped by the hypervisor.

As memory pressure builds up, more and more guest write I/O requests

require one or more pages to be swapped in before a write can be issued to

the physical disk. All of this contributes to a longer write-prepare time for

such a requests.

Consider a setup with 20% host memory is overcommitment. Of 1,366

guest write I/O requests 981 had at least one page that had to be swapped

in. Then, 524 guest write I/O requests needed between 1 and 20 swap-

in requests completed by the hypervisor in order to proceed, 177 needed

122 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Host Guest I/Os I/Os I/Os I/Os Double-(%) I/Os with 1 – 20 21 – 50 > 50 paging

Issued holes holes holes holes cases0 1,030 0 0 0 0 05 981 537 343 106 88 11,254

10 1,042 661 358 132 171 19,38115 1,292 766 377 237 152 22,58420 1,366 981 524 177 280 32,547

Table 6.1: Holes in write I/O requests for varying host overcommitment and60 MB memhog inside the guest.

between 21 and 50 swap-in requests completed, and, finally, 280 guest write

I/O requests needed more than 50 swap-in requests.

6.6.4 Fragmentation in Guest Read I/O Requests

Table 6.2 quantifies the number of extra read I/O requests that have to be

issued to the BSST if defragmentation is not used.

Host Guest I/Os I/Os w/ Total Total I/Os Score(%) Issued Holes Holes Issued

0 5,152 0 0 5,152 7,0105 5,230 708 1,675 6,197 6,801

10 5,206 2,161 5,820 8,865 6,27115 4,517 2,084 6,990 9,423 6,04820 5,698 2,739 11,854 14,813 2,841

Table 6.2: Holes in read I/O requests for Tesseract without defragmentationfor varying levels of host overcommitment and 60 MB memhog inside the guest.

Without host memory pressure there is no hypervisor level swapping and

all 5,152 guest read I/O requests can be satisfied without going to the BSST.

At higher levels of memory pressure, the hypervisor starts swapping pages

to disk. Tesseract detects pages in guest write I/O requests that are already

in the BSST to avoid swap-in requests for such pages. The amount of work

saved by Tesseract on the write I/O path is quantified in the final column of

Table 6.1.

6.6. EVALUATION 123

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

no-defrag

bsst-defrag

guest-defrag

(a) 60 MB memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

no-defrag

bsst-defrag

guest-defrag

(b) 120 MB memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

no-defrag

bsst-defrag

guest-defrag

(c) 180 MB memhog

0 5 10 15 20 25 30 35 40Total SPECjbb blockage time (seconds)

4500

5000

5500

6000

6500

7000

7500

SPECjbb sco

re

baseline

no-defrag

bsst-defrag

guest-defrag

(d) 240 MB memhog

Figure 6.15: Scores and pauses in SPECjbb runs under various defragmenta-tion schemes with 10% host overcommitment.

When host memory is 20% overcommitted we can see that out of 5,698

guest read I/O requests 2,739 will require extra read I/Os to be issued to

read data from the BSST. The total number of such an extra I/Os to the

BSST was 11,854, which made the total number of read I/O requests issued

to both the guest VMDK and the BSST equal 14,813.

6.6.5 Evaluating Defragmentation Schemes

Figures 6.15 and 6.16 show the impact of using BSST and guest VMDK de-

fragmentation on SPECjbb throughput, while Figures 6.14 and 6.17 give

insight into SPECjbb responsiveness.

Guest defragmentation performs better than the baseline in all situations

124 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

0 10 20 30 40 50 60Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

no-defrag

bsst-defrag

guest-defrag

(a) 0% host overcommitment

0 10 20 30 40 50 60Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

no-defrag

bsst-defrag

guest-defrag

(b) 5% host overcommitment

0 10 20 30 40 50 60Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

no-defrag

bsst-defrag

guest-defrag

(c) 15% host overcommitment

0 20 40 60 80 100Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjb

b s

core

baseline

no-defrag

bsst-defrag

guest-defrag

(d) 20% host overcommitment

Figure 6.16: Score and pauses in SPECjbb under various defragmentationschemes with varying host overcommitment and 60 MB memhog.

and is as good or better than BSST defragmentation. With low levels of

host memory overcommitment Tesseract with guest VMDK defragmentation

secures better SPECjbb scores than Tesseract without defragmentation and

performs on par in responsiveness metrics.

With increasing host memory overcommitment, Tesseract without de-

fragmentation starts outperforming Tesseract with either of the defragmen-

tation schemes in both the application throughput and responsiveness as

the total and maximum pause times grow slower for the no-defragmentation

case. This is due to the fact that at higher levels of hypervisor level swapping,

guest read I/O becomes more and more fragmented and pending defrag-

mentation requests become a bottleneck leading to longer read completion

times.

6.6. EVALUATION 125

60 120 180 240

Memhog Sizes (MB)

0

5

10

15

20

25

30

Max pause

/block

age tim

e (se

conds)

no-defrag

guest-defrag

bsst-defrag

baseline

Figure 6.17: Comparing maximum single pauses for SPECjbb under variousdefragmentation schemes with 10% host memory overcommitment.

Defrag Reads Reads Total BSST Total Defrag I/OsStrategy w/o w/ Holes Reads Reads Reads Writes

Holes Holes Issued Issued Issued IssuedNo-Defrag 3,025 1,203 2,456 2,456 6,684 0 0BSST 2,946 1,235 2,889 1,235 5,416 12,674 616Guest 3,909 0 0 0 3,909 11,538 11,538

Table 6.3: Total I/Os with BSST and guest defragmentation.

Table 6.3 shows the I/O overheads of the two defragmentation schemes

compared to Tesseract without them. For this table, 3 runs with similar

scores and similar number of guest read I/O requests were selected. With

BSST VMDK defragmentation enabled, Tesseract was able to reduce the

number of synchronous I/O requests to BSST VMDK from 2,889 (2.23 reads

per I/O with holes on average) to 1,235 (1 read per I/O with holes). To

do BSST VMDK defragmentation, 12,674 asynchronous reads from BSST

VMDK and 616 asynchronous writes to BSST VMDK had to be issued. This

number of writes equals the number of guest write I/O requests with holes.

Guest VMDK defragmentation eliminated holes in guest read I/O requests

entirely, so there were no guest-related reads from BSST VMDK. To achieve

this, 11,538 asynchronous reads from BSST VMDK and the same number of

asynchronous writes to the guest VMDK were issued.

126 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

0 10 20 30 40 50Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

tesseract

(a) 15% host overcommitment

0 10 20 30 40 50Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

tesseract

(b) 20% host overcommitment

0 10 20 30 40 50Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

tesseract

(c) 25% host overcommitment

0 10 20 30 40 50Total SPECjbb blockage time (seconds)

1000

2000

3000

4000

5000

6000

7000

SPECjbb sco

re

baseline

tesseract

(d) 30% host overcommitment

Figure 6.18: Tesseract performances with BSST placed on an SSD disk.

6.6.6 Using SSD For Storing BSST VMDK

SSDs have dramatically better performance over magnetic disk in terms of

lower latencies for random reads. However, their relatively higher cost keeps

them from getting into mainstream server market. They are used in smaller

units for boosting performance. One potential application for SSDs in servers

is as a hypervisor swap device allowing for higher memory overcommitment

as the cost of swapping is reduced.

In our experiment, we placed the BSST VMDK on a SATA SSD. Fig-

ure 6.18 shows the performance of the baseline and Tesseract. At lower

memory pressure, there is no difference in the performance, but as the mem-

ory pressure increases, at both guest and hypervisor level, Tesseract starts to

show benefits over the baseline.

6.6. EVALUATION 127

I/O Path Baseline No-defrag BSST defrag Guest defragRead prepare 0 37 30 109Read completion 0 232 247 55Write prepare 24,262 220 256 265Write completion 0 49 91 101

Table 6.4: Average read and write prepare/completion times in microsecondsfor baseline and Tesseract with and without defragmentation. Host overcom-mitment was 10%; memhog size was 60 MB.

6.6.7 Overheads

I/O Path Overhead

Table 6.4 presents Tesseract overheads on I/O paths. The average overhead

per I/O is on the order of microseconds. Read prepare time for guest defrag-

mentation is higher than the others due to the contention on guest VMDK

during defragmentation. At the same time, the read completion time for

guest defragmentation case is much lower than the other two cases as there

are no extra reads going to the BSST. On the write I/O path, the defrag-

mentation schemes have larger overhead. This is due to the background

defragmentation of the disks which is kicked off as soon as the write I/O is

scheduled.

Memory Overhead

Per Section 6.4.4, Tesseract maintains in-memory metadata for three pur-

poses: tracking (a) associations between PPN and BSST blocks; (b) refer-

ence counts for BSST blocks; and (c) indirections between guest VMDK and

BSST VMDK. We use 64 bits to store a (4 KB) block number. To track asso-

ciations between PPN and BSST blocks we re-use MPN field in page frames

maintained by the hypervisor so there is no extra memory overhead here.

In general case where associations between PPN and blocks in guest VMDK

have to be tracked we will need a separate memory structure with a maxi-

128 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

mum overhead of 0.2% of VM’s memory size. Each BSST block’s reference

count requires 4 bytes per disk block. To optimize the lookup for free/avail-

able BSST blocks, a bitmap is also maintained with one bit for each block.

The guest VMDK to BSST VMDK indirection metadata requires 24 bytes for

each guest VMDK block for which there is a valid indirection to BSST. A

bitmap similar to that for BSST is maintained for guest VMDK blocks to de-

termine if an indirection to BSST exists for a given guest VMDK block.

6.7 Related Work

This work intersects three areas. The first is that of uncooperative hypervisor

swapping and the double-paging problem. The second concerns the tracking

of associations between guest memory and disk state. The third concerns

memory and I/O deduplication.

6.7.1 Hypervisor Swapping and Double Paging

Concurrent work by Amit et al. [5] systematically explores the behavior of

uncooperative hypervisor swapping and implement an improved swap sub-

system for KVM called VSwapper. The main components of their imple-

mentation are the Swap Mapper and the False Reader Preventer. The paper

identifies five primary causes for performance degradation, studies each, and

offers solutions to address them. The first, “silent swap writes”, corresponds

to our notion of guest-I/O–swap optimization which we do not yet support

because we do not support reference-counting on blocks in guest VMDKs.

The second and third, “stale swap reads” and “false swap reads”, and their

solutions are similar to the existing ESX optimizations that cancel swap-ins

for memory pages that are either overwritten by disk I/O or by the guest.

For “silent swap writes” and “stale swap reads”, the Swap Mapper uses the

same techniques Tesseract does to track valid associations between pages in

guest memory and blocks on disk. Their solution to “false swap reads”, the

6.7. RELATED WORK 129

False Reader Preventer, is more general, however, because it supports the

accumulation of successive guest writes in a temporary buffer to identify if

a page is entirely overwritten before next read. The last two, “decayed swap

sequentiality” and “false page anonymity”, are not issues we consider. In

their investigation, they did not observe double-paging to have much impact

on performance. This is likely due to the fact that they followed guidelines

from VMware and provisioned guests with enough VRAM that guest pag-

ing was uncommon and most of the experiments were run with a persistent

level of overcommitment. Tesseract allows for optimizing operations involv-

ing guest I/O followed by another guest I/O with either same pages or disk

blocks. This is not possible with VSwapper. Also, vswapper doesn’t allow for

defragmentation or disk deduplication.

The double-paging problem was first identified in the context of virtual

machines running on VM/370 [46, 101]. Goldberg and Hassinger [46] dis-

cuss the impact of increased paging when the virtual machine’s address ex-

ceeds that with which it is backed. Seawright and MacKinnon [101] mention

the use of handshaking between the VMM and operating system to address

the issue but do not offer details.

The Cellular Disco project at Stanford describes the problem of paging

in the guest and swapping in the hypervisor [48, 47]. They address this

double-paging or redundant paging problem by introducing a virtual paging

device in the guest. The paging device allows the hypervisor to track the

paging activity of the guest and reconcile it with its own. Like our approach,

the guest paging device identified already swapped-out blocks and creates

indirections to these blocks that are already persistent on disk. There is no

mention of the fact that these indirections destroy expected locality and may

impact subsequent guest read I/Os.

Subsequent papers on scheduling memory for virtual machines also refer

in passing to the general problem. Waldspurger [128], for example, men-

tions the impact of double-paging and advocates random selection of pages

130 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

by the hypervisor as a simple way to minimize overlap with page-selection

by the guest. Others projects, such as the Satori project [82], use double-

paging to advocate against any mechanism to swap guest pages from the

hypervisor.

Our approach differs from these efforts in several ways. First, we have

a system in which we can—for the first time—measure the extent to which

double-paging occurs. Second, we have an approach that directly addresses

the problem of double-paging in a manner transparent to the guest. Finally,

our techniques change the relationship between the two levels of scheduling:

by reconciling and eliding redundant I/Os, Tesseract encourages the two

schedulers to choose the same pages to be paged out.

6.7.2 Associations Between Memory and Disk State

Tracking the associations between guest memory and guest disks has been

used to improve memory management and working-set estimation for vir-

tual machines. The Geiger project [60], for example, uses paravirtualization

and intimate knowledge of the guest disks to implement a secondary cache

for guest buffer-cache pages. Lu et al. [78] implement a similar form of

victim cache for the Xen hypervisor.

Park et al. [87] describe a set of techniques to speed live-migration of

VMs. One of these techniques is to track associations between pages in

memory and blocks on disks whose contents are shared between the source

and destination machines. In cases where the contents are known to be

resident on disk, the block information is sent to the destination in place

of the memory contents. In the paper, the authors describe techniques for

maintaining this mapping both through paravirtualization and through the

use of read-only mappings for fully virtualized guests.

6.8. OBSERVATIONS 131

6.7.3 I/O and Memory Deduplication

The Satori project [82] also tracks the association between disk blocks and

pages in memory. It extends the Xen hypervisor to exploit these associations,

allowing it to elide repeated I/Os that read the same blocks from disk across

VMs immediately sharing these pages of memory across those guests.

Originally inspired by the Cellular Disco and Geiger projects, Tesseract

shares much in common with these approaches. Like many of them, it tracks

valid associations between memory pages and disk blocks that contain iden-

tical content. Like Park et al., it employs techniques that are fully transpar-

ent to the guest allowing it to be applied in a wider set of contexts. Unlike

the Satori projects which focused on eliminating redundant read operations

across VMs, Tesseract uses that mapping information to deduplicate I/Os

from a specific guest and its hypervisor. As such, our approach complements

and extends these others.

6.8 Observations

Our experience in this project has led us to question the existing interface

for issuing I/O requests with scatter-gather lists. Given that the underly-

ing physical organization of the disk blocks can differ significantly from the

virtual disk structure, it makes little sense for a scatter-gather list to require

that the target blocks on disk be contiguous. Having a more flexible structure

may allow I/Os to be expressed more succinctly and to be more effective at

communicating expected relationships or locality among those disk blocks.

Further, one can think of generalizing I/O scatter-gather lists and espe-

cially virtual disks to just be indirection tables into a large sea-of-blocks. This

allows for a natural application surface for block indirection.

CHAPTER 7

Impact for the Future

In this chapter, we discuss some of the future directions that can be pursued

based on this dissertation.

7.1 Compiled Code In Scripting Languages:

Fast-Slow Paradigm

For many scripting languages (Python, R, Matlab, etc.), the interpreted lan-

guage was developed first, and researchers developed an efficient compiler

after the fact. As a result, we often have fast compiled functions that run

inside the interpreted language. The compiled code makes assumptions to

generate efficient code. Unusual user applications may violate these assump-

tions, causing the compiled code to silently return an incorrect answer. So, a

user must choose between reliable, interpreted (slow) code, and unreliable

compiled (fast) code.

Checkpointing provides an interesting third alternative. One splits the

computation into segments. For concreteness, we will give an example with

ten segments, and we will assume that ten additional “checking” hosts (or

ten additional CPU cores) are available to run in parallel.

Initially, the compiled code is run. At the beginning of each of the ten

segments, one takes a checkpoint and copies it to a different “checking”

133

134 CHAPTER 7. IMPACT FOR THE FUTURE

computer. That computer runs the next segment in interpreted mode. At

the end of that segment, the data from the corresponding checkpoint of the

compiled segment is compared with the data at the end of the interpreted

segment for correctness.

At the end, either the ten “checking” hosts (or ten “checking” CPU cores)

report that the computation is correct, or else they report that the compu-

tation must switch to interpreted mode for correctness at the beginning of

a particular segment (after which, one can return to compiled operation as

described above).

Wester et al. [131] implemented a speculation mechanism in the operat-

ing system. It provided coordination across all applications and kernel state

while the speculation policy was left up to the applications. A scheme similar

to this was employed using DMTCP by Ghoshal et al. [45] in an application

to MPI [45] and by Arya and Cooperman [9] to support the Python scripting

language.

7.2 Support for Hadoop-style Big Data

Hadoop [39] and Spark [40] support a map-reduce paradigm in which the

size of intermediate data may increase during a “map” phase and may de-

crease during a “reduce” phase. Thus, the best place to checkpoint is at

the end of a “reduce” phase. With the right hooks added to Hadoop (or

Spark), Hadoop could be instructed by a plugin to move back-end data to

longer-term storage. On restart, the plugin would use those hooks to move

the longer-term storage back to active storage, and the front end would re-

connect.

7.3. CYBERSECURITY 135

7.3 Cybersecurity

Section 5.8 described the ability to checkpoint a network of virtual machines

using plugins [44]. This can be combined with DMTCP plugins to monitor

and modify the operation of a guest virtual machine. In particular, if mal-

ware uses any external services (from gettimeofday to calling back to a con-

troller on the Internet), this can be intercepted by a suitable DMTCP plugin,

and even replayed, in order to more closely examine the malware. See Visan

et al. [127] and Arya et al. [10] for examples of using record-replay through

DMTCP plugins. (While some malware tries to detect if it is running inside

a virtual machine, malware will often continue to run in this situation. Oth-

erwise, virtual machines would provide a good defense against malware.)

7.4 Algorithmic debugging

Algorithmic debugging [102, 13, 94, 83, 84, 79] is a well-developed tech-

nique that was especially explored in the 1990s. Roughly, the idea is that

an algorithmic debugger keeps a trace of the computation, and shows the

user the input and output of various subprocedures. Through a series of

questions and answers (similar to the game of 20 questions), the software

determines which low-level subprocedure caused the bug. This tended to

be used in functional languages and declarative languages such as Prolog,

because of the ease of capturing the input and output of a subprocedure.

The use of checkpoints allows one to apply this same technique to main-

stream languages including C/C++, Python, and others. Instead of en-

capsulating a small input and output, a traditional debugger (e.g., GDB,

Python pdb) would be used to allow the programmer to fully explore the

global state at the beginning and end of the subprocedure. In case of a

failed step, checkpoint-restart would allow us to restart from the last valid

step instead of rerunning the program from the beginning.

136 CHAPTER 7. IMPACT FOR THE FUTURE

7.5 Reversible Debugging

Reversible debugging or time-travelling debuggers have a long history [19,

38, 64, 72]. Checkpointing provides an obvious approach in this area. Some

parts of this approach have already been developed within the context of

DMTCP (decomposing debugging histories for replay [127] and reverse ex-

pression watchpoints [10]).

7.6 Android-Based Mobile Computing

Huang and Cheng have already demonstrated the use of DMTCP to check-

point processes under Android [53]. This provides the potential for truly

pervasive mobile apps, which can checkpoint themselves and migrate them-

selves to other platforms. This can provide greater software sustainability

(software engineering) by saving the entire mobile app, instead of the cur-

rent practice of saving the state of an app and re-loading the state whenever

the app is re-launched.

7.7 Cloud Computing

Cloud computing provides on-demand self-service and rapid elasticity of re-

sources for applications. These characteristics are similar to that of the old-

style mainframes from the 1960s through 1980s. However, to make the

analogy complete, we need a scheduler for the Cloud. This scheduler must

support parallel applications in addition to single-process applications. A

scheduler for the Cloud can use DMTCP to suspend or migrate jobs. The ca-

pabilities of DMTCP contributing to this goal include providing checkpoint

support for: virtual machines [44], Intel Xeon Phi [12, 2], InfiniBand [27],

MPI, and 3D-graphics (for visualization) [62].

CHAPTER 8

Conclusion

Virtualization in the context of singular systems is well understood, but it

is more difficult in context of multiple systems. This dissertation presented

solutions to two long standing problems related to virtualization. A number

of future directions were presented to apply the results of this dissertation

both in context of checkpoint-restart and virtual machines.

Closed-World Assumption

This dissertation presented a framework for transparent checkpointing of

application processes that do not obey the closed world assumption. A pro-

cess virtualization approach was presented to decouple the application pro-

cesses from the external subsystems. This was achieved by introducing a

thin virtualization layer between the application and the external subsystem

that provided the application with a consistent view of the external subsys-

tem across checkpoint and restart. An adaptive plugin based architecture

was presented to allow the checkpointing system to grow organically with

each new external subsystem. The third-party plugins, developed to pro-

vide seven novel checkpointing solutions, demonstrated the success of the

plugin-based process virtualization approach.

137

138 CHAPTER 8. CONCLUSION

Double-Paging Problem

This work presented Tesseract, a system that directly and transparently (with-

out any modifications to the guest operating system) addressed the double-

paging problem. It reconciled and eliminated redundant I/O activity be-

tween the guest’s virtual disks and the hypervisor swap subsystem by track-

ing associations between the contents of the pages in guest memory and

those on disk.

Finding an Application Surface

In the first body of work, the application surface was always chosen close

to the application process. The concept of an application surface close to a

stable API served as a guide in discovering a virtualization strategy in situa-

tions where no previous virtualization strategy existed. The pid plugin is an

example of a minimal application surface at the POSIX API layer, whereas

the SSH plugin provided an application surface at the level of SSH protocol.

In the second body of the work, there were several possibilities of choos-

ing an application surface including the guest operating system, paravirtual-

ized guest devices, virtual devices in the hypervisor, virtual disk interface, or

the host kernel. We chose the application surface at the virtual disk device

interface as it provides a clear separation between the hypervisor and the

virtual disks. This application surface included the entire guest virtual ma-

chine including operating system, device, etc. However, being at the virtual

disk device layer, allowed us to provide block indirection without requiring

any knowledge of the guest internals (virtual address space, file system, etc.)

and without requiring any modifications to the host operating system.

APPENDIX A

Plugin Tutorial

A.1 Introduction

Plugins enable one to modify the behavior of DMTCP. Two of the most com-

mon uses of plugins are:

1. to execute an additional action at the time of checkpoint, resume, or

restart.

2. to add a wrapper function around a call to a library function (including

wrappers around system calls).

Plugins are used for a variety of purposes. The DMTCP_ROOT/contrib

directory contains packages that users and developers have contributed to

be optionally loaded into DMTCP.

Plugin code is expressive, while requiring only a modest number of lines

of code. The plugins in the contrib directory vary in size from 400 lines to

3000 lines of code.

Beginning with DMTCP version 2.0, much of DMTCP itself is also now a

plugin. In this new design, the core DMTCP code is responsible primarily for

copying all of user space memory to a checkpoint image file. The remaining

functions of DMTCP are handled by plugins, found in DMTCP_ROOT/plugin.

Each plugin abstracts the essentials of a different subsystem of the operating

139

140 APPENDIX A. PLUGIN TUTORIAL

system and modifies its behavior to accommodate checkpoint and restart.

Some of the subsystems for which plugins have been written are: virtualiza-

tion of process and thread ids; files(open, close, dup, fopen, fclose, mmap,

pty); events (eventfd, epoll, poll, inotify, signalfd); System V IPC constructs

(shmget, semget, msgget); TCP/IP sockets (socket, connect, bind, listen, ac-

cept); and timers (timer_create, clock_gettime). (The indicated system calls

are examples only and not all-inclusive.)

A.2 Anatomy of a plugin

A plugin modifies the behavior of either DMTCP or a target application,

through three primary mechanisms, plus virtualization of ids.

Wrapper functions: One declares a wrapper function with the same name

as an existing library function (including system calls in the run-time

library). The wrapper function can execute some prolog code, pass

control to the “real” function, and then execute some epilog code. Sev-

eral plugins can wrap the same function in a nested manner. One can

also omit passing control to the “real” function, in order to shadow

that function with an alternate behavior.

Events: It is frequently useful to execute additional code at the time of

checkpoint, or resume, or restart. Plugins provide hook functions to be

called during these three events and numerous other important events

in the life of a process.

Coordinated checkpoint of distributed processes: DMTCP transparently

checkpoints distributed computations across many nodes. At the time

of checkpoint or restart, it may be necessary to coordinate information

among the distributed processes. For example, at restart time, an inter-

nal plugin of DMTCP allows the newly re-created processes to “talk”

to their peers to discover the new network addresses of their peers.

A.3. WRITING PLUGINS 141

This is important since a distributed computation may be restarted on

a different cluster than its original one.

Virtualization of ids: Ids (process id, timer id, System V IPC id, etc.) are

assigned by the kernel, by a peer process, and by remote processes.

Upon restart, the external agent may wish to assign a different id than

the one assigned prior to checkpoint. Techniques for virtualization of

ids are described in Section Appendix A.3.2.

A.3 Writing Plugins

A.3.1 Invoking a plugin

Plugins are just dynamic run-time libraries (.so files).

gcc -shared -fPIC -IDMTCP_ROOT/include -o PLUGIN1.so

PLUGIN1.c

They are invoked at the beginning of a DMTCP computation as command-

line options:

dmtcp_launch -with-plugin PLUGIN1.so:PLUGIN2.so myapp

Note that one can invoke multiple plugins as a colon-separated list. One

should either specify a full path for each plugin (each .so library), or else to

define LD_LIBRARY_PATH to include your own plugin directory.

A.3.2 The plugin mechanisms

The mechanisms of plugins are most easily described through examples.

This tutorial will rely on the examples in DMTCP_ROOT/test/plugin. To

get a feeling for the plugins, one can “cd” into each of the subdirectories and

execute: “make check”.

142 APPENDIX A. PLUGIN TUTORIAL

Plugin events

For context, please scan the code of plugin/example/example.c. Exe-

cuting “make check” will demonstrate the intended behavior. Plugin events

are handled by including the function dmtcp_event_hook. When a DMTCP

plugin event occurs, DMTCP will call the function dmtcp_event_hook for

each plugin. This function is required only if the plugin will handle plugin

events. See Appendix A for further details.

void dmtcp_event_hook(DmtcpEvent_t event, DmtcpEventData_t *

data)

{

switch (event) {

case DMTCP_EVENT_WRITE_CKPT:

printf( " \n∗∗∗ The plugin i s being c a l l e d before

checkpoint ing . ∗∗∗\n ");

break;

case DMTCP_EVENT_RESUME:

printf( " ∗∗∗ Resume : the plug in has now been checkpointed

. ∗∗∗\n ");

break;

case DMTCP_EVENT_RESTART:

printf( " ∗∗∗ The plugin i s now being r e s t a r t e d . ∗∗∗\n ");

break;

...

default:

break;

}

DMTCP_NEXT_EVENT_HOOK(event, data);

} �

A.3. WRITING PLUGINS 143

Plugin wrapper functions

In its simplest form, a wrapper function can be written as follows:

unsigned int sleep(unsigned int seconds) {

static unsigned int (*next_fnc)() = NULL; /* Same type

signature as sleep */

struct timeval oldtv, tv;

gettimeofday(&oldtv, NULL);

time_t secs = val.tv_sec;

printf( " s leep1 : "); print_time(); printf( " . . . ");

unsigned int result = NEXT_FNC(sleep)(seconds);

gettimeofday(&tv, NULL);

printf( " Time elapsed : %f \n ",

(1e6*(val.tv_sec-oldval.tv_sec) + 1.0*(val.tv_usec

-oldval.tv_usec)) / 1e6);

print_time(); printf( " \n ");

return result;

} �In the above example, we could also shadow the standard “sleep” func-

tion by our own implementation, if we omit the call to “NEXT_FNC”.

To see a related example, try:

cd DMTCP_ROOT/test/plugin/sleep1; make check

Wrapper functions from distinct plugins can be nested. For a nesting of

plugin sleep2 around sleep1, do:

cd DMTCP_ROOT/test/plugin

make; cd sleep2; make check

If one adds a wrapper around a function from a library other than libc.so

(e.g., libglx.so), it is best to dynamically link to that additional library:

144 APPENDIX A. PLUGIN TUTORIAL

gcc ... -o PLUGIN1.so PLUGIN1.c -lglx.so

Plugin coordination among multiple or distributed processes

It is often the case that an external agent will assign a particular initial id

to your process, but later assign a different id on restart. Each process must

re-discover its peers at restart time, without knowing the pre-checkpoint ids.

DMTCP provides a “Publish/Subscribe” feature to enable communica-

tion among peer processes. Two plugin events allow user plugins to discover

peers and pass information among peers. The two events are: DMTCP_EVEN-

T_REGISTER_NAME_SERVICE_DATA and DMTCP_EVENT_SEND_QUERIES.

DMTCP guarantees to provide a global barrier between the two events.

An example of how to use the Publish/Subscribe feature is contained

in DMTCP_ROOT/test/plugin/example-db . The explanation below is

best understood in conjunction with reading that example.

A plugin processing DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA

should invoke:

int dmtcp_send_key_val_pair_to_coordinator(const void *key, size_t key_len,

const void *val, size_t val_len).

A plugin processing DMTCP_EVENT_SEND_QUERIES should invoke:

int dmtcp_send_query_to_coordinator(const void *key, size_t key_len, void

*val, size_t *val_len).

Using plugins to virtualize ids and other names

Often an id or name will change between checkpoint and restart. For ex-

ample, on restart, the real pid of a process will change from its pid prior

to checkpoint. Some DMTCP internal plugins maintain a translation table

in order to translate between a virtualized id passed to the user code and a

real id maintained inside the kernel. The utility to maintain this translation

table can also be used within third-party plugins. For an example of adding

virtualization to a plugin, see the plugin in plugin/ipc/timer.

A.4. APPLICATION-INITIATED CHECKPOINTS 145

In some less common cases, it can happen that a virtualized id is passed

to a library function by the target application. Yet, that same library function

may be passed a real id by a second function from within the same library.

In these cases, it is the responsibility of the plugin implementor to choose a

scheme that allows the first library function to distinguish whether its argu-

ment is a virtual id (passed from the target application) or a real id (passed

from within the same library).

A.4 Application-Initiated Checkpoints

Application-initiated checkpoints are even simpler than full-featured plug-

ins. In the simplest form, the following code can be executed both with

dmtcp_launch and without.:

#include <stdio.h>

#include " dmtcp . h "

int main() {

if (dmtcpCheckpoint() == DMTCP_NOT_PRESENT) {

printf( " dmtcpcheckpoint : DMTCP not present . No

checkpoint i s taken . \ n ");

}

return 0;

} �For this program to be aware of DMTCP, it must be compiled with -fPIC

and -ldl :

gcc -fPIC -IDMTCP_ROOT/include -o myapp myapp.c -ldl

The most useful functions are:

int dmtcpIsEnabled() — returns 1 when running with DMTCP; 0

146 APPENDIX A. PLUGIN TUTORIAL

otherwise.

int dmtcpCheckpoint() — returns DMTCP_AFTER_CHECKPOINT,

DMTCP_AFTER_RESTART, or DMTCP_NOT_PRESENT.

int dmtcpDelayCheckpointsLock()— DMTCP will block any check-

point requests.

int dmtcpDelayCheckpointsUnlock() — DMTCP will execute any

blocked checkpoint requests, and will permit new checkpoint requests.

The last two functions follow the common pattern of returning 0 on suc-

cess and DMTCP_NOT_PRESENT if DMTCP is not present.

A.5 Plugin Manual

A.5.1 Plugin events

dmtcp_event_hook

In order to handle DMTCP plugin events, a plugin must define an entry

point, dmtcp_event_hook.

NAME

dmtcp_event_hook - Handle plugin events for this

plugin

SYNOPSIS

#include " dmtcp/ plug in . h "

void dmtcp_event_hook(DmtcpEvent_t event,

DmtcpEventData_t *data)

DESCRIPTION

A.5. PLUGIN MANUAL 147

When a plugin event occurs, DMTCP will look for the

symbol

dmtcp_event_hook in each plugin library. If the

symbol is found,

that function will be called for the given plugin

library. DMTCP

guarantees only to invoke the first such plugin

library found in

library search order. Occurrences of

dmtcp_event_hook in later

plugin libraries will be called only if each previous

function

had invoked DMTCP_NEXT_EVENT_HOOK. The argument, <

event>, will be

bound to the event being declared by DMTCP. The

argument, <data>,

is required only for certain events. See the

following section,

‘‘Plugin Events ’ ’ for a list of all events.

SEE ALSO

DMTCP_NEXT_EVENT_HOOK �DMTCP_NEXT_EVENT_HOOK

A typical definition of dmtcp_event_hook will invoke the hook in the next

plugin via DMTCP_NEXT_EVENT_HOOK.

NAME

DMTCP_NEXT_EVENT_HOOK - call dmtcp_event_hook in next

plugin library

148 APPENDIX A. PLUGIN TUTORIAL

SYNOPSIS

#include " dmtcp/ plug in . h "

void DMTCP_NEXT_EVENT_HOOK(event, data)

DESCRIPTION

This function must be invoked from within a plugin

function library

called dmtcp_event_hook. The arguments <event> and <

data> should

normally be the same arguments passed to

dmtcp_event_hook.

DMTCP_NEXT_EVENT_HOOK may be called zero or one times

. If invoked zero

times, no further plugin libraries will be called to

handle events.

The behavior is undefined if DMTCP_NEXT_EVENT_HOOK

is invoked more than

once. The typical usage of this function is to

create a wrapper around

the handling of the same event by later plugins.

SEE ALSO

dmtcp_event_hook �Event Names

The rest of this section defines plugin events. The complete list of plugin

events is always contained in DMTCP_ROOT/include/plugin.h .

DMTCP guarantees to call the dmtcp_event_hook function of the plugin

when the specified event occurs.

A.5. PLUGIN MANUAL 149

Plugins that pass significant data through the data parameter are marked

with an asterisk: ∗. Most plugin events do not pass data through the data

parameter.

Note that the events REGISTER_NAME_SERVICE_DATA, SEND_QUERIES,

RESTART, RESUME, and REFILL, should all be processed after the call to

DMTCP_NEXT_EVENT_HOOK() in order to guarantee that the internal DMTCP

plugins have first restored full functionality.

Checkpoint-Restart

DMTCP_EVENT_WRITE_CKPT — Invoked at final barrier before writing

checkpoint

DMTCP_EVENT_RESTART — Invoked at first barrier during restart of new

process

DMTCP_EVENT_RESUME — Invoked at first barrier during resume fol-

lowing checkpoint

Coordination of Multiple or Distributed Processes during Restart

(see Appendix A.5.2)

DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA∗ restart/resume

DMTCP_EVENT_SEND_QUERIES∗ restart/resume

WARNING: EXPERTS ONLY FOR REMAINING EVENTS

Init/Fork/Exec/Exit

DMTCP_EVENT_INIT — Invoked before main (in both the original pro-

gram and any new program called via exec)

DMTCP_EVENT_EXIT — Invoked on call to exit/_exit/_Exit return from

main?;

DMTCP_EVENT_PRE_EXEC — Invoked prior to call to exec

150 APPENDIX A. PLUGIN TUTORIAL

DMTCP_EVENT_POST_EXEC — Invoked before DMTCP_EVENT_INIT in

new program

DMTCP_EVENT_ATFORK_PREPARE — Invoked before fork (see POSIX

pthread_atfork)

DMTCP_EVENT_ATFORK_PARENT — Invoked after fork by parent (see

POSIX pthread_atfork)

DMTCP_EVENT_ATFORK_CHILD — Invoked after fork by child (see POSIX

pthread_atfork)

Barriers (finer-grained control during checkpoint-restart)

DMTCP_EVENT_WAIT_FOR_SUSPEND_MSG — Invoked at barrier during

coordinated checkpoint

DMTCP_EVENT_SUSPENDED — Invoked at barrier during coordinated

checkpoint

DMTCP_EVENT_LEADER_ELECTION — Invoked at barrier during coordi-

nated checkpoint

DMTCP_EVENT_DRAIN — Invoked at barrier during coordinated check-

point

DMTCP_EVENT_REFILL — Invoked at first barrier during resume/restart

of new process

Threads

DMTCP_EVENT_THREADS_SUSPEND — Invoked within checkpoint thread

when all user threads have been suspended

DMTCP_EVENT_THREADS_RESUME — Invoked within checkpoint thread

before any user threads are resumed.

A.5. PLUGIN MANUAL 151

For debugging, consider calling the following code for this event:

static int x = 1; while(x);

DMTCP_EVENT_PRE_SUSPEND_USER_THREAD — Each user thread in-

vokes this prior to being suspended for a checkpoint

DMTCP_EVENT_RESUME_USER_THREAD — Each user thread invokes

this immediately after a resume or restart (isRestart() available

to plugin)

DMTCP_EVENT_THREAD_START — Invoked before start function given

by clone

DMTCP_EVENT_THREAD_CREATED — Invoked within parent thread when

clone call returns (like parent for fork)

DMTCP_EVENT_PTHREAD_START — Invoked before start function given

by pthread_created

DMTCP_EVENT_PTHREAD_EXIT — Invoked before call to pthread_exit

DMTCP_EVENT_PTHREAD_RETURN — Invoked in child thread when thread

start function of pthread_create returns

A.5.2 Publish/Subscribe

Section Appendix A.3.2 provides an explanation of the Publish/Subscribe

feature for coordination among peer processes at resume- or restart-time.

An example of how to use the Publish/Subscribe feature is contained in

DMTCP_ROOT/test/plugin/example-db .

The primary events and functions used in this feature are:

DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA

int dmtcp_send_key_val_pair_to_coordinator(const void *key,

152 APPENDIX A. PLUGIN TUTORIAL

size_t key_len, const void *val, size_t val_len)

DMTCP_EVENT_SEND_QUERIES

int dmtcp_send_query_to_coordinator(const void *key, size_t key_len, void

*val, size_t *val_len)

A.5.3 Wrapper functions

For a description of including wrapper functions in a plugin, see Section Ap-

pendix A.3.2.

A.5.4 Miscellaneous utility functions

Numerous DMTCP utility functions are provided that can be called from

within dmtcp_event_hook(). The utility functions are still under active de-

velopment, and may change in small ways. Some of the more commonly

used utility functions follow. Functions that return “char *” will not allocate

memory, but instead will return a pointer to a canonical string, which should

not be changed.

void dmtcp_get_local_ip_addr(struct in_addr *in);

const char* dmtcp_get_tmpdir(); /* given by --tmpdir, or

DMTCP_TMPDIR, or TMPDIR */

const char* dmtcp_get_ckpt_dir();

/* given by --ckptdir, or DMTCP_CHECKPOINT_DIR, or curr

dir at ckpt time */

const char* dmtcp_get_ckpt_files_subdir();

int dmtcp_get_ckpt_signal(); /* given by --mtcp-checkpoint-

signal */

const char* dmtcp_get_uniquepid_str();

const char* dmtcp_get_computation_id_str();

uint64_t dmtcp_get_coordinator_timestamp();

uint32_t dmtcp_get_generation(); /* number of ckpt/restart

A.5. PLUGIN MANUAL 153

sequences encountered */

const char* dmtcp_get_executable_path();

int dmtcp_get_restart_env(char *name, char *value, int

maxvaluelen);

/* For ’name’ in environment, copy its value into ’value’

param, but with

* at most length ’maxvaluelen’.

* Return 0 for success, and return code for various

errors

* See contrib/modify-env for an example of its use.

*/ �

Bibliography

[1] Hazim Abdel-Shafi, Evan Speight, and John K. Bennett. Efficient user-

level thread migration and checkpointing on windows NT clusters.

In Proceedings of the 3rd Conference on USENIX Windows NT Sympo-

sium - Volume 3, WINSYM’99, page 1–1, Berkeley, CA, USA, 1999.

USENIX Association. URL http://dl.acm.org/citation.cfm?

id=1268427.1268428. (Cited on page 15.)

[2] David Abdurachmanov, Kapil Arya, Josh Bendavid, Tommaso Boc-

cali, Gene Cooperman, Andrea Dotti, Peter Elmer, Giulio Eu-

lisse, Francesco Giacomini, Christopher D. Jones, Matteo Man-

zali, and Shahzad Muzaffar. Explorations of the viability of ARM

and xeon phi for physics processing. Journal of Physics: Confer-

ence Series, 513(5):052008, June 2014. ISSN 1742-6596. doi:

10.1088/1742-6596/513/5/052008. URL http://iopscience.

iop.org/1742-6596/513/5/052008. (Cited on page 136.)

[3] Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Mor-

eira. Adaptive incremental checkpointing for massively parallel sys-

tems. In Proceedings of the 18th Annual International Conference on

Supercomputing, ICS ’04, page 277–286, New York, NY, USA, 2004.

ACM. ISBN 1-58113-839-3. doi: 10.1145/1006209.1006248. URL

http://doi.acm.org/10.1145/1006209.1006248. (Cited on

page 15.)

155

156 BIBLIOGRAPHY

[4] Ole Agesen. System and method for maintaining memory page shar-

ing in a virtual environment, February 2013. U.S. Classification

711/147, 711/152, 711/E12.102, 717/148; International Classifica-

tion G06F12/08, G06F9/455, G06F7/04; Cooperative Classification

G06F12/08, G06F9/544, G06F9/45537. (Cited on page 98.)

[5] Nadav Amit, Dan Tsafrir, and Assaf Schuster. VSwapper: a mem-

ory swapper for virtualized environments. In Proceedings of the 19th

International Conference on Architectural Support for Programming

Languages and Operating Systems, ASPLOS ’14, page 349–366, New

York, NY, USA, 2014. ACM. ISBN 978-1-4503-2305-5. doi: 10.

1145/2541940.2541969. URL http://doi.acm.org/10.1145/

2541940.2541969. (Cited on pages 94 and 128.)

[6] Glenn Ammons, Jonathan Appavoo, Maria Butrico, Dilma Da Silva,

David Grove, Kiyokuni Kawachiya, Orran Krieger, Bryan Rosenburg,

Eric Van Hensbergen, and Robert W. Wisniewski. Libra: A library

operating system for a jvm in a virtualized execution environment.

In Proceedings of the 3rd International Conference on Virtual Execution

Environments, VEE ’07, page 44–54, New York, NY, USA, 2007. ACM.

ISBN 978-1-59593-630-1. doi: 10.1145/1254810.1254817. URL

http://doi.acm.org/10.1145/1254810.1254817. (Cited on

page 24.)

[7] Jason Ansel, Kapil Arya, and Gene Cooperman. DMTCP: transparent

checkpointing for cluster computations and the desktop. In IEEE In-

ternational Symposium on Parallel Distributed Processing, 2009. IPDPS

2009, pages 1–12, May 2009. doi: 10.1109/IPDPS.2009.5161063.

(Cited on pages 20, 25, and 58.)

[8] Linux Kernel Mailing List (LKML) Archives. [LKML] checkpoint-

restart: naked patch serialization, March 2014. URL http://lkml.

BIBLIOGRAPHY 157

iu.edu/hypermail/linux/kernel/1011.0/00770.html.

(Cited on page 17.)

[9] Kapil Arya and Gene Cooperman. DMTCP: bringing checkpoint-

restart to python. In Proceedings of the 12th Python in Science Con-

ference, pages 2–7, 2013. URL http://conference.scipy.org/

proceedings/scipy2013/arya.html. (Cited on page 134.)

[10] Kapil Arya, Tyler Denniston, Ana-Maria Visan, and Gene Cooper-

man. Semi-automated debugging via binary search through a pro-

cess lifetime. In Proceedings of the Seventh Workshop on Program-

ming Languages and Operating Systems, PLOS ’13, page 9:1–9:7, New

York, NY, USA, 2013. ACM. ISBN 978-1-4503-2460-1. doi: 10.

1145/2525528.2525533. URL http://doi.acm.org/10.1145/

2525528.2525533. (Cited on pages 135 and 136.)

[11] Kapil Arya, Yury Baskakov, and Alex Garthwaite. Tesseract: Rec-

onciling guest I/O and hypervisor swapping in a VM. In Pro-

ceedings of the 10th ACM SIGPLAN/SIGOPS International Confer-

ence on Virtual Execution Environments, VEE ’14, page 15–28, New

York, NY, USA, 2014. ACM. ISBN 978-1-4503-2764-0. doi: 10.

1145/2576195.2576198. URL http://doi.acm.org/10.1145/

2576195.2576198. (Cited on page 9.)

[12] Kapil Arya, Gene Cooperman, Andrea Dotti, and Peter Elmer.

Use of checkpoint-restart for complex HEP software on tradi-

tional architectures and intel MIC. Journal of Physics: Confer-

ence Series, 523(1):012015, June 2014. ISSN 1742-6596. doi:

10.1088/1742-6596/523/1/012015. URL http://iopscience.

iop.org/1742-6596/523/1/012015. (Cited on page 136.)

[13] Evyatar Av-Ron. Top-Down Diagnosis of Prolog Programs. PhD thesis,

Weizmanm Institute, 1984. (Cited on page 135.)

158 BIBLIOGRAPHY

[14] Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh

Krishna, Ewing Lusk, and Rajeev Thakur. PMI: a scalable parallel

process-management interface for extreme-scale systems. In Proceed-

ings of the 17th European MPI Users’ Group Meeting Conference on

Recent Advances in the Message Passing Interface, EuroMPI’10, page

31–41, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3-642-15645-

2, 978-3-642-15645-8. URL http://dl.acm.org/citation.

cfm?id=1894122.1894127. (Cited on page 83.)

[15] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Har-

ris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield.

Xen and the art of virtualization. In Proceedings of the Nineteenth

ACM Symposium on Operating Systems Principles, SOSP ’03, page

164–177, New York, NY, USA, 2003. ACM. ISBN 1-58113-757-5. doi:

10.1145/945445.945462. URL http://doi.acm.org/10.1145/

945445.945462. (Cited on page 24.)

[16] Tarick Bedeir. Building an RDMA-Capable application with IB

verbs. Technical report, http://www.hpcadvisorycouncil.com/, Au-

gust 2010. http://www.hpcadvisorycouncil.com/pdf/building-an-

rdma-capable- application-with-ib-verbs.pdf. (Cited on page 35.)

[17] Adam Beguelin, Erik Seligman, and Peter Stephan. Applica-

tion level fault tolerance in heterogeneous networks of work-

stations. Journal of Parallel and Distributed Computing, 43(2):

147–155, June 1997. ISSN 0743-7315. doi: 10.1006/jpdc.

1997.1338. URL http://www.sciencedirect.com/science/

article/pii/S0743731597913381. (Cited on page 18.)

[18] Bernard Blackham. Cryopid, 2012. URL http://cryopid.

berlios.de/index.html. (Cited on page 19.)

BIBLIOGRAPHY 159

[19] Bob Boothe. Efficient algorithms for bidirectional debugging. In

Proceedings of the ACM SIGPLAN 2000 Conference on Programming

Language Design and Implementation, PLDI ’00, page 299–310, New

York, NY, USA, 2000. ACM. ISBN 1-58113-199-2. doi: 10.1145/

349299.349339. URL http://doi.acm.org/10.1145/349299.

349339. (Cited on page 136.)

[20] Dan Bornstein. Dalvik VM internals. In Google I/O Developer Confer-

ence, volume 23, page 17–30, 2008. (Cited on page 22.)

[21] George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djilali,

Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier,

Oleg Lodygensky, Frederic Magniette, Vincent Neri, and Anton Se-

likhov. MPICH-V: toward a scalable fault tolerant MPI for volatile

nodes. In Proceedings of the 2002 ACM/IEEE Conference on Super-

computing, SC ’02, page 1–18, Los Alamitos, CA, USA, 2002. IEEE

Computer Society Press. URL http://dl.acm.org/citation.

cfm?id=762761.762815. (Cited on page 18.)

[22] Aurélien Bouteiller, Thomas Herault, Géraud Krawezik, Pierre

Lemarinier, and Franck Cappello. MPICH-V project: A multipro-

tocol automatic fault-tolerant MPI. International Journal of High

Performance Computing Applications, 20(3):319–333, 2006. doi:

10.1177/1094342006067469. URL http://hpc.sagepub.com/

content/20/3/319.abstract. (Cited on page 19.)

[23] Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul

Stodghill. Automated application-level checkpointing of MPI pro-

grams. In Proceedings of the Ninth ACM SIGPLAN Symposium on

Principles and Practice of Parallel Programming, PPoPP ’03, page

84–94, New York, NY, USA, 2003. ACM. ISBN 1-58113-588-2. doi:

160 BIBLIOGRAPHY

10.1145/781498.781513. URL http://doi.acm.org/10.1145/

781498.781513. (Cited on pages 15 and 19.)

[24] Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and

Martin Schulz. Application-level checkpointing for shared mem-

ory programs. In Proceedings of the 11th International Conference

on Architectural Support for Programming Languages and Operating

Systems, ASPLOS XI, page 235–247, New York, NY, USA, 2004.

ACM. ISBN 1-58113-804-0. doi: 10.1145/1024393.1024421. URL

http://doi.acm.org/10.1145/1024393.1024421. (Cited on

page 15.)

[25] Greg Bronevetsky, Daniel Marques, Keshav Pingali, Radu Rugina, and

Sally A. McKee. Compiler-enhanced incremental checkpointing for

OpenMP applications. In Proc. of IEEE International Parallel and Dis-

tributed Processing Symposium (IPDPS), pages 1–12, May 2009. doi:

10.1109/IPDPS.2009.5160999. (Cited on page 15.)

[26] Guohong Cao and M. Singhal. On coordinated checkpointing in dis-

tributed systems. IEEE Transactions on Parallel and Distributed Sys-

tems, 9(12):1213–1225, December 1998. ISSN 1045-9219. doi:

10.1109/71.737697. (Cited on page 22.)

[27] Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Cooperman. Trans-

parent checkpoint-restart over InfiniBand. In ACM 23rd Int. Symp. on

High Performance Parallel and Distributed Computing (HPDC), 2014.

(to appear). (Cited on pages 9, 31, 71, 89, 90, and 136.)

[28] K. Mani Chandy and Leslie Lamport. Distributed snapshots: De-

termining global states of distributed systems. ACM Trans. Com-

put. Syst., 3(1):63–75, February 1985. ISSN 0734-2071. doi:

10.1145/214451.214456. URL http://doi.acm.org/10.1145/

214451.214456. (Cited on page 29.)

BIBLIOGRAPHY 161

[29] P. Emerald Chung, Woei-Jyh Lee, Yennun Huang, Deron Liang, and

Chung-Yih Wang. Winckp: A transparent checkpointing and rollback

recovery tool for windows NT applications. In Proc. of 29th Annual

International Symposium on Fault-Tolerant Computing, page 220–223,

1999. doi: 10.1109/FTCS.1999.781053. (Cited on page 15.)

[30] Gene Cooperman, Jason Ansel, and Xiaoqin Ma. Adaptive check-

pointing for master-worker style parallelism (extended abstract). In

Proc. of 2005 IEEE Computer Society International Conference on Clus-

ter Computing. IEEE Press, 2005. conference proceedings on CD.

(Cited on page 25.)

[31] Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard,

Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs.

non-blocking coordinated checkpointing for large-scale fault tolerant

MPI. In Proceedings of the 2006 ACM/IEEE Conference on Supercom-

puting, SC ’06, New York, NY, USA, 2006. ACM. ISBN 0-7695-2700-0.

doi: 10.1145/1188455.1188587. URL http://doi.acm.org/10.

1145/1188455.1188587. (Cited on pages 18 and 22.)

[32] Timothy Cramer, Richard Friedman, Terrence Miller, David Seberger,

Robert Wilson, and Mario Wolczko. Compiling java just in time. IEEE

Micro, 17(3):36–43, May 1997. ISSN 0272-1732. doi: 10.1109/

40.591653. URL http://dx.doi.org/10.1109/40.591653.

(Cited on page 22.)

[33] William R. Dieter and James E. Lumpp,Jr. User-level checkpointing

for LinuxThreads programs. In Proceedings of the FREENIX Track:

2001 USENIX Annual Technical Conference, page 81–92, Berkeley, CA,

USA, 2001. USENIX Association. ISBN 1-880446-10-3. URL http:

//dl.acm.org/citation.cfm?id=647054.715766. (Cited on

page 15.)

162 BIBLIOGRAPHY

[34] Fred Douglis and John Ousterhout. Transparent process migration:

Design alternatives and the sprite implementation. Software: Practice

and Experience, 21(8):757–785, August 1991. ISSN 1097-024X.

doi: 10.1002/spe.4380210802. URL http://onlinelibrary.

wiley.com/doi/10.1002/spe.4380210802/abstract.

(Cited on page 13.)

[35] Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping

Chen. A survey of fault tolerance mechanisms and check-

point/restart implementations for high performance computing

systems. The Journal of Supercomputing, 65(3):1302–1326,

September 2013. ISSN 0920-8542, 1573-0484. doi: 10.

1007/s11227-013-0884-0. URL http://link.springer.com/

article/10.1007/s11227-013-0884-0. (Cited on page 13.)

[36] David Ehringer. The dalvik virtual machine architecture. Technical

report, 2010. (Cited on page 22.)

[37] Dawson R. Engler, M. Frans Kaashoek, and J.ames O’Toole,Jr. Ex-

okernel: An operating system architecture for application-level re-

source management. In Proceedings of the Fifteenth ACM Sympo-

sium on Operating Systems Principles, SOSP ’95, page 251–266, New

York, NY, USA, 1995. ACM. ISBN 0-89791-715-4. doi: 10.1145/

224056.224076. URL http://doi.acm.org/10.1145/224056.

224076. (Cited on page 24.)

[38] Stuart I. Feldman and Channing B. Brown. IGOR: a system for

program debugging via reversible execution. In Proceedings of the

1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed

Debugging, PADD ’88, page 112–123, New York, NY, USA, 1988.

ACM. ISBN 0-89791-296-9. doi: 10.1145/68210.69226. URL http:

//doi.acm.org/10.1145/68210.69226. (Cited on page 136.)

BIBLIOGRAPHY 163

[39] Apache Software Foundation. Apache hadoop, March 2014. URL

http://hadoop.apache.org/. (Cited on page 134.)

[40] Apache Software Foundation. Apache spark — lightning-fast clus-

ter computing, March 2014. URL http://spark.incubator.

apache.org/. (Cited on page 134.)

[41] Qi Gao, Weikuan Yu, Wei Huang, and D.K. Panda. Application-

transparent Checkpoint/Restart for MPI programs over InfiniBand.

In International Conference on Parallel Processing, 2006. ICPP 2006,

pages 471–478, August 2006. doi: 10.1109/ICPP.2006.26. (Cited on

page 19.)

[42] Tal Garfinkel. Traps and pitfalls: Practical problems in system call

interposition based security tools. In In Proc. Network and Dis-

tributed Systems Security Symposium, page 163–176, 2003. (Cited

on page 21.)

[43] Rohan Garg, Komal Sodha, and Gene Cooperman. A generic

checkpoint-restart mechanism for virtual machines. Technical report,

arXiv tech. report, arXiv:1212.1787, December 2012. URL http:

//arxiv.org/abs/1212.1787. Published: arXiv:1212.1787

[cs.OS], http://arxiv.org/abs/1212.1787. (Cited on page 87.)

[44] Rohan Garg, Komal Sodha, Zhengping Jin, and Gene Cooperman.

Checkpoint-restart for a network of virtual machines. In Proc. of 2013

IEEE Computer Society International Conference on Cluster Computing,

pages 1–8. IEEE Press, 2013. doi: 10.1109/CLUSTER.2013.6702626.

(Cited on pages 9, 71, 88, 135, and 136.)

[45] Devarshi Ghoshal, Sreesudhan R. Ramkumar, and Arun Chauhan.

Distributed speculative parallelization using checkpoint restart. In

Proceedings of the International Conference on Computational Science,

164 BIBLIOGRAPHY

ICCS 2011, volume 4 of Proceedings of the International Conference on

Computational Science, ICCS 2011, pages 422–431, 2011. doi: 10.

1016/j.procs.2011.04.044. URL http://www.sciencedirect.

com/science/article/pii/S1877050911001025. (Cited on

page 134.)

[46] Robert P. Goldberg and Robert Hassinger. The double paging

anomaly. In Proceedings of the May 6-10, 1974, National Com-

puter Conference and Exposition, AFIPS ’74, page 195–199, New

York, NY, USA, 1974. ACM. doi: 10.1145/1500175.1500215. URL

http://doi.acm.org/10.1145/1500175.1500215. (Cited on

pages 91 and 129.)

[47] Kinshuk Govil. Virtual clusters: resource management on large shared-

memory multiprocessors. PhD thesis, Stanford University, Palo Alto,

CA, USA, 2001. AAI3000034. (Cited on pages 91, 97, and 129.)

[48] Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and Mendel Rosen-

blum. Cellular disco: Resource management using virtual clusters

on shared-memory multiprocessors. In Proceedings of the Seventeenth

ACM Symposium on Operating Systems Principles, SOSP ’99, page

154–169, New York, NY, USA, 1999. ACM. ISBN 1-58113-140-2. doi:

10.1145/319151.319162. URL http://doi.acm.org/10.1145/

319151.319162. (Cited on pages 91, 97, and 129.)

[49] Richard L. Graham, Sung-Eun Choi, David J. Daniel, Nehal N. De-

sai, Ronald G. Minnich, Craig E. Rasmussen, L. Dean Risinger, and

Mitchel W. Sukalski. A network-failure-tolerant message-passing sys-

tem for terascale clusters. In Proceedings of the 16th International Con-

ference on Supercomputing, ICS ’02, page 77–83, New York, NY, USA,

2002. ACM. ISBN 1-58113-483-5. doi: 10.1145/514191.514205.

BIBLIOGRAPHY 165

URL http://doi.acm.org/10.1145/514191.514205. (Cited

on page 18.)

[50] Ajay Gulati, Irfan Ahmad, and Carl A. Waldspurger. PARDA: pro-

portional allocation of resources for distributed storage access. In

Proccedings of the 7th Conference on File and Storage Technologies,

FAST ’09, page 85–98, Berkeley, CA, USA, 2009. USENIX Associa-

tion. URL http://dl.acm.org/citation.cfm?id=1525908.

1525915. (Cited on page 113.)

[51] Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvard-

han Kharche, Niraj Tolia, Vanish Talwar, and Parthasarathy Ran-

ganathan. GViM: GPU-accelerated virtual machines. In Pro-

ceedings of the 3rd ACM Workshop on System-level Virtualization

for High Performance Computing, HPCVirt ’09, page 17–24, New

York, NY, USA, 2009. ACM. ISBN 978-1-60558-465-2. doi: 10.

1145/1519138.1519141. URL http://doi.acm.org/10.1145/

1519138.1519141. (Cited on page 19.)

[52] Paul H. Hargrove and Jason C. Duell. Berkeley lab checkpoint/restart

(BLCR) for linux clusters. Journal of Physics: Conference Series, 46(1):

494, September 2006. ISSN 1742-6596. doi: 10.1088/1742-6596/

46/1/067. URL http://iopscience.iop.org/1742-6596/

46/1/067. (Cited on pages 3, 17, 18, 19, and 23.)

[53] Jim Huang and Kito Cheng. Implement checkpointing for android

(slides). In Embedded Linux Conference Europe (ELCE2012). 0xlab,

November 2012. URL http://www.slideshare.net/jserv/

implement-checkpointing-for-android-elce2012. (Cited

on page 136.)

[54] J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design

and implementation of Checkpoint/Restart process fault tolerance for

166 BIBLIOGRAPHY

open MPI. In Parallel and Distributed Processing Symposium, 2007.

IPDPS 2007. IEEE International, pages 1–8, March 2007. doi: 10.

1109/IPDPS.2007.370605. (Cited on pages 18 and 19.)

[55] Joshua Hursey, Timothy I. Mattox, and Andrew Lumsdaine. Intercon-

nect agnostic Checkpoint/Restart in open MPI. In Proceedings of the

18th ACM International Symposium on High Performance Distributed

Computing, HPDC ’09, page 49–58, New York, NY, USA, 2009. ACM.

ISBN 978-1-60558-587-1. doi: 10.1145/1551609.1551619. URL

http://doi.acm.org/10.1145/1551609.1551619. (Cited on

pages 19, 72, and 89.)

[56] VMware Inc. VMware workstation, March 2014. URL http://www.

vmware.com/products/workstation. (Cited on page 92.)

[57] VMware Inc. VMware vSphere hypervisor, March 2014. URL http:

//www.vmware.com/products/esxi-and-esx/overview.

(Cited on page 91.)

[58] Pankaj Jalote. Fault Tolerance in Distributed Systems. Prentice-Hall,

Inc., Upper Saddle River, NJ, USA, 1994. ISBN 0-13-301367-7. (Cited

on page 21.)

[59] G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti, and

Yoshio Turner. Cruz: Application-transparent distributed checkpoint-

restart on standard operating systems. In International Conference

on Dependable Systems and Networks, 2005. DSN 2005. Proceedings,

pages 260–269, June 2005. doi: 10.1109/DSN.2005.33. (Cited on

page 16.)

[60] Stephen T. Jones, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-

Dusseau. Geiger: Monitoring the buffer cache in a virtual ma-

chine environment. In Proceedings of the 12th International Confer-

BIBLIOGRAPHY 167

ence on Architectural Support for Programming Languages and Oper-

ating Systems, ASPLOS XII, page 14–24, New York, NY, USA, 2006.

ACM. ISBN 1-59593-451-0. doi: 10.1145/1168857.1168861. URL

http://doi.acm.org/10.1145/1168857.1168861. (Cited on

page 130.)

[61] Poul-henning Kamp and Robert N. M. Watson. Jails: Confining the

omnipotent root. In In Proc. 2nd Intl. SANE Conference, 2000. (Cited

on page 23.)

[62] Samaneh Kazemi Nafchi, Rohan Garg, and Gene Cooperman. Trans-

parent checkpoint-restart for hardware-accelerated 3D graphics.

Technical report, arXiv tech. report, arXiv:1312.6650, 2013. URL

http://arxiv.org/abs/1312.6650v2. (Cited on pages 9, 31,

71, 88, and 136.)

[63] Gregory Kerr, Alex Brick, Gene Cooperman, and Sergey Bra-

tus. Checkpoint-restart: Proprietary hardware and the ‘Spiderweb

API’. Technical report, Recon 2011, July 2011. talk: abstract

at http://recon.cx/2011/schedule/events/112.en.html; video at

https://archive.org/details/Recon_2011_Checkpoint_Restart. (Cited

on page 35.)

[64] Samuel T. King, George W. Dunlap, and Peter M. Chen. Debugging

operating systems with time-traveling virtual machines. In Proceed-

ings of the Annual Conference on USENIX Annual Technical Confer-

ence, ATEC ’05, page 1–1, Berkeley, CA, USA, 2005. USENIX Associa-

tion. URL http://dl.acm.org/citation.cfm?id=1247360.

1247361. (Cited on page 136.)

[65] Naveen Kumar and Ramesh Peri. Transparent debugging of dy-

namically instrumented programs. SIGARCH Comput. Archit. News,

33(5):57–62, December 2005. ISSN 0163-5964. doi: 10.

168 BIBLIOGRAPHY

1145/1127577.1127589. URL http://doi.acm.org/10.1145/

1127577.1127589. (Cited on page 21.)

[66] Oren Laadan. A Personal Virtual Computer Recorder. PhD the-

sis, Columbia University, 2011. URL http://academiccommons.

columbia.edu/catalog/ac:131552. (Cited on page 16.)

[67] Oren Laadan and Jason Nieh. Transparent checkpoint-restart of mul-

tiple processes on commodity operating systems. In 2007 USENIX An-

nual Technical Conference on Proceedings of the USENIX Annual Tech-

nical Conference, ATC’07, page 25:1–25:14, Berkeley, CA, USA, 2007.

USENIX Association. ISBN 999-8888-77-6. URL http://dl.acm.

org/citation.cfm?id=1364385.1364410. (Cited on page 16.)

[68] Oren Laadan, Nicolas Viennot, and Jason Nieh. Transparent,

lightweight application execution replay on commodity multiproces-

sor operating systems. In Proceedings of the ACM SIGMETRICS Interna-

tional Conference on Measurement and Modeling of Computer Systems,

SIGMETRICS ’10, page 155–166, New York, NY, USA, 2010. ACM.

ISBN 978-1-4503-0038-4. doi: 10.1145/1811039.1811057. URL

http://doi.acm.org/10.1145/1811039.1811057. (Cited on

pages 16 and 17.)

[69] H. Andres Lagar-Cavilla, Niraj Tolia, M. Satyanarayanan, and Eyal

de Lara. VMM-independent graphics acceleration. In Proceed-

ings of the 3rd International Conference on Virtual Execution Envi-

ronments, VEE ’07, page 33–43, New York, NY, USA, 2007. ACM.

ISBN 978-1-59593-630-1. doi: 10.1145/1254810.1254816. URL

http://doi.acm.org/10.1145/1254810.1254816. (Cited on

pages 72 and 88.)

[70] Peter Alan Lee and Thomas Anderson. Fault tolerance. In Fault Tol-

erance, number 3 in Dependable Computing and Fault-Tolerant Sys-

BIBLIOGRAPHY 169

tems, pages 51–77. Springer Vienna, January 1990. ISBN 978-3-

7091-8992-4, 978-3-7091-8990-0. URL http://link.springer.

com/chapter/10.1007/978-3-7091-8990-0_3. (Cited on

page 21.)

[71] Pierre Lemarinier, Aurélien Bouteiller, Thomas Herault, Géraud

Krawezik, and Franck Cappello. Improved message logging versus

improved coordinated checkpointing for fault tolerant MPI. In Pro-

ceedings of the 2004 IEEE International Conference on Cluster Comput-

ing, CLUSTER ’04, page 115–124, Washington, DC, USA, 2004. IEEE

Computer Society. ISBN 0-7803-8694-9. URL http://dl.acm.

org/citation.cfm?id=1111682.1111713. (Cited on page 22.)

[72] E. Christopher Lewis, Prashant Dhamdhere, and Eric Xiaojian Chen.

Virtual machine-based replay debugging, October 2008. Google Tech

Talks: http://www.youtube.com/watch?v=RvMlihjqlhY; further in-

formation at http://www.replaydebugging.com. (Cited on page 136.)

[73] Kai Li, Jeffrey F. Naughton, and James S. Plank. Real-time, con-

current checkpoint for parallel programs. In Proceedings of the Sec-

ond ACM SIGPLAN Symposium on Principles &Amp; Practice of Par-

allel Programming, PPOPP ’90, page 79–88, New York, NY, USA,

1990. ACM. ISBN 0-89791-350-7. doi: 10.1145/99163.99173.

URL http://doi.acm.org/10.1145/99163.99173. (Cited on

pages 15 and 22.)

[74] Kai Li, Jeffrey F. Naughton, and James S. Plank. Low-latency, con-

current checkpointing for parallel programs. IEEE Transactions on

Parallel and Distributed Systems, 5(8):874–879, August 1994. ISSN

1045-9219. doi: 10.1109/71.298215. (Cited on pages 15 and 22.)

[75] Tim Lindholm and Frank Yellin. Java Virtual Machine Specification.

170 BIBLIOGRAPHY

Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd

edition, 1999. ISBN 0201432943. (Cited on page 22.)

[76] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny.

Checkpoint and migration of UNIX processes in the condor dis-

tributed processing system. Technical report 1346, University of Wis-

consin, Madison, Wisconsin, April 1997. (Cited on pages 15, 18,

and 23.)

[77] Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K. Panda. High perfor-

mance RDMA-Based MPI implementation over InfiniBand. Interna-

tional Journal of Parallel Programming, 32(3):167–198, June 2004.

ISSN 0885-7458, 1573-7640. doi: 10.1023/B:IJPP.0000029272.

69895.c1. URL http://link.springer.com/article/10.

1023/B:IJPP.0000029272.69895.c1. (Cited on page 19.)

[78] Pin Lu and Kai Shen. Virtual machine memory access tracing with hy-

pervisor exclusive cache. In 2007 USENIX Annual Technical Conference

on Proceedings of the USENIX Annual Technical Conference, ATC’07,

page 3:1–3:15, Berkeley, CA, USA, 2007. USENIX Association. ISBN

999-8888-77-6. URL http://dl.acm.org/citation.cfm?id=

1364385.1364388. (Cited on page 130.)

[79] Machi Maeji and Tadashi Kanamori. Top-down zooming diagnosis of

logic programs. Technical report, Kyoto University, 1988. (Cited on

page 135.)

[80] Violeta Medina and Juan Manuel García. A survey of migration mech-

anisms of virtual machines. ACM Comput. Surv., 46(3):30:1–30:33,

January 2014. ISSN 0360-0300. doi: 10.1145/2492705. URL

http://doi.acm.org/10.1145/2492705. (Cited on page 14.)

[81] Dejan S. Milojicic, Fred Douglis, Yves Paindaveine, Richard Wheeler,

and Songnian Zhou. Process migration. ACM Computing Surveys,

BIBLIOGRAPHY 171

32(3):241–299, September 2000. ISSN 0360-0300. doi: 10.1145/

367701.367728. URL http://doi.acm.org/10.1145/367701.

367728. (Cited on page 13.)

[82] Grzegorz Miłós, Derek G. Murray, Steven Hand, and Michael A. Fet-

terman. Satori: Enlightened page sharing. In Proceedings of the

2009 Conference on USENIX Annual Technical Conference, USENIX’09,

page 1–1, Berkeley, CA, USA, 2009. USENIX Association. URL http:

//dl.acm.org/citation.cfm?id=1855807.1855808. (Cited

on pages 91, 97, 101, 130, and 131.)

[83] Henrik Nilsson. Declarative debugging for lazy functional languages.

Citeseer, 1998. (Cited on page 135.)

[84] Henrik Nilsson and Peter Fritzson. Algorithmic debugging for

lazy functional languages. In Maurice Bruynooghe and Martin

Wirsing, editors, Proceedings of the 4th International Symposium

on Programming Language Implementation and Logic Programming,

PLILP ’92, pages 385–399, London, UK, UK, 1992. Springer Berlin

Heidelberg. ISBN 3-540-55844-6. URL http://dl.acm.org/

citation.cfm?id=646448.692462. (Cited on page 135.)

[85] Mark O’Neill. Cryopid2, December 2013. URL http://

sourceforge.net/projects/cryopid2. (Cited on page 19.)

[86] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The

design and implementation of zap: A system for migrating comput-

ing environments. In Proceedings of the 5th Symposium on Operating

Systems Design and implementation, OSDI ’02, page 361–376, New

York, NY, USA, 2002. ACM. ISBN 978-1-4503-0111-4. doi: 10.

1145/1060289.1060323. URL http://doi.acm.org/10.1145/

1060289.1060323. (Cited on pages 8 and 16.)

172 BIBLIOGRAPHY

[87] Eunbyung Park, Bernhard Egger, and Jaejin Lee. Fast and space-

efficient virtual machine checkpointing. In Proceedings of the 7th ACM

SIGPLAN/SIGOPS International Conference on Virtual Execution Envi-

ronments, VEE ’11, page 75–86, New York, NY, USA, 2011. ACM.

ISBN 978-1-4503-0687-4. doi: 10.1145/1952682.1952694. URL

http://doi.acm.org/10.1145/1952682.1952694. (Cited on

pages 100 and 130.)

[88] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun,

and Anand Karunanidhi. Pinpointing representative portions of large

intel&#174; itanium&#174; programs with dynamic instrumenta-

tion. In Proceedings of the 37th Annual IEEE/ACM International Sym-

posium on Microarchitecture, MICRO 37, page 81–92, Washington,

DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2126-6. doi:

10.1109/MICRO.2004.28. URL http://dx.doi.org/10.1109/

MICRO.2004.28. (Cited on pages 21, 23, and 24.)

[89] Eduardo Pinheiro. EPCKPT — a checkpoint utility for the linux ker-

nel, 2002. URL http://www.research.rutgers.edu/edpin/

epckpt/. (Cited on page 15.)

[90] James Plank. An overview of checkpointing in uniprocessor and

distributed systems, focusing on implementation and performance.

Technical report, University of Tennessee, Knoxville, TN, USA, 1997.

(Cited on page 13.)

[91] James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt:

Transparent checkpointing under unix. In Proceedings of the USENIX

1995 Technical Conference Proceedings, TCON’95, page 18–18, Berke-

ley, CA, USA, 1995. USENIX Association. URL http://dl.acm.

org/citation.cfm?id=1267411.1267429. (Cited on page 15.)

BIBLIOGRAPHY 173

[92] James S. Plank, Jian Xu, and Robert H. B. Netzer. Compressed dif-

ferences: An algorithm for fast incremental checkpointing. Technical

Report CS-95-302, University of Tennessee, August 1995. (Cited on

pages 15 and 18.)

[93] Artem Y. Polyakov. Batch-queue plugin for DMTCP, March

2014. URL https://sourceforge.net/p/dmtcp/code/

HEAD/tree/trunk/plugin/batch-queue. (Cited on pages 9

and 81.)

[94] Bernard James Pope. A declarative debugger for Haskell. PhD thesis,

University of Melbourne, Department of Computer Science and Soft-

ware Engineering„ Victoria, Australia, 2007. (Cited on page 135.)

[95] Donald E. Porter, Silas Boyd-Wickizer, Jon Howell, Reuben Olinsky,

and Galen C. Hunt. Rethinking the library OS from the top down. In

Proceedings of the Sixteenth International Conference on Architectural

Support for Programming Languages and Operating Systems, ASPLOS

XVI, page 291–304, New York, NY, USA, 2011. ACM. ISBN 978-1-

4503-0266-1. doi: 10.1145/1950365.1950399. URL http://doi.

acm.org/10.1145/1950365.1950399. (Cited on page 24.)

[96] Daniel Price, Andrew Tucker, and Sun Microsystems. Solaris zones:

Operating system support for consolidating commercial workloads.

In In 18th Large Installation System Administration Conference, page

241–254, 2004. (Cited on page 23.)

[97] Eric Roman. A survey of Checkpoint/Restart implementations. Tech-

nical report, Lawrence Berkeley National Laboratory, Tech, 2002.

(Cited on page 13.)

[98] Jose Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, and

Song Jiang. Current practice and a direction forward in check-

point/restart implementations for fault tolerance. In Parallel and

174 BIBLIOGRAPHY

Distributed Processing Symposium, 2005. Proceedings. 19th IEEE In-

ternational, pages 8 pp.–, April 2005. doi: 10.1109/IPDPS.2005.157.

(Cited on page 13.)

[99] Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, An-

drew Lumsdaine, Jason Duell, Paul Hargrove, and Eric Roman. The

Lam/Mpi Checkpoint/Restart framework: System-initiated check-

pointing. International Journal of High Performance Computing Ap-

plications, 19(4):479–493, November 2005. ISSN 1094-3420, 1741-

2846. doi: 10.1177/1094342005056139. URL http://hpc.

sagepub.com/content/19/4/479. (Cited on pages 18 and 19.)

[100] Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques,

Keshav Pingali, and Paul Stodghill. Implementation and evaluation of

a scalable application-level checkpoint-recovery scheme for MPI pro-

grams. In Proceedings of the 2004 ACM/IEEE Conference on Supercom-

puting, SC ’04, page 38–, Washington, DC, USA, 2004. IEEE Computer

Society. ISBN 0-7695-2153-3. doi: 10.1109/SC.2004.29. URL http:

//dx.doi.org/10.1109/SC.2004.29. (Cited on page 15.)

[101] Love H. Seawright and Richard A. MacKinnon. VM/370: a study

of multiplicity and usefulness. IBM Syst. J., 18(1):4–17, March 1979.

ISSN 0018-8670. doi: 10.1147/sj.181.0004. URL http://dx.doi.

org/10.1147/sj.181.0004. (Cited on page 129.)

[102] Josep Silva. A comparative study of algorithmic debugging strategies.

In Germán Puebla, editor, Logic-Based Program Synthesis and Trans-

formation, number 4407 in Lecture Notes in Computer Science, pages

143–159. Springer Berlin Heidelberg, January 2007. ISBN 978-3-

540-71409-5, 978-3-540-71410-1. URL http://link.springer.

com/chapter/10.1007/978-3-540-71410-1_11. (Cited on

page 135.)

BIBLIOGRAPHY 175

[103] Standard Performance Evaluation Corporation SPEC. SPECjbb2005,

March 2014. URL http://www.spec.org/jbb2005. (Cited on

pages 112 and 116.)

[104] G. Stellner. CoCheck: checkpointing and process migration for MPI.

In Parallel Processing Symposium, 1996., Proceedings of IPPS ’96, The

10th International, pages 526–531, April 1996. doi: 10.1109/IPPS.

1996.508106. (Cited on page 18.)

[105] O.O. Sudakov, I.S. Meshcheriakov, and Y.V. Boyko. CHPOX: transpar-

ent checkpointing system for linux clusters. In 4th IEEE Workshop on

Intelligent Data Acquisition and Advanced Computing Systems: Technol-

ogy and Applications, 2007. IDAACS 2007, pages 159–164, September

2007. doi: 10.1109/IDAACS.2007.4488396. (Cited on page 17.)

[106] Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and

Henry M. Levy. Recovering device drivers. ACM Trans. Comput.

Syst., 24(4):333–360, November 2006. ISSN 0734-2071. doi: 10.

1145/1189256.1189257. URL http://doi.acm.org/10.1145/

1189256.1189257. (Cited on page 35.)

[107] Hajime Tazaki, Frédéric Urbani, Emilio Mancini, Mathieu Lacage,

Daniel Camara, Thierry Turletti, and Walid Dabbous. Direct code

execution: Revisiting library OS architecture for reproducible net-

work experiments. In The 9th International Conference on emerg-

ing Networking EXperiments and Technologies (CoNEXT), Santa Bar-

bara, États-Unis, December 2013. URL http://hal.inria.fr/

hal-00880870. (Cited on page 24.)

[108] Boost Team. Boost serialization, March 2014. URL www.boost.

org/libs/serialization. (Cited on page 14.)

176 BIBLIOGRAPHY

[109] Condor Team. Condor standard universe, 2013. URL

http://research.cs.wisc.edu/htcondor/manual/v7.

9/2_4Road_map_Running.html. (Cited on pages 3 and 18.)

[110] Condor Team. The condor project homepage, March 2014. URL

http://www.cs.wisc.edu/condor/. (Cited on page 3.)

[111] CRIU Team. CRIU, December 2013. URL http://criu.org/.

(Cited on pages 3, 20, and 23.)

[112] FReD Team. FReD software, 2011. URL https://github.com/

fred-dbg/fred. (Cited on page 85.)

[113] Jenkins Team. Jenkins, March 2014. URL http://jenkins-ci.

org. (Cited on page 116.)

[114] KVM Team. KVM/QEmu, March 2014. URL http://wiki.qemu.

org/KVM. (Cited on pages 24 and 87.)

[115] Lguest Team. Lguest: The simple x86 hypervisor, March 2014. URL

http://lguest.ozlabs.org. (Cited on pages 24, 87, and 88.)

[116] Linux-VServer Team. Linux-VServer, 2003. URL http://

linux-vserver.org. (Cited on page 23.)

[117] LXC Team. LXC linux containers, December 2013. URL https://

linuxcontainers.org/. (Cited on pages 16, 20, and 23.)

[118] OpenVZ Team. OpenVZ, 2006. URL http://openvz.org. (Cited

on page 23.)

[119] Parallels Virtuozzo Containers Team. Parallels virtuozzo contain-

ers, 2014. URL http://www.parallels.com/products/pvc/.

(Cited on page 23.)

BIBLIOGRAPHY 177

[120] Python Team. Pickle: Python object serialization, March 2014.

URL https://docs.python.org/2/library/pickle.html.

(Cited on page 14.)

[121] QEmu Team. QEmu, 1998. URL http://qemu.org. (Cited on

page 87.)

[122] Thuan L. Thai and Hoang Lam. .NET Framework Essentials.

O’Reilly &amp; Associates, Inc., Sebastopol, CA, USA, 2001. ISBN

0596001657. (Cited on page 22.)

[123] Douglas Thain and Miron Livny. Multiple bypass: Interposition agents

for distributed computing. Cluster Computing, 4(1):39–47, March

2001. ISSN 1386-7857. doi: 10.1023/A:1011412209850. URL

http://dx.doi.org/10.1023/A:1011412209850. (Cited on

page 21.)

[124] Mustafa M. Tikir and Jeffrey K. Hollingsworth. Hardware monitors

for dynamic page migration. Journal of Parallel and Distributed Com-

puting, 68(9):1186–1200, September 2008. ISSN 0743-7315. doi:

10.1016/j.jpdc.2008.05.006. URL http://www.sciencedirect.

com/science/article/pii/S0743731508001020. (Cited on

pages 21, 23, and 24.)

[125] Anthony Velte and Toby Velte. Microsoft Virtualization with Hyper-

V. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 2010. ISBN

0071614036, 9780071614030. (Cited on page 24.)

[126] Ana-Maria Visan. Temporal Meta-Programming: Treating Time as a

Spatial Dimension. PhD thesis, Northeastern University, 2012. (Cited

on page 9.)

[127] Ana-Maria Visan, Kapil Arya, Gene Cooperman, and Tyler Denniston.

URDB: a universal reversible debugger based on decomposing debug-

178 BIBLIOGRAPHY

ging histories. In Proc. of 6th Workshop on Programming Languages

and Operating Systems (PLOS) (part of Proc. of 23rd ACM Symp.

on Operating System Principles (SOSP)), 2011. electronic proceed-

ings at http://sigops.org/sosp/sosp11/workshops/plos/08-visan.pdf;

software for latest version, FReD (Fast Reversible Debugger), at

https://github.com/fred-dbg/fred. (Cited on pages 9, 71, 84, 135,

and 136.)

[128] Carl A. Waldspurger. Memory resource management in VMware

ESX server. In Proceedings of the 5th Symposium on Operating Sys-

tems Design and implementation, OSDI ’02, page 181–194, New

York, NY, USA, 2002. ACM. ISBN 978-1-4503-0111-4. doi: 10.

1145/1060289.1060307. URL http://doi.acm.org/10.1145/

1060289.1060307. (Cited on pages 91, 96, 97, 98, and 129.)

[129] John Paul Walters and Vipin Chaudhary. Application-level checkpoint-

ing techniques for parallel programs. In Sanjay K. Madria, Kajal T.

Claypool, Rajgopal Kannan, Prem Uppuluri, and Manoj Madhava

Gore, editors, Distributed Computing and Internet Technology, number

4317 in Lecture Notes in Computer Science, pages 221–234. Springer

Berlin Heidelberg, January 2006. ISBN 978-3-540-68379-7, 978-3-

540-68380-3. URL http://link.springer.com/chapter/10.

1007/11951957_21. (Cited on page 14.)

[130] Jon Watson. VirtualBox: bits and bytes masquerading as machines.

Linux J., 2008(166), February 2008. ISSN 1075-3583. URL http:

//dl.acm.org/citation.cfm?id=1344209.1344210. (Cited

on page 24.)

[131] Benjamin Wester, Peter M. Chen, and Jason Flinn. Operating sys-

tem support for application-specific speculation. In Proceedings of the

Sixth Conference on Computer Systems, EuroSys ’11, page 229–242,

BIBLIOGRAPHY 179

New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0634-8. doi: 10.

1145/1966445.1966467. URL http://doi.acm.org/10.1145/

1966445.1966467. (Cited on page 134.)

[132] David A. Wheeler. SLOCCount: source lines of code counter, March

2014. URL http://www.dwheeler.com/sloccount. (Cited on

page 73.)

[133] Namyoon Woo, Soonho Choi, hyungsoo Jung, Jungwhan Moon,

Heon Y. Yeom, Taesoon Park, and Hyungwoo Park. MPICH-GF: pro-

viding fault tolerance on grid environments. In Proceedings of 3rd

IEEE/ACM International Symposium on Cluster Computing and the

Grid (CCGrid 2003), 2003. Published: The 3rd IEEE/ACM Interna-

tional Symposium on Cluster Computing and the Grid (CCGrid2003),

the poster and research demo session May, 2003, Tokyo, Japan.

(Cited on page 18.)

[134] Bob Woodruff, Sean Hefty, Roland Dreier, and Hal Rosenstock. In-

troduction to the InfiniBand core software. In Proceedings of the

Linux Symposium (Volume Two), page 271–282, Ottawa, Canada, July

2005. (Cited on page 35.)

[135] Victor C. Zandy. ckpt — a process checkpoint library, 2005. URL

http://cs.wisc.edu/~zandy/ckpt/. (Cited on page 23.)

[136] Victor C. Zandy, Barton P. Miller, and Miron Livny. Process hijack-

ing. In The Eighth International Symposium on High Performance Dis-

tributed Computing, 1999. Proceedings, pages 177–184, 1999. doi:

10.1109/HPDC.1999.805296. (Cited on pages 21 and 23.)

[137] Youhui Zhang, Dongsheng Wong, and Weimin Zheng. User-

level checkpoint and recovery for LAM/MPI. SIGOPS Oper. Syst.

Rev., 39(3):72–81, July 2005. ISSN 0163-5980. doi: 10.

180 BIBLIOGRAPHY

1145/1075395.1075402. URL http://doi.acm.org/10.1145/

1075395.1075402. (Cited on page 18.)

[138] Gengbin Zheng, Lixia Shi, and L.V. Kale. FTC-Charm++: an in-

memory checkpoint-based fault tolerant runtime for charm++ and

MPI. In 2004 IEEE International Conference on Cluster Comput-

ing, pages 93–103, September 2004. doi: 10.1109/CLUSTR.2004.

1392606. (Cited on page 18.)

[139] Hua Zhong and Jason Nieh. CRAK: linux Checkpoint/Restart as a

kernel module. Technical report CUCS-014-01, Dept. of Computer

Science, Columbia University, November 2001. (Cited on page 16.)