A small-scale testbed for large-scale reliable computing

Purdue UniversityPurdue e-Pubs

Open Access Theses Theses and Dissertations

12-2016

A small-scale testbed for large-scale reliablecomputingJason R. St. JohnPurdue University

Follow this and additional works at: https://docs.lib.purdue.edu/open_access_theses

Part of the Computer Sciences Commons

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] foradditional information.

Recommended CitationSt. John, Jason R., "A small-scale testbed for large-scale reliable computing" (2016). Open Access Theses. 899.https://docs.lib.purdue.edu/open_access_theses/899

https://docs.lib.purdue.edu?utm_source=docs.lib.purdue.edu%2Fopen_access_theses%2F899&utm_medium=PDF&utm_campaign=PDFCoverPages

https://docs.lib.purdue.edu/open_access_theses?utm_source=docs.lib.purdue.edu%2Fopen_access_theses%2F899&utm_medium=PDF&utm_campaign=PDFCoverPages

https://docs.lib.purdue.edu/etd?utm_source=docs.lib.purdue.edu%2Fopen_access_theses%2F899&utm_medium=PDF&utm_campaign=PDFCoverPages

https://docs.lib.purdue.edu/open_access_theses?utm_source=docs.lib.purdue.edu%2Fopen_access_theses%2F899&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/142?utm_source=docs.lib.purdue.edu%2Fopen_access_theses%2F899&utm_medium=PDF&utm_campaign=PDFCoverPages

https://docs.lib.purdue.edu/open_access_theses/899?utm_source=docs.lib.purdue.edu%2Fopen_access_theses%2F899&utm_medium=PDF&utm_campaign=PDFCoverPages

Graduate School Form30 Updated

PURDUE UNIVERSITYGRADUATE SCHOOL

Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared

By

Entitled

For the degree of

Is approved by the final examining committee:

To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material.

Approved by Major Professor(s):

Approved by:Head of the Departmental Graduate Program Date

Jason R. St. John

A SMALL-SCALE TESTBED FOR LARGE-SCALE RELIABLE COMPUTING

Master of Science

Thomas HackerChair

Eric Matson

John Springer

Thomas Hacker

Jeffrey Whitten 11/29/2016

A SMALL-SCALE TESTBED FOR

LARGE-SCALE RELIABLE COMPUTING

A Thesis

Submitted to the Faculty

of

Purdue University

by

Jason R. St. John

In Partial Fulfillment of the

Requirements for the Degree

of

Master of Science

December 2016

Purdue University

West Lafayette, Indiana

ii

ACKNOWLEDGMENTS

I wish to gratefully acknowledge my thesis committee, my family, and my

fellow graduate students for their help and support, encouragement, and insight.

iii

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.7 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

CHAPTER 2. REVIEW OF RELEVANT LITERATURE . . . . . . . . . . 62.1 HPC System Reliability . . . . . . . . . . . . . . . . . . . . . . . . 62.2 HPC, Cloud Computing, and Virtualization . . . . . . . . . . . . . 82.3 Error Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER 3. FRAMEWORK AND METHODOLOGY . . . . . . . . . . . 123.1 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Overview of Statistical Model Creation . . . . . . . . . . . . 133.1.3 System Log Analysis . . . . . . . . . . . . . . . . . . . . . . 14

3.1.3.1 Coates: Log Processing and Sorting . . . . . . . . . 143.1.3.2 Coates: Timestamp Standardization . . . . . . . . 153.1.3.3 Coates: Node ID Standardization . . . . . . . . . . 153.1.3.4 Coates: Scatter Plots of Memory Failure Events . . 163.1.3.5 Coates: Declustering of Failure Events . . . . . . . 163.1.3.6 Carter: Scatter Plots of Hard Disk Failure Events . 163.1.3.7 Carter: Declustering of Failure Events . . . . . . . 163.1.3.8 Distribution Fitting . . . . . . . . . . . . . . . . . 21

3.1.4 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv

Page3.1.4.1 Memory Error Injection Procedure . . . . . . . . . 223.1.4.2 Machine Check Registers . . . . . . . . . . . . . . . 223.1.4.3 Memory Error Injection Strings . . . . . . . . . . . 233.1.4.4 QEMU Modifications and Hard Disk Error Simulation

Procedure . . . . . . . . . . . . . . . . . . . . . . . 233.1.4.5 Hard Disk Errors . . . . . . . . . . . . . . . . . . . 243.1.4.6 Tunable Fault Injection Frequency . . . . . . . . . 253.1.4.7 Tuning the Three-Parameter Weibull Distribution . 253.1.4.8 Tuning the Lomax Distribution . . . . . . . . . . . 253.1.4.9 Experimental Setup . . . . . . . . . . . . . . . . . 26

3.2 Unit & Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.3 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.4 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.5 Measure for Success . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

CHAPTER 4. PRESENTATION OF DATA . . . . . . . . . . . . . . . . . . 324.1 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Coates: Memory Errors . . . . . . . . . . . . . . . . . . . . 324.1.2 Carter: Hard Disk Errors . . . . . . . . . . . . . . . . . . . . 33

4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.1 System Log Samples . . . . . . . . . . . . . . . . . . . . . . 334.2.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 33

CHAPTER 5. CONCLUSIONS, DISCUSSION, AND RECOMMENDATIONS 365.1 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . 365.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

APPENDIX A. FAULT INJECTION SCRIPT . . . . . . . . . . . . . . . . . 38

APPENDIX B. QEMU SOURCE CODE MODIFICATIONS . . . . . . . . . 51

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

v

LIST OF TABLES

Table Page

3.1 Comparison of various timestamp formats . . . . . . . . . . . . . . . . 15

3.2 Polynomial coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Polynomial roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Number of centers vs. WSSE . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Supported memory errors and MCi STATUS register values . . . . . . 23

3.6 Supported hard disk errors and their descriptions . . . . . . . . . . . . 24

4.1 Parameters of the three-parameter Weibull distribution for integrated memorycontroller errors on the Coates system . . . . . . . . . . . . . . . . . . 32

4.2 Parameters of the Lomax distribution for hard disk errors on the Cartersystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Number of attempted and successful fault injections . . . . . . . . . . . 35

vi

LIST OF FIGURES

Figure Page

3.1 Time of Memory Error Events vs. Node ID . . . . . . . . . . . . . . . 17

3.2 Time of Hard Disk Error Events vs. Node ID . . . . . . . . . . . . . . 18

3.3 WSSE vs. Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Configuration file for fio . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 systemd service file for fio, installed on the VMs . . . . . . . . . . . . . 29

3.6 systemd service file for changing the default polling interval for machinechecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 Bash script that changes the default polling frequency for machine checks 30

4.1 System log screenshot of VM “centos4” experiencing an ENOMEDIUMerror injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

ABBREVIATIONS

ADDR memory address

BSD Berkeley Software Distribution

CPU central processing unit

EC2 Amazon’s Elastic Compute Cloud

ECC error-correcting code

DRAM dynamic random-access memory

fio Flexible I/O Tester

GART Graphics Address Relocation Table

GUI graphical user interface

HMP QEMU’s Human Monitor Protocol

HPC high performance computing

IaaS infrastructure as a service

I/O input/output

IP Internet Protocol

ISO International Organization for Standardization

IT information technology

MAC media access control

MC machine check

MCA machine check architecture

MCE machine check exception

MTBF mean time between failures

MTTF mean time to failure

RAM random-access memory

RCAC Rosen Center for Advanced Computing

viii

RFC Request For Comments

RHEL Red Hat Enterprise Linux

TLB Translation Lookaside Buffer

UTC Coordinated Universal Time

VM virtual machine

VMM virtual machine monitor

WSSE within groups sum of squared error

ix

ABSTRACT

St. John, Jason R. M.S., Purdue University, December 2016. A Small-Scale TestbedFor Large-Scale Reliable Computing. Major Professor: Thomas J. Hacker.

High performance computing (HPC) systems frequently suffer errors and

failures from hardware components that negatively impact the performance of jobs

run on these systems. We analyzed system logs from two HPC systems at Purdue

University and created statistical models for memory and hard disk errors. We

created a small-scale error injection testbed—using a customized QEMU build,

libvirt, and Python—for HPC application programmers to test and debug their

programs in a faulty environment so that programmers can write more robust and

resilient programs before deploying them on an actual HPC system. The

deliverables for this project are the fault injection program, the modified QEMU

source code, and the statistical models used for driving the injection.

1

CHAPTER 1. INTRODUCTION

High performance computing (HPC) systems are a major component of

academic and industrial scientific research, and the reliability of these systems is one

of the foremost areas of research in the HPC community. The HPC community is

continually trying to make larger, more powerful HPC systems, and the poor

reliability of commodity HPC clusters is one of the biggest hindrances to the

adoption of petascale and exascale computing systems. This chapter provides the

scope, significance, research question, and other background information on the

thesis.

1.1 Scope

This study examined the system logs of large-scale HPC systems operated by

the Rosen Center for Advanced Computing (RCAC) at Purdue University. The

systems studied are all commodity-based computing clusters running Red Hat

Enterprise Linux (RHEL) 5, which is a GNU/Linux distribution based on version

2.6.18 of the Linux kernel. This study determined the frequency at which

component failures occurred and produced statistical models of the recorded failure

events. The statistical models were used as the driving input to a synthetic fault

generator that simulated hardware failures within a virtual machine environment.

1.2 Significance

Large-scale high performance computing (HPC) systems suffer from low

reliability due to frequent component failures of commodity hardware. Due to the

series reliability model, increasing the number of nodes or processors in an HPC

2

cluster reduces the overall reliability of the whole cluster. This results in frequent

failures that hinder the performance and scalability of large-scale systems.

The work in this thesis provides a small-scale fault injection system based on

system logs from real HPC systems under a typical workload that are used to

emulate failures on a VM-level to create a simulation of a faulty cluster for a

testbed for parallel applications. This would allow parallel application developers to

test their code’s robustness and tolerance of node failures in a small-scale

environment that replicates the failure patterns of large-scale systems. Researchers

will be able to submit more fault-tolerant parallel applications to large-scale HPC

systems first, instead of submitting jobs that have not been properly tested for

robustness and resilience.

1.3 Research Question

Can a small-scale fault injection system based on real HPC system logs be

used to create a suitable testbed for parallel applications?

1.4 Assumptions

The assumptions for this study include:

• The filtering of the system logs by RCAC did not remove any important or

relevant messages.

• The system logs provided by RCAC are accurate and correct.

• MathWave’s EasyFit software correctly and accurately fits datasets to

statistical distributions.

• All virtual cluster nodes’ system clocks were in synchrony with each other

throughout the logging period.

3

• The overhead of the developed system will be minimal enough to avoid

significant performance penalties.

1.5 Limitations

The limitations for this study include:

• The developed system was tested in a virtual environment using the

QEMU-KVM hypervisor only.

• The developed system was tested using the CentOS 7.2 Linux distribution

only.

• The developed system was run on Intel processors.

• The developed system cannot reliably inject multiple hard disk errors when

the time between errors is some time under one minute.

1.6 Delimitations

The delimitations for this study include:

• This study cannot examine the most critical of failure events, the machine

check exception (MCE), because of an unpatched bug in the Linux kernel’s

handling of MCEs. This limitation applies to Linux kernel version 2.6.18.

• The studied HPC systems were manufactured by HP, use AMD processors,

and run Red Hat Enterprise Linux 5.5.

• The developed system was not be tested on live HPC cluster nodes or with

real HPC jobs running.

4

1.7 Definitions

In the broader context of thesis writing, we define the following terms:

• checkpoint–restart: a fault tolerance mechanism designed to back up the state

of a system (Hacker, Romero, & Carothers, 2009)

• component failure: any defect or error that occurs in a single computational

node

• machine check architecture (MCA): the platform independent framework for

detecting, reporting, and handling machine check errors (“AMD64

Architecture Programmer’s Manual Volume 2: System Programming”, 2011)

• machine check error: (commonly: machine check) a correctable or

uncorrectable hardware error detected by the machine check architecture

(MCA) (“AMD64 Architecture Programmer’s Manual Volume 2: System

Programming”, 2011)

• machine check exception (MCE): a machine check error that cannot be

corrected and frequently results in a processor’s context becoming corrupted

(“AMD64 Architecture Programmer’s Manual Volume 2: System

Programming”, 2011)

• mean time between failures (MTBF): the mean time elapsed between

consecutive failure events that includes the repair/maintenance time

• mean time to failure (MTTF): the mean time elapsed from a node being

added to a cluster to its first recorded failure event

• memory module: a hardware component composed of volatile random access

memory (RAM) that is used as the primary cache for a node; nodes usually

have many memory modules.

5

• spatial locality: the apparent clustering of failure events with respect to

physical node location (e.g. server rack), indicating failure event dependence

on an external factor (Hacker et al., 2009)

• temporal locality: the apparent clustering of failure events with respect to

time, indicating failure event interdependence (Hacker et al., 2009)

• Unix epoch timestamp: the count of the number of seconds since

1970-01-01T00:00:00 UTC

1.8 Summary

This chapter has provided the background information, research question,

and the scope of the thesis. The next chapter provides a review of the literature

related to HPC system logs and system reliability.

6

CHAPTER 2. REVIEW OF RELEVANT LITERATURE

This chapter provides a review of the literature relevant to HPC system logs

and HPC system reliability.

2.1 HPC System Reliability

Improving the reliability of high performance computing (HPC) systems is

one of the leading research areas in the HPC field. Many studies have been

performed that have covered various methods for understanding how, why, and

when failures occur in large-scale HPC systems (Atif & Strazdins, 2009;

DeBardeleben, Blanchard, Fu, Guan, & Zhang, 2011; Fu & Xu, 2007; Hacker et al.,

2009; Oliner & Stearley, 2007; Pandit, Kalbarczyk, & Iyer, 2009; Romero, 2010;

Salfner & Tschirpke, 2008; Zhang, Squillante, Sivasubramaniam, & Sahoo, 2004;

Zheng, Lan, Park, & Geist, 2009; Zhou, Zhan, Meng, Xu, & Zhang, 2010; Zhou,

Zhan, Meng, & Zhang, 2010).

HPC system logs are in invaluable tool for studying ways of improving the

reliability of HPC systems. Unfortunately, the system logs on HPC systems are

frequently of poor quality—vague, cryptic, not easily machine parsible, etc.—which

necessitates significant processing to produce useful data. A notable flaw in the

2.6.18 version of the Linux kernel—the same kernel version used in Red Hat

Enterprise Linux (RHEL) 5—prevents the most severe of hardware failures from

being logged at all (Pandit et al., 2009). This further complicates system log

analysis on clusters running RHEL 5.

Zheng et al. (2009) presented a framework for pre-processing HPC system

logs to make these logs better suited for statistical analysis and failure prediction.

The framework consists of three components:

7

1. classifying messages by type and severity

2. removing temporal and spatial clustering of repeated messages for the same

event

3. identifying causal relationships between system log messages to better identify

symptomatic messages

Zheng et al. (2009) stated that their framework improved failure prediction over the

incumbent mechanism by 20% to 170%.

Salfner and Tschirpke (2008) proposed a set of techniques for processing the

system logs of a commercial telecommunication system, and their techniques

consisted of the following:

• classifying “messages based on Levenshtein’s edit distance”

• clustering related error messages

• “a statistical noise filtering algorithm”

Salfner and Tschirpke (2008) stated that their techniques significantly improved

upon other failure prediction methods, and the authors concluded that appropriate

pre-processing of the system logs is vital for producing useful data.

Hacker et al. (2009) concluded in their analysis of system logs from two large

IBM Blue Gene systems that system log messages show significant spatial and

temporal clustering, that the failure events rate varies in time following the Weibull

distribution, and that the health and reliability of individual computational nodes

can be estimated by system log analysis.

The work in this thesis partially aims to support the live migration of virtual

machines atop a cloud-like infrastructure as a service (IaaS) platform—such as

OpenNebula—with which a statistical model of previous HPC system failures, to be

combined in the future with the discrete-time, semi-Markov model introduced by

Hacker et al. (2009), will trigger live migrations as node health decreases. An ideal

8

implementation of this work, under perfect conditions, would result in improved

uptime for jobs because virtual machines would be live migrated off of unhealthy

hardware before failures occur.

2.2 HPC, Cloud Computing, and Virtualization

Cloud computing is a technology still in its adolescence, and it was only in

the mid-2000s that cloud computing and virtualization exploded in popularity.

Virtualization is required for almost any cost-effective cloud computing

environment; thus, for cloud computing to be viable for HPC applications, the

overhead introduced by virtualization must be minimal. Youseff, Wolski, Gorda,

and Krintz (2006) demonstrated that the Xen virtual machine monitor (VMM) adds

“no statistically significant overhead” for standard HPC benchmarks run on the

small-scale HPC system tested in their study. Youseff et al. (2006) stated that their

study showed that previous skepticism of virtualization for HPC applications was

“unwarranted” and that the many added benefits of virtualization outweigh the

negligible performance overhead.

Evangelinos and Hill (2008) stated that Amazon’s Elastic Compute Cloud

(EC2) is a promising platform for HPC applications because research institutions

will not need to build and maintain their own HPC clusters. Evangelinos and Hill

(2008) stated, however, that coupled HPC applications do suffer from a significant

performance penalty incurred by the inherent design of Amazon’s EC2 because

Amazon does not provide clients the ability to deploy virtual machines at a granular

level. Despite the performance penalty, the low cost and ability to deploy HPC

applications quickly with few requirements for local IT infrastructure demonstrate

the viability of cloud-based HPC systems in the future, especially if a cloud system

was designed with HPC applications in mind (Evangelinos & Hill, 2008).

The live migration of virtual machines is a process which involves copying

the running virtual machine’s memory pages to another physical host also running a

9

VM and, in the case of Xen, iteratively copies the memory pages up to 30 times

before pausing the running operation of the VM, resuming the VM on the new host,

and then destroying the original VM instance (Atif & Strazdins, 2009). Atif and

Strazdins (2009) hypothesized that the overall performance of live migrations could

be significantly improved by reducing the amount of CPU intensive and network

intensive memory copy iterations to the bare minimum of two iterations. Atif and

Strazdins (2009) demonstrated that optimizing this process resulted in a reduction

of memory page transfer by over 500% for the very memory intensive HPC

benchmarks. Their optimization showed improvements of almost 50% in total wall

clock time spent in the live migration process and around 200% performance

improvement for memory intensive and network intensive jobs (Atif & Strazdins,

2009). Additionally, live migration has already been shown to be a promising

proactive fault tolerance technique. Romero (2010) demonstrated a reduction in the

failure rate of parallel applications by live migrating multiple OpenVZ

containers—basically, advanced derivatives of FreeBSD Jails—to multiple nodes

that are less prone to failure.

2.3 Error Injection

Several error and fault injection systems have been developed previously.

Giuffrida, Kuijsten, and Tanenbaum (2013) created a fault injection tool for

software programs. Their system injects software faults in a controlled manner and

“offers strong guarantees” that attempted software fault injections do not cause

unintended side effects that could undermine the results. However, their approach

requires changes be made to the tested software at compile-time or using a

disassembler if source code is not available.

Guan, DeBardeleben, Blanchard, and Fu (2014) created a fault injector tool

that uses the QEMU virtual machine and can precisely target programs running

inside a VM. Their system is intended for introducing errors in software programs

10

by modifying the program’s memory by exploiting how QEMU translates virtual

machine instructions between the guest and host systems. Their system exploits the

Tiny Code Generation (TCG) system used by QEMU and “translates” the intended

VM instructions to corrupted ones before the host executes them.

Error injection into virtual machines has been investigated previously by

DeBardeleben et al. (2011). DeBardeleben et al. (2011) introduced a framework for

injecting errors that evaluates the “resilience” of parallel applications to such errors.

The approach DeBardeleben et al. (2011) took is targeted solely at causing faults

inside the application’s memory.

Levy, Dosanjh, Bridges, and Ferreira (2013) created a VM-based fault

injection tool using the Palacios virtual machine monitor (Lange et al., 2011), which

is a VMM developed for HPC systems. Their approach injects memory errors at

specific memory locations and IDE disk errors at the VM level. Their method of

injecting errors differs from our approach due to differences in the hypervisor and

due to our approach being more generalized and not necessitating physical memory

or disk addresses.

The work in this thesis is evaluated by testing the injection of machine

checks and simulation of disk failures into a virtual cluster—in other words, causing

the virtual hardware to experience similar failures and errors as those observed in

real system logs under typical workloads. The work in this thesis uses libvirt (libvirt

Virtualization API , 2015)—a virtualization API—and a custom QEMU to inject

errors into the virtual hardware which causes the virtual machine to detect and

report these errors using the inherent failure detection mechanisms within the OS.

Our approach differs from Giuffrida et al. (2013), Guan et al. (2014), and

DeBardeleben et al. (2011) by simulating errors at the virtual hardware level and

using the inherent failure detection mechanisms within the OS to detect the errors.

Our approach improves upon the work by Levy et al. (2013) by triggering

whole-disk errors which ensures that errors are always detected given some disk

activity, supporting a larger array of memory and disk error types, and by using a

11

mainstream virtual machine monitor in QEMU—a key ease-of-use requirement for a

testbed system not intended to be deployed on real HPC systems, and therefore not

bound by the same performance requirements of real HPC jobs.

2.4 Summary

This chapter provided an overview of the literature related to HPC system

logs and HPC system reliability. The next chapter provides the research approach

and methodology for the thesis.

12

CHAPTER 3. FRAMEWORK AND METHODOLOGY

This chapter provides the framework and methodology that was used in this

research project.

3.1 Study Design

This study was a quantitative research analysis of system logs from real HPC

systems. This study created statistical models of component failures within the

systems and used the models to simulate failure events on virtual machines. The

virtual machines experienced failures similar to those found in the logs from real

HPC systems.

3.1.1 Data Sets

Memory error log data was taken from system logs from the 993-node Coates

system (Rosen Center for Advanced Computing, n.d.-b) operated by the Rosen

Center for Advanced Computing (RCAC) at Purdue University. The logs analyzed

ranged from 2009-09-01 to 2011-02-20.

Hard disk error log data was taken from system logs from the 660-node

Carter system (Rosen Center for Advanced Computing, n.d.-a) operated by RCAC

at Purdue University. The logs analyzed ranged from 2012-11-25 to 2013-08-31.

The Coates and Carter systems were operating under typical HPC workloads

throughout the log collection period. This makes the analyzed log data

generalizable to other large-scale systems operating under workload.

13

3.1.2 Overview of Statistical Model Creation

This section provides an overview of the steps taken to create the statistical

models. This is provided to assist other researchers that may be interested in

creating statistical models for errors on other systems.

The system logs for each system were collected in a central location and were

stripped of personally-identifiable information prior to the beginning of this study.

The following list provides a high-level overview of the procedure.

1. Error discovery

2. Error filtering

3. Error sorting

4. Log message timestamp sanitization

5. Scatter plot creation to identify clustering of errors

6. Declustering data using k-means

7. Fitting distributions to the data

8. Selecting a model from the fitted distributions

Manual searching of the system logs was performed using grep with regular

expressions to look for anomalous log events. Error messages of interest were then

filtered out from the rest of the logs and sorted based on the type of error. Due to

poor quality timestamps of the Coates system logs, these timestamps had to be

corrected prior to further processing.

Scatter plots were made of the error event data to identify the severity of

clustering. We then declustered the data using k-means cluster analysis. The

declustered data was loaded into distribution fitting software. Finally, the

continuous statistical distributions with the best fit were selected as our models.

14

3.1.3 System Log Analysis

The system logs were manually examined to determine the key phrases

within the logs related to failure events of memory and hard disks. The logs were

searched using a grep regular expression like the one below:

grep -Ei error messages*

The system logs from Coates needed significant processing to improve the

quality of message timestamps. The system logs from Carter did not need any

notable processing.

3.1.3.1 Coates: Log Processing and Sorting

The logs were filtered so that only failure-event-related messages were kept. The

logs were further processed to identify specific failure messages and were sorted by

those messages.

We investigated general errors detected by the integrated memory

controller—an external memory controller is called a northbridge, but this

distinction is not made in the system logs. The “BIOS and Kernel Developer’s

Guide for AMD Athlon 64 and AMD Opteron Processors” (2006) states that these

errors include errors in the GART TLB cache, errors in the HyperTransport link, or

errors in DRAM. It should be noted that the RHEL kernel version

2.6.18-194.17.1.el5 disables GART TLB error reporting by default. The “node” and

“core” values mentioned below refer to the physical CPU package (i.e. socket) and

the CPU core number within the physical CPU package, respectively. Because the

Coates cluster is built entirely of dual-socket, quad-core AMD Opteron processors,

the only valid values for “node” are 0 and 1, and the only valid values for “core” are

0, 1, 2, and 3.

The regular expression used to match these errors is

kernel:.*Northbridge Error

and an example log message is

15

Feb 16 14:45:50 172.18.42.117 kernel: Northbridge Error, node 1, core: 2

3.1.3.2 Coates: Timestamp Standardization

The timestamps in the Coates logs were changed from the default BSD-style syslog

format, specified in RFC 3164, to the Unix epoch format for easier machine

processing. Table 3.1 compares the differences between the easily human-readable

ISO 8601 and BSD-style syslog formats and the easily machine-readable Unix epoch

format for a sample from the Coates system logs. For the BSD-style syslog format,

the year was determined based on log file metadata because the year is not included

in these timestamps.

Table 3.1Comparison of various timestamp formats

Format Example timestamp

ISO 8601 2011-02-13T09:02:39Z

BSD-style syslog Feb 13 04:02:39

Unix epoch 1297587759

3.1.3.3 Coates: Node ID Standardization

To look for temporal and spatial clustering of events, we created a scatter plot to

investigate the degree of clustering. To plot the nodes on the horizontal axis, the

nodes were assigned unique IDs based on their IP addresses.

The unique node IDs were generated based on the compute node’s IP

address. The IP address was converted from dotted decimal notation, e.g.

192.168.15.76, into its packed, 32-bit, binary format represented as a decimal, e.g.

3232239436. Sequential node IDs were then generated by sorted the decimal-based

node IDs. The data set, composed of unique node IDs and the message timestamps

16

of failure events, was plotted as a scatter plot with the node IDs sorted from

smallest to largest along the horizontal axis and the event timestamps plotted along

the vertical axis.

3.1.3.4 Coates: Scatter Plots of Memory Failure Events

Figure 3.1 shows a scatter plot of memory failure event times plotted against node

IDs.

3.1.3.5 Coates: Declustering of Failure Events

Notable temporal and spatial clustering of events was found in the data set, as

shown in Figure 3.1. The temporal and spatial clustering of events shows that these

events are not randomly distributed. This can be for a number of various reasons

including faulty hardware, physical location in the data center, etc. The events for

memory errors were declustered by Rui Maximo Esteves from the University of

Stavanger in Norway using the R statistical software (Esteves, Hacker, & Rong,

2012).

3.1.3.6 Carter: Scatter Plots of Hard Disk Failure Events

Figure 3.2 shows a scatter plot of hard disk failure event times plotted against node

IDs.

3.1.3.7 Carter: Declustering of Failure Events

Notable temporal and spatial clustering of events was found in the data set, as

shown in Figure 3.2. The events for hard disk errors were declustered using the R

statistical software (The R Project for Statistical Computing , 2016) by the author.

The “neural gas” k-means algorithm was used from the “cclust” package in R. The

17

Fig

ure

3.1.

Tim

eof

Mem

ory

Err

orE

vents

vs.

Node

ID

18

Fig

ure

3.2.

Tim

eof

Har

dD

isk

Err

orE

vents

vs.

Node

ID

19

number of centers was chosen by locating the “elbow” of the plot of within groups

sum of squared error (WSSE) versus the number of centers. WSSE is a measure of

the accuracy of the k-means fit. WSSE is the sum of the distance squared between

data points and their assigned cluster. When the slope of a polynomial fitted to a

plot of the WSSE vs. the number of centers approaches 0, then the optimum

number of centers has been found because increasing the number of centers does not

lower WSSE any further.

The elbow was located by fitting a fifth-degree polynomial to the WSSE plot

using MatLab and finding the real roots of the polynomial. The polynomial, in

Equation 3.1, has the coefficients shown in Table 3.2:

p1 ∗ x5 + p2 ∗ x4 + p3 ∗ x3 + p4 ∗ x2 + p5 ∗ x+ p6 (3.1)

Table 3.2Polynomial coefficients

Coefficient Value

p1 -3.204e-07

p2 0.005271

p3 -34.83

p4 1.157e+05

p5 -1.934e+08

p6 1.31e+11

Differentiating the polynomial and solving for its roots in MatLab produced

the solutions shown in Table 3.3. The first real root was selected and rounded down

to 3197. K-means was run using 3197 centers to get the final declustered data set

for Carter. Figure 3.3 shows a plot of WSSE vs. the number of clusters with a clear

“elbow”.

Table 3.4 lists the number of centers vs. WSSE.

20

Table 3.3Polynomial roots

Roots

3197.3 + 0.0000i

3762.3 + 0.0000i

3100.7 - 0.6492i

3100.7 + 0.6492i

Figure 3.3. WSSE vs. Number of Clusters

21

Table 3.4Number of centers vs. WSSE

Number of Centers WSSE

1000 23408200075.0000

1500 7866952504.0000

2000 2380480559.0000

2500 823117829.2000

3000 585667433.3000

3500 458904640.7000

4000 347255632.8000

3.1.3.8 Distribution Fitting

The declustered data sets for Coates and Carter were then input into MathWave’s

EasyFit software for distribution fitting (EasyFit , 2010). The fit process was run

using a continuous data domain of the time between events data. The distributions

with the best goodness of fit results were chosen.

3.1.4 Fault Injection

We created a Python 3 script that can inject arbitrary machine checks into

VMs and simulate hard disk failures for virtual disk drives. Machine checks were

injected using QEMU’s built-in Human Monitor Protocol (QEMU Emulator User

Documentation: QEMU Monitor , n.d.). Hard disk errors were simulated using

modified VirtIO SCSI drivers for QEMU (Bellard, 2014).

The script has a multi-threaded design that allows for an arbitrary number

of VMs to be supported. When an error is about to be injected, the script randomly

selects the VM in which to inject the error from a preconfigured list of VMs.

Appendix A contains the source code for our fault injection script.

22

3.1.4.1 Memory Error Injection Procedure

Arbitrary memory errors are injected using QEMU’s Human Monitor Protocol

(HMP). The HMP has various commands that can be executed via libvirt’s “virsh”

tool. One of these commands is the “mce” command that sets the values of the

virtual CPU’s MCi STATUS and related registers. These registers control the

Machine Check Architecture (MCA) subsystem of the processor, which is used for

memory and processor error detection and reporting.

For example, running the following command will inject an ECC error

overflow with a valid address into CPU 0 bank 4 of domain 12:

virsh -c qemu:///system qemu-monitor-command 12 –hmp –cmd ’mce 0 4

0xd426c0010b000813 0x0 0x0 0x0’

3.1.4.2 Machine Check Registers

The memory error messages analyzed were composed of errors detected in the

on-die northbridge of the systems’ AMD processors. These memory errors are

handled by the processor’s Machine Check Architecture, which contains several

registers indicating the type of error detected by the processor. The MCi STATUS

register is the register that reports the error detected. The underlying bit patterns

of the MCi STATUS register were analyzed based on AMD’s system programming

manuals (“AMD64 Architecture Programmer’s Manual Volume 2: System

Programming”, 2011; “BIOS and Kernel Developer’s Guide for AMD Athlon 64 and

AMD Opteron Processors”, 2006), the Linux kernel documentation (Thompson,

Jiang, Peterson, Harbaugh, & Chebab, 2011), and the RHEL kernel’s source code

(Torvalds, Red Hat, Inc., et al., 2010). The bit patterns required to inject the

observed failure events were recorded and added to the fault injection system.

23

3.1.4.3 Memory Error Injection Strings

The fault injection system supports injecting ten types of memory errors. Table 3.5

documents the supported memory errors. Uncorrectable errors are indicated by

“(UE)”. Overflow errors occur when more than one machine check occurs within the

machine check polling interval.

Table 3.5Supported memory errors and MCi STATUS register values

Error Type MCi STATUS Register

ECC error (ADDR valid) 0x9426c0010b000813

ECC error overflow (ADDR valid) 0xd426c0010b000813

ECC error (ADDR invalid) 0x9026c0010b000813

ECC error overflow (ADDR invalid) 0xd026c0010b000813

L1 Cache Data Store error (UE) 0xb600200000000145

L1 Instruction Cache (Instruction Fetch)

error (ADDR valid) 0x9400000000000151

L1 Instruction Cache (Instruction Fetch)

error overflow (ADDR valid) 0xd400000000000151

Bus Unit (L2 Cache) error (UE) 0xb600000000020136

L2 Data Cache (Line Fill) error (ADDR valid) 0x9400400000000136

L2 Data Cache (Line Fill) error overflow (ADDR valid) 0xd400400000000136

3.1.4.4 QEMU Modifications and Hard Disk Error Simulation Procedure

QEMU does not support disk error injection by default. We modified the source file

hw/scsi/scsi-disk.c

from QEMU version 2.0.0 to support disk error injection via the VirtIO SCSI

subsystem. When a hard disk error is triggered and detected, the VM’s SCSI driver

24

skips to its built-in error handling code and propagates the error to the rest of the

system using the OS’s inherent failure detection mechanisms. Appendix B contains

the source code changes we made.

Hard disk errors are injected by the existence of a file under a directory tree

containing MAC addresses that map the errors to specific VMs. For example, the

existence of the following file will simulate the error ENOMEDIUM for the VM with

MAC address 52:54:00:39:ca:8b:

/tmp/qemu-disk-inject/52:54:00:39:ca:8b/ENOMEDIUM

The Flexible I/O Tester (fio) (Axboe, 2015) was used to simulate a light disk

I/O workload on the VMs. This ensures that the injected errors are recognized by

QEMU and the guest OS.

3.1.4.5 Hard Disk Errors

The fault injection system supports injecting four types of hard disk errors;

however, only the errors ENOMEM, ENOMEDIUM, and EINVAL were used for the

simulations. Injecting the error ENOSPC immediately suspends the VM, making

this error not suitable for simulations. These four errors are all of the errors

supported by QEMU. Table 3.6 documents the supported errors and their

descriptions. The descriptions are from

/usr/include/asm-generic/errno{,-base}.h

Table 3.6Supported hard disk errors and their descriptions

Errno Description

ENOMEM out of memory

ENOSPC no space left on device

ENOMEDIUM no medium found

EINVAL invalid argument

25

3.1.4.6 Tunable Fault Injection Frequency

The script supports tunable fault injection frequency. The statistical distributions

from EasyFit were mathematically analyzed to allow for arbitrary changes to the

expected values of the distributions. The scale parameters to the distributions were

made variable, via an inverse multiplier term, to provide a desired increase or

decrease in the frequency of failure events, similar to a tuning knob.

3.1.4.7 Tuning the Three-Parameter Weibull Distribution

Integrated memory controller errors on the Coates system followed a

three-parameter Weibull distribution with the parameters found in Table 4.1. The

expected value of the three-parameter Weibull distribution is shown in

Equation 3.2, where Γ is the Gamma function, γ is the shape parameter, α is the

scale parameter, and µ is the location parameter (NIST, 2003).

E[X] = Γ(

1 +1

γ

)∗ α + µ (3.2)

Rearranging the terms of Equation 3.2 gives a formula for an alternate scale

parameter with an inverse multiplier term, as shown in Equation 3.3, where E[X] is

the expected value and β is the multiplier.

α =

((E[X]− µ) ∗

(1β

))Γ(1 + 1

γ

) (3.3)

3.1.4.8 Tuning the Lomax Distribution

Hard disk errors on the Carter system followed a Lomax distribution—Pareto Type

II distribution that starts at 0—with the parameters found in Table 4.2. The

expected value of the Lomax distribution is shown in Equation 3.4, where λ is the

scale parameter, and α is the shape parameter.

E[X] =λ

α− 1(3.4)

26

Rearranging the terms of Equation 3.4 gives a formula for an alternate scale

parameter with an inverse multiplier term, as shown in Equation 3.5, where E[X] is

the expected value and β is the multiplier.

λ = E[X] ∗ 1

β∗ (α− 1) (3.5)

3.1.4.9 Experimental Setup

The fitted distributions were then used as input for the fault injection script to

inject errors with the same statistical distributions observed in the Coates and

Carter logs.

The testbed server is a custom-built, dual-socket Intel Xeon X5650 system

with 24 cores and 36 GiB of DDR3 memory. The stock QEMU package was

installed, and our custom QEMU was compiled and installed manually into

/usr/local/bin

To replicate this, future users can apply the “diff” from Appendix B to the

QEMU 2.0.0 source tarball using “git apply”, install the needed dependencies from

the “configure” script, and then compile and install the software:

./configure make make install

The stock QEMU executable was backed up, and a symlink was created to

our custom QEMU executable from

/usr/libexec/qemu-kvm

We added the SELinux tag “qemu exec t” to the custom QEMU executable

for SELinux compatibility. Eight QEMU-KVM virtual machines were created on

the testbed server. Using virt-manager, we changed their SCSI controller type from

“hypervisor default” to “VirtIO SCSI”, and we changed the disk bus to “SCSI”

with a “raw” storage format. The server and its eight VMs run CentOS 7.2 without

a GUI.

mcelog (Kleen, 2015) was installed on the guest VMs to decode the injected

machine checks. The Flexible I/O Tester (fio) (Axboe, 2015) was configured as a

27

workload generator on all eight VMs. fio has been configured in time-based mode

scheduled for a two month duration using the Intel IOMeter File Server Access

Pattern job configured with a “linear” I/O depth, as shown in Figure 3.4. A systemd

service file, shown in Figure 3.5, was created to start the fio service at boot and to

automatically restart on service failure. The eight VMs were configured to forward

their system logs to a centralized server for easy log collection and monitoring.

A systemd service file, shown in Figure 3.6, was created to set the Linux

kernel’s machine check “check interval” from 300 seconds to 1 second. Figure 3.7

shows the Bash script that is called from the ExecStart directive from Figure 3.6.

This systemd service file and Bash script were installed on the VMs.

3.2 Unit & Sampling

The following sections will discuss the hypotheses, population, samples,

variables, and the measure for success.

3.2.1 Hypothesis

The hypotheses for this study are the following:

H0: It is not possible to develop a small-scale fault injection testbed that

can emulate the types of faults on large-scale HPC systems.

Hα: It is possible to develop a small-scale fault injection testbed that

can emulate the types of faults on large-scale HPC systems.

3.2.2 Population

The population of the memory error system log data is a set of logs from the

“Coates” compute cluster operated by the Rosen Center for Advanced Computing

(RCAC) at Purdue University. The population of the hard disk system log data is a

28

# This job file tries to mimic the Intel IOMeter File Server Access

# Pattern

[global]

description=Emulation of Intel IOmeter File Server Access Pattern

time_based

runtime=5256000

filename=fio-data-file

unified_rw_reporting=1

[iometer]

bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10

#readwrite=randrw

readwrite=read

#rwmixread=80

direct=1

size=4g

ioengine=libaio

# IOMeter defines the server loads as the following:

# iodepth=1 Linear

# iodepth=4 Very Light

# iodepth=8 Light

# iodepth=64 Moderate

# iodepth=256 Heavy

iodepth=1

Figure 3.4. Configuration file for fio

29

[Unit]

Description=flexible I/O tester

After=network.target

[Install]

WantedBy=multi-user.target

[Service]

Type=idle

PIDFile=/run/fio.pid

Restart=always

ExecStart=/usr/bin/fio /home/jason/iometer-file-access-server.fio

Figure 3.5. systemd service file for fio, installed on the VMs

[Unit]

Description=Change the default polling frequency for machine checks

After=network.target

[Install]

WantedBy=multi-user.target

[Service]

Type=oneshot

ExecStart=/usr/bin/bash /home/jason/set-mce-polling-frequency.sh

Figure 3.6. systemd service file for changing the default pollinginterval for machine checks

30

#!/usr/bin/bash

echo "1" > /sys/devices/system/machinecheck/machinecheck0/\

check_interval

echo "1" > /sys/devices/system/machinecheck/machinecheck1/\

check_interval

Figure 3.7. Bash script that changes the default polling frequency formachine checks

31

set of logs from the “Carter” compute cluster operated by RCAC at Purdue

University.

The population of the experimental units is a set of virtual machines

deployed on a small set of servers.

3.2.3 Sample

The fault injection system uses the Python 3 Standard Library and SciPy

(SciPy-Developers, 2013) to sample the statistical distributions with the specified

parameters.

3.2.4 Variables

The independent variable was the frequency of fault injections for each failure

event type. The dependent variables were the systems’ perceived “healthiness” and

accuracy of triggered migrations with respect to “actual” major failures.

3.2.5 Measure for Success

The study examined the data and tested the null hypothesis with a 95%

successful injection rate.

3.3 Summary

This chapter provided the research approach and methodology that was used

in the execution of this research project.

32

CHAPTER 4. PRESENTATION OF DATA

This chapter provides the results from this research project.

4.1 Statistical Models

This section describes the statistical models for failure events observed in the

Coates and Carter data sets.

4.1.1 Coates: Memory Errors

Integrated memory controller errors on the Coates system followed a

three-parameter Weibull distribution with the parameters found in Table 4.1. This

correlates with the results from Hacker et al. (2009) that also showed a Weibull

distribution fits well with failure events.

Table 4.1Parameters of the three-parameter Weibull distribution for integratedmemory controller errors on the Coates system

Parameter Value

Shape 0.49575

Scale 5293.0

Location 9.1171

33

4.1.2 Carter: Hard Disk Errors

Hard disk errors on the Carter system followed a Lomax distribution (Pareto

Type II that starts at 0) with the parameters found in Table 4.2. This correlates

with the results from Schroeder, Damouras, and Gill (2010) that also showed a

Pareto distribution is the best fit for disk errors.

Table 4.2Parameters of the Lomax distribution for hard disk errors on the Carter system

Parameter Value

Shape 0.38278

Scale 450.36

4.2 Experiments

This section contains the results of the experimental setup.

4.2.1 System Log Samples

This section contains system log samples from the VMs running under the

fault injection script.

4.2.2 Quantitative Results

This section contains quantitative results from the experiments comparing

the number of attempted fault injections versus the number of faults the guest OS

of the VMs reported.

Over a 7 day time period, 15,644 total faults were randomly injected across

the eight VMs. Table 4.3 shows the number of attempted injections, the number of

successful injections, and an adjusted number of injections that accounts for

34

Figure 4.1. System log screenshot of VM “centos4” experiencing anENOMEDIUM error injection

35

duplicates outside the scope of this thesis due to the limitations of the hard disk

injection mechanism. The adjusted value was tabulated by examining the logs on a

per-VM basis and using a sliding window of one minute. When multiple disk errors

occurred within one minute of each other, the first error time was noted and one

minute was added to this time; any errors that occurred within this window were

marked as successful and tabulated under the adjusted column. The values for

ENOMEM and EINVAL have higher adjusted values than errors attempted due to

uncertainty in how big this sliding window must be. From this data, 99.7% of disk

errors were successfully injected; 100% of memory errors were successfully injected;

and 99.99% of all total errors were successfully injected.

Table 4.3Number of attempted and successful fault injections

Error Attempted Successful Adjusted

ENOMEDIUM 300 182 298

ENOMEM 205 122 216

EINVAL 230 150 301

ECC error (ADDR valid) 143 143 N/A

L1 Instruction Cache (Instruction Fetch) 143 143 N/A

error (ADDR valid)

L2 Data Cache (Line Fill) error 14 623 14 623 N/A

(ADDR valid)

36

CHAPTER 5. CONCLUSIONS, DISCUSSION, AND RECOMMENDATIONS

5.1 Conclusions and Discussion

We have processed and analyzed system log data from two high performance

computing clusters running typical HPC workloads operated by the Rosen Center

for Advanced Computing at Purdue University. We have produced statistical

models of hardware errors in these systems. We have developed a fault injection

testbed that, in virtual machines, injects/simulates real hardware errors using the

same statistical models we produced from these computing clusters.

As seen in the raw injection data, a number of attempted disk error

injections did not succeed. We found that hard disk errors cannot be reliably

injected with a time between errors of some time under one minute. After

examining the data, most of these failed error injections can be accounted for with

multiple injections in short succession before the virtual machines were able to

recover from previous failures. We postulate this is because the workload generator

crashes at every hard disk injection, and the generator needs time to restart itself

fully, along with the normal hard disk recovery procedures.

The statistical models we created are broadly applicable due to the high

quality of the source data—large-scale HPC systems running typical workloads for

long durations. HPC system logs of production systems under workload are rarely

made available to researchers so we believe these statistical models will be useful for

other researchers and system administrators that do not have access to such data.

The fault injection system we created is useful for parallel application

developers and system administrators that want to test their application’s

robustness to hardware errors. Parallel application developers can test how their

37

programs respond to errors in main memory and CPU caches, which is especially

useful for highly optimized programs designed to stay in the CPU caches.

Additionally, system administrators can test before deploying to production how

arbitrary web applications, customer-facing web portals like Internet-based stores,

high availability software, etc. respond to failing hardware that could interrupt

business functions or result in lost revenue.

5.2 Recommendations

For future work, we recommend further analysis of the Rosen Center’s

cluster data sets for other hardware error types and generate more statistical models

for these errors. We also recommend working with the QEMU project maintainers

to create a built-in mechanism for hard disk error injection. Finally, we recommend

extending this work by investigating error prediction based on prior hardware errors.

APPENDICES

38

APPENDIX A. FAULT INJECTION SCRIPT

#!/usr/bin/env python3

"""Docstring: Use QEMU and libvirt to inject arbitrary errors into VMs.

Author: Jason St. John

License: Apache 2.0

Copyright 2016 Jason St. John

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

-------------------------------------------------------------------

Python version required: >=3.2

This script generates lists of sojourn times between machine check and

39

disk fault injections based on various statistical distributions and

injects the given error using libvirt and a custom QEMU.

"""

from scipy.stats import lomax

import io, os, subprocess, time, threading, sys, math, random, csv

import datetime

#import libvirt

__version__ = "1.0"

__status__ = "Final"

__date__ = "2016-11-09"

__maintainer__ = "Jason St. John"

__author__ = ["Jason St. John"]

__email__ = "[email protected]"

"""

conn = libvirt.open(’qemu:///system’)

if conn == None:

print(’Failed to open connection to qemu:///system’, file=sys.stderr)

exit(1)

domain_IDs = conn.listDomainsID()

if domain_IDs == None:

print(’Failed to get a list of domain IDs’, file=sys.stderr)

print("Active domain IDs:")

if len(domain_IDs) == 0:

print(’ None’)

40

else:

for domain_ID in domain_IDs:

print(’ ’ + str(domain_ID))

"""

# These are dicts of domain (read: VM) IDs & MAC addresses for use

# with libvirt. The "high", "normal", and "low" denote fault frequency.

# Run ‘virsh -c qemu:///system list‘ to find the domain IDs you want

# to use. MAC addresses can be found via virt-manager or virsh.

domain_list_high = {

’9’: ’52:54:00:52:66:c1’,#centos1

’4’: ’52:54:00:42:6f:d4’,#centos4

’6’: ’52:54:00:26:57:2d’,#centos6

’7’: ’52:54:00:ae:f6:2a’#centos7

}

domain_list_normal = {

’2’: ’52:54:00:1e:cc:65’,#centos2

’5’: ’52:54:00:e0:0a:6d’,#centos5

’8’: ’52:54:00:82:53:05’#centos8

}

domain_list_low = {

’3’: ’52:54:00:cf:de:c7’#centos3

}

"""

# Check whether the targeted VMs are running

domain_list_all = list(domain_list_high.keys())

domain_list_all += list(domain_list_normal.keys())

domain_list_all += list(domain_list_low.keys())

for domain in domain_list_all:

41

if domain not in domain_IDs:

print("Targeted VM not running: ", domain)

conn.close()

exit(2)

"""

# This is the path to the directory that our modified QEMU looks for

# disk error injection.

qemu_disk_inject_path = "/tmp/qemu-disk-inject/"

"""

Define all of the functions we will use for tweaking the distribution

parameters.

"""

def weibull_mean(shape, scale, location):

"""Return the mean of the three parameter Weibull distribution."""

mean = scale * math.gamma(1 + (1 / shape)) + location

return mean

def weibull_alt_scale(mean, multiplier, shape, location):

"""Return an alternate scale parameter used to tweak the intensity."""

alt_scale = ((mean - location) * (1 / multiplier))\

/ math.gamma(1 + (1 / shape))

return alt_scale

def lomax_mean(shape, scale):

"""Return the mean of the Lomax distribution."""

mean = scale / (shape - 1)

return mean

42

def lomax_alt_scale(mean, multiplier, shape):

"""Return an alternate scale parameter used to tweak the intensity."""

alt_scale = mean * (1 / multiplier) * (shape - 1)

return alt_scale

"""

Set variables for the distributions we are using. These values were

derived from declustering of the raw fault data from the Coates

cluster.

"""

# Weibull distribution

# ECC Errors

fault_mem_shape = 0.49575

fault_mem_scale = 5293.0

fault_mem_location = 9.1171

fault_mem_mean = weibull_mean(fault_mem_shape, fault_mem_scale,

fault_mem_location)

# High intensity

fault_mem_high_iterations = 20000

fault_mem_high_multiplier = 400

fault_mem_high_scale = weibull_alt_scale(fault_mem_mean,

fault_mem_high_multiplier, fault_mem_shape, fault_mem_location)

# Normal intensity

fault_mem_normal_iterations = 20000

fault_mem_normal_multiplier = 400

fault_mem_normal_scale = weibull_alt_scale(fault_mem_mean,

fault_mem_normal_multiplier, fault_mem_shape, fault_mem_location)

43

# Low intensity

fault_mem_low_iterations = 20000

fault_mem_low_multiplier = 400

fault_mem_low_scale = weibull_alt_scale(fault_mem_mean,

fault_mem_low_multiplier, fault_mem_shape, fault_mem_location)

# Lomax distribution

fault_disk_shape = 0.38278

fault_disk_scale = 450.36

fault_disk_mean = lomax_mean(fault_disk_shape, fault_disk_scale)

# High intensity

fault_disk_high_iterations = 20000

fault_disk_high_multiplier = 300

fault_disk_high_scale = lomax_alt_scale(fault_disk_mean,

fault_disk_high_multiplier, fault_disk_shape)

# Normal intensity

fault_disk_normal_iterations = 20000

fault_disk_normal_multiplier = 300

fault_disk_normal_scale = lomax_alt_scale(fault_disk_mean,

fault_disk_normal_multiplier, fault_disk_shape)

# Low intensity

fault_disk_low_iterations = 20000

fault_disk_low_multiplier = 300

fault_disk_low_scale = lomax_alt_scale(fault_disk_mean,

fault_disk_low_multiplier, fault_disk_shape)

44

"""

These strings contain commands for the QEMU monitor to inject arbitrary

machine checks (MCs) and machine check exceptions (MCEs).

Note: ’_of’ designates an overflow machine check.

Format: "mce [cpu] [bank] [status] [mcgstatus] [addr] [misc]"

Reference AMD publications #24593 and #26094 for documentation on

the format of the [status] section.

"""

# L2 Data Cache (Line Fill) error (ADDR valid)

mc0_inject = "mce 0 0 0x9400400000000136 0x0 0x0 0x0"

mc0_inject_of = "mce 0 0 0xd400400000000136 0x0 0x0 0x0"

# L1 Instruction Cache (Instruction Fetch) Error (ADDR valid)

mc1_inject = "mce 0 1 0x9400000000000151 0x0 0x0 0x0"

mc1_inject_of = "mce 0 1 0xd400000000000151 0x0 0x0 0x0"

# Bus Unit (L2 Cache) Error (uncorrectable error)

mc2_inject = "mce 0 2 0xb600000000020136 0x0 0x0 0x0"

# L1 Cache Data Store Error (uncorrectable error)

mc3_inject = "mce 0 3 0xb600200000000145 0x0 0x0 0x0"

# ECC error (ADDR valid)

mc4_inject = "mce 0 4 0x9426c0010b000813 0x0 0x0 0x0"

mc4_inject_of = "mce 0 4 0xd426c0010b000813 0x0 0x0 0x0"

45

# ECC error (ADDR invalid)

mc4_inject_invalid_addr = "mce 0 4 0x9026c0010b000813 0x0 0x0 0x0"

mc4_inject_invalid_addr_of = "mce 0 4 0xd026c0010b000813 0x0 0x0 0x0"

# Populate the lists of sojourn times.

fault_mem_normal_list = []

for obs in range(fault_mem_normal_iterations):

variate = random.weibullvariate(fault_mem_normal_scale,

fault_mem_shape)

variate += fault_mem_location

fault_mem_normal_list.append(variate)

fault_mem_high_list = []

for obs in range(fault_mem_high_iterations):

variate = random.weibullvariate(fault_mem_high_scale, fault_mem_shape)

variate += (fault_mem_location * fault_mem_high_multiplier)

fault_mem_high_list.append(variate)

fault_mem_low_list = []

for obs in range(fault_mem_low_iterations):

variate = random.weibullvariate(fault_mem_low_scale, fault_mem_shape)

variate += (fault_mem_location * fault_mem_low_multiplier)

fault_mem_low_list.append(variate)

fault_disk_normal_list = lomax.rvs(fault_disk_shape,

scale=fault_disk_normal_scale,

size=fault_disk_normal_iterations)

fault_disk_high_list = lomax.rvs(fault_disk_shape,

46

scale=fault_disk_high_scale,

size=fault_disk_high_iterations)

fault_disk_low_list = lomax.rvs(fault_disk_shape,

scale=fault_disk_low_scale,

size=fault_disk_low_iterations)

def touch(file_name):

"""Emulate the Unix ’touch’ command."""

try:

os.utime(file_name, None)

except:

open(file_name, "a").close()

def write_log(domain, timestamp, errmsg):

"""Write injected errors to log files."""

vm_log_path = "".join(["/home/jason/vm_grouped_logs/"])

err_log_path = "".join(["/home/jason/error_grouped_logs/"])

if not os.path.exists(vm_log_path):

os.makedirs(vm_log_path)

if not os.path.exists(err_log_path):

os.makedirs(err_log_path)

err_log_subdir = "".join(errmsg.split())

iso_timestamp = datetime.datetime.fromtimestamp(\

time.mktime(timestamp)).isoformat()

with open("".join([vm_log_path, domain]), "a", newline="")\

as vm_log_file:

logwriter = csv.writer(vm_log_file, quoting=csv.QUOTE_NONE)

logwriter.writerow([iso_timestamp, errmsg])

with open("".join([err_log_path, err_log_subdir]), "a", newline="")\

47

as errmsg_log_file:

logwriter = csv.writer(errmsg_log_file, quoting=csv.QUOTE_NONE)

logwriter.writerow([iso_timestamp, domain])

def inject_error(domain, mac_addr, inject_string):

"""Inject a fault into a virtual machine (domain)."""

# Handle disk errors separately from MCE errors

if inject_string.startswith(("ENO", "EIN")):

mac_addr_path = "".join([qemu_disk_inject_path, mac_addr, "/"])

file_path = "".join([mac_addr_path, inject_string])

if not os.path.exists(mac_addr_path):

os.makedirs(mac_addr_path)

with open(file_path, "w"):

touch(file_path)

time.sleep(0.01) # let the file sit for a fraction of 1 second

# remove the file, and hence, stop triggering the injection

try:

os.remove(file_path)

except OSError as e:

if e.errno != errno.ENOENT: # no such file or directory

raise # re-raise exception if a different error occurred

else:

try:

subprocess.check_call(["virsh", "-c", "qemu:///system",

48

"qemu-monitor-command " + domain +

" --hmp --cmd \’" + inject_string + "\’"])

except subprocess.CalledProcessError:

print("Error injecting [[", inject_string, "]] into domain",

domain)

def injection_manager(domain_list, fault_list, intensity, inject_string):

"""Manage the sojourn times and fault injections per fault type."""

for seconds in fault_list:

# Randomly select a target domain every injection step

if intensity == ’high’:

domain = random.choice(list(domain_list_high.keys()))

mac_addr = domain_list_high[domain]

elif intensity == ’normal’:

domain = random.choice(list(domain_list_normal.keys()))

mac_addr = domain_list_normal[domain]

elif intensity == ’low’:

domain = random.choice(list(domain_list_low.keys()))

mac_addr = domain_list_low[domain]

else:

print("Invalid intensity: ", intensity)

exit(3)

print("\"" + inject_string + "\" will be injected into domain "\

+ domain + " in " + str(round(seconds,2)) + " seconds.")

time.sleep(seconds)

inject_error(domain, mac_addr, inject_string)

timestamp = time.localtime()

#timestamp = time.gmtime()

49

write_log(domain, timestamp, inject_string)

"""

Spawn a thread for each fault type. At each injection interval,

each thread will randomly select a domain from the list of domains

and inject one instance of its fault type. The thread will loop

through the number of sojourn times.

"""

# Memory errors

thread1 = threading.Thread(target=injection_manager, args=(

domain_list_high, fault_mem_high_list, ’high’, mc4_inject))


domain_list_normal, fault_mem_normal_list, ’normal’, mc0_inject))


domain_list_low, fault_mem_low_list, ’low’, mc1_inject))

# Disk errors


domain_list_high, fault_disk_high_list, ’high’, ’ENOMEDIUM’))


domain_list_normal, fault_disk_normal_list, ’normal’, ’ENOMEM’))


domain_list_low, fault_disk_low_list, ’low’, ’EINVAL’))

# Threads must not be ’daemon threads’

thread1.daemon = False






50

# Start all threads at the same time

thread1.start()

thread2.start()

thread3.start()

thread6.start()

thread7.start()

thread8.start()

print("There are currently ", threading.active_count(),

"active threads.")

51

APPENDIX B. QEMU SOURCE CODE MODIFICATIONS

The next page begins the multi-page output of the “git diff” command that

contains our modifications to the QEMU 2.0.0 source code.

52

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c

index 48a28ae..4359047 100644

--- a/hw/scsi/scsi-disk.c

+++ b/hw/scsi/scsi-disk.c

@@ -37,10 +37,14 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)

#include "hw/block/block.h"

#include "sysemu/dma.h"

+#include "net/net.h"

+

#ifdef __linux

#include <scsi/sg.h>

#endif

+#include <dirent.h>

+

#define SCSI_WRITE_SAME_MAX 524288

#define SCSI_DMA_BUF_SIZE 131072

#define SCSI_MAX_INQUIRY_LEN 256

@@ -292,6 +296,159 @@ static void scsi_dma_complete(void *opaque, int ret)

53

scsi_dma_complete_noio(opaque, ret);

}

+/* Read a type of disk error to inject from the dir /tmp/qemu-disk-inject,

+ * and return it. The dir structure is

+ * /tmp/qemu-disk-inject/<MAC addr>/<err type>

+ * When a VM matches the MAC address and has an empty file named one of

+ * "ENOMEDIUM", "ENOSPC", "EINVAL", or "ENOMEM", an error will be injected

+ * until the file is removed from the file system. */

+static int get_disk_error_to_inject(void)

+{

+ int err_to_inject = 0;

+ DIR *dp, *mdp;

+ struct dirent *dresult, *mdresult;

+ char macdir_name[1024];

+ int done = 0;

+ int macdone = 0;

+ macdone = macdone + 0;

+ int rc = 0;

+ rc = rc + 0;

54

+ char messages[128];

+ strcpy(messages, "0");

+ char vm_interface[100] = "net0";

+ char *vm_int;

+ int i;

+ int match;

+ uint8_t macaddr[6];

+ int inaddr[6];

+

+ NetClientState *ncs;

+ NICState *nic_state;

+

+ vm_int = vm_interface;

+

+ // Find the running VM’s MAC address

+ ncs = qemu_find_netdev_match(vm_int);

+ if (ncs == NULL) {

+ qemu_log("NetClientState pointer NULL for interface %s\n", vm_int);

+ } else {

+ nic_state = qemu_get_nic(ncs);

55

+

+ if (nic_state != NULL) {

+ for (i = 0; i < 5; i++) {

+ macaddr[i] = nic_state->conf->macaddr.a[i];

+ }

+ }

+ }

+

+ errno = 0;

+ dp = opendir("/tmp/qemu-disk-inject");

+

+ if (dp == NULL) {

+ perror("SCSI disk fault injection error: Main control config diropen failed");

+ goto closefiles;

+ }

+

+ while (done == 0) {

+ // Iterate over MAC-named directories to find a match to current running VM

+

+ dresult = readdir(dp);

56

+

+ if ((dresult == NULL) && (errno != 0)) {

+ perror("SCSI disk fault injection error: Main control config readdir failed");

+ goto closefiles;

+ }

+

+ if ((dresult == NULL) && (errno == 0)) {

+ // No more directories.

+ done = 1;

+ break;

+ }

+ if (index(dresult->d_name, ’:’) != NULL) {

+ rc = sscanf (dresult->d_name, "%x:%x:%x:%x:%x:%x", &inaddr[0], &inaddr[1], &inaddr[2],\

+ &inaddr[3], &inaddr[4], &inaddr[5]);

+

+ // Is this line intended for us?

+ match = 1;

+ for (i = 0; i < 5; i++) {

+ if (inaddr[i] != macaddr[i]) {

+ match = 0;

57

+ }

+ }

+ if (match == 1) {

+ // Get the error to inject from the filename in this directory.

+ sprintf (macdir_name, "/tmp/qemu-disk-inject/%s", dresult->d_name);

+ mdp = opendir(macdir_name);

+

+ if (mdp == NULL) {

+ perror("SCSI disk fault injection error: MAC (node-level) config diropen failed");

+ goto closemdpfiles;

+ }

+ while (macdone == 0) {

+ mdresult = readdir(mdp);

+

+ if ((mdresult == NULL) && (errno != 0)) {

+ perror("SCSI disk fault injection error: MAC (node-level) config readdir failed");


+ }

+

+ if ((mdresult == NULL) && (errno == 0)) {

58

+ // Empty MAC directory


+ }

+ if (index(mdresult->d_name, ’.’) == NULL) {

+ // Scan the directory to discover the one message to be sent

+ if (strncmp (mdresult->d_name, "ENOMEDIUM", strlen("ENOMEDIUM")) == 0) {

+ err_to_inject = ENOMEDIUM;

+ /* Remove injection support for ENOSPC because it suspends the VM immediately

+ * and the error isn’t logged to syslog or the Journal.

+ *

+ * } else if (strncmp (mdresult->d_name, "ENOSPC", strlen("ENOSPC")) == 0) {

+ * err_to_inject = ENOSPC;

+ */

+ } else if (strncmp (mdresult->d_name, "EINVAL", strlen("EINVAL")) == 0) {

+ err_to_inject = EINVAL;

+ } else if (strncmp (mdresult->d_name, "ENOMEM", strlen("ENOMEM")) == 0) {

+ err_to_inject = ENOMEM;

+ } else {

+ err_to_inject = 0;

+ }

59

+

+ macdone = 1;

+ }

+ }

+ if (mdp != NULL) {

+ closedir(mdp);

+ mdp = NULL;

+ }

+ return err_to_inject;

+

+ closemdpfiles:

+ {

+ if (mdp != NULL) {

+ closedir(mdp);

+ mdp = NULL;

+ }

+ if (dp != NULL) {

+ closedir(dp);

+ dp = NULL;

+ }

60

+ return 0;

+ }

+ }

+ }

+ }

+

+closefiles:

+{

+ if (dp != NULL) {

+ closedir(dp);

+ dp = NULL;

+ }

+ return 0;

+}

+}

+

static void scsi_read_complete(void * opaque, int ret)

{

SCSIDiskReq *r = (SCSIDiskReq *)opaque;

@@ -330,6 +486,7 @@ static void scsi_do_read(void *opaque, int ret)

61

SCSIDiskReq *r = opaque;

SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);

uint32_t n;

+ int inject_err;

if (r->req.aiocb != NULL) {

r->req.aiocb = NULL;

@@ -339,6 +496,13 @@ static void scsi_do_read(void *opaque, int ret)

goto done;

}

+ inject_err = get_disk_error_to_inject();

+ if (inject_err) {

+ // inject_err is either "0" or the value of a C errno

+ scsi_handle_rw_error(r, inject_err);

+ return;

+ }

+

if (ret < 0) {

if (scsi_handle_rw_error(r, -ret)) {

62

goto done;

@@ -450,6 +614,7 @@ static void scsi_write_complete(void * opaque, int ret)

SCSIDiskReq *r = (SCSIDiskReq *)opaque;

SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);

uint32_t n;

+ int inject_err;

if (r->req.aiocb != NULL) {

r->req.aiocb = NULL;

@@ -459,6 +624,13 @@ static void scsi_write_complete(void * opaque, int ret)

goto done;

}

+ inject_err = get_disk_error_to_inject();

+ if (inject_err) {

+ // inject_err is either "0" or the value of a C errno

+ scsi_handle_rw_error(r, inject_err);

+ return;

+ }

+

63

if (ret < 0) {

if (scsi_handle_rw_error(r, -ret)) {

goto done;

diff --git a/include/net/net.h b/include/net/net.h

index 8166345..f4ee769 100644

--- a/include/net/net.h

+++ b/include/net/net.h

@@ -98,6 +98,7 @@ typedef struct NICState {

bool peer_deleted;

} NICState;

+NetClientState *qemu_find_netdev_match(const char *id);

NetClientState *qemu_find_netdev(const char *id);

int qemu_find_net_clients_except(const char *id, NetClientState **ncs,

NetClientOptionsKind type, int max);

diff --git a/net/net.c b/net/net.c

index e3ef1e4..bbc6fd4 100644

--- a/net/net.c

+++ b/net/net.c

@@ -623,6 +623,19 @@ NetClientState *qemu_find_netdev(const char *id)

64

return NULL;

}

+NetClientState *qemu_find_netdev_match(const char *id)

+{

+ NetClientState *nc;

+

+ QTAILQ_FOREACH(nc, &net_clients, next) {

+ if (!strcmp(nc->name, id)) {

+ return nc;

+ }

+ }

+

+ return NULL;

+}

+

int qemu_find_net_clients_except(const char *id, NetClientState **ncs,

NetClientOptionsKind type, int max)

{

LIST OF REFERENCES

65

LIST OF REFERENCES

AMD64 architecture programmer’s manual volume 2: System programming[Computer software manual]. (2011, May). Pub. 24593.

Atif, M., & Strazdins, P. (2009). Optimizing live migration of virtual machines inSMP clusters for HPC applications. In Sixth IFIP international conferenceon network and parallel computing (pp. 51–58). Gold Coast, Queensland,AUS: IEEE. doi: 10.1109/NPC.2009.32

Axboe, J. (2015). fio: flexible I/O tester. Retrieved fromhttps://github.com/axboe/fio

Bellard, F. (2014). QEMU: open-source processor emulator. Retrieved fromhttp://qemu.org

BIOS and kernel developer’s guide for AMD Athlon 64 and AMD Opteronprocessors [Computer software manual]. (2006, February). Pub. 26094.

DeBardeleben, N. A., Blanchard, S. P., Fu, S., Guan, Q., & Zhang, Z. (2011).Experimental framework for injecting logic errors in a virtual machine toprofile applications for soft error resilience (Tech. Rep.). Los Alamos, NewMexico, USA: Los Alamos National Laboratory. Retrieved fromhttp://newmexicoconsortium.org/usrc/usrc-publications(LA-UR-11-10926)

EasyFit. (2010). MathWave Technologies. Retrieved fromhttp://www.mathwave.com

Esteves, R. M., Hacker, T., & Rong, C. (2012, December). Cluster analysis for thecloud: Parallel Competitive Fitness and parallel K-means++ for largedataset analysis. In Proceedings of the 2012 IEEE 4th internationalconference on cloud computing technology and science (pp. 177–184). Taipei,Taiwan: IEEE. doi: 10.1109/CloudCom.2012.6427553

Evangelinos, C., & Hill, C. N. (2008). Cloud computing for parallel scientific HPCapplications: Feasibility of running coupled atmosphere-ocean climate modelson Amazon’s EC2. In The first workshop on cloud computing and itsapplications ’08. Retrieved from http://www.cca08.org/papers.html

Fu, S., & Xu, C.-Z. (2007, October). Quantifying temporal and spatial correlationof failure events for proactive management. In Proceedings of the 26th IEEEinternational symposium on reliable distributed systems (pp. 175–184).Beijing, China: IEEE. doi: 10.1109/SRDS.2007.18

66

Giuffrida, C., Kuijsten, A., & Tanenbaum, A. S. (2013, December). EDFI: Adependable fault injection tool for dependability benchmarking experiments.In Proceedings of the 2013 IEEE 19th pacific rim international symposium ondependable computing (pp. 31–40). Vancouver, British Columbia, Canada:IEEE.

Guan, Q., DeBardeleben, N., Blanchard, S., & Fu, S. (2014, May). F-SEFI: Afine-grained soft error fault injection tool for profiling applicationvulnerability. In Proceedings of the 2014 IEEE 28th international parallel anddistributed processing symposium (pp. 1245–1254). Phoenix, Arizona: IEEE.

Hacker, T. J., Romero, R. F., & Carothers, C. D. (2009). An analysis of clusteredfailures on large supercomputing systems. Journal of Parallel and DistributedComputing , 69 (7), 652–665. doi: 10.1016/j.jpdc.2009.03.007

Kleen, A. (2015). mcelog: the Linux hardware error daemon. Retrieved fromhttp://www.mcelog.org

Lange, J., Dinda, P., Xia, L., Bridges, P. G., Hale, K., et al. (2011). Palacios: AnOS-independent embeddable VMM. Retrieved fromhttp://www.v3vee.org/palacios/

Levy, S., Dosanjh, M. G. F., Bridges, P. G., & Ferreira, K. B. (2013, June). Usingunreliable virtual hardware to inject errors in extreme-scale systems. InProceedings of the 3rd workshop on fault-tolerance for HPC at eXtreme scale(FTXS). New York, NY: ACM.

libvirt virtualization API. (2015). Red Hat, Inc. Retrieved fromhttps://libvirt.org/

NIST. (2003, June). NIST/SEMATECH e-handbook of statistical methods. NationalInstitute of Standards and Technology. Retrieved fromhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htm

Oliner, A., & Stearley, J. (2007, June). What supercomputers say: A study of fivesystem logs. In Proceedings of the 37th annual IEEE/IFIP internationalconference on dependable systems & networks (pp. 575–584). Edinburgh,Scotland: IEEE. doi: 10.1109/DSN.2007.103

Pandit, N., Kalbarczyk, Z., & Iyer, R. K. (2009, June / July). Effectiveness ofmachine checks for error diagnostics. In Proceedings of the IEEE/IFIPinternational conference on dependable systems & networks, 2009 (pp.578–583). Lisbon, Portugal: IEEE. doi: 10.1109/DSN.2009.5270290

QEMU emulator user documentation: QEMU monitor. (n.d.). QEMU Project.Retrieved from http://wiki.qemu.org/download/qemu-doc.html

Romero, R. F. (2010). Live migration of parallel applications. Unpublished master’sthesis, Purdue University, West Lafayette. Retrieved fromhttp://docs.lib.purdue.edu/techmasters/27/ (UMI No. 1489538)

Rosen Center for Advanced Computing, P. U. (n.d.-a). ITaP research computing:Overview of Carter. Retrieved fromhttps://www.rcac.purdue.edu/compute/carter/

67

Rosen Center for Advanced Computing, P. U. (n.d.-b). ITaP research computing:Overview of Coates. Retrieved fromhttps://www.rcac.purdue.edu/compute/coates/

The R project for statistical computing. (2016). Retrieved fromhttps://www.r-project.org

Salfner, F., & Tschirpke, S. (2008, December). Error log processing for accuratefailure prediction. In WASL ’08 proceedings of the first USENIX conferenceon analysis of system logs (p. 4). San Diego, CA, USA: USENIXAssociation.

Schroeder, B., Damouras, S., & Gill, P. (2010, February). Understanding latentsector errors and how to protect against them. In Proceedings of the 8thUSENIX conference on file and storage technologies (pp. 71–84). San Jose,California: USENIX Association.

SciPy-Developers. (2013). SciPy: Scientific computing tools for Python. Retrievedfrom https://www.scipy.org

Thompson, D., Jiang, D., Peterson, D., Harbaugh, T., & Chebab, M. C. (2011).EDAC - error detection and correction. Retrieved from https://www.kernel.org/doc/Documentation/edac.txt

Torvalds, L., Red Hat, Inc., et al. (2010). The Red Hat Enterprise Linux 5 kernelsource code. Retrieved from http://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/

Youseff, L., Wolski, R., Gorda, B., & Krintz, C. (2006). Evaluating the performanceimpact of Xen on MPI and process execution for HPC systems. In Secondinternational workshop on virtualization technology in distributed computing.

Zhang, Y., Squillante, M. S., Sivasubramaniam, A., & Sahoo, R. K. (2004).Performance implications of failures in large-scale cluster scheduling. LectureNotes in Computer Science, 3277 , 233–252. Retrieved fromhttp://www.springer.com/computer/lncs

Zheng, Z., Lan, Z., Park, B. H., & Geist, A. (2009). System log pre-processing toimprove failure prediction. In Proceedings of the IEEE/IFIP internationalconference on dependable systems & networks, 2009 (pp. 572–577). Lisbon,Portugal: IEEE. doi: 10.1109/DSN.2009.5270289

Zhou, W., Zhan, J., Meng, D., Xu, D., & Zhang, Z. (2010). LogMaster: Miningevent correlations in logs of large-scale cluster systems. Computing ResearchRepository. Retrieved from http://arxiv.org/abs/1003.0951

Zhou, W., Zhan, J., Meng, D., & Zhang, Z. (2010). Online event correlationsanalysis in system logs of large-scale cluster systems. Lecture Notes inComputer Science, 6289 , 262–276.

Documents

A small-scale testbed for large-scale reliable computing