Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Purdue UniversityPurdue e-Pubs
Open Access Theses Theses and Dissertations
12-2016
A small-scale testbed for large-scale reliablecomputingJason R. St. JohnPurdue University
Follow this and additional works at: https://docs.lib.purdue.edu/open_access_theses
Part of the Computer Sciences Commons
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] foradditional information.
Recommended CitationSt. John, Jason R., "A small-scale testbed for large-scale reliable computing" (2016). Open Access Theses. 899.https://docs.lib.purdue.edu/open_access_theses/899
Graduate School Form30 Updated
PURDUE UNIVERSITYGRADUATE SCHOOL
Thesis/Dissertation Acceptance
This is to certify that the thesis/dissertation prepared
By
Entitled
For the degree of
Is approved by the final examining committee:
To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material.
Approved by Major Professor(s):
Approved by:Head of the Departmental Graduate Program Date
Jason R. St. John
A SMALL-SCALE TESTBED FOR LARGE-SCALE RELIABLE COMPUTING
Master of Science
Thomas HackerChair
Eric Matson
John Springer
Thomas Hacker
Jeffrey Whitten 11/29/2016
A SMALL-SCALE TESTBED FOR
LARGE-SCALE RELIABLE COMPUTING
A Thesis
Submitted to the Faculty
of
Purdue University
by
Jason R. St. John
In Partial Fulfillment of the
Requirements for the Degree
of
Master of Science
December 2016
Purdue University
West Lafayette, Indiana
ii
ACKNOWLEDGMENTS
I wish to gratefully acknowledge my thesis committee, my family, and my
fellow graduate students for their help and support, encouragement, and insight.
iii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.7 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
CHAPTER 2. REVIEW OF RELEVANT LITERATURE . . . . . . . . . . 62.1 HPC System Reliability . . . . . . . . . . . . . . . . . . . . . . . . 62.2 HPC, Cloud Computing, and Virtualization . . . . . . . . . . . . . 82.3 Error Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER 3. FRAMEWORK AND METHODOLOGY . . . . . . . . . . . 123.1 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Overview of Statistical Model Creation . . . . . . . . . . . . 133.1.3 System Log Analysis . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3.1 Coates: Log Processing and Sorting . . . . . . . . . 143.1.3.2 Coates: Timestamp Standardization . . . . . . . . 153.1.3.3 Coates: Node ID Standardization . . . . . . . . . . 153.1.3.4 Coates: Scatter Plots of Memory Failure Events . . 163.1.3.5 Coates: Declustering of Failure Events . . . . . . . 163.1.3.6 Carter: Scatter Plots of Hard Disk Failure Events . 163.1.3.7 Carter: Declustering of Failure Events . . . . . . . 163.1.3.8 Distribution Fitting . . . . . . . . . . . . . . . . . 21
3.1.4 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
Page3.1.4.1 Memory Error Injection Procedure . . . . . . . . . 223.1.4.2 Machine Check Registers . . . . . . . . . . . . . . . 223.1.4.3 Memory Error Injection Strings . . . . . . . . . . . 233.1.4.4 QEMU Modifications and Hard Disk Error Simulation
Procedure . . . . . . . . . . . . . . . . . . . . . . . 233.1.4.5 Hard Disk Errors . . . . . . . . . . . . . . . . . . . 243.1.4.6 Tunable Fault Injection Frequency . . . . . . . . . 253.1.4.7 Tuning the Three-Parameter Weibull Distribution . 253.1.4.8 Tuning the Lomax Distribution . . . . . . . . . . . 253.1.4.9 Experimental Setup . . . . . . . . . . . . . . . . . 26
3.2 Unit & Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.3 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.4 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.5 Measure for Success . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 4. PRESENTATION OF DATA . . . . . . . . . . . . . . . . . . 324.1 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Coates: Memory Errors . . . . . . . . . . . . . . . . . . . . 324.1.2 Carter: Hard Disk Errors . . . . . . . . . . . . . . . . . . . . 33
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.1 System Log Samples . . . . . . . . . . . . . . . . . . . . . . 334.2.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 33
CHAPTER 5. CONCLUSIONS, DISCUSSION, AND RECOMMENDATIONS 365.1 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . 365.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
APPENDIX A. FAULT INJECTION SCRIPT . . . . . . . . . . . . . . . . . 38
APPENDIX B. QEMU SOURCE CODE MODIFICATIONS . . . . . . . . . 51
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
v
LIST OF TABLES
Table Page
3.1 Comparison of various timestamp formats . . . . . . . . . . . . . . . . 15
3.2 Polynomial coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Polynomial roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Number of centers vs. WSSE . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Supported memory errors and MCi STATUS register values . . . . . . 23
3.6 Supported hard disk errors and their descriptions . . . . . . . . . . . . 24
4.1 Parameters of the three-parameter Weibull distribution for integrated memorycontroller errors on the Coates system . . . . . . . . . . . . . . . . . . 32
4.2 Parameters of the Lomax distribution for hard disk errors on the Cartersystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Number of attempted and successful fault injections . . . . . . . . . . . 35
vi
LIST OF FIGURES
Figure Page
3.1 Time of Memory Error Events vs. Node ID . . . . . . . . . . . . . . . 17
3.2 Time of Hard Disk Error Events vs. Node ID . . . . . . . . . . . . . . 18
3.3 WSSE vs. Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Configuration file for fio . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 systemd service file for fio, installed on the VMs . . . . . . . . . . . . . 29
3.6 systemd service file for changing the default polling interval for machinechecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Bash script that changes the default polling frequency for machine checks 30
4.1 System log screenshot of VM “centos4” experiencing an ENOMEDIUMerror injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
ABBREVIATIONS
ADDR memory address
BSD Berkeley Software Distribution
CPU central processing unit
EC2 Amazon’s Elastic Compute Cloud
ECC error-correcting code
DRAM dynamic random-access memory
fio Flexible I/O Tester
GART Graphics Address Relocation Table
GUI graphical user interface
HMP QEMU’s Human Monitor Protocol
HPC high performance computing
IaaS infrastructure as a service
I/O input/output
IP Internet Protocol
ISO International Organization for Standardization
IT information technology
MAC media access control
MC machine check
MCA machine check architecture
MCE machine check exception
MTBF mean time between failures
MTTF mean time to failure
RAM random-access memory
RCAC Rosen Center for Advanced Computing
viii
RFC Request For Comments
RHEL Red Hat Enterprise Linux
TLB Translation Lookaside Buffer
UTC Coordinated Universal Time
VM virtual machine
VMM virtual machine monitor
WSSE within groups sum of squared error
ix
ABSTRACT
St. John, Jason R. M.S., Purdue University, December 2016. A Small-Scale TestbedFor Large-Scale Reliable Computing. Major Professor: Thomas J. Hacker.
High performance computing (HPC) systems frequently suffer errors and
failures from hardware components that negatively impact the performance of jobs
run on these systems. We analyzed system logs from two HPC systems at Purdue
University and created statistical models for memory and hard disk errors. We
created a small-scale error injection testbed—using a customized QEMU build,
libvirt, and Python—for HPC application programmers to test and debug their
programs in a faulty environment so that programmers can write more robust and
resilient programs before deploying them on an actual HPC system. The
deliverables for this project are the fault injection program, the modified QEMU
source code, and the statistical models used for driving the injection.
1
CHAPTER 1. INTRODUCTION
High performance computing (HPC) systems are a major component of
academic and industrial scientific research, and the reliability of these systems is one
of the foremost areas of research in the HPC community. The HPC community is
continually trying to make larger, more powerful HPC systems, and the poor
reliability of commodity HPC clusters is one of the biggest hindrances to the
adoption of petascale and exascale computing systems. This chapter provides the
scope, significance, research question, and other background information on the
thesis.
1.1 Scope
This study examined the system logs of large-scale HPC systems operated by
the Rosen Center for Advanced Computing (RCAC) at Purdue University. The
systems studied are all commodity-based computing clusters running Red Hat
Enterprise Linux (RHEL) 5, which is a GNU/Linux distribution based on version
2.6.18 of the Linux kernel. This study determined the frequency at which
component failures occurred and produced statistical models of the recorded failure
events. The statistical models were used as the driving input to a synthetic fault
generator that simulated hardware failures within a virtual machine environment.
1.2 Significance
Large-scale high performance computing (HPC) systems suffer from low
reliability due to frequent component failures of commodity hardware. Due to the
series reliability model, increasing the number of nodes or processors in an HPC
2
cluster reduces the overall reliability of the whole cluster. This results in frequent
failures that hinder the performance and scalability of large-scale systems.
The work in this thesis provides a small-scale fault injection system based on
system logs from real HPC systems under a typical workload that are used to
emulate failures on a VM-level to create a simulation of a faulty cluster for a
testbed for parallel applications. This would allow parallel application developers to
test their code’s robustness and tolerance of node failures in a small-scale
environment that replicates the failure patterns of large-scale systems. Researchers
will be able to submit more fault-tolerant parallel applications to large-scale HPC
systems first, instead of submitting jobs that have not been properly tested for
robustness and resilience.
1.3 Research Question
Can a small-scale fault injection system based on real HPC system logs be
used to create a suitable testbed for parallel applications?
1.4 Assumptions
The assumptions for this study include:
• The filtering of the system logs by RCAC did not remove any important or
relevant messages.
• The system logs provided by RCAC are accurate and correct.
• MathWave’s EasyFit software correctly and accurately fits datasets to
statistical distributions.
• All virtual cluster nodes’ system clocks were in synchrony with each other
throughout the logging period.
3
• The overhead of the developed system will be minimal enough to avoid
significant performance penalties.
1.5 Limitations
The limitations for this study include:
• The developed system was tested in a virtual environment using the
QEMU-KVM hypervisor only.
• The developed system was tested using the CentOS 7.2 Linux distribution
only.
• The developed system was run on Intel processors.
• The developed system cannot reliably inject multiple hard disk errors when
the time between errors is some time under one minute.
1.6 Delimitations
The delimitations for this study include:
• This study cannot examine the most critical of failure events, the machine
check exception (MCE), because of an unpatched bug in the Linux kernel’s
handling of MCEs. This limitation applies to Linux kernel version 2.6.18.
• The studied HPC systems were manufactured by HP, use AMD processors,
and run Red Hat Enterprise Linux 5.5.
• The developed system was not be tested on live HPC cluster nodes or with
real HPC jobs running.
4
1.7 Definitions
In the broader context of thesis writing, we define the following terms:
• checkpoint–restart: a fault tolerance mechanism designed to back up the state
of a system (Hacker, Romero, & Carothers, 2009)
• component failure: any defect or error that occurs in a single computational
node
• machine check architecture (MCA): the platform independent framework for
detecting, reporting, and handling machine check errors (“AMD64
Architecture Programmer’s Manual Volume 2: System Programming”, 2011)
• machine check error: (commonly: machine check) a correctable or
uncorrectable hardware error detected by the machine check architecture
(MCA) (“AMD64 Architecture Programmer’s Manual Volume 2: System
Programming”, 2011)
• machine check exception (MCE): a machine check error that cannot be
corrected and frequently results in a processor’s context becoming corrupted
(“AMD64 Architecture Programmer’s Manual Volume 2: System
Programming”, 2011)
• mean time between failures (MTBF): the mean time elapsed between
consecutive failure events that includes the repair/maintenance time
• mean time to failure (MTTF): the mean time elapsed from a node being
added to a cluster to its first recorded failure event
• memory module: a hardware component composed of volatile random access
memory (RAM) that is used as the primary cache for a node; nodes usually
have many memory modules.
5
• spatial locality: the apparent clustering of failure events with respect to
physical node location (e.g. server rack), indicating failure event dependence
on an external factor (Hacker et al., 2009)
• temporal locality: the apparent clustering of failure events with respect to
time, indicating failure event interdependence (Hacker et al., 2009)
• Unix epoch timestamp: the count of the number of seconds since
1970-01-01T00:00:00 UTC
1.8 Summary
This chapter has provided the background information, research question,
and the scope of the thesis. The next chapter provides a review of the literature
related to HPC system logs and system reliability.
6
CHAPTER 2. REVIEW OF RELEVANT LITERATURE
This chapter provides a review of the literature relevant to HPC system logs
and HPC system reliability.
2.1 HPC System Reliability
Improving the reliability of high performance computing (HPC) systems is
one of the leading research areas in the HPC field. Many studies have been
performed that have covered various methods for understanding how, why, and
when failures occur in large-scale HPC systems (Atif & Strazdins, 2009;
DeBardeleben, Blanchard, Fu, Guan, & Zhang, 2011; Fu & Xu, 2007; Hacker et al.,
2009; Oliner & Stearley, 2007; Pandit, Kalbarczyk, & Iyer, 2009; Romero, 2010;
Salfner & Tschirpke, 2008; Zhang, Squillante, Sivasubramaniam, & Sahoo, 2004;
Zheng, Lan, Park, & Geist, 2009; Zhou, Zhan, Meng, Xu, & Zhang, 2010; Zhou,
Zhan, Meng, & Zhang, 2010).
HPC system logs are in invaluable tool for studying ways of improving the
reliability of HPC systems. Unfortunately, the system logs on HPC systems are
frequently of poor quality—vague, cryptic, not easily machine parsible, etc.—which
necessitates significant processing to produce useful data. A notable flaw in the
2.6.18 version of the Linux kernel—the same kernel version used in Red Hat
Enterprise Linux (RHEL) 5—prevents the most severe of hardware failures from
being logged at all (Pandit et al., 2009). This further complicates system log
analysis on clusters running RHEL 5.
Zheng et al. (2009) presented a framework for pre-processing HPC system
logs to make these logs better suited for statistical analysis and failure prediction.
The framework consists of three components:
7
1. classifying messages by type and severity
2. removing temporal and spatial clustering of repeated messages for the same
event
3. identifying causal relationships between system log messages to better identify
symptomatic messages
Zheng et al. (2009) stated that their framework improved failure prediction over the
incumbent mechanism by 20% to 170%.
Salfner and Tschirpke (2008) proposed a set of techniques for processing the
system logs of a commercial telecommunication system, and their techniques
consisted of the following:
• classifying “messages based on Levenshtein’s edit distance”
• clustering related error messages
• “a statistical noise filtering algorithm”
Salfner and Tschirpke (2008) stated that their techniques significantly improved
upon other failure prediction methods, and the authors concluded that appropriate
pre-processing of the system logs is vital for producing useful data.
Hacker et al. (2009) concluded in their analysis of system logs from two large
IBM Blue Gene systems that system log messages show significant spatial and
temporal clustering, that the failure events rate varies in time following the Weibull
distribution, and that the health and reliability of individual computational nodes
can be estimated by system log analysis.
The work in this thesis partially aims to support the live migration of virtual
machines atop a cloud-like infrastructure as a service (IaaS) platform—such as
OpenNebula—with which a statistical model of previous HPC system failures, to be
combined in the future with the discrete-time, semi-Markov model introduced by
Hacker et al. (2009), will trigger live migrations as node health decreases. An ideal
8
implementation of this work, under perfect conditions, would result in improved
uptime for jobs because virtual machines would be live migrated off of unhealthy
hardware before failures occur.
2.2 HPC, Cloud Computing, and Virtualization
Cloud computing is a technology still in its adolescence, and it was only in
the mid-2000s that cloud computing and virtualization exploded in popularity.
Virtualization is required for almost any cost-effective cloud computing
environment; thus, for cloud computing to be viable for HPC applications, the
overhead introduced by virtualization must be minimal. Youseff, Wolski, Gorda,
and Krintz (2006) demonstrated that the Xen virtual machine monitor (VMM) adds
“no statistically significant overhead” for standard HPC benchmarks run on the
small-scale HPC system tested in their study. Youseff et al. (2006) stated that their
study showed that previous skepticism of virtualization for HPC applications was
“unwarranted” and that the many added benefits of virtualization outweigh the
negligible performance overhead.
Evangelinos and Hill (2008) stated that Amazon’s Elastic Compute Cloud
(EC2) is a promising platform for HPC applications because research institutions
will not need to build and maintain their own HPC clusters. Evangelinos and Hill
(2008) stated, however, that coupled HPC applications do suffer from a significant
performance penalty incurred by the inherent design of Amazon’s EC2 because
Amazon does not provide clients the ability to deploy virtual machines at a granular
level. Despite the performance penalty, the low cost and ability to deploy HPC
applications quickly with few requirements for local IT infrastructure demonstrate
the viability of cloud-based HPC systems in the future, especially if a cloud system
was designed with HPC applications in mind (Evangelinos & Hill, 2008).
The live migration of virtual machines is a process which involves copying
the running virtual machine’s memory pages to another physical host also running a
9
VM and, in the case of Xen, iteratively copies the memory pages up to 30 times
before pausing the running operation of the VM, resuming the VM on the new host,
and then destroying the original VM instance (Atif & Strazdins, 2009). Atif and
Strazdins (2009) hypothesized that the overall performance of live migrations could
be significantly improved by reducing the amount of CPU intensive and network
intensive memory copy iterations to the bare minimum of two iterations. Atif and
Strazdins (2009) demonstrated that optimizing this process resulted in a reduction
of memory page transfer by over 500% for the very memory intensive HPC
benchmarks. Their optimization showed improvements of almost 50% in total wall
clock time spent in the live migration process and around 200% performance
improvement for memory intensive and network intensive jobs (Atif & Strazdins,
2009). Additionally, live migration has already been shown to be a promising
proactive fault tolerance technique. Romero (2010) demonstrated a reduction in the
failure rate of parallel applications by live migrating multiple OpenVZ
containers—basically, advanced derivatives of FreeBSD Jails—to multiple nodes
that are less prone to failure.
2.3 Error Injection
Several error and fault injection systems have been developed previously.
Giuffrida, Kuijsten, and Tanenbaum (2013) created a fault injection tool for
software programs. Their system injects software faults in a controlled manner and
“offers strong guarantees” that attempted software fault injections do not cause
unintended side effects that could undermine the results. However, their approach
requires changes be made to the tested software at compile-time or using a
disassembler if source code is not available.
Guan, DeBardeleben, Blanchard, and Fu (2014) created a fault injector tool
that uses the QEMU virtual machine and can precisely target programs running
inside a VM. Their system is intended for introducing errors in software programs
10
by modifying the program’s memory by exploiting how QEMU translates virtual
machine instructions between the guest and host systems. Their system exploits the
Tiny Code Generation (TCG) system used by QEMU and “translates” the intended
VM instructions to corrupted ones before the host executes them.
Error injection into virtual machines has been investigated previously by
DeBardeleben et al. (2011). DeBardeleben et al. (2011) introduced a framework for
injecting errors that evaluates the “resilience” of parallel applications to such errors.
The approach DeBardeleben et al. (2011) took is targeted solely at causing faults
inside the application’s memory.
Levy, Dosanjh, Bridges, and Ferreira (2013) created a VM-based fault
injection tool using the Palacios virtual machine monitor (Lange et al., 2011), which
is a VMM developed for HPC systems. Their approach injects memory errors at
specific memory locations and IDE disk errors at the VM level. Their method of
injecting errors differs from our approach due to differences in the hypervisor and
due to our approach being more generalized and not necessitating physical memory
or disk addresses.
The work in this thesis is evaluated by testing the injection of machine
checks and simulation of disk failures into a virtual cluster—in other words, causing
the virtual hardware to experience similar failures and errors as those observed in
real system logs under typical workloads. The work in this thesis uses libvirt (libvirt
Virtualization API , 2015)—a virtualization API—and a custom QEMU to inject
errors into the virtual hardware which causes the virtual machine to detect and
report these errors using the inherent failure detection mechanisms within the OS.
Our approach differs from Giuffrida et al. (2013), Guan et al. (2014), and
DeBardeleben et al. (2011) by simulating errors at the virtual hardware level and
using the inherent failure detection mechanisms within the OS to detect the errors.
Our approach improves upon the work by Levy et al. (2013) by triggering
whole-disk errors which ensures that errors are always detected given some disk
activity, supporting a larger array of memory and disk error types, and by using a
11
mainstream virtual machine monitor in QEMU—a key ease-of-use requirement for a
testbed system not intended to be deployed on real HPC systems, and therefore not
bound by the same performance requirements of real HPC jobs.
2.4 Summary
This chapter provided an overview of the literature related to HPC system
logs and HPC system reliability. The next chapter provides the research approach
and methodology for the thesis.
12
CHAPTER 3. FRAMEWORK AND METHODOLOGY
This chapter provides the framework and methodology that was used in this
research project.
3.1 Study Design
This study was a quantitative research analysis of system logs from real HPC
systems. This study created statistical models of component failures within the
systems and used the models to simulate failure events on virtual machines. The
virtual machines experienced failures similar to those found in the logs from real
HPC systems.
3.1.1 Data Sets
Memory error log data was taken from system logs from the 993-node Coates
system (Rosen Center for Advanced Computing, n.d.-b) operated by the Rosen
Center for Advanced Computing (RCAC) at Purdue University. The logs analyzed
ranged from 2009-09-01 to 2011-02-20.
Hard disk error log data was taken from system logs from the 660-node
Carter system (Rosen Center for Advanced Computing, n.d.-a) operated by RCAC
at Purdue University. The logs analyzed ranged from 2012-11-25 to 2013-08-31.
The Coates and Carter systems were operating under typical HPC workloads
throughout the log collection period. This makes the analyzed log data
generalizable to other large-scale systems operating under workload.
13
3.1.2 Overview of Statistical Model Creation
This section provides an overview of the steps taken to create the statistical
models. This is provided to assist other researchers that may be interested in
creating statistical models for errors on other systems.
The system logs for each system were collected in a central location and were
stripped of personally-identifiable information prior to the beginning of this study.
The following list provides a high-level overview of the procedure.
1. Error discovery
2. Error filtering
3. Error sorting
4. Log message timestamp sanitization
5. Scatter plot creation to identify clustering of errors
6. Declustering data using k-means
7. Fitting distributions to the data
8. Selecting a model from the fitted distributions
Manual searching of the system logs was performed using grep with regular
expressions to look for anomalous log events. Error messages of interest were then
filtered out from the rest of the logs and sorted based on the type of error. Due to
poor quality timestamps of the Coates system logs, these timestamps had to be
corrected prior to further processing.
Scatter plots were made of the error event data to identify the severity of
clustering. We then declustered the data using k-means cluster analysis. The
declustered data was loaded into distribution fitting software. Finally, the
continuous statistical distributions with the best fit were selected as our models.
14
3.1.3 System Log Analysis
The system logs were manually examined to determine the key phrases
within the logs related to failure events of memory and hard disks. The logs were
searched using a grep regular expression like the one below:
grep -Ei error messages*
The system logs from Coates needed significant processing to improve the
quality of message timestamps. The system logs from Carter did not need any
notable processing.
3.1.3.1 Coates: Log Processing and Sorting
The logs were filtered so that only failure-event-related messages were kept. The
logs were further processed to identify specific failure messages and were sorted by
those messages.
We investigated general errors detected by the integrated memory
controller—an external memory controller is called a northbridge, but this
distinction is not made in the system logs. The “BIOS and Kernel Developer’s
Guide for AMD Athlon 64 and AMD Opteron Processors” (2006) states that these
errors include errors in the GART TLB cache, errors in the HyperTransport link, or
errors in DRAM. It should be noted that the RHEL kernel version
2.6.18-194.17.1.el5 disables GART TLB error reporting by default. The “node” and
“core” values mentioned below refer to the physical CPU package (i.e. socket) and
the CPU core number within the physical CPU package, respectively. Because the
Coates cluster is built entirely of dual-socket, quad-core AMD Opteron processors,
the only valid values for “node” are 0 and 1, and the only valid values for “core” are
0, 1, 2, and 3.
The regular expression used to match these errors is
kernel:.*Northbridge Error
and an example log message is
15
Feb 16 14:45:50 172.18.42.117 kernel: Northbridge Error, node 1, core: 2
3.1.3.2 Coates: Timestamp Standardization
The timestamps in the Coates logs were changed from the default BSD-style syslog
format, specified in RFC 3164, to the Unix epoch format for easier machine
processing. Table 3.1 compares the differences between the easily human-readable
ISO 8601 and BSD-style syslog formats and the easily machine-readable Unix epoch
format for a sample from the Coates system logs. For the BSD-style syslog format,
the year was determined based on log file metadata because the year is not included
in these timestamps.
Table 3.1Comparison of various timestamp formats
Format Example timestamp
ISO 8601 2011-02-13T09:02:39Z
BSD-style syslog Feb 13 04:02:39
Unix epoch 1297587759
3.1.3.3 Coates: Node ID Standardization
To look for temporal and spatial clustering of events, we created a scatter plot to
investigate the degree of clustering. To plot the nodes on the horizontal axis, the
nodes were assigned unique IDs based on their IP addresses.
The unique node IDs were generated based on the compute node’s IP
address. The IP address was converted from dotted decimal notation, e.g.
192.168.15.76, into its packed, 32-bit, binary format represented as a decimal, e.g.
3232239436. Sequential node IDs were then generated by sorted the decimal-based
node IDs. The data set, composed of unique node IDs and the message timestamps
16
of failure events, was plotted as a scatter plot with the node IDs sorted from
smallest to largest along the horizontal axis and the event timestamps plotted along
the vertical axis.
3.1.3.4 Coates: Scatter Plots of Memory Failure Events
Figure 3.1 shows a scatter plot of memory failure event times plotted against node
IDs.
3.1.3.5 Coates: Declustering of Failure Events
Notable temporal and spatial clustering of events was found in the data set, as
shown in Figure 3.1. The temporal and spatial clustering of events shows that these
events are not randomly distributed. This can be for a number of various reasons
including faulty hardware, physical location in the data center, etc. The events for
memory errors were declustered by Rui Maximo Esteves from the University of
Stavanger in Norway using the R statistical software (Esteves, Hacker, & Rong,
2012).
3.1.3.6 Carter: Scatter Plots of Hard Disk Failure Events
Figure 3.2 shows a scatter plot of hard disk failure event times plotted against node
IDs.
3.1.3.7 Carter: Declustering of Failure Events
Notable temporal and spatial clustering of events was found in the data set, as
shown in Figure 3.2. The events for hard disk errors were declustered using the R
statistical software (The R Project for Statistical Computing , 2016) by the author.
The “neural gas” k-means algorithm was used from the “cclust” package in R. The
17
Fig
ure
3.1.
Tim
eof
Mem
ory
Err
orE
vents
vs.
Node
ID
18
Fig
ure
3.2.
Tim
eof
Har
dD
isk
Err
orE
vents
vs.
Node
ID
19
number of centers was chosen by locating the “elbow” of the plot of within groups
sum of squared error (WSSE) versus the number of centers. WSSE is a measure of
the accuracy of the k-means fit. WSSE is the sum of the distance squared between
data points and their assigned cluster. When the slope of a polynomial fitted to a
plot of the WSSE vs. the number of centers approaches 0, then the optimum
number of centers has been found because increasing the number of centers does not
lower WSSE any further.
The elbow was located by fitting a fifth-degree polynomial to the WSSE plot
using MatLab and finding the real roots of the polynomial. The polynomial, in
Equation 3.1, has the coefficients shown in Table 3.2:
p1 ∗ x5 + p2 ∗ x4 + p3 ∗ x3 + p4 ∗ x2 + p5 ∗ x+ p6 (3.1)
Table 3.2Polynomial coefficients
Coefficient Value
p1 -3.204e-07
p2 0.005271
p3 -34.83
p4 1.157e+05
p5 -1.934e+08
p6 1.31e+11
Differentiating the polynomial and solving for its roots in MatLab produced
the solutions shown in Table 3.3. The first real root was selected and rounded down
to 3197. K-means was run using 3197 centers to get the final declustered data set
for Carter. Figure 3.3 shows a plot of WSSE vs. the number of clusters with a clear
“elbow”.
Table 3.4 lists the number of centers vs. WSSE.
20
Table 3.3Polynomial roots
Roots
3197.3 + 0.0000i
3762.3 + 0.0000i
3100.7 - 0.6492i
3100.7 + 0.6492i
Figure 3.3. WSSE vs. Number of Clusters
21
Table 3.4Number of centers vs. WSSE
Number of Centers WSSE
1000 23408200075.0000
1500 7866952504.0000
2000 2380480559.0000
2500 823117829.2000
3000 585667433.3000
3500 458904640.7000
4000 347255632.8000
3.1.3.8 Distribution Fitting
The declustered data sets for Coates and Carter were then input into MathWave’s
EasyFit software for distribution fitting (EasyFit , 2010). The fit process was run
using a continuous data domain of the time between events data. The distributions
with the best goodness of fit results were chosen.
3.1.4 Fault Injection
We created a Python 3 script that can inject arbitrary machine checks into
VMs and simulate hard disk failures for virtual disk drives. Machine checks were
injected using QEMU’s built-in Human Monitor Protocol (QEMU Emulator User
Documentation: QEMU Monitor , n.d.). Hard disk errors were simulated using
modified VirtIO SCSI drivers for QEMU (Bellard, 2014).
The script has a multi-threaded design that allows for an arbitrary number
of VMs to be supported. When an error is about to be injected, the script randomly
selects the VM in which to inject the error from a preconfigured list of VMs.
Appendix A contains the source code for our fault injection script.
22
3.1.4.1 Memory Error Injection Procedure
Arbitrary memory errors are injected using QEMU’s Human Monitor Protocol
(HMP). The HMP has various commands that can be executed via libvirt’s “virsh”
tool. One of these commands is the “mce” command that sets the values of the
virtual CPU’s MCi STATUS and related registers. These registers control the
Machine Check Architecture (MCA) subsystem of the processor, which is used for
memory and processor error detection and reporting.
For example, running the following command will inject an ECC error
overflow with a valid address into CPU 0 bank 4 of domain 12:
virsh -c qemu:///system qemu-monitor-command 12 –hmp –cmd ’mce 0 4
0xd426c0010b000813 0x0 0x0 0x0’
3.1.4.2 Machine Check Registers
The memory error messages analyzed were composed of errors detected in the
on-die northbridge of the systems’ AMD processors. These memory errors are
handled by the processor’s Machine Check Architecture, which contains several
registers indicating the type of error detected by the processor. The MCi STATUS
register is the register that reports the error detected. The underlying bit patterns
of the MCi STATUS register were analyzed based on AMD’s system programming
manuals (“AMD64 Architecture Programmer’s Manual Volume 2: System
Programming”, 2011; “BIOS and Kernel Developer’s Guide for AMD Athlon 64 and
AMD Opteron Processors”, 2006), the Linux kernel documentation (Thompson,
Jiang, Peterson, Harbaugh, & Chebab, 2011), and the RHEL kernel’s source code
(Torvalds, Red Hat, Inc., et al., 2010). The bit patterns required to inject the
observed failure events were recorded and added to the fault injection system.
23
3.1.4.3 Memory Error Injection Strings
The fault injection system supports injecting ten types of memory errors. Table 3.5
documents the supported memory errors. Uncorrectable errors are indicated by
“(UE)”. Overflow errors occur when more than one machine check occurs within the
machine check polling interval.
Table 3.5Supported memory errors and MCi STATUS register values
Error Type MCi STATUS Register
ECC error (ADDR valid) 0x9426c0010b000813
ECC error overflow (ADDR valid) 0xd426c0010b000813
ECC error (ADDR invalid) 0x9026c0010b000813
ECC error overflow (ADDR invalid) 0xd026c0010b000813
L1 Cache Data Store error (UE) 0xb600200000000145
L1 Instruction Cache (Instruction Fetch)
error (ADDR valid) 0x9400000000000151
L1 Instruction Cache (Instruction Fetch)
error overflow (ADDR valid) 0xd400000000000151
Bus Unit (L2 Cache) error (UE) 0xb600000000020136
L2 Data Cache (Line Fill) error (ADDR valid) 0x9400400000000136
L2 Data Cache (Line Fill) error overflow (ADDR valid) 0xd400400000000136
3.1.4.4 QEMU Modifications and Hard Disk Error Simulation Procedure
QEMU does not support disk error injection by default. We modified the source file
hw/scsi/scsi-disk.c
from QEMU version 2.0.0 to support disk error injection via the VirtIO SCSI
subsystem. When a hard disk error is triggered and detected, the VM’s SCSI driver
24
skips to its built-in error handling code and propagates the error to the rest of the
system using the OS’s inherent failure detection mechanisms. Appendix B contains
the source code changes we made.
Hard disk errors are injected by the existence of a file under a directory tree
containing MAC addresses that map the errors to specific VMs. For example, the
existence of the following file will simulate the error ENOMEDIUM for the VM with
MAC address 52:54:00:39:ca:8b:
/tmp/qemu-disk-inject/52:54:00:39:ca:8b/ENOMEDIUM
The Flexible I/O Tester (fio) (Axboe, 2015) was used to simulate a light disk
I/O workload on the VMs. This ensures that the injected errors are recognized by
QEMU and the guest OS.
3.1.4.5 Hard Disk Errors
The fault injection system supports injecting four types of hard disk errors;
however, only the errors ENOMEM, ENOMEDIUM, and EINVAL were used for the
simulations. Injecting the error ENOSPC immediately suspends the VM, making
this error not suitable for simulations. These four errors are all of the errors
supported by QEMU. Table 3.6 documents the supported errors and their
descriptions. The descriptions are from
/usr/include/asm-generic/errno{,-base}.h
Table 3.6Supported hard disk errors and their descriptions
Errno Description
ENOMEM out of memory
ENOSPC no space left on device
ENOMEDIUM no medium found
EINVAL invalid argument
25
3.1.4.6 Tunable Fault Injection Frequency
The script supports tunable fault injection frequency. The statistical distributions
from EasyFit were mathematically analyzed to allow for arbitrary changes to the
expected values of the distributions. The scale parameters to the distributions were
made variable, via an inverse multiplier term, to provide a desired increase or
decrease in the frequency of failure events, similar to a tuning knob.
3.1.4.7 Tuning the Three-Parameter Weibull Distribution
Integrated memory controller errors on the Coates system followed a
three-parameter Weibull distribution with the parameters found in Table 4.1. The
expected value of the three-parameter Weibull distribution is shown in
Equation 3.2, where Γ is the Gamma function, γ is the shape parameter, α is the
scale parameter, and µ is the location parameter (NIST, 2003).
E[X] = Γ(
1 +1
γ
)∗ α + µ (3.2)
Rearranging the terms of Equation 3.2 gives a formula for an alternate scale
parameter with an inverse multiplier term, as shown in Equation 3.3, where E[X] is
the expected value and β is the multiplier.
α =
((E[X]− µ) ∗
(1β
))Γ(1 + 1
γ
) (3.3)
3.1.4.8 Tuning the Lomax Distribution
Hard disk errors on the Carter system followed a Lomax distribution—Pareto Type
II distribution that starts at 0—with the parameters found in Table 4.2. The
expected value of the Lomax distribution is shown in Equation 3.4, where λ is the
scale parameter, and α is the shape parameter.
E[X] =λ
α− 1(3.4)
26
Rearranging the terms of Equation 3.4 gives a formula for an alternate scale
parameter with an inverse multiplier term, as shown in Equation 3.5, where E[X] is
the expected value and β is the multiplier.
λ = E[X] ∗ 1
β∗ (α− 1) (3.5)
3.1.4.9 Experimental Setup
The fitted distributions were then used as input for the fault injection script to
inject errors with the same statistical distributions observed in the Coates and
Carter logs.
The testbed server is a custom-built, dual-socket Intel Xeon X5650 system
with 24 cores and 36 GiB of DDR3 memory. The stock QEMU package was
installed, and our custom QEMU was compiled and installed manually into
/usr/local/bin
To replicate this, future users can apply the “diff” from Appendix B to the
QEMU 2.0.0 source tarball using “git apply”, install the needed dependencies from
the “configure” script, and then compile and install the software:
./configure make make install
The stock QEMU executable was backed up, and a symlink was created to
our custom QEMU executable from
/usr/libexec/qemu-kvm
We added the SELinux tag “qemu exec t” to the custom QEMU executable
for SELinux compatibility. Eight QEMU-KVM virtual machines were created on
the testbed server. Using virt-manager, we changed their SCSI controller type from
“hypervisor default” to “VirtIO SCSI”, and we changed the disk bus to “SCSI”
with a “raw” storage format. The server and its eight VMs run CentOS 7.2 without
a GUI.
mcelog (Kleen, 2015) was installed on the guest VMs to decode the injected
machine checks. The Flexible I/O Tester (fio) (Axboe, 2015) was configured as a
27
workload generator on all eight VMs. fio has been configured in time-based mode
scheduled for a two month duration using the Intel IOMeter File Server Access
Pattern job configured with a “linear” I/O depth, as shown in Figure 3.4. A systemd
service file, shown in Figure 3.5, was created to start the fio service at boot and to
automatically restart on service failure. The eight VMs were configured to forward
their system logs to a centralized server for easy log collection and monitoring.
A systemd service file, shown in Figure 3.6, was created to set the Linux
kernel’s machine check “check interval” from 300 seconds to 1 second. Figure 3.7
shows the Bash script that is called from the ExecStart directive from Figure 3.6.
This systemd service file and Bash script were installed on the VMs.
3.2 Unit & Sampling
The following sections will discuss the hypotheses, population, samples,
variables, and the measure for success.
3.2.1 Hypothesis
The hypotheses for this study are the following:
H0: It is not possible to develop a small-scale fault injection testbed that
can emulate the types of faults on large-scale HPC systems.
Hα: It is possible to develop a small-scale fault injection testbed that
can emulate the types of faults on large-scale HPC systems.
3.2.2 Population
The population of the memory error system log data is a set of logs from the
“Coates” compute cluster operated by the Rosen Center for Advanced Computing
(RCAC) at Purdue University. The population of the hard disk system log data is a
28
# This job file tries to mimic the Intel IOMeter File Server Access
# Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern
time_based
runtime=5256000
filename=fio-data-file
unified_rw_reporting=1
[iometer]
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
#readwrite=randrw
readwrite=read
#rwmixread=80
direct=1
size=4g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1 Linear
# iodepth=4 Very Light
# iodepth=8 Light
# iodepth=64 Moderate
# iodepth=256 Heavy
iodepth=1
Figure 3.4. Configuration file for fio
29
[Unit]
Description=flexible I/O tester
After=network.target
[Install]
WantedBy=multi-user.target
[Service]
Type=idle
PIDFile=/run/fio.pid
Restart=always
ExecStart=/usr/bin/fio /home/jason/iometer-file-access-server.fio
Figure 3.5. systemd service file for fio, installed on the VMs
[Unit]
Description=Change the default polling frequency for machine checks
After=network.target
[Install]
WantedBy=multi-user.target
[Service]
Type=oneshot
ExecStart=/usr/bin/bash /home/jason/set-mce-polling-frequency.sh
Figure 3.6. systemd service file for changing the default pollinginterval for machine checks
30
#!/usr/bin/bash
echo "1" > /sys/devices/system/machinecheck/machinecheck0/\
check_interval
echo "1" > /sys/devices/system/machinecheck/machinecheck1/\
check_interval
Figure 3.7. Bash script that changes the default polling frequency formachine checks
31
set of logs from the “Carter” compute cluster operated by RCAC at Purdue
University.
The population of the experimental units is a set of virtual machines
deployed on a small set of servers.
3.2.3 Sample
The fault injection system uses the Python 3 Standard Library and SciPy
(SciPy-Developers, 2013) to sample the statistical distributions with the specified
parameters.
3.2.4 Variables
The independent variable was the frequency of fault injections for each failure
event type. The dependent variables were the systems’ perceived “healthiness” and
accuracy of triggered migrations with respect to “actual” major failures.
3.2.5 Measure for Success
The study examined the data and tested the null hypothesis with a 95%
successful injection rate.
3.3 Summary
This chapter provided the research approach and methodology that was used
in the execution of this research project.
32
CHAPTER 4. PRESENTATION OF DATA
This chapter provides the results from this research project.
4.1 Statistical Models
This section describes the statistical models for failure events observed in the
Coates and Carter data sets.
4.1.1 Coates: Memory Errors
Integrated memory controller errors on the Coates system followed a
three-parameter Weibull distribution with the parameters found in Table 4.1. This
correlates with the results from Hacker et al. (2009) that also showed a Weibull
distribution fits well with failure events.
Table 4.1Parameters of the three-parameter Weibull distribution for integratedmemory controller errors on the Coates system
Parameter Value
Shape 0.49575
Scale 5293.0
Location 9.1171
33
4.1.2 Carter: Hard Disk Errors
Hard disk errors on the Carter system followed a Lomax distribution (Pareto
Type II that starts at 0) with the parameters found in Table 4.2. This correlates
with the results from Schroeder, Damouras, and Gill (2010) that also showed a
Pareto distribution is the best fit for disk errors.
Table 4.2Parameters of the Lomax distribution for hard disk errors on the Carter system
Parameter Value
Shape 0.38278
Scale 450.36
4.2 Experiments
This section contains the results of the experimental setup.
4.2.1 System Log Samples
This section contains system log samples from the VMs running under the
fault injection script.
4.2.2 Quantitative Results
This section contains quantitative results from the experiments comparing
the number of attempted fault injections versus the number of faults the guest OS
of the VMs reported.
Over a 7 day time period, 15,644 total faults were randomly injected across
the eight VMs. Table 4.3 shows the number of attempted injections, the number of
successful injections, and an adjusted number of injections that accounts for
34
Figure 4.1. System log screenshot of VM “centos4” experiencing anENOMEDIUM error injection
35
duplicates outside the scope of this thesis due to the limitations of the hard disk
injection mechanism. The adjusted value was tabulated by examining the logs on a
per-VM basis and using a sliding window of one minute. When multiple disk errors
occurred within one minute of each other, the first error time was noted and one
minute was added to this time; any errors that occurred within this window were
marked as successful and tabulated under the adjusted column. The values for
ENOMEM and EINVAL have higher adjusted values than errors attempted due to
uncertainty in how big this sliding window must be. From this data, 99.7% of disk
errors were successfully injected; 100% of memory errors were successfully injected;
and 99.99% of all total errors were successfully injected.
Table 4.3Number of attempted and successful fault injections
Error Attempted Successful Adjusted
ENOMEDIUM 300 182 298
ENOMEM 205 122 216
EINVAL 230 150 301
ECC error (ADDR valid) 143 143 N/A
L1 Instruction Cache (Instruction Fetch) 143 143 N/A
error (ADDR valid)
L2 Data Cache (Line Fill) error 14 623 14 623 N/A
(ADDR valid)
36
CHAPTER 5. CONCLUSIONS, DISCUSSION, AND RECOMMENDATIONS
5.1 Conclusions and Discussion
We have processed and analyzed system log data from two high performance
computing clusters running typical HPC workloads operated by the Rosen Center
for Advanced Computing at Purdue University. We have produced statistical
models of hardware errors in these systems. We have developed a fault injection
testbed that, in virtual machines, injects/simulates real hardware errors using the
same statistical models we produced from these computing clusters.
As seen in the raw injection data, a number of attempted disk error
injections did not succeed. We found that hard disk errors cannot be reliably
injected with a time between errors of some time under one minute. After
examining the data, most of these failed error injections can be accounted for with
multiple injections in short succession before the virtual machines were able to
recover from previous failures. We postulate this is because the workload generator
crashes at every hard disk injection, and the generator needs time to restart itself
fully, along with the normal hard disk recovery procedures.
The statistical models we created are broadly applicable due to the high
quality of the source data—large-scale HPC systems running typical workloads for
long durations. HPC system logs of production systems under workload are rarely
made available to researchers so we believe these statistical models will be useful for
other researchers and system administrators that do not have access to such data.
The fault injection system we created is useful for parallel application
developers and system administrators that want to test their application’s
robustness to hardware errors. Parallel application developers can test how their
37
programs respond to errors in main memory and CPU caches, which is especially
useful for highly optimized programs designed to stay in the CPU caches.
Additionally, system administrators can test before deploying to production how
arbitrary web applications, customer-facing web portals like Internet-based stores,
high availability software, etc. respond to failing hardware that could interrupt
business functions or result in lost revenue.
5.2 Recommendations
For future work, we recommend further analysis of the Rosen Center’s
cluster data sets for other hardware error types and generate more statistical models
for these errors. We also recommend working with the QEMU project maintainers
to create a built-in mechanism for hard disk error injection. Finally, we recommend
extending this work by investigating error prediction based on prior hardware errors.
APPENDICES
38
APPENDIX A. FAULT INJECTION SCRIPT
#!/usr/bin/env python3
"""Docstring: Use QEMU and libvirt to inject arbitrary errors into VMs.
Author: Jason St. John
License: Apache 2.0
Copyright 2016 Jason St. John
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-------------------------------------------------------------------
Python version required: >=3.2
This script generates lists of sojourn times between machine check and
39
disk fault injections based on various statistical distributions and
injects the given error using libvirt and a custom QEMU.
"""
from scipy.stats import lomax
import io, os, subprocess, time, threading, sys, math, random, csv
import datetime
#import libvirt
__version__ = "1.0"
__status__ = "Final"
__date__ = "2016-11-09"
__maintainer__ = "Jason St. John"
__author__ = ["Jason St. John"]
__email__ = "[email protected]"
"""
conn = libvirt.open(’qemu:///system’)
if conn == None:
print(’Failed to open connection to qemu:///system’, file=sys.stderr)
exit(1)
domain_IDs = conn.listDomainsID()
if domain_IDs == None:
print(’Failed to get a list of domain IDs’, file=sys.stderr)
print("Active domain IDs:")
if len(domain_IDs) == 0:
print(’ None’)
40
else:
for domain_ID in domain_IDs:
print(’ ’ + str(domain_ID))
"""
# These are dicts of domain (read: VM) IDs & MAC addresses for use
# with libvirt. The "high", "normal", and "low" denote fault frequency.
# Run ‘virsh -c qemu:///system list‘ to find the domain IDs you want
# to use. MAC addresses can be found via virt-manager or virsh.
domain_list_high = {
’9’: ’52:54:00:52:66:c1’,#centos1
’4’: ’52:54:00:42:6f:d4’,#centos4
’6’: ’52:54:00:26:57:2d’,#centos6
’7’: ’52:54:00:ae:f6:2a’#centos7
}
domain_list_normal = {
’2’: ’52:54:00:1e:cc:65’,#centos2
’5’: ’52:54:00:e0:0a:6d’,#centos5
’8’: ’52:54:00:82:53:05’#centos8
}
domain_list_low = {
’3’: ’52:54:00:cf:de:c7’#centos3
}
"""
# Check whether the targeted VMs are running
domain_list_all = list(domain_list_high.keys())
domain_list_all += list(domain_list_normal.keys())
domain_list_all += list(domain_list_low.keys())
for domain in domain_list_all:
41
if domain not in domain_IDs:
print("Targeted VM not running: ", domain)
conn.close()
exit(2)
"""
# This is the path to the directory that our modified QEMU looks for
# disk error injection.
qemu_disk_inject_path = "/tmp/qemu-disk-inject/"
"""
Define all of the functions we will use for tweaking the distribution
parameters.
"""
def weibull_mean(shape, scale, location):
"""Return the mean of the three parameter Weibull distribution."""
mean = scale * math.gamma(1 + (1 / shape)) + location
return mean
def weibull_alt_scale(mean, multiplier, shape, location):
"""Return an alternate scale parameter used to tweak the intensity."""
alt_scale = ((mean - location) * (1 / multiplier))\
/ math.gamma(1 + (1 / shape))
return alt_scale
def lomax_mean(shape, scale):
"""Return the mean of the Lomax distribution."""
mean = scale / (shape - 1)
return mean
42
def lomax_alt_scale(mean, multiplier, shape):
"""Return an alternate scale parameter used to tweak the intensity."""
alt_scale = mean * (1 / multiplier) * (shape - 1)
return alt_scale
"""
Set variables for the distributions we are using. These values were
derived from declustering of the raw fault data from the Coates
cluster.
"""
# Weibull distribution
# ECC Errors
fault_mem_shape = 0.49575
fault_mem_scale = 5293.0
fault_mem_location = 9.1171
fault_mem_mean = weibull_mean(fault_mem_shape, fault_mem_scale,
fault_mem_location)
# High intensity
fault_mem_high_iterations = 20000
fault_mem_high_multiplier = 400
fault_mem_high_scale = weibull_alt_scale(fault_mem_mean,
fault_mem_high_multiplier, fault_mem_shape, fault_mem_location)
# Normal intensity
fault_mem_normal_iterations = 20000
fault_mem_normal_multiplier = 400
fault_mem_normal_scale = weibull_alt_scale(fault_mem_mean,
fault_mem_normal_multiplier, fault_mem_shape, fault_mem_location)
43
# Low intensity
fault_mem_low_iterations = 20000
fault_mem_low_multiplier = 400
fault_mem_low_scale = weibull_alt_scale(fault_mem_mean,
fault_mem_low_multiplier, fault_mem_shape, fault_mem_location)
# Lomax distribution
fault_disk_shape = 0.38278
fault_disk_scale = 450.36
fault_disk_mean = lomax_mean(fault_disk_shape, fault_disk_scale)
# High intensity
fault_disk_high_iterations = 20000
fault_disk_high_multiplier = 300
fault_disk_high_scale = lomax_alt_scale(fault_disk_mean,
fault_disk_high_multiplier, fault_disk_shape)
# Normal intensity
fault_disk_normal_iterations = 20000
fault_disk_normal_multiplier = 300
fault_disk_normal_scale = lomax_alt_scale(fault_disk_mean,
fault_disk_normal_multiplier, fault_disk_shape)
# Low intensity
fault_disk_low_iterations = 20000
fault_disk_low_multiplier = 300
fault_disk_low_scale = lomax_alt_scale(fault_disk_mean,
fault_disk_low_multiplier, fault_disk_shape)
44
"""
These strings contain commands for the QEMU monitor to inject arbitrary
machine checks (MCs) and machine check exceptions (MCEs).
Note: ’_of’ designates an overflow machine check.
Format: "mce [cpu] [bank] [status] [mcgstatus] [addr] [misc]"
Reference AMD publications #24593 and #26094 for documentation on
the format of the [status] section.
"""
# L2 Data Cache (Line Fill) error (ADDR valid)
mc0_inject = "mce 0 0 0x9400400000000136 0x0 0x0 0x0"
mc0_inject_of = "mce 0 0 0xd400400000000136 0x0 0x0 0x0"
# L1 Instruction Cache (Instruction Fetch) Error (ADDR valid)
mc1_inject = "mce 0 1 0x9400000000000151 0x0 0x0 0x0"
mc1_inject_of = "mce 0 1 0xd400000000000151 0x0 0x0 0x0"
# Bus Unit (L2 Cache) Error (uncorrectable error)
mc2_inject = "mce 0 2 0xb600000000020136 0x0 0x0 0x0"
# L1 Cache Data Store Error (uncorrectable error)
mc3_inject = "mce 0 3 0xb600200000000145 0x0 0x0 0x0"
# ECC error (ADDR valid)
mc4_inject = "mce 0 4 0x9426c0010b000813 0x0 0x0 0x0"
mc4_inject_of = "mce 0 4 0xd426c0010b000813 0x0 0x0 0x0"
45
# ECC error (ADDR invalid)
mc4_inject_invalid_addr = "mce 0 4 0x9026c0010b000813 0x0 0x0 0x0"
mc4_inject_invalid_addr_of = "mce 0 4 0xd026c0010b000813 0x0 0x0 0x0"
# Populate the lists of sojourn times.
fault_mem_normal_list = []
for obs in range(fault_mem_normal_iterations):
variate = random.weibullvariate(fault_mem_normal_scale,
fault_mem_shape)
variate += fault_mem_location
fault_mem_normal_list.append(variate)
fault_mem_high_list = []
for obs in range(fault_mem_high_iterations):
variate = random.weibullvariate(fault_mem_high_scale, fault_mem_shape)
variate += (fault_mem_location * fault_mem_high_multiplier)
fault_mem_high_list.append(variate)
fault_mem_low_list = []
for obs in range(fault_mem_low_iterations):
variate = random.weibullvariate(fault_mem_low_scale, fault_mem_shape)
variate += (fault_mem_location * fault_mem_low_multiplier)
fault_mem_low_list.append(variate)
fault_disk_normal_list = lomax.rvs(fault_disk_shape,
scale=fault_disk_normal_scale,
size=fault_disk_normal_iterations)
fault_disk_high_list = lomax.rvs(fault_disk_shape,
46
scale=fault_disk_high_scale,
size=fault_disk_high_iterations)
fault_disk_low_list = lomax.rvs(fault_disk_shape,
scale=fault_disk_low_scale,
size=fault_disk_low_iterations)
def touch(file_name):
"""Emulate the Unix ’touch’ command."""
try:
os.utime(file_name, None)
except:
open(file_name, "a").close()
def write_log(domain, timestamp, errmsg):
"""Write injected errors to log files."""
vm_log_path = "".join(["/home/jason/vm_grouped_logs/"])
err_log_path = "".join(["/home/jason/error_grouped_logs/"])
if not os.path.exists(vm_log_path):
os.makedirs(vm_log_path)
if not os.path.exists(err_log_path):
os.makedirs(err_log_path)
err_log_subdir = "".join(errmsg.split())
iso_timestamp = datetime.datetime.fromtimestamp(\
time.mktime(timestamp)).isoformat()
with open("".join([vm_log_path, domain]), "a", newline="")\
as vm_log_file:
logwriter = csv.writer(vm_log_file, quoting=csv.QUOTE_NONE)
logwriter.writerow([iso_timestamp, errmsg])
with open("".join([err_log_path, err_log_subdir]), "a", newline="")\
47
as errmsg_log_file:
logwriter = csv.writer(errmsg_log_file, quoting=csv.QUOTE_NONE)
logwriter.writerow([iso_timestamp, domain])
def inject_error(domain, mac_addr, inject_string):
"""Inject a fault into a virtual machine (domain)."""
# Handle disk errors separately from MCE errors
if inject_string.startswith(("ENO", "EIN")):
mac_addr_path = "".join([qemu_disk_inject_path, mac_addr, "/"])
file_path = "".join([mac_addr_path, inject_string])
if not os.path.exists(mac_addr_path):
os.makedirs(mac_addr_path)
with open(file_path, "w"):
touch(file_path)
time.sleep(0.01) # let the file sit for a fraction of 1 second
# remove the file, and hence, stop triggering the injection
try:
os.remove(file_path)
except OSError as e:
if e.errno != errno.ENOENT: # no such file or directory
raise # re-raise exception if a different error occurred
else:
try:
subprocess.check_call(["virsh", "-c", "qemu:///system",
48
"qemu-monitor-command " + domain +
" --hmp --cmd \’" + inject_string + "\’"])
except subprocess.CalledProcessError:
print("Error injecting [[", inject_string, "]] into domain",
domain)
def injection_manager(domain_list, fault_list, intensity, inject_string):
"""Manage the sojourn times and fault injections per fault type."""
for seconds in fault_list:
# Randomly select a target domain every injection step
if intensity == ’high’:
domain = random.choice(list(domain_list_high.keys()))
mac_addr = domain_list_high[domain]
elif intensity == ’normal’:
domain = random.choice(list(domain_list_normal.keys()))
mac_addr = domain_list_normal[domain]
elif intensity == ’low’:
domain = random.choice(list(domain_list_low.keys()))
mac_addr = domain_list_low[domain]
else:
print("Invalid intensity: ", intensity)
exit(3)
print("\"" + inject_string + "\" will be injected into domain "\
+ domain + " in " + str(round(seconds,2)) + " seconds.")
time.sleep(seconds)
inject_error(domain, mac_addr, inject_string)
timestamp = time.localtime()
#timestamp = time.gmtime()
49
write_log(domain, timestamp, inject_string)
"""
Spawn a thread for each fault type. At each injection interval,
each thread will randomly select a domain from the list of domains
and inject one instance of its fault type. The thread will loop
through the number of sojourn times.
"""
# Memory errors
thread1 = threading.Thread(target=injection_manager, args=(
domain_list_high, fault_mem_high_list, ’high’, mc4_inject))
thread2 = threading.Thread(target=injection_manager, args=(
domain_list_normal, fault_mem_normal_list, ’normal’, mc0_inject))
thread3 = threading.Thread(target=injection_manager, args=(
domain_list_low, fault_mem_low_list, ’low’, mc1_inject))
# Disk errors
thread6 = threading.Thread(target=injection_manager, args=(
domain_list_high, fault_disk_high_list, ’high’, ’ENOMEDIUM’))
thread7 = threading.Thread(target=injection_manager, args=(
domain_list_normal, fault_disk_normal_list, ’normal’, ’ENOMEM’))
thread8 = threading.Thread(target=injection_manager, args=(
domain_list_low, fault_disk_low_list, ’low’, ’EINVAL’))
# Threads must not be ’daemon threads’
thread1.daemon = False
thread2.daemon = False
thread3.daemon = False
thread6.daemon = False
thread7.daemon = False
thread8.daemon = False
50
# Start all threads at the same time
thread1.start()
thread2.start()
thread3.start()
thread6.start()
thread7.start()
thread8.start()
print("There are currently ", threading.active_count(),
"active threads.")
51
APPENDIX B. QEMU SOURCE CODE MODIFICATIONS
The next page begins the multi-page output of the “git diff” command that
contains our modifications to the QEMU 2.0.0 source code.
52
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 48a28ae..4359047 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -37,10 +37,14 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)
#include "hw/block/block.h"
#include "sysemu/dma.h"
+#include "net/net.h"
+
#ifdef __linux
#include <scsi/sg.h>
#endif
+#include <dirent.h>
+
#define SCSI_WRITE_SAME_MAX 524288
#define SCSI_DMA_BUF_SIZE 131072
#define SCSI_MAX_INQUIRY_LEN 256
@@ -292,6 +296,159 @@ static void scsi_dma_complete(void *opaque, int ret)
53
scsi_dma_complete_noio(opaque, ret);
}
+/* Read a type of disk error to inject from the dir /tmp/qemu-disk-inject,
+ * and return it. The dir structure is
+ * /tmp/qemu-disk-inject/<MAC addr>/<err type>
+ * When a VM matches the MAC address and has an empty file named one of
+ * "ENOMEDIUM", "ENOSPC", "EINVAL", or "ENOMEM", an error will be injected
+ * until the file is removed from the file system. */
+static int get_disk_error_to_inject(void)
+{
+ int err_to_inject = 0;
+ DIR *dp, *mdp;
+ struct dirent *dresult, *mdresult;
+ char macdir_name[1024];
+ int done = 0;
+ int macdone = 0;
+ macdone = macdone + 0;
+ int rc = 0;
+ rc = rc + 0;
54
+ char messages[128];
+ strcpy(messages, "0");
+ char vm_interface[100] = "net0";
+ char *vm_int;
+ int i;
+ int match;
+ uint8_t macaddr[6];
+ int inaddr[6];
+
+ NetClientState *ncs;
+ NICState *nic_state;
+
+ vm_int = vm_interface;
+
+ // Find the running VM’s MAC address
+ ncs = qemu_find_netdev_match(vm_int);
+ if (ncs == NULL) {
+ qemu_log("NetClientState pointer NULL for interface %s\n", vm_int);
+ } else {
+ nic_state = qemu_get_nic(ncs);
55
+
+ if (nic_state != NULL) {
+ for (i = 0; i < 5; i++) {
+ macaddr[i] = nic_state->conf->macaddr.a[i];
+ }
+ }
+ }
+
+ errno = 0;
+ dp = opendir("/tmp/qemu-disk-inject");
+
+ if (dp == NULL) {
+ perror("SCSI disk fault injection error: Main control config diropen failed");
+ goto closefiles;
+ }
+
+ while (done == 0) {
+ // Iterate over MAC-named directories to find a match to current running VM
+
+ dresult = readdir(dp);
56
+
+ if ((dresult == NULL) && (errno != 0)) {
+ perror("SCSI disk fault injection error: Main control config readdir failed");
+ goto closefiles;
+ }
+
+ if ((dresult == NULL) && (errno == 0)) {
+ // No more directories.
+ done = 1;
+ break;
+ }
+ if (index(dresult->d_name, ’:’) != NULL) {
+ rc = sscanf (dresult->d_name, "%x:%x:%x:%x:%x:%x", &inaddr[0], &inaddr[1], &inaddr[2],\
+ &inaddr[3], &inaddr[4], &inaddr[5]);
+
+ // Is this line intended for us?
+ match = 1;
+ for (i = 0; i < 5; i++) {
+ if (inaddr[i] != macaddr[i]) {
+ match = 0;
57
+ }
+ }
+ if (match == 1) {
+ // Get the error to inject from the filename in this directory.
+ sprintf (macdir_name, "/tmp/qemu-disk-inject/%s", dresult->d_name);
+ mdp = opendir(macdir_name);
+
+ if (mdp == NULL) {
+ perror("SCSI disk fault injection error: MAC (node-level) config diropen failed");
+ goto closemdpfiles;
+ }
+ while (macdone == 0) {
+ mdresult = readdir(mdp);
+
+ if ((mdresult == NULL) && (errno != 0)) {
+ perror("SCSI disk fault injection error: MAC (node-level) config readdir failed");
+ goto closemdpfiles;
+ }
+
+ if ((mdresult == NULL) && (errno == 0)) {
58
+ // Empty MAC directory
+ goto closemdpfiles;
+ }
+ if (index(mdresult->d_name, ’.’) == NULL) {
+ // Scan the directory to discover the one message to be sent
+ if (strncmp (mdresult->d_name, "ENOMEDIUM", strlen("ENOMEDIUM")) == 0) {
+ err_to_inject = ENOMEDIUM;
+ /* Remove injection support for ENOSPC because it suspends the VM immediately
+ * and the error isn’t logged to syslog or the Journal.
+ *
+ * } else if (strncmp (mdresult->d_name, "ENOSPC", strlen("ENOSPC")) == 0) {
+ * err_to_inject = ENOSPC;
+ */
+ } else if (strncmp (mdresult->d_name, "EINVAL", strlen("EINVAL")) == 0) {
+ err_to_inject = EINVAL;
+ } else if (strncmp (mdresult->d_name, "ENOMEM", strlen("ENOMEM")) == 0) {
+ err_to_inject = ENOMEM;
+ } else {
+ err_to_inject = 0;
+ }
59
+
+ macdone = 1;
+ }
+ }
+ if (mdp != NULL) {
+ closedir(mdp);
+ mdp = NULL;
+ }
+ return err_to_inject;
+
+ closemdpfiles:
+ {
+ if (mdp != NULL) {
+ closedir(mdp);
+ mdp = NULL;
+ }
+ if (dp != NULL) {
+ closedir(dp);
+ dp = NULL;
+ }
60
+ return 0;
+ }
+ }
+ }
+ }
+
+closefiles:
+{
+ if (dp != NULL) {
+ closedir(dp);
+ dp = NULL;
+ }
+ return 0;
+}
+}
+
static void scsi_read_complete(void * opaque, int ret)
{
SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -330,6 +486,7 @@ static void scsi_do_read(void *opaque, int ret)
61
SCSIDiskReq *r = opaque;
SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
uint32_t n;
+ int inject_err;
if (r->req.aiocb != NULL) {
r->req.aiocb = NULL;
@@ -339,6 +496,13 @@ static void scsi_do_read(void *opaque, int ret)
goto done;
}
+ inject_err = get_disk_error_to_inject();
+ if (inject_err) {
+ // inject_err is either "0" or the value of a C errno
+ scsi_handle_rw_error(r, inject_err);
+ return;
+ }
+
if (ret < 0) {
if (scsi_handle_rw_error(r, -ret)) {
62
goto done;
@@ -450,6 +614,7 @@ static void scsi_write_complete(void * opaque, int ret)
SCSIDiskReq *r = (SCSIDiskReq *)opaque;
SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
uint32_t n;
+ int inject_err;
if (r->req.aiocb != NULL) {
r->req.aiocb = NULL;
@@ -459,6 +624,13 @@ static void scsi_write_complete(void * opaque, int ret)
goto done;
}
+ inject_err = get_disk_error_to_inject();
+ if (inject_err) {
+ // inject_err is either "0" or the value of a C errno
+ scsi_handle_rw_error(r, inject_err);
+ return;
+ }
+
63
if (ret < 0) {
if (scsi_handle_rw_error(r, -ret)) {
goto done;
diff --git a/include/net/net.h b/include/net/net.h
index 8166345..f4ee769 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -98,6 +98,7 @@ typedef struct NICState {
bool peer_deleted;
} NICState;
+NetClientState *qemu_find_netdev_match(const char *id);
NetClientState *qemu_find_netdev(const char *id);
int qemu_find_net_clients_except(const char *id, NetClientState **ncs,
NetClientOptionsKind type, int max);
diff --git a/net/net.c b/net/net.c
index e3ef1e4..bbc6fd4 100644
--- a/net/net.c
+++ b/net/net.c
@@ -623,6 +623,19 @@ NetClientState *qemu_find_netdev(const char *id)
64
return NULL;
}
+NetClientState *qemu_find_netdev_match(const char *id)
+{
+ NetClientState *nc;
+
+ QTAILQ_FOREACH(nc, &net_clients, next) {
+ if (!strcmp(nc->name, id)) {
+ return nc;
+ }
+ }
+
+ return NULL;
+}
+
int qemu_find_net_clients_except(const char *id, NetClientState **ncs,
NetClientOptionsKind type, int max)
{
LIST OF REFERENCES
65
LIST OF REFERENCES
AMD64 architecture programmer’s manual volume 2: System programming[Computer software manual]. (2011, May). Pub. 24593.
Atif, M., & Strazdins, P. (2009). Optimizing live migration of virtual machines inSMP clusters for HPC applications. In Sixth IFIP international conferenceon network and parallel computing (pp. 51–58). Gold Coast, Queensland,AUS: IEEE. doi: 10.1109/NPC.2009.32
Axboe, J. (2015). fio: flexible I/O tester. Retrieved fromhttps://github.com/axboe/fio
Bellard, F. (2014). QEMU: open-source processor emulator. Retrieved fromhttp://qemu.org
BIOS and kernel developer’s guide for AMD Athlon 64 and AMD Opteronprocessors [Computer software manual]. (2006, February). Pub. 26094.
DeBardeleben, N. A., Blanchard, S. P., Fu, S., Guan, Q., & Zhang, Z. (2011).Experimental framework for injecting logic errors in a virtual machine toprofile applications for soft error resilience (Tech. Rep.). Los Alamos, NewMexico, USA: Los Alamos National Laboratory. Retrieved fromhttp://newmexicoconsortium.org/usrc/usrc-publications(LA-UR-11-10926)
EasyFit. (2010). MathWave Technologies. Retrieved fromhttp://www.mathwave.com
Esteves, R. M., Hacker, T., & Rong, C. (2012, December). Cluster analysis for thecloud: Parallel Competitive Fitness and parallel K-means++ for largedataset analysis. In Proceedings of the 2012 IEEE 4th internationalconference on cloud computing technology and science (pp. 177–184). Taipei,Taiwan: IEEE. doi: 10.1109/CloudCom.2012.6427553
Evangelinos, C., & Hill, C. N. (2008). Cloud computing for parallel scientific HPCapplications: Feasibility of running coupled atmosphere-ocean climate modelson Amazon’s EC2. In The first workshop on cloud computing and itsapplications ’08. Retrieved from http://www.cca08.org/papers.html
Fu, S., & Xu, C.-Z. (2007, October). Quantifying temporal and spatial correlationof failure events for proactive management. In Proceedings of the 26th IEEEinternational symposium on reliable distributed systems (pp. 175–184).Beijing, China: IEEE. doi: 10.1109/SRDS.2007.18
66
Giuffrida, C., Kuijsten, A., & Tanenbaum, A. S. (2013, December). EDFI: Adependable fault injection tool for dependability benchmarking experiments.In Proceedings of the 2013 IEEE 19th pacific rim international symposium ondependable computing (pp. 31–40). Vancouver, British Columbia, Canada:IEEE.
Guan, Q., DeBardeleben, N., Blanchard, S., & Fu, S. (2014, May). F-SEFI: Afine-grained soft error fault injection tool for profiling applicationvulnerability. In Proceedings of the 2014 IEEE 28th international parallel anddistributed processing symposium (pp. 1245–1254). Phoenix, Arizona: IEEE.
Hacker, T. J., Romero, R. F., & Carothers, C. D. (2009). An analysis of clusteredfailures on large supercomputing systems. Journal of Parallel and DistributedComputing , 69 (7), 652–665. doi: 10.1016/j.jpdc.2009.03.007
Kleen, A. (2015). mcelog: the Linux hardware error daemon. Retrieved fromhttp://www.mcelog.org
Lange, J., Dinda, P., Xia, L., Bridges, P. G., Hale, K., et al. (2011). Palacios: AnOS-independent embeddable VMM. Retrieved fromhttp://www.v3vee.org/palacios/
Levy, S., Dosanjh, M. G. F., Bridges, P. G., & Ferreira, K. B. (2013, June). Usingunreliable virtual hardware to inject errors in extreme-scale systems. InProceedings of the 3rd workshop on fault-tolerance for HPC at eXtreme scale(FTXS). New York, NY: ACM.
libvirt virtualization API. (2015). Red Hat, Inc. Retrieved fromhttps://libvirt.org/
NIST. (2003, June). NIST/SEMATECH e-handbook of statistical methods. NationalInstitute of Standards and Technology. Retrieved fromhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htm
Oliner, A., & Stearley, J. (2007, June). What supercomputers say: A study of fivesystem logs. In Proceedings of the 37th annual IEEE/IFIP internationalconference on dependable systems & networks (pp. 575–584). Edinburgh,Scotland: IEEE. doi: 10.1109/DSN.2007.103
Pandit, N., Kalbarczyk, Z., & Iyer, R. K. (2009, June / July). Effectiveness ofmachine checks for error diagnostics. In Proceedings of the IEEE/IFIPinternational conference on dependable systems & networks, 2009 (pp.578–583). Lisbon, Portugal: IEEE. doi: 10.1109/DSN.2009.5270290
QEMU emulator user documentation: QEMU monitor. (n.d.). QEMU Project.Retrieved from http://wiki.qemu.org/download/qemu-doc.html
Romero, R. F. (2010). Live migration of parallel applications. Unpublished master’sthesis, Purdue University, West Lafayette. Retrieved fromhttp://docs.lib.purdue.edu/techmasters/27/ (UMI No. 1489538)
Rosen Center for Advanced Computing, P. U. (n.d.-a). ITaP research computing:Overview of Carter. Retrieved fromhttps://www.rcac.purdue.edu/compute/carter/
67
Rosen Center for Advanced Computing, P. U. (n.d.-b). ITaP research computing:Overview of Coates. Retrieved fromhttps://www.rcac.purdue.edu/compute/coates/
The R project for statistical computing. (2016). Retrieved fromhttps://www.r-project.org
Salfner, F., & Tschirpke, S. (2008, December). Error log processing for accuratefailure prediction. In WASL ’08 proceedings of the first USENIX conferenceon analysis of system logs (p. 4). San Diego, CA, USA: USENIXAssociation.
Schroeder, B., Damouras, S., & Gill, P. (2010, February). Understanding latentsector errors and how to protect against them. In Proceedings of the 8thUSENIX conference on file and storage technologies (pp. 71–84). San Jose,California: USENIX Association.
SciPy-Developers. (2013). SciPy: Scientific computing tools for Python. Retrievedfrom https://www.scipy.org
Thompson, D., Jiang, D., Peterson, D., Harbaugh, T., & Chebab, M. C. (2011).EDAC - error detection and correction. Retrieved from https://www.kernel.org/doc/Documentation/edac.txt
Torvalds, L., Red Hat, Inc., et al. (2010). The Red Hat Enterprise Linux 5 kernelsource code. Retrieved from http://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/
Youseff, L., Wolski, R., Gorda, B., & Krintz, C. (2006). Evaluating the performanceimpact of Xen on MPI and process execution for HPC systems. In Secondinternational workshop on virtualization technology in distributed computing.
Zhang, Y., Squillante, M. S., Sivasubramaniam, A., & Sahoo, R. K. (2004).Performance implications of failures in large-scale cluster scheduling. LectureNotes in Computer Science, 3277 , 233–252. Retrieved fromhttp://www.springer.com/computer/lncs
Zheng, Z., Lan, Z., Park, B. H., & Geist, A. (2009). System log pre-processing toimprove failure prediction. In Proceedings of the IEEE/IFIP internationalconference on dependable systems & networks, 2009 (pp. 572–577). Lisbon,Portugal: IEEE. doi: 10.1109/DSN.2009.5270289
Zhou, W., Zhan, J., Meng, D., Xu, D., & Zhang, Z. (2010). LogMaster: Miningevent correlations in logs of large-scale cluster systems. Computing ResearchRepository. Retrieved from http://arxiv.org/abs/1003.0951
Zhou, W., Zhan, J., Meng, D., & Zhang, Z. (2010). Online event correlationsanalysis in system logs of large-scale cluster systems. Lecture Notes inComputer Science, 6289 , 262–276.