TEL AVIV UNIVERSITYprimage.tau.ac.il/.../theses/exeng/free/9932978264204146.pdfrecovery of all of the parity blocks, but do so for a limited set of system parameters [36]. In a recent

TEL AVIV UNIVERSITYThe Iby and Aladar Fleischman Faculty of Engineering

The Zandman-Slaner School of Graduate Studies

On Fault Tolerance, Locality, and Optimality inLocally Repairable Codes

A thesis submitted toward a degree of

Master of Science in Electrical Engineering

by

Oleg Kolosov

July 2018

TEL AVIV UNIVERSITYThe Iby and Aladar Fleischman Faculty of Engineering

The Zandman-Slaner School of Graduate Studies

On Fault Tolerance, Locality, and Optimality inLocally Repairable Codes

A thesis submitted toward a degree of

Master of Science in Electrical Engineering

by

Oleg Kolosov

This research was carried out in The

School of Electrical Engineering

Department of Electrical Engineering - Systems

This work was carried out under the supervision of

Dr. Itzhak Tamo and Dr. Gala Yadgar

July 2018

Abstract

Erasure codes are used in large-scale storage systems to allow recovery of data from afailed node. A recently developed class of erasure codes, termed locally repairable codes

(LRCs), offers tradeoffs between storage overhead and repair cost. LRCs facilitate moreefficient recovery scenarios by storing additional parity blocks in the system, but theseadditional blocks may eventually increase the number of blocks that must be reconstructed.Existing codes differ in their use of the additional parity blocks, but also in their localitysemantics and in the parameters for which they are defined. As a result, existing theoreticalmodels cannot be used to directly compare different LRCs to determine which code willoffer the best recovery performance, and at what cost.

In this study, we perform the first systematic comparison of existing LRC approaches.We analyze Xorbas, Azure’s LRCs, and the recently proposed Optimal-LRCs in light oftwo new metrics: the average degraded read cost, and the normalized repair cost. Weshow the tradeoff between these costs and the code’s fault tolerance, and that differentapproaches offer different choices in this tradeoff. Our experimental evaluation on a Cephcluster deployed on Amazon EC2 further demonstrates the different effects of realistic net-work and storage bottlenecks on the benefit from each examined LRC approach. Despitethese differences, the normalized repair cost metric can reliably identify the LRC approachthat would achieve the lowest repair cost in each setup.

i

ii

Table of Contents

List of Figures vi

1: Introduction 1

2: Preliminaries 42.1 Erasure Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Locally Repairable Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Types of LRCs . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3: Methodology 83.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Xorbas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Azure-LRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3.1 Removing Division Constraints . . . . . . . . . . . . . . . . . . 10

3.4 Azure-LRC+1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.5 New Optimal-LRC Construction . . . . . . . . . . . . . . . . . . . . . . 11

3.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.5.2 Technical introduction . . . . . . . . . . . . . . . . . . . . . . . 11

3.5.3 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5.4 Properties of the code C . . . . . . . . . . . . . . . . . . . . . . 15

3.5.4.1 Locality . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5.4.2 Dimension and distance . . . . . . . . . . . . . . . . . 15

3.5.4.3 Optimality . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 Evaluation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4: Theoretical Analysis 204.1 Data-LRC vs. full-LRC . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Optimality of Optimal-LRC . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 NRC vs. d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4 Target fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5: System-Level Evaluation Setup 265.1 Ceph Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 System structure . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.2 Erasure coded pool . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.3 Block Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 LRC plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2.1 Constructing local groups in Ceph . . . . . . . . . . . . . . . . . 29

5.2.2 Issues discovered in the LRC plugin . . . . . . . . . . . . . . . . 31

5.3 Optimal-LRC implementation . . . . . . . . . . . . . . . . . . . . . . . 34

5.3.1 Implementation overview . . . . . . . . . . . . . . . . . . . . . . 34

5.3.2 Calculation of generator matrix in Matlab . . . . . . . . . . . . . 36

5.3.3 Implementation in Ceph . . . . . . . . . . . . . . . . . . . . . . 40

5.4 Amazon EC2 deployment . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6: Results 446.1 Amount of data read and transferred . . . . . . . . . . . . . . . . . . . . 44

6.2 Repair time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 Different storage types . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.4 Foreground workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.5 Multiple zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.5.1 Basic setup and its limitations . . . . . . . . . . . . . . . . . . . 50

6.5.2 Weighted evaluation . . . . . . . . . . . . . . . . . . . . . . . . 51

7: Related Work 537.1 Locally Repairable Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2 Minimum Storage Regenerating Codes . . . . . . . . . . . . . . . . . . . 54

7.3 System Level Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 55

7.4 Write Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

iv

8: Conclusions 58

References 59

Appendix A: NRC and the degraded cost 65

Appendix B: Minimum distance 66

Appendix C: Codes with d ≥ 5 67

v

List of Figures

2.1 (10,6) Reed-Solomon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 (11,6,3) Azure-LRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 (12,6,3) Optimal-LRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Three LRCs with the same k, demonstrating different tradeoffs between locality

and overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 (16,10,5) Xorbas.⊗

marks a function computed by the local parities, not a real

block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 (11,6,3) Azure-LRC+1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 NRC and the degraded cost for all the codes in our evaluation. The repair cost of

the full-LRCs is always lower than that of Azure-LRC. . . . . . . . . . . . . . 21

4.2 d for all the codes in our evaluation. . . . . . . . . . . . . . . . . . . . . . . 21

4.3 NRC for (n, k, r) Azure-LRC and (n+1, k, r) Azure-LRC+1 and Optimal-LRC.

Adding a local parity always reduces repair cost, despite the increase in overhead. 22

4.4 Examples where (n, k, r) Optimal-LRC does not achieve the lowest NRC.In both cases, an alternative (n, k, r − 1) Optimal-LRC achieves a lowerNRC, possibly at the cost of reducing d. . . . . . . . . . . . . . . . . . . 23

4.5 Repair-distance ratio (NRCd ). For each (n,k), different codes achieve their mini-

mal rd-ratio (marked by the small triangle) with different values of r. . . . . . . 24

4.6 NRC of codes with d ≥ 4. Azure-LRC and Optimal-LRC are the most flexible

codes, defined for all (k, n) combinations. . . . . . . . . . . . . . . . . . . . 24

5.1 PGs in a pool contains objects which are distributed to OSDs . . . . . . . . . . 27

5.2 LRC definitions using layers in Ceph. . . . . . . . . . . . . . . . . . . . . . 31

vi

5.3 (10,6,3) Azure-LRC+1 with a local group of size 2, containing global parity and

its replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4 (11,6,3) Azure-LRC+1 PG representation describing a case of a failed

OSD and possible CRUSH behavior in selecting its replacement . . . . . 325.5 (15,10,4) Azure-LRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 The number of average read blocks per data block repaired, compared to expected

ARC and NRC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Recovery time of LRCs normalized to Reed-Solomon with the same k and n. . . 476.3 Throughput of RADOS benchmark during repair with LRC in (15,10,4) and

RS(15,10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

vii

1 Introduction

In large-scale storage systems consisting of hundreds of thousands of servers, node fail-ures are the norm rather than the exception. For this reason, redundancy is added to en-sure availability of the data despite the failures. Typically, the redundancy of hot data isachieved by replication of each data object, ensuring the availability of the data as long asone replica is available. This also allows efficient reconstruction of data that was stored ona failed node from the surviving replicas.

Due to the high overhead of replication, most of the data is stored redundantly byerasure coding. With an (n, k) erasure code, the data is split into k data blocks that areused to generate n − k parity blocks. The blocks are distributed across n different nodes,so that the original data can be reconstructed as long as at least k blocks are available.The storage overhead of erasure coding is n

k—considerably lower than that of replica-

tion. However, reconstruction of one data block requires reading k surviving blocks—anoverhead considerably higher than that of replication.

Storage systems distinguish between two types of node failures. Transient failures

may be caused by system restarts or updates, after which the node is available again. Dur-ing this time, read operations of data stored on the failed node are served as degraded

reads—only the required data blocks are reconstructed from the surviving blocks. Per-

manent failures occur when the node malfunctions and is no longer accessible. Typically,a failure is considered permanent after 15 minutes of unavailability, which triggers fullrecovery of its data. Recent studies indicate that transient failures comprise 90% of failureevents [36], and only the remaining 10% trigger full node recovery. Nevertheless, recov-ery traffic incurs significant load on the data center’s servers and network—up to 180TBof data transfer between racks each day, according to a recent study on Facebook’s datacenters [31].

The vast majority of failures (up to 98% [31]) constitute exactly one unavailable node.Thus, several approaches have been used to design erasure codes that can withstand sev-

1

eral concurrent failures, but optimize the recovery cost of a single node. These includepreprocessing the surviving data to minimize repair network bandwidth [8, 29, 32], andreducing the amount of data read from each surviving node [9, 13, 17, 19, 30, 48]. Thesecodes can reduce the amount of data read by up to 50%, but in many realistic settings, thisreduction is no more than 25%, or not applicable due to the required I/O granularity [25].

A different approach increases the storage overhead and utilizes the added redundancyto optimize the recovery of a single data node. An (n, k, r) locally repairable code (LRC)

supports the local recovery of an unavailable block by reading at most r surviving blocks.These codes were originally designed to reduce the cost of degraded reads, and thus mostof them optimize only the recovery of data blocks [14, 15]. Others further optimize therecovery of all of the parity blocks, but do so for a limited set of system parameters [36].In a recent work [39], new codes were constructed that support local recovery of both dataand parity blocks, with the same storage overhead as previously known constructions.

LRCs present an inherent tradeoff. On the one hand, they considerably reduce theamount of data that must be read for degraded reads and recovery. On the other hand, inorder to store the additional parity, the system must store more blocks on each of its nodes,or allocate more nodes for the same amount of data. In the first case, more data must bereconstructed whenever a node fails, while the latter increases the probability of failure inthe system. As a result, LRCs not only increase the system’s storage overhead, but mightalso increase its overall recovery costs. Different codes offer different tradeoffs betweenstorage overhead and recovery cost, and between recovery cost and the cost of degradedreads. Furthermore, they are defined for different (n, k, r) combinations and differ in theirlocality semantics. Thus, directly comparing their costs and benefits is a nontrivial task,which makes it hard to choose the optimal code and configuration for a given system.

In this study, we perform the first comprehensive analysis of the different LRC ap-proaches. We take into account the overall cost of recovery, including data and parityblocks, as well as the maximum number of failed blocks the code can recover. Our analy-sis includes Xorbas [36], Azure-LRC—the LRC codes used by Microsoft Azure [15], andOptimal-LRC—a recently proposed theoretically optimal code [39]. We also define a newcode, Azure-LRC+1, which is based on Azure-LRC and supports efficient recovery of allparity blocks.

We conduct a theoretical comparison between the different LRC approaches. Our anal-ysis demonstrates the limitations of existing measures, such as locality and average repaircost. Thus, we define new metrics that model each code’s overhead, full-node repair cost,

2

degraded read cost, and fault tolerance. Our results demonstrate the tradeoff between theobjectives measured by these costs, and how different codes optimize different objectives.

We follow the theoretical analysis with an evaluation of these codes in a Ceph clusterdeployed in AWS EC2. Our experimental evaluation shows that we can accurately predictthe amount of data required by each code for reconstructing an entire storage node. Thisprediction also provides a good estimate of the time required for reconstruction, for mostcombinations of storage type, network configurations, and foreground traffic.

3

2 Preliminaries

2.1 Erasure Codes

The storage overhead of an erasure code is defined as nk

. Its minimal distance, d, is definedas the smallest number of node failures that may cause data loss. In other words, there isat least one combination of d node failures from which the code will not be able to recoverthe data. An important class of codes, termed maximum distance separable (MDS) codes,is characterized by the relation d = n−k+1, and provides the largest possible d for givenn and k. In MDS codes, k surviving blocks are required to recover a lost block. Reed-Solomon codes [33] are the most commonly used MDS codes owing to their parameterflexibility and efficient implementation. Reed-Solomon codes and many other erasurecodes use a generator matrix for encoding the data blocks. A k × n generator matrix acollection of vectors which lay the basis for the vector space of the code. The vectorsform the rows of the generator matrix. The generated code is a linear combination of thesevectors. While the generator matrix is used for decoding, parity check matrix does theopposite, and is used for decoding.

2.2 Locally Repairable Codes

An (n, k, r) locally repairable code (LRC) consists of k data blocks and n − k parityblocks. The data blocks are grouped into local groups no larger than r. A local parity iscomputed from each local group of blocks and can be used for the recovery of any blockin this group. In total, each local group of LRC contains at most r + 1 blocks. In case ofan arbitrary failure of one block in a local group, r surviving blocks are required for itsrecovery. A global parity is a function of all data blocks, and can thus be used to recoverany lost block. Pyramid codes [14], which are based on (n, k) Reed-Solomon codes werethe first suggested family of LRCs. Another family, Azure-LRC, is a variation of Pyramid

4

��

Figure 2.1: (10,6)Reed-Solomon

��

�0

��

��

��

Figure 2.2:(11,6,3) Azure-LRC

P0 P1X0 X1 X2

L0

X3 X4 X5

L1 L2

P2

Figure 2.3:(12,6,3) Optimal-LRC

codes and is used in Windows Azure [15].

Figure 2.1 depicts a (10,6) Reed-Solomon code, and Figure 2.2 shows the (11,6,3)Azure-LRC which results from replacing one of its global parities with two local parities.In this example, P3 was replaced with L0 and L1, which can be used in the recoveryof groups (X0, X1, X2) and (X3, X4, X5), respectively. In the new code, any of the datablocks can be repaired by reading the remaining three blocks in its local group. Thus,the recovery cost of a data block is reduced by 50%. However, the overhead increasesby 10%, from 10

6to 11

6. Note also that the new code is non-MDS: it can repair any four

missing blocks but not any five, therefore d = 5, but n− k + 1 = 6.

Azure-LRC successfully reduces the repair cost of data blocks and local parities, and,as a result, the degraded read cost. However, due to the allocation of blocks to nodes,when an entire node must be reconstructed, this node will include a significant numberof global parities, which will require k surviving blocks for recovery. For example, in(11,6,3) Azure-LRC, which contains m = 3 global parities, an average of 3

11= 27.2% of

the blocks stored on each node will be global parities.

2.2.1 Types of LRCs

Coding theory distinguishes between two types of (n, k, r) LRCs. In codes with information-

symbol locality, only the data blocks can be repaired in a local fashion by r survivingblocks, while the global parities require k blocks for recovery. We refer to these codesas data-LRCs. Pyramid and Azure-LRC are data-LRCs. In contrast, in codes with all-

symbol locality, all the blocks, including the global parities, can be repaired locally fromr surviving blocks. We refer to such codes as full-LRCs.

Optimal-LRC is a recently proposed full-LRC [39]. In this code, k data blocks and mglobal parities are divided into groups of size r, and a local parity is added to each group,allowing repair of any lost block by the r surviving blocks in its group. r does not neces-sarily divide m+ k, but Optimal-LRC requires that nmod (r + 1) 6= 1. Figure 2.3 showsa (12,6,3) Optimal-LRC. Each of the global parities, P0, P1, and P2, can be reconstructed

5

(10,6,3) Azure-LRC

(11,6,2) Azure-LRC

(12,6,3) Optimal LRC

Figure 2.4: Three LRCs with the same k, demonstrating different tradeoffs be-tween locality and overhead.

from the other global parities and the local parity L2. The overhead of this code is higherthan that of the (11,6,3) Azure-LRC in Figure 2.2, but its minimum distance is also higher(d = 6).

Full-LRCs introduce a new point in the tradeoff between fault tolerance and per-formance, which previously consisted only of MDS codes and data-LRCs. Gopalan etal. [11] proved that an upper bound on the minimal distance for an (n, k, r) LRC isd ≤ n − k −

⌈kr

⌉+ 2. Codes that achieve this bound are regarded as optimal; in par-

ticular, Optimal-LRC has been shown to achieve this bound. Specifically, the minimumdistance of Optimal-LRC was shown to be [39]:

d = n− k −⌈k

r

⌉+ 2, if (r + 1)|n

d ≥ n− k −⌈k

r

⌉+ 1, if (r + 1) 6 | n, r|(k + 1).

2.3 Challenges

Azure-LRC provides an appealing tradeoff when compared to Reed-Solomon codes: a10% increase in storage overhead can halve the cost of all degraded reads and most blockrepairs. Unfortunately, the comparison between data-LRCs and full-LRCs is not similarlystraightforward. Consider, for example, the three codes in Figure 2.4. The (11,6,2) Azure-LRC has three local parities, one more than the (10,6,3) Azure-LRC, which reduces itsr from 3 to 2, but increases its overhead by 10%. The (12,6,3) Optimal-LRC also hasthree local parities. However, rather than reducing r, the additional local parity enableslocal repair of the global parities. Thus, r represents different locality semantics in each ofthese models. In addition, each model represents a different tradeoff between the cost ofdegraded read and the cost of full node repair, and between these costs and the overhead.

6

It is not entirely clear which of these codes will have the lowest repair cost. Clearly, ralone cannot serve as a metric for comparing data-LRCs to full-LRCs. The average repair

cost (ARC) used in previous analyses [15] fails to capture the effect the code’s overhead hason its repair cost. In the next section, we introduce three composite metrics that facilitatea systematic comparison of LRCs.

The task of comparing different codes is further complicated by the fact that existingcodes are not all defined for the same range of parameters. Our new metrics alleviatethis problem to some extent. To eliminate the problem completely, we adopt a somewhat‘flexible’ interpretation of Azure-LRC. We also use a new construction of Optimal-LRC,which is optimal for parameters for which an explicit construction has not been givenbefore.

Finally, theoretically proven benefits are not always achievable in real systems. Therepair-cost benefit of different codes may be determined by factors such as storage andnetwork bandwidth, the nature and priority of the foreground load, and the system-levelimplementation. Thus, we complement our theoretical analysis with an evaluation on adistributed cluster in Amazon EC2, where we verify our metrics and identify additionalfactors that should be taken into account when designing an erasure coded storage system.

7

3 Methodology

3.1 Metrics

The starting point of our theoretical analysis consists of the existing measures describedabove: r is the maximal number of blocks required for the recovery of any block or a datablock, in full-LRCs and data-LRCs, respectively. The overhead of the code is n

k, and its

minimal distance is d. We use d to represent the code’s fault tolerance, despite its inherentlimitation—two codes with the same d may be considered equally fault tolerant, althoughone may prevent data loss in more combinations of correlated failures than the other [15].The mean time to data loss (MTTDL) [12] is considered a more accurate measure forfault tolerance. However, to calculate the MTTDL of a code, one must construct a Markovchain for every specific set of n, k, r parameters. In addition, this model does not alwaysyield an analytic closed-form equation. Thus, d is more appropriate for our large-scaleanalysis. For a limited comparison of a small set of constructions, d can be replaced withMTTDL.

The average repair cost (ARC) has been used in previous studies [15], and is based onthe assumption that the probability of repair due to failure is the same for all blocks. It isdefined as

ARC =

∑ni=1 cost(bi)

n,

where bi is the ith block in the code, and cost(bi) is the number of blocks required forthe repair of bi. For example, the ARC of the (10,6,3) Azure-LRC in Figure 2.4 is(8×3)+(2×6)

10= 3.6. Similarly, in the same figure, the ARC of the (11,6,2) Azure-LRC

is 2.73 and that of the (12,6,3) Optimal-LRC is 3.

ARC does not take into account the higher overhead of some of these codes, whichimplies that more blocks will have to be repaired in the event of a node failure. We addressthis by defining a new composite metric. The normalized repair cost (NRC) of a code is

8

�

Figure 3.1: (16,10,5) Xorbas.⊗

marks a function computed by the local parities,not a real block.

the product of its ARC and overhead.

NRC = ARC × n

k=

∑ni=1 cost(bi)

k.

NRC can also be viewed as the average cost of repairing a lost data block, where the costof repairing the parity blocks is amortized over the k data blocks. For example, the NRCof the (10,6,3) Azure-LRC in Figure 2.4 is (8×3)+(2×6)

6= 6. Similarly, the NRC of the

(11,6,2) Azure-LRC is 5 and that of the (12,6,3) Optimal-LRC is 6.

ARC is also inappropriate for modeling the cost of degraded reads. By definition, de-graded reads refer to data blocks only, while ARC averages the repair cost of all blocks—data and parity alike. We define the average degraded read cost (‘degraded cost’, in short)as the average cost of repairing data blocks only:

Degraded cost =

∑ki=1 cost(bi)

k,

where blocks b1, ..., bk are the object’s data blocks. For example, the degraded cost of the(10,6,3) Azure-LRC and the (12,6,3) Optimal-LRC in Figure 2.4 is 6×3

6= 3. Similarly, the

degraded cost of the (11,6,2) Azure-LRC is 2. Note that in the general case, the degradedcost is not always equal to r.

We base our analysis on three existing LRCs: Xorbas, Azure-LRC, and Optimal-LRC.We use Reed-Solomon as a baseline for some of our comparisons. Below, we describehow we extended the definitions of these codes for our analysis and evaluation.

3.2 Xorbas

Xorbas [36] is a full-LRC, in which the global parities can be recovered from the localparities. Figure 3.1 shows a (16,10,5) Xorbas code. Each local parity belongs to a groupcontaining five data blocks. The special construction of Xorbas ensures that any of theglobal parities can be reconstructed by the remaining global parities and the two local

9

Figure 3.2: (11,6,3) Azure-LRC+1

parities. Thus, r = 5 for all the blocks in the code. This special property can be maintainedif we remove the same number of blocks from each group. For example, a (13,8,4) Xorbascode can be obtained by removing two data blocks and one global parity from the originalconstruction. The number of global parities can be further reduced to achieve a loweroverhead, without reducing r. For example, a (12,8,4) Xorbas code has the same r as the(13,8,4) code, but a smaller d.

The (16,10,5) is the only configuration shown in the original paper [36] and the ad-ditional constructions we have shown are merely an simple extension of it. However it isnot clear how to general it a much larger set of parameters (n, k, r).

3.3 Azure-LRC

3.3.1 Removing Division Constraints

We use Azure-LRC as the data-LRC in our evaluation. It is explicitly defined in its originalpaper only for (n, k, r) where r divides k [15]. For the sake of analysis, we extend thiscode to the general case as follows. For an (n, k, r) LRC, the number of local parities isl =

⌈kr

⌉. The l − 1 groups each contain r data blocks and one local parity. The remaining

group contains k mod r data blocks and one local parity. For the code to have at leastone global parity, we must only ensure that k + l < n. Although this extension resultsin asymmetric allocation of data blocks to groups, it allows us to consider Azure-LRC inmost (n, k, r) combinations.

3.4 Azure-LRC+1

For the sake of analysis, we define a new full-LRC which is based on Azure-LRC. An(n, k, r) Azure-LRC+1 code is constructed by adding one local parity to the group of globalparities of an (n− 1, k, r) Azure-LRC. This local parity is computed as an XOR of all theglobal parities. Figure 3.2 shows an (11,6,3) Azure-LRC+1 constructed from the (10,6,3)Azure-LRC in Figure 2.4. When one global parity block is missing, it can be repaired

10

from the remaining global parities and this additional local parity. Thus, Azure-LRC+1will have l+1 local parities, l =

⌈kr

⌉, and can be constructed as long as k+ l+1 < n. This

naıve definition implies that an Azure-LRC+1 construction may result in a local parityadded to a ‘group’ of one global parity. Nevertheless, it has the added value of beingdirectly and easily applicable to any system that uses Azure-LRC or Pyramid codes, andis thus an important aspect of our analysis.

3.5 New Optimal-LRC Construction

3.5.1 Overview

The original Optimal-LRC construction [39] was shown to be optimal for the cases de-scribed in Section 2. However, the extension of this construction to all admissible pa-rameters results in a code with lower d, which is suboptimal. The minimum distance ofOptimal-LRC, shown in Section 2.2.1, is proven to be optimal when (r + 1)|n. Anothercase shown in Section 2.2.1 is when (r + 1) 6 | n and r|(k + 1). In this case the minimumdistance is at as most 1 less than the optimal d, but can be equal to the optimal value. Theremaining cases were not covered in [39], but our calculations have shown that dmight beequal to 2 less than the optimal value for certain configurations. To address this issue, wehave devised a new construction of LRC codes in the spirit of the original construction.

The advantage of the new construction is that it applies to all parameters n, k, r suchthat nmod (r+1) 6= 1, in contrast to the previous construction, which comprised two dif-ferent instances depending on some divisibility constraints. Furthermore, it can be shownthat our new construction attains the largest possible minimum distance even when theupper bound n − k − dk

re + 2 is not attainable. In summary, the new construction is the

first optimal construction for all admissible parameters.

Informally, the construction follows the same steps as in [39], where a good polyno-mial (see [39] for more details) forms the main ingredient for the code construction. How-ever, the encoding polynomial used to encode the message vector has a different structure,which sometimes leads to a polynomial with a larger degree than the degree of the encod-ing polynomial in [39].

3.5.2 Technical introduction

11

The codes in [39] are constructed as certain subcodes of Reed-Solomon (RS) codes.Namely, for a given n the code is constructed as a subcode of the RS code of length nand dimension k + dk

re − 1. While the “parent” RS code is obtained by evaluating all the

polynomials of degree ≤ k + dkre − 2, the LRC codes in [39] are isolated by evaluating

the subset of polynomials of the form

fa(x) =r−1∑i=0

d kre−1∑j=0

aijg(x)jxi,

where deg(fa) ≤ k + dkre − 2 and where g(x) is a polynomial constant on each of the

repair groups Ai.As pointed out in [39], it is possible to lift the condition (r + 1)|n, obtaining LRC

codes whose distance is at most one less than the right-hand side of

dmin(C) ≤ n− k −⌈kr

⌉+ 2. (3.1)

At the same time, [39] did not give a concrete construction of such codes, and did notresolve the question of optimality. In this construction we point out a way to lift thedivisibility assumption, constructing optimal LRC codes for almost all parameters.

Our results can be summarized as follows.

Theorem 1. Suppose that the following assumptions on the parameters are satisfied:

(1) Let s := nmod (r + 1) and suppose that s 6= 1;

(2) Let

m =⌈ n

r + 1

⌉.

We assume that n := m(r + 1) ≤ q;

Then there exists an explicitly constructible (n, k, r) LRC code C whose distance is the

largest possible for its parameters n, k, and r.

Remark: After this construction was completed, we became aware that most of itsresults are implied by an earlier work by A. Zeh and E. Yaakobi [49]. Specifically, weprove a bound on the distance of LRC codes of length n given in Theorem 2, whichis sometimes stronger than the bound (3.1). We also construct a family of LRC codesobtained as shortenings of the codes in [39] and use the bounds (3.1), (3.11) to show thatthey have the largest possible minimum distance for their parameters. It turns out that our

12

strengthened bound is a particular case of [49, Thm.6], and that the fact that shorteningoptimal LRC codes preserves optimality is shown in [49, Thm.13]. This implies that thecodes in [39] can be shortened without sacrificing the optimality property.

In this section we give an explicit algebraic construction of the shortened codes from[39], which is not directly implied by [49]. We believe that the construction of codespresents some interest. We also give an independent, self-contained proof of the neededparticular case of the bound on their distance.

3.5.3 The construction

Let F = Fq, let n be the code length and let r be the target locality parameter. As statedabove, we assume that s 6= 1 (the case s = 0 accounts for the original construction in [39]and is included below). Let t := r + 1− s.

Let A ⊂ F be a subset of size n. Suppose that A is partitioned into disjoint subsets ofsize r + 1:

A =m⋃i=1

Ai. (3.2)

The setA of n coordinates of the code C is formed of arbitrarym−1 blocks in this partition,say A1, . . . , Am−1, and an arbitrary subset of the block Am of size s (our constructionincludes the case of (r + 1)|n in [39], in which case this subset is empty). Denote by Bthe subset of Am that is not included in A, so that

A =m−1⋃i=1

Ai ∪ (Am\B).

Let g(x) ∈ F[x] be a polynomial of degree r + 1 that is constant on each of the blocks ofthe partition (3.2) and let γ be the value of g(x) on the points in the set Am. Without lossof generality we will assume that γ = 0 (if not, we can take the polynomial g(x) − γ asthe new polynomial g(x)). A way to construct such polynomials relies on the structure ofsubgroups of F and was presented in [39] (see also [22]).

The codewords of C are formed as evaluations of specially constructed polynomialsf(x) on the set of points A. To define the polynomials, let k′ := k + t and define the

13

quantity

Sk′,r(i) =

⌊k′

r

⌋i < k′mod r⌊

k′

r

⌋− 1 i ≥ k′mod r.

, i = 0, . . . , r − 1.

Next let a ∈ F k be a data vector. Write a as a concatenation of two vectors:

(aij, i = 0, . . . , r − 1, j = 1, . . . , Sk′,r(i))

(bm,m = 0, . . . , s− 2).(3.3)

The total number of entries in the vectors in (3.3) equals

(k′mod r)⌊k′r

⌋+ r − (k′mod r)

⌊k′r− 1⌋

=⌊k′r

⌋r − r + (k′mod r) + r − t = k′ − t = k,

so (3.3) is a valid representation of the k-dimensional vector a.

Given a, let us construct the polynomial

fa(x) =r−1∑i=0

fi(x)xi + hB(x)r−t−1∑m=0

bmxm, (3.4)

where hB(x) =∏

β∈B(x− β), deg(hB) = t is the annihilator polynomial of B and

fi(x) :=

Sk′,r(i)∑j=1

aijg(x)j.

Let {α1, α2, . . . , αn} be the set of elements of F that corresponds to the indices in theset A. Define the evaluation map

aev7→ ca := (fa(αi), i = 1, . . . , n). (3.5)

Varying a ∈ F k, we obtain a linear (n, k) code C. We summarize the construction asfollows.

CONSTRUCTION 1. For given n, k, r the LRC code C of length n and dimension k is the

14

image of the linear map ev : F k → F n defined in (3.5).

We note that the code C forms a shortening of the code in [39] by the coordinates inB, so overall C is a shortened subcode of the RS code of length n.

3.5.4 Properties of the code C

3.5.4.1 Locality

Let us show that the code C has locality r. Let i be the erased coordinate. If i ∈ Aj, j =

1, . . . ,m−1, then consider the restriction (fa)|Ajof the polynomial fa to the setAj, |Aj| =

r + 1. From (3.4), deg(fa)|Aj= r − 1, so on the set Aj it can be interpolated from its r

values. Once the polynomial (fa(x))|Ajis found, we compute the value ci = (fa)|Aj

(αi),

completing the repair task.

Now suppose that i ∈ Am\B.We note that the restricted polynomial (fa)|Am of degreeat most r − 1 such that

fa(α) = (fa)|Am(α) = 0 for any α ∈ B.

Now note that |Am\B| = s, and that s − 1 out of these coordinates are known. Togetherwith the zero values at the points of B this gives s−1+ t = r known values, implying thatit is possible to find the restricted polynomial (fa)|Am . Once this polynomial is computed,evaluating it at the point αi again gives back the value of the missing coordinate.

3.5.4.2 Dimension and distance

Lemma 1. [39, Thm.2.1] Let C be an (n, k, r) LRC code, then

k ≤ n−⌈ n

r + 1

⌉. (3.6)

The results about the parameters of the code C are summarized in the following propo-sition.

Proposition 1. Let C be an LRC code with locality r given by Construction 1. Then

dim(C) = k and

d ≥ n− k −⌈k + t

r

⌉+ 2. (3.7)

15

Proof. We begin with bounding the degree of the polynomials fa(x) in (3.4). Suppose thatr6 | k′, then the maximum degree is

deg(fa) =⌊k′r

⌋(r + 1) + (k′mod r)− 1

= k′ +⌊k′r

⌋− 1 (3.8)

= k′ +⌈k′r

⌉− 2. (3.9)

Now consider the case r|k′, namely,

deg(fa) ≤(⌊k′

r

⌋− 1)

(r + 1) + (r − 1)

= k′ +⌈k′r

⌉− 2. (3.10)

To prove that dim(C) = k it suffices to show that the image of a nonzero a ∈ F k underthe map (3.5) is nonzero. We will prove an even stronger fact, namely that wt(ca) ≥ 2 forany a 6= 0. We know that fa(x) has t of its zeros in B, so the number of zeros in the set∪m−1i=1 Ai ∪ (Am\B) is at most

deg(fa)− t ≤ k′ + dk′/re − 2− t.

Noting that d nr+1e = n+t

r+1and using (3.6), we obtain

n− k −⌈k′r

⌉≥⌈ n

r + 1

⌉−⌈k + t

r

⌉≥⌈ n

r + 1

⌉−⌈n− d n

r+1e+ t

r

⌉=⌈ n

r + 1

⌉−⌈n− n+t

r+1+ t

r

⌉=⌈ n

r + 1

⌉−⌈n+ t

r + 1

⌉= 0.

Thus the number of nonzero values of fa(x) within the support of the codeword is at leasttwo1. Hence the mapping (3.5) is injective on F k, which proves that dim(C) = k. The

1 The distance of C is in fact greater than 2, as is shown in the second part of this lemma. The reason thatwe obtain 2 here is that we rely on a universal bound (3.6) for the rate of the code C which is valid for all

16

weight of a nonzero vector satisfies wt(ca) ≥ n − deg(fa), which together with (3.9)-(3.10) proves inequality (3.7) for the distance of C.

3.5.4.3 Optimality

Finally let us prove that the constructed codes are distance-optimal. The following upperbound on the distance of LRC codes tightens the bound (3.1) in some cases.

Theorem 2. Let C be an (n, k, r) LRC such that s := nmod (r + 1) 6= 0, 1. Suppose that

C has m := b nr+1c disjoint repair groups Ai such that |Ai| = r + 1, i = 1, . . . ,m− 1 and

|Am| = s.

If either r|k, or r6 | k and kmod r ≥ s, then

dmin(C) ≤ n− k −⌈kr

⌉+ 1. (3.11)

Proof. The minimum distance of a q-ary (n, k) code (with or without locality) equals

d = n−maxI⊆[n]{|I| : CI < qk}.

Let A′i ⊂ Ai be an arbitrary subset of size |Ai| − 1 and let

I ′ = A′1 ∪ · · · ∪ A′b k−1rc ∪ A

′m.

Note that|I ′| = r

⌊k − 1

r

⌋+ s− 1. (3.12)

If r|k then, since s ≤ r, (3.12) becomes

k − r + s− 1 ≤ k − 1

Similarly, if r6 | k and kmod r ≥ s, then (3.12) becomes

k − 1− ((kmod r)− 1) + s− 1 ≤ k − 1.

Thus in either case |I ′| ≤ k − 1.

parameters.

17

If |I ′| < k− 1, let us add to it arbitrary k− 1−|I ′| coordinates which are not in the set

A1 ∪ · · · ∪ Ab k−1rc ∪ Am,

again calling the resulting subset I ′. By construction, |CI′| ≤ qk−1.

Now consider a larger subset of coordinates that is formed of the complete repairgroups and the set I ′,

I = I ′ ∪ A1 ∪ · · · ∪ Ab k−1rc ∪ Am.

Because of the locality property, the coordinates in I depend on the coordinates in I ′, andso |CI | = |CI′ |. Finally, |I| = |I ′| + bk−1

rc + 1 = k + dk

re − 1. Therefore the minimum

distance is at most d ≤ n− |I|, giving (3.11).

Proposition 2. The codes given by Construction 1 have the largest possible value of the

distance for their parameters.

Proof. The distance of the code C is given in (3.7). The difference between n−k−dkre+2

in (3.1) and the right-hand side of (3.7) equals

∆ :=⌈k + t

r

⌉−⌈kr

⌉≤ 1

(since t ≤ r − 1).If ∆ = 0, then the code C is optimal by (3.1).Let us prove that C is optimal even when ∆ = 1. Let us find the parameters for which

this holds true. Let k = ru + v, where 0 ≤ v ≤ r − 1. If v = 0 (i.e., r|k), then clearly∆ = 1. Otherwise, suppose that v ≥ 1 and compute⌈k + t

r

⌉−⌈kr

⌉=⌈u+ 1 +

v + 1− sr

⌉− (u+ 1),

which equals 1 if and only if v + 1− s > 0, i.e., if and only if v = kmod r ≥ s.

In summary, ∆ = 1 if and only if either r|k, or r6 | k and kmod r ≥ s. However, inboth these cases, according to (3.11), the maximum possible distance is one smaller thanthe bound (3.1). This again establishes optimality of the code C.

Remark: In conclusion we note that the code family in [39] affords an easy extensionto the case when each repair group is resilient to more than one erasure (i.e., it supports a

18

code with distance ρ > 2). The construction in [39] assumes that (r+ ρ− 1)|n. Using theideas in the previous section, specifically, polynomials of the form (3.4), it is easy to liftthis assumption, obtaining codes that support local correction of multiple erasures for anylength n ≤ q such that d n

r+ρ−1e(r + ρ− 1) ≤ q.

3.6 Evaluation parameters

We computed the ARC, NRC, degraded cost, overhead, and minimal distance of eachof the codes described above, for all (n, k, r) combinations for which they are defined,where 9 ≤ n ≤ 19 and n

k≤ 2. These combinations include specific sets of parameters

that appear in the literature and in documented deployments: (18,12,3) Azure-LRC [15],(16,10,5) Xorbas, (14,10) Reed-Solomon [31], and (9,6) Reed-Solomon [47]. For clarityof presentation, we show only results for 12 ≤ n ≤ 18 and n

k≤ 1.6, which include

the more common combinations (the full graph is included in the appendix). This range ofparameters suffices for demonstrating our observations, which we verified on the completerange.

19

4 Theoretical Analysis

In this section, we describe our observations from comparing NRC and d of the differentcodes in the parameter combinations. We begin with an overview of all the codes andparameters, and then zoom in on subsets of these results in order to focus on more subtleaspects of our comparison.

Figure 4.1 shows NRC and the degraded read cost of the different codes. For the samen, k, and r, the degraded cost is usually the same for all codes. It is different when rdoes not divide k, where the codes differ in their allocation of blocks to groups. As weexpected, for the same n and k, increasing r increases each code’s degraded read cost andNRC. However, when comparing different codes, neither r nor the degraded read cost canindicate which code will have the lowest full-node repair cost. For example, the NRC of(14,10,6) Azure-LRC+1 is lower than that of (14,10,5) Azure-LRC, although its degradedcost is higher.

Figure 4.2 shows the minimum distance, d, of the different codes. Figures 4.1 and 4.2together demonstrate a clear tradeoff between repair costs and fault tolerance. In general,for given n, k, and r, to increase d one must either increase n or increase r, thus increasingboth the degraded cost and NRC. Nevertheless, different codes offer different points in thistradeoff.

4.1 Data-LRC vs. full-LRC

For the same (n, k, r), there is always one full-LRC with a lower NRC than that of Azure-LRC. However, in most settings, the reduction in NRC is coupled with a reduction ind. In the settings in which it is defined, Xorbas achieves the same d but a higher NRCthan Azure-LRC. However, its NRC is not the lowest. Optimal-LRC and Azure-LRC+1achieve the same d and NRC in many settings. In the settings where the NRC of Azure-LRC+1 is lower than that of Optimal-LRC, its d is also lower (except for a few corner

20

0

2

4

6

8

10

r=3 4k=8n=12

5 3k=9

n=14

4 4 5k=10n=14

6 3 4k=10n=15

5 6 5k=10n=16

6 3 4 5k=11n=16

6 7 5 6k=12n=16

7 4 5k=12n=17

6 7 3 4k=12n=18

5

Cos

t (bl

ocks

)

Opt-LRC NRCAzure NRC

Azure+1 NRCXorbas NRC

Degraded cost

Figure 4.1: NRC and the degraded cost for all the codes in our evaluation. Therepair cost of the full-LRCs is always lower than that of Azure-LRC.

2

4

6

r=3 4k=8n=12

5 3k=9

n=14

4 4 5k=10n=14

6 3 4k=10n=15

5 6 5k=10n=16

6 3 4 5k=11n=16

6 7 5 6k=12n=16

7 4 5k=12n=17

6 7 3 4k=12n=18

5

d

Opt-LRCAzure

Azure+1Xorbas

Figure 4.2: d for all the codes in our evaluation.

cases discussed below).In Figure 4.3, we compare the NRC of (n, k, r) Azure-LRC to that of the (n+ 1, k, r)

full-LRCs with the same d. The full-LRCs use an additional local parity to allow fast repairof the global parities. This addition always reduces the repair cost, despite the increase instorage overhead.

4.2 Optimality of Optimal-LRC

Despite its optimal properties, our analysis reveals that for a given (n, k, r), Optimal-LRCdoes not always achieve the lowest NRC. Optimal-LRC is designed to accommodate theglobal parities with the data blocks in the same group. However, when the number ofglobal parities is much smaller than r, this results in increasing the size of one of thegroups, thus increasing the NRC. For example, Figure 4.4(a) shows a (12,8,5) Azure-LRC whose NRC is lower than that of (12,8,5) Optimal-LRC, and Figure 4.4(b) shows a(16,10,6) Azure-LRC+1 whose NRC is lower than that of (16,10,6) Optimal-LRC. In bothcases, Optimal-LRC can achieve a lower NRC with a smaller r, possibly at the cost of

21

0

2

4

6

8

10

12

(15,

10,5

)(1

6,10

,5)

(16,

10,5

)

(15,

10,6

)(1

6,10

,6)

(16,

10,6

)

(16,

12,6

)(1

7,12

,6)

(17,

12,6

)

(16,

12,7

)(1

7,12

,5)

(17,

12,7

)

(17,

12,5

)(1

8,12

,4)

(18,

12,5

)

NR

Cd

AzureOpt-LRCAzure+1

Figure 4.3: NRC for (n, k, r) Azure-LRC and (n + 1, k, r) Azure-LRC+1 andOptimal-LRC. Adding a local parity always reduces repair cost, despite the in-crease in overhead.

reducing d.

4.3 NRC vs. d

Our results demonstrate a subtle tradeoff between repair cost (NRC) and d. Codes withthe same (n, k, r) may or may not have the same d, and are thus not directly comparable:one may satisfy fault tolerance requirements that the other does not. To facilitate a moresystematic comparison, we defined another composite metric, repair-distance ratio (rd-

ratio), NRCd

. This can be viewed as a measure of the efficiency with which a code allocatesits local parities, with the conflicting objectives of maximizing d and minimizing NRC.

Figure 4.5 shows the rd-ratio of all LRCs. It shows that the code with the lowest rd-ratio is different for different (n, k, r) combinations, and is not necessarily a full-LRC.For example, when (n, k, r) is (14,10,5), Azure-LRC has the lowest rd-ratio. Anotherinteresting observation is that when fixing n and k, different codes achieve their minimalrd-ratio with different values of r. For example, the rd-ratio of (17,12,5) Optimal-LRC islower than that of (17,12,4) Optimal LRC. When we fix (n, k) and consider the “best” r foreach code, we observe that Optimal-LRC achieves the lowest rd-ratio. This demonstratesthat this code is optimal in its allocation of local parity blocks—it efficiently reduces therepair cost with minimal reduction in d. The rd-ratio can be generalized to reflect different

22

(12,8,5) Azure-LRC

[d=4, NRC=7.25]

(12,8,5) Op mal LRC

[d=4, NRC=7.5]

Alt: (12,8,4) Op mal LRC

[d=3, NRC=5.25]

(a) (12, 8, 5) Optimal-LRC has higher NRC than that of Azure-LRC. Reducing r from 5 to 4improves its NRC from 7.5 to 5.25

(16,10,6) Azure-LRC+1

[d=6, NRC=7.4]

(16,10,6) Op mal LRC

[d=5, NRC=8.6]

Alt: (16,10,5) Op mal LRC

[d=5, NRC=7.2]

(b) (16, 10, 6) Optimal-LRC has higher NRC than that of Azure-LRC. Reducing r from 6 to5 improves its NRC from 8.6 to 7.25

Figure 4.4: Examples where (n, k, r) Optimal-LRC does not achieve the lowestNRC. In both cases, an alternative (n, k, r − 1) Optimal-LRC achieves a lowerNRC, possibly at the cost of reducing d.

weights of NRC and d, e.g., by defining it as NRCdx

.

4.4 Target fault tolerance

The required fault tolerance in a distributed storage system is determined by many fac-tors, including the number of nodes, their organizations into racks and clusters, and theanticipated causes of failure. Nevertheless, once the required level of fault tolerance isdetermined, the goal is to select a code which will provide this level at the lowest cost. Inthis context, the code with the lowest rd-ratio may not be the optimal choice—a differentcode may have a higher ratio but provide the required level of fault tolerance at a loweroverhead or repair cost.

We defined a threshold value of d, dth, and compared the NRC of all the codes for

23

0

0.5

1

1.5

2

2.5

r=3 4k=8

n=12

5 3k=9

n=14

4 4 5k=10n=14

6 3 4k=10n=15

5 6 5k=10n=16

6 3 4 5k=11n=16

6 7 5 6k=12n=16

7 4 5k=12n=17

6 7 3 4k=12n=18

5

NR

C/d

Opt-LRCAzure

Azure+1Xorbas

Figure 4.5: Repair-distance ratio (NRCd ). For each (n,k), different codes achievetheir minimal rd-ratio (marked by the small triangle) with different values of r.

0

2

4

6

8

10

r=4k=8n=12

5 3k=9n=14

4 5k=10n=14

6 4k=10n=15

5 51016

61116

5k=11n=17

6 6k=12n=16

7 4 5k=12n=17

6 3k=12n=18

4

NR

C

Opt-LRCAzure

Azure+1Xorbas

Figure 4.6: NRC of codes with d ≥ 4. Azure-LRC and Optimal-LRC are themost flexible codes, defined for all (k, n) combinations.

24

which d ≥ dth. We considered dth ∈ {3, 4, 5}, corresponding to the minimum distanceof commonly deployed configurations. Figure 4.6 shows the NRC of all the LRCs whosed ≥ 4. Many constructions do not provide the required fault tolerance, and are thus absentfrom this figure. Different codes achieved the lowest NRC for different k, n combinations.However, we note that a construction of Azure-LRC and Optimal-LRC with the requiredd was defined for every k, n combination. This demonstrates the flexibility of both codes.We observed similar results when setting the threshold dth to 3 or 5 (when 3 is for allexamined configurations, both attached in the appendix), where increasing the thresholdremoved more codes from the comparison, and vice versa.

Our theoretical evaluation results demonstrate the limitations of existing evaluationmetrics for comparing different LRC codes and approaches. Our comparison, based onour new metrics, demonstrates the flexibility of Optimal-LRC, and the realistic settings inwhich this flexibility may improve the repair cost of the system. In the next section, wevalidate these results on a real distributed system prototype.

Our theoretical evaluation results demonstrate the challenges in comparing differentLRC codes and approaches. Our metrics, NRC, degraded cost, and rd-ratio, provide aframework for directly comparing all codes in all parameter combinations. Our compari-son demonstrates the benefit of full-LRCs, the flexibility of Optimal-LRC, and the realisticsettings in which they may reduce the amount of data read and thus the system repair cost.In the following, we extend our notion of ‘repair cost’ to additional performance measures.

25

5 System-Level Evaluation Setup

The goal of our system-level evaluation was threefold: to validate the accuracy of NRCwhen predicting the amount of data read for node reconstruction, to evaluate its ability toestimate repair time and bandwidth, and to compare the recovery efficiency of the differentLRCs in a real system. We omitted the minimum distance, d, from this part of our analysis,because it could not be measured empirically. We focused on four representative (n, k)

combinations, and compared Reed-Solomon, Azure-LRC, Azure-LRC+1, and Optimal-LRC in these setups. We excluded Xorbas from this part of our analysis due to designlimitations described below.

5.1 Ceph Storage System

5.1.1 System structure

We performed our evaluation in Ceph—a distributed open-source storage system [44].Ceph’s object storage service, RADOS [46], is responsible for object placement, failuredetection and failure recovery. Ceph’s storage nodes are called object storage devices

(OSDs). OSDs have a logical affiliation to pools. Pools are the storage unit visible to theclient and serve as the interface for the client to store and read the data.

In Ceph nodes communicate with one another without the need of communicatingthrough a centralized gateway. Thus, Ceph is fully scalable. A Ceph cluster is composedof at least one monitor (MON) and metadata server (MDS) and tens to thousands of ObjectStorage Devices (OSDs). The MON keeps a master copy of a collection of ”maps” (knownas cluster map1): Monitor map, OSD map, PG map, CRUSH map, MDS map. CRUSHmap, for example, contains a list of storage devices, the hierarchical relationships betweenthem and a list of rules indicating CRUSH how to allocate data in pools. Each of these

1 http://docs.ceph.com/docs/master/architecture

26

Pool A

PG

PG

Objects

OSD.0

OSD.1

OSD.2

OSD.3

Figure 5.1: PGs in a pool contains objects which are distributed to OSDs

maps is required for the synchronization of the storage cluster between the storage devices.Periodically, each OSD checks its own state and the state of other OSDs and reports back tothe MON. It is possible to use more than one MON. The usage of multiple MONs ensureshigher storage device availability. The MDS stores metadata for the Ceph file system. Itallows clients to perform metadata operations such as search or rename, without placingburden on the storage cluster.

Objects in Ceph are assigned to placement groups, which define the allocation ofblocks to OSDs. Placement groups construct the pools. The use of multiple placementgroups in each pool is harnessed by Ceph’s load balancing and parallel I/O capabilities.When creating a pool the number of placement groups residing in it is defined. In our ex-periments, each pool contained 512 placement groups. Each placement group contains acollection of objects striped across the pool’s OSDs. For an erasure code of n blocks eachplacement group will map the n blocks to n different OSDs. Each placement group canbe described as an array pointing to n OSDs. Each OSD belongs to multiple placementgroups.

Figure 5.1 shows the relationship between PGs (that contain objects) and OSDs. EachPG contains one object or more, each split into multiple blocks (3 in this example). Eachof the blocks is mapped to a different OSD.

The primary OSD in each placement group is responsible for encoding the data anddistributing the data and parity blocks to the remaining, secondary, OSDs. When one ofthe OSDs in a placement groups fails, a replacement OSD is assigned to the group. Theprimary OSD is responsible for reading the required data from the surviving OSDs, recon-

27

structing the missing block, and sending it to replacement OSD for permanent storage.

The mapping of placement groups to OSDs is implemented by CRUSH as a pseudo-random function to ensure load balancing [45]. CRUSH can take into account node loca-tion (for example to prevent allocation of data in 2 nodes on the same rack), predefinedrules, and system state.

Ceph’s design imposes certain limitations on our evaluation. When the failed OSD andthe primary OSD belong to different locality groups, the repair data must be transferredacross groups. In complex network topologies, this might incur cross-rack or cross-zonetraffic that LRCs were designed to avoid. In addition, degraded reads are currently imple-mented by reconstructing the entire object at the primary OSD. This means that all k datablocks are read, even if only r blocks are required to repair the missing block. As a result,for degraded reads, there is no observable difference between MDS codes and LRCs.

We chose to use Ceph despite these limitations. As far as we know, it is the only open-source distributed storage system that implements LRCs as part of its main distribution.Furthermore, at the time we began this research, it was the only system to support onlineerasure coding, without requiring that objects are first replicated and then erasure-codedin the background.

5.1.2 Erasure coded pool

In Ceph, object size can be between 4KB and 2 GB, default being 4MB. The size is definedwhen creating the erasure coded pool. For an (n, k) erasure code, during encoding, a clientwrite request is sent to the primary OSD. The data is transformed into an object (or severalobjects) which is split into k data blocks (shards) and additional n− k parity blocks. Eachof the blocks has an ID (rank). The ID of the first block is 0, the second block is 1 and soon, having n in total. CRUSH algorithm will distribute these blocks to n different OSDs.The OSD containing the first block (with ID 0), is defined as the primary OSD in theplacement group.

5.1.3 Block Reconstruction

Block reconstruction is initiated when a node is detected to be permanently unavailable.A failed node is replaced by an active one in the placement group by CRUSH (differentnode per each placement group). This procedure however, can cause unnecessary traffic.When in steady state, the storage load in the cluster is balanced equally between all active

28

nodes. Once a node fails, the cluster attempts to re-balance the load, while replacingmultiple nodes in the placement groups and thus creating increased disk access. In orderto avoid this unwanted traffic and forcing CRUSH to only replace the failed node with itsreplacement, we have used a system parameter2.

Once a replacement node is set the reconstruction process is initiated. The data whichresided on the failed node will be reconstructed using the surviving nodes and will bestored in the replacement node. We can distinguish between reconstruction of a primaryOSD or a non-primary (secondary) OSD:

1. When a primary OSD is reconstructed, the replacement OSD receives all the re-quired blocks for reconstruction from the surviving OSDs in the placement groupand performs the reconstruction process on the new primary OSD.

2. When a secondary OSD is reconstructed, the required blocks for reconstruction aresent from the surviving OSDs to the primary OSD. The primary OSD performs thereconstruction process and sends the reconstructed block to the replacement OSD.

5.2 LRC plugin

5.2.1 Constructing local groups in Ceph

In Ceph, erasure codes are implemented as plugins within its Erasure Code infrastruc-ture. This infrastructure allows for a modular implementation of erasure codes to work insynergy with Ceph. It is only required to implement a minimum of encode and decodefunctions under Ceph guidelines3 in order to add an erasure code to Ceph. We used theJerasure Erasure Code plugin [4], which contains an implementation of Reed-Solomonbased on the Jerasure [28] and GF-Complete [27] libraries. It is enough to define the num-ber of data and parity blocks in order to encode data in the Jerasure plugin. It performsall operations on-the fly, including the construction of required generator and parity checkmatrices. To implement the LRCs in our evaluation we used the Locally Repairable Era-sure Code plugin (LRC plugin) [5] which also incorporates Jerasure Erasure Code pluginfor local group encoding and decoding. When defining a new pool, the configuration of

2 chooseleaf stable is set to true3 described in Ceph repository in src/erasure-code/ErasureCodeInterface.h

29

the code is also defined. The LRC plugin provides two methods for the configuration ofthe code: high level and low level configurations.

The high level configuration allows the user to define the number of data blocks, k,coding blocks, m (global parities) and locality, r. With this configuration, the plugin willautomatically construct the local groups according to a function in the plugin’s sourcecode.

Unlike the high level configuration, in the low-level configuration all local groups aredefined manually. In the low-level configuration interface of this plugin, an LRC is definedin layers—each layer specifies the dependency of parity blocks (‘c’) on the relevant datablocks (‘D’), and the ordering of the layers specifies the order in which the primary OSDattempts to recover the missing block. Recovery begins by finding the lowest layer thatcontains the missing data block, and in which sufficient blocks survive. Thus, if only oneblock is missing from a local group, it will be reconstructed according to the respectivelayer. Otherwise, it will be reconstructed from the global parities in the highest layer.

Figure 5.2 shows how (11,6,3) Azure-LRC is specified in Ceph4 and how we use thesame interface to specify Azure-LRC+1. (11,6,3) Azure-LRC consists of 3 groups de-scribed by 3 layers. Layers 2 and 3 describe the local data groups containing 3 data blocksand one local parity each. The top layer (layer 1) specified the dependency of the globalparities on all the data blocks in the code. When a block is lost, Ceph will attempt to recon-struct it using each layer, going from bottom to top - from local groups to global parities.For example, if block 7 is lost, it will be reconstructed using ‘D’ blocks in locations 4,5,6according to layer 3. If block 10 is lost, Ceph will skip layers 3 and 2 because a globalparity cannot be reconstructed locally, reaching layer 1 and successfully reconstructing it.Note that a single data block or local parity block, will always be reconstructed in the localgroup. Layer 1 will only serve failures of the global parities.

(11,6,3) Azure-LRC+1 will be specified in a similar manner to the previous example,with the exception of global parity handling. It has 4 layers, with the additional layerdefining the global parity local group. When a global parity is lost, it will be reconstructedusing layer 4.

4 Ceph’s LRC plugin actually implements Pyramid codes, and not the LRCs used in Azure. However, thedata read by these two codes in all single-node failure scenarios is identical. The precise parity calculationsand fault tolerance of Azure-LRC are outside the scope of this study.

30

blk nr 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10layer 1 DDD DDD c c c DDD DDD c clayer 2 DDDc DDDclayer 3 DDDc DDDclayer 4 DDc

(11,6,3) Azure-LRC (11,6,3) Azure-LRC+1

Figure 5.2: LRC definitions using layers in Ceph.

Figure 5.3: (10,6,3) Azure-LRC+1 with a local group of size 2, containing globalparity and its replication

5.2.2 Issues discovered in the LRC plugin

LRC high-level configuration. The LRC high-level configuration described in Section5.2 requires that the following conditions hold:

1. k +m must be a multiple of l

2. k must be a multiple of k+ml

3. m must be a multiple of k+ml

These limitations constrained us to a very small set of possible (n, k, r) configurations.To overcome this limitation, we defined all LRC layers using the low-level configuration,in which these limitations do not exist.

Single data block in LRC layer. We discovered that the LRC implementation in Cephdoes not allow constructions of a local group with a single block and a local parity. Figure5.3 shows a (10,6,3) Azure-LRC+1, which is an example of such configuration. In thiscase, the code contains a local group of size 2 which consists of a global parity and a localparity which is its replication. These are not likely configurations in reality, but for thecompleteness of our experiments we removed this constraint from the Ceph source code.

31

8 90 1 2 4 5 6

1073

(a) An active (11,6,3) Azure-LRC+1PG

8 90 1 2 4 5 6

1073

(b) OSD 3 failed

8 90 1 2 4 5 6

1074

(c) OSD 4 replaced OSD 3, but it’s al-ready in use

8 90 1 2 4 5 6

10711

(d) Reallocation eventually choseOSD 11 to replace 3

Figure 5.4: (11,6,3) Azure-LRC+1 PG representation describing a case of a failedOSD and possible CRUSH behavior in selecting its replacement

Reflecting the correct amount of data required for recovery. During our initial ex-periments we identify several design decisions made in Ceph that cause it to read moredata than is actually required for the reconstruction of a failed node. These extra readsare performed to maintain Ceph’s scalability and load balancing, but are irrelevant to thecomparison of different LRC approaches. Thus, we eliminated them from our experimentsby making the following modifications:

1. Redundant reads during reconstruction. When LRC plugin calculates the min-imum required blocks for reconstruction in a layer, it actually selects all availableblocks instead of the required minimum, causing the system to read more than re-quired. In our experiments it has only affected global parity reconstruction since inother local groups all surviving blocks always participate in reconstruction anyway.For example, the (11,6,3) Azure-LRC in Figure 5.2 has three layers. Layer 1 spec-ifies k = 6 data blocks and 3 global parities. In the case of global parity loss, kblocks are required for reconstruction, but the original LRC plugin reads all eightsurviving blocks to reconstruct the missing block.

We implemented a fix on top of the existing source code to ensure that only k blocksare read during reconstruction of a global parity in all LRCs.

32

2. Suboptimal choice of replacement for a failed OSD. When an OSD fails, Ceph’sCRUSH algorithm selects an active OSD as its replacement in the placement group(PG), copying the reconstructed data of the failed OSD to its replacement. CRUSHis not limited to choosing a replacement OSD from those that are not in use, andmight choose an OSD which is already used in that PG. Such event is called a col-

lision [45]. This issue occurs for CRUSH’s default algorithm, straw25. When thecollision is eventually detected, the reconstruction process has already been initiatedand required blocks were read. Ceph will discard the reconstructed data and CRUSHwill again choose an available replacement OSD. Figure 5.4 illustrates an exampleof this process. Figure 5.4(a) depicts (11,6,3) Azure-LRC+1 in its PG representa-tion, each block numbered by the number of the OSD hosting it. In Figure 5.4(b)OSD 3 has failed. At this stage, CRUSH selects a replacement according to thestraw2 algorithm, and might choose an OSD which is already in use. For example,in Figure 5.4(c). OSD 4 has been chosen to replace OSD 3, although it’s already al-located to this PG. Once a replacement is chosen, the reconstruction process beginsreading the required blocks from OSDs 1,2,3. Only after the blocks have been readCRUSH detects the collision and re-runs the algorithm, choosing the available OSD11 as seen in Figure 5.4(d). In small clusters, CRUSH might even require multipleiterations of the algorithm for choosing an available OSD. However, this problem isminimized for larger clusters.

We solve this issue by changing the allocation algorithm to be uniform. This al-gorithm can be used in the case all storage devices have the same weight, which isa value given to each OSD and is usually relative to its storage capacity, which isthe case in our evaluation setup. With the uniform algorithm, we allow CRUSH toallocate blocks in constant time and a pseudo-random fashion.

3. Non-uniform block allocation. Ceph attempts to allocate blocks uniformly acrossall OSDs. For example, in Azure-LRC (15,10,4) depicted in Figure 5.5, the numberof global parities is m = 2, meaning m

n= 13.3% of all the blocks stored in thew

system are global parities. For this configuration, we expect 13.3% of the blockson each OSD to be global parities, and the rest to be data blocks and local parities.However, CRUSH’s block allocation is not perfectly uniform. By logging the exact

5 The algorithm is specified at https://www.spinics.net/lists/ceph-devel/msg21635.html

33

Figure 5.5: (15,10,4) Azure-LRC

allocation of blocks in the cluster, we detected deviations in the range of 1.1% - 4.2%in the expected allocation of blocks. For example, in the case of a single node failurewith (15,10,4) Azure-LRC, we noticed that only 11.5% of the recovered blocks wereglobal parities instead of the expected 13.3%. The deviation from uniform alloca-tion is expected to decrease for clusters with more OSDs or more written data. Weverified that this deviation from uniform allocation caused the deviation from theexpected amount of data read, which was sufficient for the purpose of our analysis.

5.3 Optimal-LRC implementation

5.3.1 Implementation overview

Encoding in Optimal-LRC is implemented as a multiplication of a data vector of size k bya k×n generator matrix. The generator matrix for each (n, k, r) Optimal-LRC was createdfrom the polynomial described in Section 3.5.3 and the guidelines in [39], which we willdescribe below. Once the generator matrix is constructed, we transform it into a systematic

form in which the data blocks are not encoded and are stored on the storage nodes in theiroriginal form. Next, we multiply the generator matrix from the left by a diagonal invertiblematrix to ensure that the decoding process of local recovery would consist only of XORoperations, avoiding finite field operations.

The following is a high level description of the methods used to construct the generatorpolynomial fa(x) described in Section 3.5.3 and the generator matrix which is derivedfrom it. We describe the entire process in detail in Section 5.3.2.

1. The first step in constructing the encoding polynomial is by choosing a polynomialg(x) (termed as good polynomial) which satisfies the following two conditions de-scribed also in our new construction in Section 3.5.3 and in the original constructionin [39]. A polynomial g(x) ∈ F[x] is a good polynomial if it is of degree r + 1 thatit is constant on d n

r+1e disjoint sets Ai ⊆ F, each of size r + 1. The sets Ai ⊆ F

can be arbitrary subsets, however we restricted to the case where they are cosets of

34

additive or multiplicative subgroups of the underlying field F (as shown in examplesin [39] ). Section III.B in [39] describes how to construct such good polynomials,and Algorithm 2 in the sequel describes this step in detail.

2. Together with the construction of g(x), we need to also to define the d nr+1e disjoint

sets Ai of size r + 1 in order to satisfy the second condition for g(x) to be a goodpolynomial [39]. The field elements of the sets Ai are called locations, and theprocess of defining them is described in algorithm 1.

3. Once the polynomial g(x) and its corresponding sets Ai are found, we proceed toconstruct the encoding polynomial fa(x). fa(x) is defined by g(x) and the k dataelements to be stored a = (aij, i = 0..., r − 1, j = 0, ..., k

r− 1), as described in

Section 3.5.3. Once fa(x) is constructed, it is evaluated in each of the locations α ∈Ai (n in total). Using the standard techniques, each evaluation of the polynomial isviewed as a multiplication of the length k row vector of data elements by a columnvector of length. Hence each location is viewed as length k column. Collecting then column in a matrix gives the k × n generating matrix. Algorithm 3 describes theentire procedure of constructing fa(x) and the corresponding generating matrix.

4. At this step the k × n matrix generating matrix is constructed, and the next goalis to provide simple local reconstruction for each missing block. In other wordslocal reconstructing is done by only performing XOR operation between survivingblocks within the local group. This is achieved as follows. We first calculate foreach local group, the coefficient of the linear dependency between the columns inthe generating matrix that corresponds to it. Then we multiply each column withthe coefficient that was found in the previous step. The resulting generating matrixenables a simple XOR reconstruction. More precisely, in a local group of size r+ 1,XOR between any r columns will construct the remaining column. This operationis described in algorithm 4.

5. For ensuring efficient data access, it is customary to transform the generator matrixinto a systematic form (also described in algorithm 4). Let G be the k×n generatormatrix constructed in the previous step, then it can be transformed into the matrixG′ = [Ik|C] of the same order as follows. Assuming that the first k columns of Gare linearly independent (in not, permute the columns of G so the resulting matrix

35

satisfies this property), and define G′ = C−1G, where C is the k × k matrix formedby the k first columns of G. It is easy to verify that the resulting matrix maintainsall the desired characteristics of G. This is the generator matrix we’ve used forOptimal-LRC encoding.

5.3.2 Calculation of generator matrix in Matlab

We implemented the steps described in Section 5.3.1 in Matlab. The algorithms belowdescribe our implementation.

Algorithm 1 create optimal lrc gen matrixInput: n, k, rOutput: generator matrix

1: if n mod (r + 1) == 1 then // Check if state is legal2: return // Illegal state - local group of size 13: end if4: g(x) = calculate good polynomial(n, k, r)5: for i ∈ {0, ..., 28 − 1} do // calculate locations over all elements of F28

6: values← g(i) // array of evaluations of g(x) for 28 locations7: base← i // array of all possible locations8: end for9: s = n mod (r + 1) // for the case (r + 1) 6 |n

10: for i ∈ {1, ..., d nr+1e} do // Collect b n

r+1c groups of locations, each of size r + 1

11: for 1 ≤ j ≤ r + 1; lj such that values[l1] == values[l2]... == values[lr+1] do12: locations← base[lj]13: end for14: end for15: if s > 0 then // Meaning we have a small local group of size s16: for 1 ≤ j ≤ s; lj such that values[l1] == values[l2]... == values[ls] do17: locations← base[lj]18: end for19: end if20: if g(Ad n

r+1e) 6= 0 then // as in the construction in Section 3.5.3

21: g(x) = g(x)− g(Ad nr+1e) // normalize g(x)

22: end if23: gen matrix temp = calc and evaluate fa(n, r, s, locations, g(x))24: generator matrix = formalize generator matrix(gen matrix temp)25: return generator matrix

36

Algorithm 1 describes the main construction of the Optimal-LRC generator matrix. Itsmain elements are:

1. Construct good polynomial g(x) (line 4)

2. For a finite field of 28 elements collect d nr+1e Ai disjoint sets, each of size r + 1,

such that for α, β ∈ Aig(α) = g(β).

In case n mod (r+ 1) > 0 there is a local group smaller than r+ 1. Add these ele-ments to locations array as well. In total locations will contain n elements(line 10).

3. Evaluate fa(x): fa(x) is a polynomial with the coefficients ai,j , each representingan information (data) symbol. By evaluating fa(x) in the n locations and loggingthe coefficient of each information symbol (k coefficients) we will construct n × kgenerator matrix (line 23).

4. Formalize generator matrix: Convert the generator matrix to a systematic formwhich is also simple to decode (line 24).

Algorithm 2 describes the construction of a good polynomial according to the guide-lines set in [39], over an extension field of Fp. The good polynomial is constant on disjointsubsets of points of size mpt, where m and p are coprime. Working over an extension fieldof p = 2, requires m to be odd. We have followed these instructions, which handle mostcases, however some cases do not fall into this category, such as the case of r = 5 [39]. Itis not specifically explained in [39] how to construct the good polynomial in such cases,but relying on a counting argument it is shown that such polynomials exist (see construc-tion in 3.5.3). Using a brute-force search one good polynomial is found.

37

Algorithm 2 calculate good polynomialInput: n, k, rOutput: g(x)

1: found group← false2: for m ∈ {0, ..., r + 1} do3: if (m odd) && (m == r + 1) then // Multiplicative subgroup4: g(x) = xr+1

5: found group← true6: return g(x)7: end if8: end for9: if exists t such that (2t = r + 1) &&found group = false then // Additive subgroup

10: g(x) = x · (x+ 1) · (x+ α) · (x+ α + 1)... // construct polynomial of this form11: // such that it’s of degree r + 112: return g(x)13: else // No group found14: sweep all possible polynomials such that good polynomial conditions hold15: return16: end if

Algorithm 3 calc and evaluate fa

Input: n, r, s, locations, g(x)Output: gen matrix

1: t = r + 1− s2: hB = 13: for i ∈ {1, ..., t} do // constructing hB(x) as defined in equation 3.44: hB(x) = hB(X) · (x+ location[n− i])5: end for6: for i ∈ {0, ..., (r − 1)}, j ∈ {1, ..., Sk′,r(i)} do // fa(x) is

constructed as a matrix with each row representing the coefficients of the polynomialper i,m as defined in equation 3.4

7: fa(x)← g(x)j · xi // each assertion is a newline in the matrix8: end for9: for m ∈ {0, ..., (r − t− 1)} do

10: fa(x)← hB(x) · xm // each assertion is a newline in the matrix11: end for12: for i ∈ {0, ..., n} do // evaluate fa13: gen matrix← fa(locations[i]) // implemented in Matlab with M matrix14: end for15: return gen matrix

38

Algorithm 3 describes the calculation and evaluation of the generator polynomial fa(x)

which will be used for the construction of the generator matrix. fa(x) is constructed ac-cording to equation 3.4. In order to use Matlab finite field evaluation function on fa(x),we had to construct it as a matrix. In equation 3.4 fa(x) is described as a collection ofsums, containing r + r − t elements in total. The matrix representing fa(x), was con-structed such that each row will contain an element of fa(x), with a total of r + r − t

rows. For example∑r−1

i=0 fi(x)xi has constructed r rows, with i evaluated in each. Oncethe matrix has been constructed, it has been evaluated in matlab using polyval (Polynomialevaluation) function, which supports evaluation in finite fields.

Algorithm 4 formalize generator matrixInput: gen matrix tempOutput: generator matrix final

1: for i ∈ {1, ..., d nr+1e} do // for each local group Ai in gen matrix temp

2: coef← linear coefficients of (Ai) // description below3: end for4: gen matrix xored = gen matrix temp × coef5: for i ∈ {1, ..., k} do // select k vectors at the locations of k information symbols6: tmp for systematic[i]← gen matrix xored[data block location]7: end for8: inv tmp for systematic← invert (tmp for systematicd)9: generator matrix final← inv tmp for systematic × gen matrix xored

10: return generator matrix final

Algorithm 4 describes the steps taken to bring the generator matrix to its final form.First, we bring it to a state in which the XOR of the columns that correspond to a localgroup is zero. As previously described, it will allow us to reconstruct the lost block usingonly XOR operations. Matlab has a function that calculates the coefficients that correspondto the linear dependency between linearly dependent vectors, however it does not supportfinite fields. For our purpose, initially a greedy search function was constructed, evaluatingall possible coefficient options to find the linear combination. This function has proven tobe extremely slow (a search for r = 5 can take hours, evaluating (28)5 options). To solvethis run-time issue, we have analyzed sets of linear combinations found with the greedysearch and discovered that they’re equal to locations of each local group. We haveutilized this knowledge and used locations instead of calculating linear coefficientswith a greedy search. Still remained problematic however, there are cases in which the

39

good polynomial was not constructed using algorithm 2, but with a brute force search.In this case locations wasn’t not the linear combination, and since matlab does notsupport this calculation for finite fields the combination was calculated manually.

Next we bring the generator matrix to a systematic form. k column vectors at thelocations of the data blocks are collected from the generator matrix G into a matrix C ofsize k × k. By inverting this matrix and multiplying it by the G, the resulting generatormatrix G′ = C−1G becomes systematic as described by:

G = [C|Pn−k×k], and,

G′ = C−1G = C−1 · [C|Pn−k×k] = [Ik|C−1Pn−k×k].

5.3.3 Implementation in Ceph

As described in Section 5.2.1, Ceph has the ability to construct a generator matrix on-the-fly for its LRC plugin. The transformation from Reed-Solomon codes to Azure-LRCrelies on the fact that Azure-LRC is based on Reed-Solomon (Section 2). However, it doesnot serve our purpose in implementing Optimal-LRC, which not based on Reed-Solomon.Specifically, the layers framework in the LRC plugin described in Section 5.2.1 cannotbe used for Optimal-LRC. Instead, we used the functionality provided by Ceph’s erasurecoding infrastructure and store a file containing the pre-calculated generator matrices ofOptimal-LRC which is accessed from the Ceph source code. The file was stored in Ceph’ssource code and accessed by the optlrc encode function described below. We usedMatlab to construct the generator matrix for each (n, k, r) Optimal-LRC in our evaluation.Beyond these initial calculations, the encoding and decoding processes of Optimal-LRCare equivalent to those of the original Ceph LRC implementation. The differences in en-coding and decoding complexities are negligible compared to the I/O and network times ofa large-scale storage system [8,15,17,20,26]. Similarly, there is no significant differencein the overhead of their implementation and metadata storage and maintenance. The ma-trices in the file were calculated offline as described in Section 5.3.2. The calculation ofeach generator matrix in the common case took less than a minute of runtime. The uncom-mon cases, also referred in Section 5.3.2 (Algorithm 2), are for (n, k, r) configurationswhich require certain manual calculation of the generator matrix.

We modified the following Ceph Erasure Code functions to serve Optimal-LRC im-plementation:

40

1. minimum to decode - This function receives the ID of the missing blocks (Sec-tion 5.1.2) in the placement group (PG ID) and returns the PG IDs of the survivingblocks required for their reconstruction. The original function relies on the LRCplugin layers, and is thus not suitable for Optimal-LRC.

For the purpose of our evaluation, it was enough to assume we would have only onenode failure in the cluster (resulting in at most one lost block per placement group).Our implementation also covers cases in which there is at most one block loss ineach local group (for a total of

⌈nr+1

⌉local groups in the code), but it does not cover

cases in which more than one block is lost in a local group. Our function returns theIDs of the surviving blocks in the local group of the missing block, which are all theblocks in the local group of the lost block, except for the lost block. The originalfunction also addresses the case of global parity loss in Azure-LRC. This case isirrelevant for Optimal-LRC, since all Optimal-LRC blocks belong to local groups.

2. encode chunks - The original implementation of this function first separates lo-cal groups into layers and inputs each layer into a Jerasure encoding function. In ourimplementation, this function only serves as a wrapper for the optlrc encode

function described below.

3. decode chunks - The original function receives the locations of all missing blocksand handles each one separately relying on LRC layers. Our function is a variationof the original function, supporting Optimal-LRC instead of LRC layers. Each lo-cal group is decoded by optlrc decode local which is described below. Ourfunction can handle up to one lost block in each local group.

We implemented the following two functions as part of our Optimal-LRC implemen-tation in Ceph:

1. optlrc encode - Our function is an analogue of Jerasure plugin’s encode chunks,which is used for encoding the local parities of each LRC local group. The core ofour encoding function is based on jerasure matrix encode from the Jerasurelibrary6. jerasure matrix encode encodes parity blocks using a generator

6 http://jerasure.org/jerasure-2.0/

41

matrix of the type Gm×k where m = n − k represents all parities, including globaland local.

optlrc encode retrieves the generator matrix calculated beforehand (Section5.3.3) for the (n, k, r) configuration we’re encoding. Next, the function reconstructsOptimal-LRC generator matrix from the form Gn×k (which is the one generated byMatlab and stored in Ceph’s source code)to the form required by jerasure matrix encode

(Gm×k) (it omits matrix columns representing data blocks) and passes this matrix toit. The output of jerasure matrix encode is the encoded blocks which arethe product of the data blocks and the Optimal-LRC generator matrix.

2. optlrc decode local - Our decoding function is an analogue of Jerasure plu-gin’s decode chunks, which decodes lost blocks of Reed-Solomon encoding.For reconstructing local parities and data blocks, optlrc decode local re-ceives a pointer to the surviving blocks and XORes them to reconstruct the miss-ing block. Since we rely on the assumption we have at most one lost block in alocal group, global parity reconstruction is equivalent to local parity or data blockreconstruction.

5.4 Amazon EC2 deployment

We deployed our Ceph cluster on 20 instances in the Amazon Elastic Compute Cloud(EC2). We used t2.medium instances, each equipped with two Intel Xeon processorsand 4GiB RAM [2]. We allocated two storage volumes to each instance, and used them toinitialize two OSDs (40 OSDs in total). An additional instance was allocated to accommo-date the monitor, metadata server, and client. For this purpose, we used a t2.2xlargeinstance equipped with eight Intel Xeon processors and 32GiB RAM.

EC2 data centers belong to different regions, which correspond to distinct geographicallocations. Each region contains several availability zones, which are connected by lowlatency links and guarantee failure tolerance within the region [3]. We deployed our clusterin a single availability zone in the Frankfurt region in all experiments but one. In the“load” experiment (described below) we deployed our cluster in three separate zones inthe N. Virgina region. The N. Virginia region was chosen for this purpose because itcontained 6 availability zones (Frankfurt had 3), which is sufficient for the purpose of ourexperiment, taking into account not all zones are always available for usage. We used

42

General Purpose SSD as our storage devices in all experiments but two. We replaced theseSSDs with Cold HDD volumes in one experiment, and with Provisioned IOPS SSDs inthe load experiment [1].

In our basic “node repair” experiment, we populated the cluster with 200GB of data,written as 64MB objects. These objects are distributed across 512 placement groups. Thus,each OSD stored, on average, 5GB of data, and additional parity blocks according to theevaluated code. We killed one OSD daemon on one instance and removed this OSD fromthe cluster. This initialized the repair process, which was performed by the primary OSDin each affected placement group. We recorded the amount of data read from each deviceand the CPU utilization of each instance, until the full recovery of the cluster. We describevariations of this experiment with foreground workload (“load”) and with slower storage(“HDD”) in the following section.

43

6 Results

6.1 Amount of data read and transferred

Figure 6.1 shows the number of blocks read by each code during repair, normalized tothe number of data blocks on the failed OSD. We also present the ARC and NRC ofeach code, for comparison. We use an (n, k) Reed-Solomon in each configuration as ourbaseline. The results show the considerable reduction in repair cost achieved by LRCs,and that full-LRCs achieve a larger reduction, as shown in our theoretical evaluation.

For a given (n, k, r) combination, both ARC and NRC can predict which code willincur the the highest and lowest repair costs. At the same time, they are both inaccuratein their prediction of the actual repair cost. The reason for this inaccuracy is different foreach metric. ARC inherently underestimates the absolute cost, because it does not takeinto account the code’s overhead. As a result, it is not useful for comparing codes withdifferent storage overheads. For example, the ARC of (14,10) Reed-Solomon and (15,10)Reed-Solomon is 10 for both, but they read 12.53 and 13.24 blocks per data block recov-ered, respectively.

44

0

2

4

6

8

10

12

14

16

r=3 n=12 k=8

4 5 4 n=14 k=10

5 6 4 n=15 k=10

5 6 4 5 n=17 k=12

6 7

Blo

cks

Rea

d / D

ata

Blo

cks

Rec

over

ed

NRCARC

RSOpt-LRC

AzureAzure+1

Figure 6.1: The number of average read blocks per data block repaired, com-pared to expected ARC and NRC.

45

Code ARC Adjusted ARC NRC Adjusted NRC Data read Time (s)(13,10) RS 10 10 13 13 11.71 89

(14,10,5) Azure 5.71 (0.57) 5.5 (0.55) 8 (0.62) 7.7 (0.59) 6.74 (0.58) 59 (0.66)(14,10,6) Azure 5.85 (0.58) 5.6 (0.56) 8.2 (0.63) 7.84 (0.6) 6.86 (0.59) 57 (0.64)

(15,10,5) Azure+1 4.4 (0.44) 4.5 (0.45) 6.6 (0.51) 6.75 (0.52) 5.96 (0.51) 57 (0.64)(15,10,6) Azure+1 4.53 (0.53) 4.59 (0.46) 6.8 (0.52) 6.88 (0.53) 6.08 (0.52) 57 (0.64)

Table 6.1: Adjusted ARC and NRC according to block distribution on the failedOSD. The adjusted NRC corresponds to the actual amount of data read for re-covery. The values in parenthesis show the costs normalized to Reed-Solomonwith the same k and d.

The inaccuracy of NRC is the result of our limited evaluation setup. Although CRUSHattempts to uniformly distribute data and parity blocks on all OSDs, its mapping is deter-ministic, and the actual distribution with 40 OSDs is not perfectly uniform. As a result,some OSDs store more blocks than others, and the percentage of data and parity blocks oneach OSD is different. We verified that this is the cause of the inaccuracy - by distinguish-ing between the blocks on the failed OSD according to the number of blocks required fortheir repair and observing that their percentage is different than expected. We adjusted theNRC and ARC in several setups according to the observed distribution, although we stillassumed 5GB of data on each OSD1.

Table 6.1 shows the detailed metric and results for Reed-Solomon, Azure-LRC, andAzure-LRC+1, when k = 10, and d = 4. The required storage overhead is different foreach code, which makes it difficult to directly compare their repair costs. The mappingof OSDs to placement groups is also different for each n. The table shows the calculatedand the adjusted ARC and NRC of all codes, with the repair cost of each LRC comparedto that of Reed-Solomon in parenthesis. The adjusted NRC provides a fairly accurateprediction of the amount of data read for recovery (we discuss the recovery time below).This confirms that in a large-scale storage system with uniform block distribution, theNRC can accurately predict the average repair cost of an entire storage node.

The amount of data transferred between nodes was almost identical to the amount ofdata read. We verified that the differences were caused by the role of the primary OSDin the reconstruction process: when the primary stored one of the blocks required forreconstruction, it did not have to transfer this block to another OSD. On the other hand,the primary always had to transfer the reconstructed block to the replacement OSD. In

1 Ceph does not report the number of data blocks stored on each OSD, and we could not distinguishbetween data blocks and local parity blocks because they require the same number of blocks for repair.

46

0

0.2

0.4

0.6

0.8

1

r=3 4 n=12 k=8

5 4 5 n=14 k=10

6 4 5 n=15 k=10

6 4 5 n=17 k=12

6 7

Nor

mal

ized

Rep

air

Tim

e

NRCOpt-LRC

AzureAzure+1

Figure 6.2: Recovery time of LRCs normalized to Reed-Solomon with the samek and n.

light of this simple correlation, we omit the amount of data transferred from the rest of ourdiscussion.

6.2 Repair time

Figure 6.2 shows the recovery time of LRCs normalized to Reed-Solomon with the samek and n (normalizing to Reed-Solomon with the same d yields equivalent results). Ourresults show that the reduction in the amount of data read for repair does not directlytranslate to a reduction in repair time. This is the result of additional bottlenecks in thesystem, such as queuing and batching delays. We verified that the CPU utilization is thesame for all codes, ruling out encoding costs as a bottleneck. However, the I/O bandwidthutilized by the codes was slightly different. Reed-Solomon typically achieved a higherthroughput than the LRCs—it reads considerably more data than the other codes, whichallows it to saturate the storage devices. Thus, the reduction in repair time achieved bythe LRCs was smaller than that predicted by NRC. Overall, the full-LRCs achieved thegreatest reduction in repair time.

47

Code NRC SSD Opt HDD Cold HDDReed-Solomon 15 100 115 303

Azure-LRC 6.6 (0.44) 58 (0.58) 65 (0.56) 134 (0.44)Azure-LRC+1 4.8 (0.32) 49 (0.49) 53 (0.46) 134 (0.44)Optimal-LRC 6 (0.4) 54 (0.54) 57 (0.49) 143 (0.47)

Table 6.2: NRC of all codes and their recovery time in seconds. n = 15 andk = 10 for all codes and r = 4 for the LRCs, with the repair time normalized toReed-Solomon in parentheses.

20

30

40

50

60

70

100 200 300 400 500 600

MB

/sec

Time[sec]

AzureAzure+1

Opt-LRCRS

Figure 6.3: Throughput of RADOS benchmark during repair with LRC in(15,10,4) and RS(15,10).

6.3 Different storage types

LRCs reduce the amount of data read during recovery, and thus their benefit is expectedto increase with the cost of storage I/O. We repeated our repair experiment for one con-figuration, replacing the SSD storage volumes with two types of hard drives, OptimizedHDD and Cold HDD, with a maximum throughput of 500 and 250 IOPS, respectively [1].The amount of data read from all storage types was the same. Table 6.2 shows the repairtime, in seconds, for all the codes and storage types. As we expected, in the setups wherethe repair time of Reed-Solomon was longer, the reduction in repair time achieved by allLRCs was higher and closer to the reduction predicted by NRC.

48

6.4 Foreground workloads

Local repair is also designed to minimize the interference with application workloads run-ning in the system at the time of failure. To evaluate this interference, we repeated therepair experiment with the (15,10,4) configuration, in which each LRC has a differentNRC. We ran a Ceph benchmark called RADOS Bench [41], which writes objects fora given amount of time (220 seconds in our experiment), reads all the objects, and ter-minates. For this experiment, we increased the number of outstanding recovery requestsallowed per OSD from 15 to 150. We killed one OSD 100 seconds after the benchmarkstarted to read. The repair process took place while the benchmark was still reading thedata, but the system recovered before the benchmark terminated.

Figure 6.3 shows the throughput of the benchmark’s I/O requests during its read phase.The black circles mark the time at which recovery was fully completed, and the measure-ments continue until the benchmark terminates. The differences between the codes weresmaller than we expected. This is the result of Ceph’s restrictions on repair throughput,and of the high I/O parallelism of SSDs. Nevertheless, the results show that the differ-ent codes completed their repair in the order of their NRC: Azure-LRC+1 was the fastestand Reed-Solomon the slowest. The throughput reduction experienced by the benchmarkwas greatest with Reed-Solomon and smallest with the full-LRCs—Azure-LRC+1 andOptimal-LRC.

6.5 Multiple zones

The first LRCs were motivated by the goal of restricting the repair cost to the locality of thefailed node. In production systems, this means that blocks in the same group are assignedto a group of nodes on the same rack or in the same zone of the datacenter [15]. To evaluatethe different LRCs in a similar environment, we repeated the repair experiment when ourinstances were deployed on three availability zones in the same EC2 region (N. Virginia).In this experiment, we deployed six instances in each zone, with a total of 18 instancesrunning 36 OSDs. To make up for the reduced I/O bandwidth within in each zone, wereplaced the General Purpose SSDs with Provisioned IOPS SSDs, which increased themaximum IOPS per volume from 150 to 2500.

49

Zones 1 3 1 3Reads Reads Time(s) Time (s)

Reed-Solomon 13.24 14.37 (8.5%) 100 133 (33%)Azure-LRC 7.73 8.11 (4.9%) 76 138 (81.5%)

Azure-LRC+1 5.96 6.3 (5.7%) 57 144 (252%)Optimal-LRC 5.94 6.3 (6%) 56 144 (277%)

Table 6.3: Number of blocks read per lost data blocks and repair time of alln = 15, k = 10 (r = 5 for LRC) codes when running on one zone and on threezones. The increase in the amount of data read and the increase in repair timecompared to one zone appear in parentheses.

6.5.1 Basic setup and its limitations

First we chose (15,10,5) as our configuration. We instructed CRUSH to assign placementgroups to OSDs such that groups are allocated in the same zone by editing the CRUSHmap [45]. Recall, however, that recovery in Ceph is handled by the primary OSD, whichreads the data required for repair from the secondary OSDs and constructs the missingdata. Thus, when the primary OSD is in a different zone than the failed OSD, the repairdata is transferred between zones, rather than directly to the replacement node in the samezone.

Table 6.3 shows the amount of data read and the repair time for all codes in the sameconfiguration, when deployed on one and on three zones. The amount of data read inthe three-zone experiment was higher due to a different allocation of placement groupsto OSD than in the one-zone experiment. Indeed, the amount of data required for repairdid not depend on the physical location of the nodes, and the reduction achieved by thedifferent LRCs is similar in both experiments.

The recovery time increased considerably when the nodes were distributed across threezones. The network bandwidth between zones is high, but users are limited in the band-width they may consume. As a result, the recovery time of Reed-Solomon increased by33%. The increase in recovery time of the LRCs was substantially higher. The reason isthe uneven distribution of recovery load. In Reed-Solomon, all the surviving nodes fromall zones participated equally in the repair process. However, in the LRCs, recovery loadwas concentrated in the zone of the failed OSD. Azure-LRC was more balanced than theother LRCs because the recovery of the global parities was still distributed across all thenodes in the system. Thus, its recovery time increased by 82%.

50

Code NRC Reads 1 zone 3 zonesBaseline 3 groups

Reed-Solomon 15 14.42 121 179 190Azure-LRC 10 9.73 88 (0.73) 158 (0.88) 162 (0.85)

Azure-LRC+1 7.5 7.21 80 (0.66) 148 (0.82) 148 (0.78)

Table 6.4: Number of blocks read per lost data block and repair time when run-ning on one zone and on three zones, with the repair time normalized to Reed-Solomon in parentheses.

The full-LRCs exhibited the worst load distribution. All the data was read from theOSDs in the zone of the failed OSD, and two-thirds of it was transferred to a primaryOSD on another zone (assuming the distribution of primary OSD to zones was uniform).As previously mentioned, to alleviate the pressure on the SSDs in that zone, we usedProvisioned IOPS SSDs rather than general purpose SSDs, which increased the maximumIOPS per volume from 150 to 2500.

6.5.2 Weighted evaluation

We changed the configuration and the method in which we perform the experiment, inorder to bypass Ceph’s limitation for our specific configuration. First, we ensured that theprimary OSD always resides in the same zone as the failed OSD. We ran this experimenttwice. In the first setup, both the primary OSD and the failed OSD belonged to a datagroup. In the second setup, both the primary OSD and the failed OSD belonged to a global-parity group. For full-LRCs, both setups are equivalent: all blocks are reconstructed fromblocks in the same group. For Reed-Solomon, all recovery scenarios follow the secondsetup: recovery requires blocks from different groups. For data-LRCs, data and localparity blocks are recovered according to the first setup, and global parities are recoveredaccording to the second setup. We calculated the weighted average of these two setups toobtain the expected recovery time for each code.

We used (15,8,4) as our configuration, having excluded it from our previous analysisdue to its high overhead. Nevertheless, it has the desirable property that all the groups inall codes have the same size (5). This ensures that all placement groups include the samenumber of OSDs in each zone. In this configuration, the full LRCs are equivalent in theirdistribution of data and parity blocks to groups. We use Azure-LRC+1 as our full-LRC inthis experiment.

51

For comparison, we repeated this experiment with all OSDs in a single zone, but withthe same restriction on the allocation of OSDs to groups. This setup eliminates the cross-zone network bottleneck, with I/O parallelism limited as in the three-zone experiment. Ourbaseline is the unrestricted setup we used in the rest of this section.

Table 6.4 shows the amount of data read and the weighted average of the repair timein this experiment. It shows that restricting the number of nodes that participate in therepair process significantly reduces its throughput. When all the OSDs are deployed inthe same zone, this restriction increases the repair time by 48% to 85%. The increase islower for Reed-Solomon because it can still utilize twice as many OSDs than the LRCs.The addition of the cross-zone network bottleneck further reduces the repair time of Reed-Solomon (by 6%) and of Azure-LRC (by 2.5%), but does not affect Azure-LRC+1 whichdoes not incur any cross-zone transfers for repair.

These results demonstrate the well-known tradeoff between I/O parallelism and local-ity. They confirm that data-LRCs and full-LRCs are expected to achieve the highest benefitin large-scale deployments, where sufficient I/O parallelism can be achieved within a sin-gle zone.

52

7 Related Work

Erasure coding is an important field, increasingly growing in its popularity as cloud storageservices become more common. Our research focuses on reducing repair cost in erasurecoding. In this section we will survey the relevant research on repair cost reduction andother aspects of erasure coding.

Efforts to reduce the repair cost can be classified into two main lines of research, LRCswhich attempt to reduce the repair cost by reducing the number of nodes participating inthe repair process [11, 14, 15, 24, 36] and Regenerating Codes which strive to attain thesame general goal by reducing the network bandwidth utilized in the course of repair [8,31, 32].

7.1 Locally Repairable Codes

The benefits of codes with locality were first realized in Pyramid codes [14] before theactual notion was isolated into a stand-alone concept in the information-theory commu-nity [11]. The basic code construction of Pyramid codes [14] assumes that an MDS codeis subdivided into multiple local groups which can be used for local repair along with theglobal parities of the MDS code, which provide additional protection against data blockloss. Unlike the initial MDS on which Pyramid is based, the Pyramid code itself and theentire LRC family are non-MDS, meaning their minimum distance is smaller than that ofan MDS code, as explained in Section 2. The work of Pyramid codes indicated possi-ble saving in repair cost, thereby propelling further research on LRC codes. In particular,Huang et al. [15] developed LRC codes and observed substantial savings in the repair costof Microsoft Azure storage attained by using them, and Gopalan et al. [11] developed thecoding-theoretic side of the notion of LRCs. Sathiamoorthy et al. [36] presented theapproach of full-LRC in Xorbas, described in detail in Section 3.2. By constructing thecode’s local parities they were the first to reduce the repair cost of the global parities

53

Finally, a family of codes called Sector Disk codes [20, 26] addresses the recovery ofa lost block within an otherwise healthy node. The codes constructed in these works addparity blocks that allow efficient recovery of bad hard disk sectors or SSD blocks, whichwould otherwise require entire parity nodes for failure recovery. This approach relies onthe assumption that recovery in disks has an aspect of locality. Thus, storing local paritysectors (instead of entire parity disks) improves recovery performance.

Our study complements previous research by providing a thorough analysis and com-parison of the state-of-the-art LRCs. We have presented an improvement to the code de-veloped by Tamo and Barg [39] by expanding the possible (n, k, r) configurations andimproving minimum distance optimality (Section 3.5.4.3). We proved that our new con-struction provides the largest possible distance for (n, k, r) LRC configurations. Our workhas also presented a method of relaxing (n, k, r) constraints in Azure-LRC (Section 3.3.1).In Section 3.4 we have presented a method of transforming Azure-LRC, which is data-LRC, to a full-LRC.

7.2 Minimum Storage Regenerating Codes

Minimum storage regenerating (MSR) codes [8, 32] are a class of MDS codes designedto optimize recovery network bandwidth rather than the number of accessed storage de-vices. MSR codes and related families, such as RotatedRS [17], Hitchhiker-XOR [30],Butterfly [9] and Zigzag [40] codes, divide each data and parity block into smaller chunks,such that only a subset of each block’s chunks are required for the repair of a failed node.RotatedRS [17] allows combining blocks required for reconstruction and read which areencoded as RS [33] to a single degraded read operation, minimizing network bandwidth.Similarly, Hitchhiker-XOR [30], combines multiple RS encoded bytes of data to furtherencode RS global parities by XOR operations, reducing network traffic and disk I/O with-out additional storage. Butterfly [9] require to transfer only 1/2 of all the surviving datawhen reconstructing a lost block, but they are constructed only for 2 parities. Similarlyto Butterfly, Zigzag [40] code parity is a linear combination of k data blocks from mul-tiple data, each split into k blocks. However it offers construction for any r = n − k

parity blocks. Rashmi et al. [29] constructed an MSR code which reduces the amountof data read from some of the surviving nodes, but is applicable only for clusters withn = 2 × k. These codes reduce the rebuilding ratio—the portion of the surviving nodes’data that must be read during recovery. All MDS codes with the same d have the same

54

overhead, and can be directly compared by their rebuilding ratio. However, this metric isalso limited in its ability to predict recovery costs in a real system: these costs depend onthe granularity of the non-sequential I/O accesses incurred when reading arbitrary chunksfrom each block [25].

7.3 System Level Optimizations

An alternative approach reduces recovery costs of existing codes. One approach was in-troduced in the work of Wang et al. [42] and developed by Guruswami and Wootters [13].In particular, the work of Guruswami and Wootters [13] considers linear repair schemesof Reed-Solomon codes. They have shown that if their Reed-Solomon reconstructionscheme, which reads partial blocks instead of the whole k required blocks, is used, thennetwork repair bandwidth is reduced compared to the trivial approach. A different ap-proach achieves the reduction by delaying recovery to amortize its costs over more thanone failure [37]. This approach can also be applied to existing local reconstruction codesHowever it reduces the fault tolerance of the system, which is equivalent to reducing d.Yet another approach to node repair relies on attached non-volatile memory for cachingadditional parity blocks, thus reducing traffic and storage overheads [35]. This effectivelyincreases the storage overhead of the system. Our comparative framework, using NRC,can also be extended to evaluate the above approaches.

Another relevant work by Xia et al. [47] builds on Azure-LRC to dynamically adjustthe system’s overall storage overhead and average recovery speed by migrating hot andcold data to arrays with more or fewer parity nodes, respectively. A similar approach canbe applied to the codes in our study. Our evaluation based on NRC can more accuratelycapture the true repair cost in different configurations.

Partial-Parallel-Repair [23] and Repair pipelining [21] are approaches for reducingrecovery latency by parallelizing recovery operations. Partial Parallel Repair (PPR) [23]divides reconstruction operations into several fragments of operations and schedules themin parallel on nodes participating in the reconstruction. PPR has been shown to reducereconstruction repair time compared to Reed-Solomon. Repair pipelining [21] also breaksdown reconstruction operations into smaller fragments, but by pipelining them it preformsbetter than PPR. This approach is compatible with existing erasure codes including LRCand does not require additional storage overhead. This approach uses a chain of nodes inthe cluster for the implementation of the pipeline, thus it is susceptible to poor links or

55

slow performing nodes and might degrade the performance of the recovery. Comparedto standard Azure-LRC, Azure-LRC with repair pipelining has shown to have reducedrecovery time.

Giza [7] is an approach for reducing cross data center latency. In order to protectuser data, storage systems can apply an additional layer of protection on top of erasurecoding. On top of LRC, Microsoft Windows Azure optionally replicates user data to asecondary data center, which ensures protection from large data center failures [7]. Gizaseparates data and metadata paths while maintaining strong data consistency. As a result,it achieves lower latency than a naive global consistency approach. This approach hasbeen effectively applied on Azure-LRC and can be applied to all LRC family. However,the additional replication of data is an additional service suitable only for users whose datais extremely valuable. Otherwise, the additional storage cost might be too expensive.

7.4 Write Performance

Whenever erasure coded data is updated, its parity must also be updated. Optimizing theseadditional writes is an important optimization objective in erasure coded systems. Thereare two main methods for applying write operations: in in-place updates the update isapplied directly to the current data and overwrites it. In out-of-place updates, also knownas logging, the current data is marked as invalid, and the new data is appended to the endof the log.

In-place updates in erasure coded systems may incur a large I/O overhead, because ev-ery write also incurs a parity update. This problem is especially acute in the case of smallwrites, i.e., of less than an entire stripe [10]. Logging reduces this overhead for smallwrites by amortizing the parity writes over several data writes. However, it may degraderead performance by fragmenting logically sequential data. Specifically, this fragmen-tation may slow down recovery which requires sequential reads of both data and parity[6].

To address the small-write problem, Stodolsky et al. [38] introduced the approach ofparity logging: small parity updates are accumulated until the resulting writes are largeenough for efficient I/O. Chan et al [6] used parity logging to improve recovery perfor-mance by applying in-place updates to data while logging parity updates. Most paritylogging approaches [38], [6], [16] log only the delta of the parity when new data iswritten. This somewhat improves the performance of logging, but still requires significant

56

amount of write-after-read for parity update. Thus, erasure coding with parity logging isstill considered worse than replication with respect to update performance [6].

A different approach which reduces number of writes has been presented in Parix [18].Parix performs speculative partial writes which allow fast parity logging by calculatingparities from data deltas when updating parity values. This method was shown to to out-perform existing parity logging schemes. However, speculative writes can increase thenumber of IOPS in the cluster if its speculations are incorrect. These approaches of im-proving write performance can also be applied to LRC. However, as mentioned it candegrade the read performance or increase the amount of IOPS in the system.

57

8 Conclusions

In this study, we performed the first systematic comparison of full-LRCs and data-LRCs.To perform this analysis, we extended the popular data-LRC used in Windows Azure(Azure-LRC) to a full-LRC that efficiently reconstructs global parities (Azure-LRC+1).We also extended an existing full-LRC to apply to a wide range of parameter combina-tions, and implemented it in a distributed object store (Optimal-LRC). In our analysis, wedemonstrated the limitations of existing metrics and introduced a new metric (NRC) thatsuccessfully captures each code’s overhead and repair cost. Using this metric, we showedthe advantage of Optimal-LRC’s flexibility, and that it indeed offers the optimal tradeoffbetween repair cost and fault tolerance.

We further evaluated these codes on a small cluster deployed on Amazon EC2. We val-idated the new metric, NRC and showed that it can successfully predict the repair cost ofthe different codes along with the benefit from different possible allocations of local pari-ties. We showed how this benefit depends on the underlying storage device by comparingSSD and HDD-based clusters. We demonstrated the code’s sensitivity to the network ar-chitecture by deploying our cluster on three separate data centers. We also showed how thenetwork and storage bandwidth consumed by repair with different codes affect an appli-cation running in the foreground during repair. These results are valuable for determiningwhich code best suits each system architecture, and how to achieve required fault toleranceand recovery efficiency objectives at the lowest cost.

58

References

[1] Amazon EBS volumes. http://docs.aws.amazon.com/AWSEC2/ lat-est/UserGuide/EBSVolumes.html, 2017. [Accessed: 2017-09-24].

[2] Amazon EC2 instance types. https://aws.amazon.com/ec2/instance-types, 2017.[Accessed: 2017-09-22].

[3] Amazon EC2 regions and availability zones.http://docs.aws.amazon.com/AWSEC2/latest/ UserGuide/using-regions-availability-zones.html, 2017. [Accessed: 2017-09-22].

[4] Jerasure erasure code plugin. http://docs.ceph.com/docs/hammer/rados/operations/erasure-code-jerasure/, 2017. [Accessed: 2017-09-24].

[5] Locally repairable erasure code plugin. http://docs.ceph.com/docs/hammer/rados/operations/erasure-code-lrc/, 2017. [Accessed: 2017-09-24].

[6] J. C. Chan, Q. Ding, P. P. Lee, and H. H. Chan. Parity logging with reserved space:towards efficient updates and recovery in erasure-coded clustered storage. In FAST,pages 163–176, 2014.

[7] Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips. Giza: Era-sure coding objects across global data centers. In 2017 USENIX Annual Technical

Conference (ATC), pages 539–551, 2017.

[8] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran. Net-work coding for distributed storage systems. IEEE Transactions on Information The-

ory, 56(9):4539–4551, Sept 2010.

59

[9] E. En-Gad, R. Mateescu, F. Blagojevic, C. Guyot, and Z. Bandic. Repair-optimalMDS array codes over GF (2). In IEEE International Symposium on Information

Theory (ISIT), 2013.

[10] G. A. Gibson. Redundant disk arrays: Reliable, parallel secondary storage. 1992.

[11] P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin. On the locality of codewordsymbols. IEEE Transactions on Information Theory, 58(11):6925–6934, November2012.

[12] K. M. Greenan, J. S. Plank, J. J. Wylie, et al. Mean time to meaningless: Mttdl,markov models, and storage system reliability. In HotStorage, pages 1–5, 2010.

[13] V. Guruswami and M. Wootters. Repairing reed-solomon codes. In 48th Annual

ACM SIGACT Symposium on Theory of Computing (STOC), 2016.

[14] C. Huang, M. Chen, and J. Li. Pyramid codes: Flexible schemes to trade spacefor access efficiency in reliable data storage systems. Trans. Storage, 9(1):3:1–3:28,Mar. 2013.

[15] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin.Erasure coding in Windows Azure Storage. In USENIX Annual Technical Conference

(ATC), 2012.

[16] C. Jin, D. Feng, H. Jiang, and L. Tian. Raid6l: A log-assisted raid6 storage architec-ture with improved write performance. In Mass Storage Systems and Technologies

(MSST), 2011 IEEE 27th Symposium on, pages 1–6. IEEE, 2011.

[17] O. Khan, R. Burns, J. S. Plank, W. Pierce, and C. Huang. Rethinking erasure codesfor cloud file systems: Minimizing I/O for recovery and degraded reads. In 10th

Usenix Conference on File and Storage Technologies (FAST), 2012.

[18] H. Li, Y. Zhang, Z. Zhang, S. Liu, D. Li, X. Liu, and Y. Peng. Parix: Speculative par-tial writes in erasure-coded systems. In 2017 USENIX Annual Technical Conference

(ATC), pages 581–587, 2017.

60

[19] J. Li and X. Tang. Optimal exact repair strategy for the parity nodes of the (k +

2, k) zigzag code. IEEE Transactions on Information Theory, 62(9):4848–4856, Sept2016.

[20] M. Li and P. P. C. Lee. STAIR codes: A general family of erasure codes for toleratingdevice and sector failures. Trans. Storage, 10(4):14:1–14:30, Oct. 2014.

[21] R. Li, X. Li, P. P. Lee, and Q. Huang. Repair pipelining for erasure-coded storage.In 2017 USENIX Annual Technical Conference (ATC), pages 567–579, 2017.

[22] J. Liu, S. Mesnager, and L. Chen. New constructions of optimal locally recoverablecodes via good polynomials. IEEE Trans. Inform. Theory, 2018. to appear.

[23] S. Mitra, R. Panta, M.-R. Ra, and S. Bagchi. Partial-parallel-repair (ppr): a dis-tributed technique for repairing erasure coded storage. In Proceedings of the Eleventh

European Conference on Computer Systems, page 30. ACM, 2016.

[24] F. Oggier and A. Datta. Self-repairing homomorphic codes for distributed storagesystems. In Proc. 2011 IEEE INFOCOM, pages 1215–1223, 2011.

[25] L. Pamies-Juarez, F. Blagojevic, R. Mateescu, C. Guyot, E. En-Gad, and Z. Bandic.Opening the chrysalis: On the real repair performance of MSR codes. In 14th Usenix

Conference on File and Storage Technologies (FAST), 2016.

[26] J. S. Plank and M. Blaum. Sector-disk (SD) erasure codes for mixed failure modesin RAID systems. Trans. Storage, 10(1):4:1–4:17, Jan. 2014.

[27] J. S. Plank, K. M. Greenan, and E. L. Miller. Screaming fast Galois field arithmeticusing Intel SIMD instructions. In 11th USENIX Conference on File and Storage

Technologies (FAST), 2013.

[28] J. S. Plank, J. Luo, C. D. Schuman, L. Xu, and Z. Wilcox-O’Hearn. A performanceevaluation and examination of open-source erasure coding libraries for storage. In7th Usenix Conference on File and Storage Technologies (FAST), 2009.

61

[29] K. Rashmi, P. Nakkiran, J. Wang, N. B. Shah, and K. Ramchandran. Having yourcake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In 13th USENIX Conference on File and Storage Technologies (FAST),2015.

[30] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran. A“hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded datacenters. In ACM SIGCOMM, 2014.

[31] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran. Asolution to the network challenges of data recovery in erasure-coded distributed stor-age systems: A study on the Facebook warehouse cluster. In 5th USENIX Workshop

on Hot Topics in Storage and File Systems (HotStorage), 2013.

[32] K. V. Rashmi, N. B. Shah, and P. V. Kumar. Optimal exact-regenerating codes fordistributed storage at the MSR and MBR points via a product-matrix construction.IEEE Transactions on Information Theory, 57(8):5227–5239, Aug 2011.

[33] I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. Journal of

the society for industrial and applied mathematics, 8(2):300–304, 1960.

[34] M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems (TOCS), 10(1):26–52, 1992.

[35] E. Rosenfeld, N. Amit, and D. Tsafrir. Using disk add-ons to withstand simultaneousdisk failures with fewer replicas. 7th Annual Workshop on the Interaction amongst

Virtualization, Operating Systems and Computer Architecture (WIVOSCA), 2013.

[36] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen,and D. Borthakur. XORing elephants: novel erasure codes for big data. In 39th

international conference on Very Large Data Bases (VLDB), 2013.

[37] M. Silberstein, L. Ganesh, Y. Wang, L. Alvisi, and M. Dahlin. Lazy means smart:Reducing repair bandwidth costs in erasure-coded distributed storage. In Interna-

tional Conference on Systems and Storage (SYSTOR), 2014.

62

[38] D. Stodolsky, G. Gibson, and M. Holland. Parity logging overcoming the small writeproblem in redundant disk arrays. In ACM SIGARCH Computer Architecture News,volume 21, pages 64–75. ACM, 1993.

[39] I. Tamo and A. Barg. A family of optimal locally recoverable codes. IEEE Transac-

tions on Information Theory, 60(8):4661–4676, Aug 2014.

[40] I. Tamo, Z. Wang, and J. Bruck. Zigzag codes: MDS array codes with optimal re-building. IEEE Transactions on Information Theory, 59(3):1597–1616, March 2013.

[41] F. Wang, M. Nelson, S. Oral, S. Atchley, S. Weil, B. W. Settlemyer, B. Caldwell, andJ. Hill. Performance and scalability evaluation of the ceph parallel file system. InProceedings of the 8th Parallel Data Storage Workshop. ACM, 2013.

[42] Z. Wang, A. Dimakis, , and J. Bruck. Rebuilding for array codes in distributed storagesystems. In GLOBECOM Workshops (GC Wkshps), pages 1905–1909. IEEE, 2010.

[43] H. Weatherspoon, J. Kubiatowicz, et al. Erasure coding vs. replication: A quantita-tive comparison. In IPTPS, volume 1, pages 328–338. Springer, 2002.

[44] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: Ascalable, high-performance distributed file system. In 7th Symposium on Operating

Systems Design and Implementation (OSDI), 2006.

[45] S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scal-able, decentralized placement of replicated data. In ACM/IEEE Conference on Su-

percomputing (SC), 2006.

[46] S. A. Weil, A. W. Leung, S. A. Brandt, and C. Maltzahn. RADOS: A scalable, reli-able storage service for petabyte-scale storage clusters. In 2nd International Work-

shop on Petascale Data Storage (PDSW): Held in Conjunction with Supercomputing,2007.

[47] M. Xia, M. Saxena, M. Blaum, and D. A. Pease. A tale of two erasure codes inHDFS. In 13th USENIX Conference on File and Storage Technologies (FAST), 2015.

63

[48] M. Ye and A. Barg. Explicit constructions of MDS array codes and RS codes withoptimal repair bandwidth. In 2016 IEEE International Symposium on Information

Theory (ISIT), 2016.

[49] A. Zeh and E. Yaacobi. Bounds and constructions of codes with multiple localities,2016. arXiv:1601.02763.

64

A NRC and the degraded cost

0 2 4 6 8 10

12

r=4

k=6

n=9

3 6 10

42 6 11

32

3 6 12

43

4 8 12

54 8 13

54 8 14

53 9 14

44

5 10 14

64 8 15

53 9 15

43

4 10 15

56

5 8 16

3 9 16

55 10 16

63

45 11 16

67

56 12 16

75 11 17

64

5 12 17

67

34 12 18

58

34 12 19

56

Cost (blocks)

Opt

-LR

C N

RC

Azu

re N

RC

Azu

re+

1 N

RC

Xor

bas

NR

CD

egra

ded

cost

65

B Minimum distance

0 2 4 6 8

r=4

k=6

n=9

3 6 10

42 6 11

32

3 6 12

43

4 8 12

54 8 13

54 8 14

53 9 14

44

5 10 14

64 8 15

53 9 15

43

4 10 15

56

5 8 16

3 9 16

55 10 16

63

45 11 16

67

56 12 16

75 11 17

64

5 12 17

67

34 12 18

58

34 12 19

56

d

Opt

-LR

CA

zure

Azu

re+

1X

orba

s

66

C Codes with d ≥ 5

0 2 4 6 8 10

r=5

k=10

n=15

65

k=10

n=16

66

k=11

n=16

75

k=11

n=17

66

k=12

n=17

74

k=12

n=18

5

NRC

Opt

-LR

CA

zure

Azu

re+

1X

orba

s

67

תקציר

. משפחה של קודים ים תקוליםקודי מחיקה משמשים במערכות אחסון גדולות לצורך שחזור מידע משרת

, מציעה (Locally Repairable Codes - LRCs) מקומיים שחזורקודי שפותחה לאחרונה, אשר נקראת

שקלול תמורות בין יתירות האחסון לבין עלות השחזור. קודי התיקון המקומיים מאפשרים שחזור יעיל

פים במערכת, אך בלוקים אלו עשויים לבסוף להגדיל את כמות יתירות נוס יבלוקיותר על ידי אחסון

באופן השימוש שלהם בבלוקי היתירות הנוספים, זה מזה הבלוקים שיש לשחזר. קודים קיימים נבדלים

בהגדרת המקומיות )לוקאליות( שלהם ובפרמטרים המשמשים להגדרתם.וכן

שונות. אנו מנתחים את LRCהראשונה של גישות במחקר זה, אנו עורכים את ההשוואה השיטתית

Xorbas ,LRC שלAzure וOptimal-LRC מדדים שניאשר פותח לאחרונה. ניתוח זה נעשה תחת

עלות שחזור משוקללתו (average degraded read cost) צעת של קריאה בתקלהעלות ממוחדשים:

(normalized repair cost). אנו מראים שקלול תמורות בין ערכים אלו לבין העמידות בפני השגיאות

של האחרון בחלקואפשרויות שונות בשקלול תמורות זה. חושפותוכן שגישות שונות ,של הקודים

Amazon שרתי גבי על ניסיים ביצוע ידי על, שהוצעו החדשים המדדים נכונות את אנו מאמתים המחקר

EC2 מסוג הקבצים מערכת את הריצו אשר Ceph .את ברור באופן ממחישות הניסויים תוצאות

אשר נבחנו. למרות LRCאמיתית ומערכת אחסון אמיתית על גישות ה רשת של השונות ההשפעות

אשר תשיג את עלות LRC-מסוגל לזהות את גישת ה normalized repair cost -אלו, מדד ה השפעות

השחזור הנמוכה ביותר בכל תצורה.

אביב-אוניברסיטת תל

הפקולטה להנדסה ע"ש איבי ואלדר פליישמן

סליינר-הספר לתארים מתקדמים ע"ש זנדמןבית

עמידות בפני שגיאות, לוקאליות ואופטימליות

מקומיים שחזור בקודי

סת חשמלדבהנחיבור זה הוגש כעבודת מחקר לקראת התואר "מוסמך אוניברסיטה"

על ידי

קולוסוב אולג

ת חשמלהעבודה נעשתה בבית הספר להנדס

למערכותהמחלקה

יצחק תמו וד"ר גלה ידגרבהנחיית ד"ר

חתשע" תמוז

אביב-אוניברסיטת תל

הפקולטה להנדסה ע"ש איבי ואלדר פליישמן

סליינר-בית הספר לתארים מתקדמים ע"ש זנדמן

עמידות בפני שגיאות, לוקאליות ואופטימליות

מקומיים שחזורבקודי

הנדסת חשמל"מוסמך אוניברסיטה" בחיבור זה הוגש כעבודת מחקר לקראת התואר

על ידי

קולוסוב אולג

חתשע" תמוז

Documents

TEL AVIV UNIVERSITYprimage.tau.ac.il/.../theses/exeng/free/9932978264204146.pdfrecovery of all of the parity blocks, but do so for a limited set of system parameters [36]. In a recent