Improving Cloud Storage Cost and Data Resiliency with ... · PDF fileImproving Cloud Storage...

Preview:

Citation preview

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Improving Cloud Storage Cost and Data Resiliency with Erasure Codes

Michael Penick

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Commodity Storage

Hosting storage FTP backup Goals Inexpensive (use “commodity” hardware) Resilient to failures Highly available Customizable

2

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

MogileFS

Open source distributed filesystem Written by Brad Fitzpatrick No single point of failure Automatic/Asynchronous file replication Shared-Nothing design (disks) Local filesystem agnostic Flat namespace

3

Tracker

Storage Node

MetadataDB

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

MogileFS

4

Tracker

Storage Node

MetadataDB

Clients

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

“NebulaFS”

Large file support Offsite Replication Self-healing Data retention C++ client (PHP and Perl SWIG wrappers) Metadata Sharding Range GETs

5

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

“NebulaFS”

6

Tracker / Storage Node

MySQL

Storage Node

MySQL Tracker /

Storage Node

Storage Node

Clients

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

FTP Backup

7

FTP Presentation (Net::FTPServer)

VFS DB

NebulaFS

Metadata DB Super Nodes Storage Nodes

NebulaFSAPI

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Widely Applicable

“Storage service” (REST) layer New Product Integrations Online File Folder (videos and images) Website Builder/ Photo Album Go Daddy Cloud Servers (snapshots) Email …

8

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Object Storage

9

RESTful Presentation (S3, GDCS)

VFS DB

NebulaFS

Metadata DB Super Nodes Storage Nodes

VFS

User DB

NebulaFSAPI

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Why?

10

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Why?

11

5.39% 8.51%

83.89%

1.20% 1.01% ~3.25 PB

Aries FTP WST/PA OFF VDC Other

1.80% 2.56%

38.44%

1.44%

55.44%

0.30%

~10.8 PB

Aries FTP WST/PA OFF VDC Email Other

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

The Problem

NebulaFS = Inexpensive, resilient, highly available storage

Problem: Disk drives fail...a lot. F = mean time to failure In a system of n devices our mean time failure

is: F/n Solution: Replicate the data

12

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Replication

13

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

Success!

Duplicate Replicate

Copy 1

Copy 2

Copy 3

Copy 4

Disk 1 … Disk 2 Disk n

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Replication

Simple and effective Durability:

99.99999% over 1 year (or 0.1 of 1 million objects) 99.99% over 3 years (or 100 of 1 million objects)

Problem: 100 % overhead per copy +300% overhead for 3 onsite and 1 offsite

copy There has to be a better way.

14

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Erasure Codes

Forward error correction code Add redundant data (codes) to message so that it

can be recovered Where’s EC used?

Optical media Media streaming File Systems (RAID-6, several distributed FS, …)

15

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Erasure code (write)

Divide 01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

Encode

010101010110101010110101010101

010101011010101100101010110100

110101010101101001010101011010

101010000000000000000000000000

101010010101101010101010101010

101010001010101010100101101001

k

m

Disk 1

Disk 2

Disk n

Copy 1

Copy .75

101010001010101010100101101001

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Erasure code (read)

010101010110101010110101010101

010101011010101100101010110100

110101010101101001010101011010

101010000000000000000000000000

101010010101101010101010101010

101010001010101010100101101001

Verify Decode

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

Disk 1

Disk 2

Disk n 101010001010101010100101101001

k

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Erasure codes

What? k – number of original pieces m – number of redundant pieces (codes)

How? k = 4, m = 3: only 75% overhead (3 failures) k = 10, m = 6: only 60% overhead (6 failures) k = 9, m = 3: only 33% overhead (3 failures)

AKA: k = 10, m = 6 10 of 16

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Trade-offs (positive)

Better resilience to failure Durability for 10 of 16:

99.9999999999% over 1 year (or 0.000001 of 1 million objects) 99.99999% over 3 years (or 0.1 of 1 million objects)

Durability for 9 of 12: 99.99999% over 1 year (or 0.1 of 1 million objects) 99.99% over 3 years (or 100 of 1 million objects)

Significant savings (includes a full offsite copy) 10 of 16: (4 – 2.60) / 4 = 35% savings (60 % w/o offsite) 9 of 12: (4 – 2.33) / 4 = 42% savings (67% w/o offsite)

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Trade-offs (negative)

Computationally expensive Increased number of IOPS Complexity (additional metadata) More nodes and connections

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Erasure Codes

Optimal erasure code Any k pieces of the message can recover the message Reed-Solomon (and Cauchy Reed-Solomon)

Libraries (Jerasure, Zfec, Luby, librs,…) Stability/Performance Evaluation

Paper – “A Performance Comparison of Open-Source Erasure Coding Libraries for Storage Applications” http://web.eecs.utk.edu/~plank/plank/papers/CS-08-

625.pdf

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries - zfec

Reed Solomon Written in C (Python and Haskell bindings) Download: http://pypi.python.org/pypi/zfec Documentation is the source code

22

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – zfec Encoding

23

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – zfec Decoding

24

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – zfec Decoding contd.

25

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – zfec Decoding contd.

26

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – zfec Decoding cond.

k = 6, m =2, erasures = { 0, 2, -1 }

index = { 6, 1, 7, 3, 4, 5 }

27

inpkts

coding 0

data 1

coding 1

data 3

data 4

data 5

outpkts

data 0

data 2

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries - Jerasure

Reed Solomon, Cauchy Reed Solomon, and Minimal Density Codes

Written in C (no bindings) Download:

http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html

Good documentation and examples

28

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – Jerasure Encoding

29

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – Jerasure Decoding

30

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – Performance

31

0

500

1000

1500

2000

2500

w = 8 w = 16 w = 32

MB/s

Encoding

Jerasure RS Jearsure CRS zfec

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

EC Libraries – Performance

32

0 100 200 300 400 500 600 700 800 900

w = 8 w = 16 w = 32

MB/s

Decoding

Jerasure RS Jearsure CRS zfec

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library

EC library (Phase I) Read/Copy Write Repair

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library

Inputs/Outputs abstracted boost::asio (HTTP) PHP/Perl bindings Random access reads (i.e. Range GET) Data validated/corrected on-the-fly

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library Writes

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

010101010110101010110101010101

010101011010101100101010110100

110101010101101001010101011010

101010000000000000000000000000

101010010101101010101010101010

101010001010101010100101101001

k

m

Disk 1

Disk 2

Disk n 101010001010101010100101101001

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library Failures

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library Reads

010101010110101010110101010101

010101011010101100101010110100

110101010101101001010101011010

101010000000000000000000000000

101010010101101010101010101010

101010001010101010100101101001

01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….

Disk 1

Disk 2

Disk n 101010001010101010100101101001

k

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library Copy

010101010110101010110101010101

010101011010101100101010110100

110101010101101001010101011010

101010000000000000000000000000

101010010101101010101010101010

101010001010101010100101101001

Disk 1

Disk 2

Disk n 101010001010101010100101101001

k Disk 1

Disk 2

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library Repair

010101010110101010110101010101

010101011010101100101010110100

110101010101101001010101011010

101010000000000000000000000000

101010010101101010101010101010

101010001010101010100101101001

Disk 1

Disk 2

Disk n 101010001010101010100101101001

k

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – EC Library

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – Reads/Writes

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – Reads/Writes DB

Increased number of “file_device” entries Decreased number of “file” entries

Change the meaning of “class”

45

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – Reads/Writes DB

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – Write

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – Read

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Integration – Recovery

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Lessons Learned

CRC32 can be slow Intel’s Slicing-by-8 Algorithm

Block size can limit your smallest file size Lighttpd doesn’t support “Transfer-Encoding:

chunked”

50

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Performance Test Setup

6 super nodes (tracker and storage node) 180 drives Drives not distributed i.e. not 30 drives per node EC strips maximally distributed

51

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Performance Test Results

0 10 20 30 40 50 60 70 80

1KB 1MB 16MB 32MB 64MB 128MB 256MB

MB

/s

File Size

Writes

ec_1_of_2 ec_6_of_9 ec_9_of_12 ec_10_of_16 replication

52

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Performance Test Results

0 20 40 60 80

100 120 140 160 180 200

1KB 1MB 16MB 32MB 64MB 128MB 256MB

MB

/s

File Size

Reads

ec_1_of_2 ec_6_of_9 ec_9_of_12 ec_10_of_16 replication

53

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Migrations

54

RESTful Presentation (S3, GDCS)

VFS DB

NebulaFS

Metadata DB Super Nodes Storage Nodes

User DB

Migration Script

VFS NebulaFSAPI

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Future

Finish Phase III Repairs Offsite copy

Net new growth Optimizations Open source

55

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

Questions

56

Thank You!

2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.

57

Recommended