22
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco. Thursday, July 31, 2003

Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

Embed Size (px)

Citation preview

Page 1: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

Toward Achieving Tapeless Backup

at PB Scales

Hakim Weatherspoon University of California, Berkeley

Frontiers in Distributed Information SystemsSan Francisco. Thursday, July 31, 2003

Page 2: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:2

OceanStore Context: Ubiquitous Computing

• Computing everywhere:– Desktop, Laptop, Palmtop.– Cars, Cellphones.– Shoes? Clothing? Walls?

• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the

net.– Broadband to the home and office.– Wireless technologies such as CDMA, Satellite,

laser.

Page 3: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:3

Archival Storage

• Where is persistent information stored?– Want: Geographic independence for availability,

durability, and freedom to adapt to circumstances

• How is it protected?– Want: Encryption for privacy, secure naming and

signatures for authenticity, and Byzantine commitment for integrity

• Is it Available/Durable? – Want: Redundancy with continuous repair and

redistribution for long-term durability

Page 4: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:4

Path of an Update

Page 5: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:5

Questions about Data?

• How to use redundancy to protect against data being lost?

• How to verify data?

• Amount of resources used to keep data durable? Storage? Bandwidth?

Page 6: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:6

Archival Dissemination Built into Update

• Erasure codes– redundancy without overhead of strict replication– produce n fragments, where any m is sufficient to reconstruct data. m < n.

rate r = m/n. Storage overhead is 1/r.

Page 7: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:7

Durability

• Fraction of Blocks Lost Per Year (FBLPY)*– r = ¼, erasure-enncoded block. (e.g. m = 16, n = 64)– Increasing number of fragments, increases durability of block

• Same storage cost and repair time.

– n = 4 fragment case is equivalent to replication on four servers.* Erasure Coding vs. Replication, H. Weatherspoon and J. Kubiatowicz, In Proc. of IPTPS

2002.

Page 8: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:8

Naming and Verification Algorithm

• Use cryptographically secure hash algorithm to detect corrupted fragments.

• Verification Tree: – n is the number of

fragments.– store log(n) + 1 hashes

with each fragment.

– Total of n*(log(n) + 1) hashes.

• Top hash is a block GUID (B-GUID).

– Fragments and blocks are self-verifying

Fragment 3:

Fragment 4:

Data:

Fragment 1:

Fragment 2:

H2 H34 Hd F1 - fragment data

H14 data

H1 H34 Hd F2 - fragment data

H4 H12 Hd F3 - fragment data

H3 H12 Hd F4 - fragment data

F1 F2 F3 F4

H1 H2 H3 H4

H12 H34

H14

B-GUID

HdData

Encoded Fragments

F1

H2

H34

Hd

Fragment 1: H2 H34 Hd F1 - fragment data

Page 9: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:9

Naming and Verification Algorithm

• Use cryptographically secure hash algorithm to detect corrupted fragments.

• Verification Tree: – n is the number of

fragments.– store log(n) + 1 hashes

with each fragment.

– Total of n*(log(n) + 1) hashes.

• Top hash is a block GUID (B-GUID).

– Fragments and blocks are self-verifying

Fragment 3:

Fragment 4:

Data:

Fragment 1:

Fragment 2:

H2 H34 Hd F1 - fragment data

H14 data

H1 H34 Hd F2 - fragment data

H4 H12 Hd F3 - fragment data

H3 H12 Hd F4 - fragment data

F1

H1 H2

H12 H34

H14

B-GUID

Hd

Encoded Fragments

F1

H2

H34

Hd

Fragment 1: H2 H34 Hd F1 - fragment data

Page 10: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:10

Enabling Technology

GUID

Fragments

Page 11: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:11

Complex Objects I

Unit of Coding

data

Verification Tree

GUID of d

Encoded Fragments:Unit of Archival Storage

Page 12: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:12

Complex Objects II

GUID of d1

d1

Unit of CodingEncoded Fragments:Unit of Archival Storage

Verification Tree

BlocksData

VGUID

d2 d4d3 d8d7d6d5 d9

DataB -Tree

Indirect Blocks

M

Page 13: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:13

Complex Objects III

DataBlocks

VGUIDi VGUIDi + 1

d2 d4d3 d8d7d6d5 d9d1

Data B -Tree

IndirectBlocks

M

d'8 d'9

Mbackpointer

copy on write

copy on write

AGUID = hash{name+keys}

Page 14: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:14

Mutable Data

• Need mutable data for real system.– Entity in network.– A-GUID to V-GUID mapping.– Byzantine Commitment for Integrity– Verifies client privileges.– Creates a serial order. – Atomically applies update.

• Versioning system– Each version is inherently read-only.

Page 15: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:15

Deployment

• Planet Lab global network– 98 machines at 42 institutions, in North America,

Europe, Asia, Australia.– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)– North American machines (2/3) on Internet2

Page 16: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:16

Deployment

• Deployed storage system in November of 2002.– ~ 50 physical machines.– 100 virtual nodes.

• 3 clients, 93 storage serves, 1 archiver, 1 monitor.

– Support OceanStore API• NFS, IMAP, etc.

– Fault injection.– Fault detection and repair.

Page 17: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:17

Performance

• Performance of the Archival Layer– Performance of an OceanStore server in archiving a objects.– analyze operations of archiving data (this includes signing

updates in a BFT protocol).• No archiving• Archiving (synchronous) (m = 16, n = 32)

• Experiment Environment– OceanStore servers were analyzed on a 42-node cluster. – Each machine in the cluster is a

• IBM xSeries 330 1U rackmount PC with• two 1.0 GHz Pentium III CPUs• 1.5 GB ECC PC133 SDRAM• two 36 GB IBM UltraStar 36LZX hard drives.• The machines use a single Intel PRO/1000 XF gigabit Ethernet

adaptor to connect to a Packet Engines• Linux 2.4.17 SMP kernel.

Page 18: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:18

Performance: Throughput

• Data Throughput– No archive 8MB/s.– Archive 2.8MB/s.

Page 19: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:19

Performance: Latency

• Latency– Fragmentation

• Y-intercept 3ms, slope 0.3s/MB.

– Archive = No archive + Fragmentation.Update Latency vs. Update Size

y = 0.6x + 29.6

y = 0.3x + 3.0

y = 1.2x + 36.4

0

10

20

30

40

50

60

70

80

2 7 12 17 22 27 32

UpdateSize (kB)

La

ten

cy

(in

ms

) Archive

No Archive

Fragmentation

Page 20: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:20

Closer Look: Update Latency

• Threshold Signature dominates small update latency– Common RSA tricks not applicable

• Batch updates to amortize signature cost• Tentative updates hide latency

Update Latency (ms)

Key Size

Update Size

5% Time

Median Time

95%Time

512b4kB 39 40 41

2MB 1037 1086 1348

1024b

4kB 98 99 100

2MB 1098 1150 1448

Latency Breakdown

Phase Time (ms)

Check 0.3

Serialize

6.1

Apply 1.5

Archive 4.5

Sign 77.8

Page 21: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:21

Current Situation

• Stabilized routing layer under churn and extraordinary circumstances

• NSF infrastructure grant– Deploy code as a service for Berkeley– Target 1/3 PB

• Future Collaborations– CMU for PB Store. – Internet Archive?

Page 22: Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco

FDIS 2003 ©2003 Hakim Weatherspoon/UC Berkeley Distributed Archival Service:22

Conclusion• Storage efficient, self-verifying mechanism.

– Erasure codes are good.

• Self-verifying data assist in– Secure read-only data – Secure caching infrastructures– Continuous adaptation and repair

For more information: http://oceanstore.cs.berkeley.edu/ Papers: Pond: the OceanStore Prototype- Naming and Integrity: Self-Verifying Data in P2P Systems