30
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance Ricardo Koller Raju Rangaswami School of Computing and Information Sciences College of Engineering and Computing September 9, 2010

I/O Deduplication: Utilizing Content Similarity to Improve ...researcher.watson.ibm.com/researcher/files/us-xmeng/dedup.pdf · I/O Deduplication: Utilizing Content Similarity to

Embed Size (px)

Citation preview

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

I/O Deduplication: Utilizing Content Similarity toImprove I/O Performance

Ricardo Koller Raju Rangaswami

School of Computing and Information Sciences

College of Engineering and Computing

September 9, 2010

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

I/O Deduplication

It is a storage solution that uses content similarity for improvingI/O: eliminate duplicated I/O’s and reduce seek times.

◮ It is not Data Deduplication, used in arhival storage [Venti],COW disks [QEMU].

◮ It consists of 3 techniques:

X Content based cacheX Dynamic replica retrievalX Selective duplication

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Outline

1 Content Based Cache

2 Dynamic Replica Retrieval

3 Selective Duplication

4 Related Work

5 Limitations & Future work

6 Conclusions

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Workloads

Block traces collected downstream of an active page cache forthree weeks.

web-vm Two VM’s hosting web-servers: web-mail & onlinecourse management.

mail Our department mail server.

homes NFS server that serves the home directories of ourresearch group.

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Outline

1 Content Based Cache

2 Dynamic Replica Retrieval

3 Selective Duplication

4 Related Work

5 Limitations & Future work

6 Conclusions

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Motivation 1: Frequency

1000

10000

100000

1e+06

1e+07

1e+08

Read Write Read+Write

Hits

Web-vm

SectorContent

Read Write Read+Write

Mail

Read Write Read+Write

Homes

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Motivation 2: Recency

1000

10000

100000

1e+06

1e+07

Read Write Read+Write

Reu

se d

ista

nce

Web-vm

SectorContent

Read Write Read+Write

Mail

Read Write Read+Write

Homes

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Design: Content based Cache

Sec

tor-

to-H

ash

Funct

ion

Sec

tor

Digest-to-Hash Function

MD5 Digest

s

s

s

s

p

s

s

s

s

s

s

s

p

age{data, refs count}p

ector{sector, digest}s

◮ Reads sectors are searchedfor hits and in case of miss,content is searched andpossibly inserted into thecache.

◮ Placed at the block layer

X Write-though cache tomaintain semantics.

X ARC for second levelcache.

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Evaluation

◮ The I/O deduplication system was implemented as a modulefor kernel 2.6.20

◮ Traces replayed at 100X using a modified version of btreplay

◮ Measurements of I/O time were taken using blktrace

◮ Performed on a single Intel(R) Pentium 4 CPU 2.00GHZ with1GB of memory and a WD disk running at 7200RPM

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Evaluation: Content Addressed Cache

1e-05

0.0001

0.001

0.01

0.1

1

web-vm mail homes

Rea

d hi

t rat

ioSector 4MB

Content 4MBSector 200MB

Content 200MB

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Evaluation: Hits versus Cache Size

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000

Rea

d hi

t rat

io

Cache size (MBytes)

ARC - ReadLRU - Read

1 10 100 1000 10000

Cache size (MBytes)

ARC - Read/WriteLRU - Read/Write

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Outline

1 Content Based Cache

2 Dynamic Replica Retrieval

3 Selective Duplication

4 Related Work

5 Limitations & Future work

6 Conclusions

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Motivation: Duplication

Workloads web-vm mail homes

Unique 4K pages (millions) 1.9 27 62Total 4K pages (millions) 5.2 73 183Disk Static similarity 2.67 2.64 2.94

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Motivation: Duplication

Workloads web-vm mail homes

Unique 4K pages (millions) 1.9 27 62Total 4K pages (millions) 5.2 73 183Disk Static similarity 2.67 2.64 2.94

0 0.5

1 1.5

2 2.5

3 3.5

4

10 100 1000 no limit

Wor

kloa

d st

atic

sim

ilarit

y

Maximum number of copies

web-vmmail

homes

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Design: Dynamic Replica Retrieval

Sec

tor-

to-H

ash

Funct

ion

Sec

tor

Digest-to-Hash Function

MD5 Digest

s

s

s

s

p

s

s

s

s

s

s

s

p

◮ Reduce seek times byindirecting requests basedon head position: choosethe duplicate that’s closerto the head.

◮ The yellow entries share anuncached page.

◮ Current head position basedon completed reads.

◮ Placed above the I/Oscheduler:

X Indirect only if there areno adjacent requests.

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Evaluation: Dynamic Replica Retrieval

0

0.005

0.01

0.015

0.02

web-vm mail homes

vanilla dedup

Per

-req

uest

dis

k I/O

rea

d tim

e (s

ec)

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Outline

1 Content Based Cache

2 Dynamic Replica Retrieval

3 Selective Duplication

4 Related Work

5 Limitations & Future work

6 Conclusions

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Motivation: Working Set Overlap

0

20

40

60

80

100

120

1 2 3 4 5 6 7

Con

tent

acc

ess

over

lap

(%)

Intervals

web-vm mail homes

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Selective Duplication

read( )

head

Exported space

Scratch space

Data is duplicated at scratch spaces interspersed across the disk.

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Dynamic Replica Retrieval

0

0.005

0.01

0.015

0.02

web-vm mail homes

v d

Per

-req

uest

dis

k I/O

rea

d tim

e (s

ec)

Without Selective Duplication

web-vm mail homes

v d

With Selective Duplication

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

All Together

Workload Vanilla (rd sec) I/O dedup (rd sec) Improvement

web-vm 3098.61 1641.90 47%mail 4877.49 3467.30 28%home 1904.63 1160.40 39%

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Overhead

◮ Memorymem(P ,WSS ,HTB) = 13 ∗ P + 36 ∗ P ∗ WSS + 8 ∗ HTB

For a content cache of 1GB , static similarity of 4 and a hash tableof a million buckets, the metadata is 48MB (4.6%).

◮ CPU

X if HTB = 1e3, cpu read miss(P) = O(P) + 100000 cycles.X if HTB = 1e6, cpu read miss(P) = O(1) + 100000 cycles.

For our machine running at 2GHz , the 100000 + 1000 cycles are90µs.

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Outline

1 Content Based Cache

2 Dynamic Replica Retrieval

3 Selective Duplication

4 Related Work

5 Limitations & Future work

6 Conclusions

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Related Work

◮ I/O Performance Optimization

X Duplication of popular data: FS2, Borg

◮ Content Addressed Storage

X Archival storage: Venti

◮ I/O Deduplication

X Satori (COW-disk sharing mode)

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Outline

1 Content Based Cache

2 Dynamic Replica Retrieval

3 Selective Duplication

4 Related Work

5 Limitations & Future work

6 Conclusions

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Limitations & Future Work

◮ Integration with the page cache

◮ Multiple disks

◮ Variable sized chunks

◮ Write requests ”special handling”, leave them for later?pdflush?

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Limitations & Future Work

◮ Page replacement strategies for content

X LRU and ARC are further from the optimal when used withcontent

◮ I/O scheduling based on duplicated blocks

X Greedy option is sub-optimal

X I/O scheduling with duplicated blocks is NP-Hard (as shownfor regular I/O scheduling by [Andrews, 1996].)

X Design an approximation algorithm with some nice approx.ratio

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Outline

1 Content Based Cache

2 Dynamic Replica Retrieval

3 Selective Duplication

4 Related Work

5 Limitations & Future work

6 Conclusions

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Summary and Conclusions

◮ For systems where content is more frequent than sector andreuse distances are shorter for content compared to sertor,content based caches can be more effective than sector ones.

◮ On disk duplications can be used for reducing I/O times.

Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions

Questions?http://dsrl.cs.fiu.edu/projects/iodedup/