Upload
phungkhue
View
220
Download
2
Embed Size (px)
Citation preview
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
I/O Deduplication: Utilizing Content Similarity toImprove I/O Performance
Ricardo Koller Raju Rangaswami
School of Computing and Information Sciences
College of Engineering and Computing
September 9, 2010
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
I/O Deduplication
It is a storage solution that uses content similarity for improvingI/O: eliminate duplicated I/O’s and reduce seek times.
◮ It is not Data Deduplication, used in arhival storage [Venti],COW disks [QEMU].
◮ It consists of 3 techniques:
X Content based cacheX Dynamic replica retrievalX Selective duplication
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Outline
1 Content Based Cache
2 Dynamic Replica Retrieval
3 Selective Duplication
4 Related Work
5 Limitations & Future work
6 Conclusions
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Workloads
Block traces collected downstream of an active page cache forthree weeks.
web-vm Two VM’s hosting web-servers: web-mail & onlinecourse management.
mail Our department mail server.
homes NFS server that serves the home directories of ourresearch group.
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Outline
1 Content Based Cache
2 Dynamic Replica Retrieval
3 Selective Duplication
4 Related Work
5 Limitations & Future work
6 Conclusions
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Motivation 1: Frequency
1000
10000
100000
1e+06
1e+07
1e+08
Read Write Read+Write
Hits
Web-vm
SectorContent
Read Write Read+Write
Read Write Read+Write
Homes
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Motivation 2: Recency
1000
10000
100000
1e+06
1e+07
Read Write Read+Write
Reu
se d
ista
nce
Web-vm
SectorContent
Read Write Read+Write
Read Write Read+Write
Homes
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Design: Content based Cache
Sec
tor-
to-H
ash
Funct
ion
Sec
tor
Digest-to-Hash Function
MD5 Digest
s
s
s
s
p
s
s
s
s
s
s
s
p
age{data, refs count}p
ector{sector, digest}s
◮ Reads sectors are searchedfor hits and in case of miss,content is searched andpossibly inserted into thecache.
◮ Placed at the block layer
X Write-though cache tomaintain semantics.
X ARC for second levelcache.
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Evaluation
◮ The I/O deduplication system was implemented as a modulefor kernel 2.6.20
◮ Traces replayed at 100X using a modified version of btreplay
◮ Measurements of I/O time were taken using blktrace
◮ Performed on a single Intel(R) Pentium 4 CPU 2.00GHZ with1GB of memory and a WD disk running at 7200RPM
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Evaluation: Content Addressed Cache
1e-05
0.0001
0.001
0.01
0.1
1
web-vm mail homes
Rea
d hi
t rat
ioSector 4MB
Content 4MBSector 200MB
Content 200MB
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Evaluation: Hits versus Cache Size
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000
Rea
d hi
t rat
io
Cache size (MBytes)
ARC - ReadLRU - Read
1 10 100 1000 10000
Cache size (MBytes)
ARC - Read/WriteLRU - Read/Write
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Outline
1 Content Based Cache
2 Dynamic Replica Retrieval
3 Selective Duplication
4 Related Work
5 Limitations & Future work
6 Conclusions
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Motivation: Duplication
Workloads web-vm mail homes
Unique 4K pages (millions) 1.9 27 62Total 4K pages (millions) 5.2 73 183Disk Static similarity 2.67 2.64 2.94
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Motivation: Duplication
Workloads web-vm mail homes
Unique 4K pages (millions) 1.9 27 62Total 4K pages (millions) 5.2 73 183Disk Static similarity 2.67 2.64 2.94
0 0.5
1 1.5
2 2.5
3 3.5
4
10 100 1000 no limit
Wor
kloa
d st
atic
sim
ilarit
y
Maximum number of copies
web-vmmail
homes
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Design: Dynamic Replica Retrieval
Sec
tor-
to-H
ash
Funct
ion
Sec
tor
Digest-to-Hash Function
MD5 Digest
s
s
s
s
p
s
s
s
s
s
s
s
p
◮ Reduce seek times byindirecting requests basedon head position: choosethe duplicate that’s closerto the head.
◮ The yellow entries share anuncached page.
◮ Current head position basedon completed reads.
◮ Placed above the I/Oscheduler:
X Indirect only if there areno adjacent requests.
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Evaluation: Dynamic Replica Retrieval
0
0.005
0.01
0.015
0.02
web-vm mail homes
vanilla dedup
Per
-req
uest
dis
k I/O
rea
d tim
e (s
ec)
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Outline
1 Content Based Cache
2 Dynamic Replica Retrieval
3 Selective Duplication
4 Related Work
5 Limitations & Future work
6 Conclusions
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Motivation: Working Set Overlap
0
20
40
60
80
100
120
1 2 3 4 5 6 7
Con
tent
acc
ess
over
lap
(%)
Intervals
web-vm mail homes
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Selective Duplication
read( )
head
Exported space
Scratch space
Data is duplicated at scratch spaces interspersed across the disk.
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Dynamic Replica Retrieval
0
0.005
0.01
0.015
0.02
web-vm mail homes
v d
Per
-req
uest
dis
k I/O
rea
d tim
e (s
ec)
Without Selective Duplication
web-vm mail homes
v d
With Selective Duplication
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
All Together
Workload Vanilla (rd sec) I/O dedup (rd sec) Improvement
web-vm 3098.61 1641.90 47%mail 4877.49 3467.30 28%home 1904.63 1160.40 39%
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Overhead
◮ Memorymem(P ,WSS ,HTB) = 13 ∗ P + 36 ∗ P ∗ WSS + 8 ∗ HTB
For a content cache of 1GB , static similarity of 4 and a hash tableof a million buckets, the metadata is 48MB (4.6%).
◮ CPU
X if HTB = 1e3, cpu read miss(P) = O(P) + 100000 cycles.X if HTB = 1e6, cpu read miss(P) = O(1) + 100000 cycles.
For our machine running at 2GHz , the 100000 + 1000 cycles are90µs.
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Outline
1 Content Based Cache
2 Dynamic Replica Retrieval
3 Selective Duplication
4 Related Work
5 Limitations & Future work
6 Conclusions
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Related Work
◮ I/O Performance Optimization
X Duplication of popular data: FS2, Borg
◮ Content Addressed Storage
X Archival storage: Venti
◮ I/O Deduplication
X Satori (COW-disk sharing mode)
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Outline
1 Content Based Cache
2 Dynamic Replica Retrieval
3 Selective Duplication
4 Related Work
5 Limitations & Future work
6 Conclusions
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Limitations & Future Work
◮ Integration with the page cache
◮ Multiple disks
◮ Variable sized chunks
◮ Write requests ”special handling”, leave them for later?pdflush?
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Limitations & Future Work
◮ Page replacement strategies for content
X LRU and ARC are further from the optimal when used withcontent
◮ I/O scheduling based on duplicated blocks
X Greedy option is sub-optimal
X I/O scheduling with duplicated blocks is NP-Hard (as shownfor regular I/O scheduling by [Andrews, 1996].)
X Design an approximation algorithm with some nice approx.ratio
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Outline
1 Content Based Cache
2 Dynamic Replica Retrieval
3 Selective Duplication
4 Related Work
5 Limitations & Future work
6 Conclusions
Content Based Cache Dynamic Replica Retrieval Selective Duplication Related Work Limitations & Future work Conclusions
Summary and Conclusions
◮ For systems where content is more frequent than sector andreuse distances are shorter for content compared to sertor,content based caches can be more effective than sector ones.
◮ On disk duplications can be used for reducing I/O times.