View
3
Download
0
Category
Preview:
Citation preview
FLAP: Flash-aware Prefetching for Improving
SSD-based Disk Cache
Yang Liu The Big Data Research Institute of Modern Educational Technology Center, Henan University of Economics and Law,
Zhengzhou 450046, China
Email: yangliu.huel@foxmail.com
Wang Wei
Department of Computer Science, Henan Industry and Trade Vocational College, Zhengzhou 450003, China
Email: wangwei.hntvc@foxmail.com
Abstract—In modern enterprise storage systems, there is a
trend that using NAND flash based solid state disks (SSDs)
as a second-level disk cache to reduce the slow access to
hard disk drives (HDDs) by caching the hot data of HDDs
with SSDs. However, using SSDs for both caching and
prefetching has rarely been discussed due to the
performance penalty caused by unsuccessful prefetching, including garbage collection cost and disk bandwidth
wasting. In this paper, a FLash-Aware Prefetching (FLAP)
scheme has been presented to provide aggressive
prefetching for improving the performance of sequential
disk accesses. FLAP has been evaluated using well known
real-world workloads. Experiment results show that FLAP
can offer performance improvement under sequential workloads compared with pure caching, while reducing the
internal garbage collection of SSDs.
Index Terms—Hybrid Storage System; Flash Cache;
Prefetching; Garbage Collection
I. INTRODUCTION
UNDER current technology trends, the access time gap
between processor and disk keeps increasing. Under
current technology trends, the access time gap between processor and disk is kept increasing [1]. To bridge this
gap, multi-level buffer caches are employed along data
retrieving paths to reduce time-consuming disk accesses.
Besides multiple volatile buffer caches, many modern storage systems utilize NAND flash based large
nonvolatile memory as second-level buffer caches to
speed-up I/O accesses in storage servers [2]. As a promising nonvolatile storage with attractive
features, NAND flash based SSD caches in storage
systems play important roles in improving data access
performance. However, studies have shown that storage traffic exhibits much longer reuse distances at the SSD
cache due to the existence of the first-level DRAM cache.
Specifically, accesses to a SSD cache exhibit poorer
temporal locality than those to the first-level DRAM cache because these accesses are misses from the higher
level. Keeping increasing the capacity of SSD cache, the
cache miss rate would not be reduced beyond a certain size. To improve the resource utilization of SSD cache,
prefetching would be a potential choice for storage
servers to improve data access performance. Specifically,
it could fully leverage the capacity advantage of the SSD
cache to perform aggressive prefetching for the sequential access. In this way, more data could be sequentially
fetched from disks in a prefetching request, subsequently
more expensive disk accesses could be avoided.
Unlike a typical cache, flash cache has several unique features. Firstly, internal garbage collection (GC) cost
must be considered in a flash cache. In details, NAND
flash is organized in units of pages and blocks. A typical flash page is 4 KB in size and a flash block is made up of
64 flash pages (256 KB). Reads and writes are performed
on a page basis and flash erasures are performed on a
block basis. Erase is the slowest operation while write is slower than read. A flash must perform an erase on a
block before it can write any data to a page belonging to
the block. Each page on flash can be in one of three different states, namely, valid, invalid, and erased. When
data is written to an erased page, its state becomes valid.
If the page contains an older version of data, it’s said to
be in the valid state. Note that, out-of-place updates and cache evictions result in a large amount of invalid pages.
When SSD has little free pages for writing requests, the
garbage collector is employed to reclaim those invalid
pages and create free pages. Firstly, it selects a victim block based on the greedy policy, such as choosing a
block with maximum invalid pages. All valid pages
within the victim block should be copied and then written into another free block, this is called write amplification.
After that, the victim block is erased to create a free block.
The efficiency of garbage collector is one of the dominant
factors affecting flash memory performance. Secondly, the lifespan of SSD cache can be drastically reduced by
the write-amplification, which is measured by the ratio of
the amount of data physically written to flash memory to
the amount of data logically written by the host. It has been reported that each flash memory cell can sustain
about 100K erase operations for Single-Level-Cell (SLC)
and only 10K erase operations for Multi- Level-Cell (MLC). Evidently, write-amplification is very high
especially when pages with different access patterns
2766 JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014
© 2014 ACADEMY PUBLISHERdoi:10.4304/jnw.9.10.2766-2774
mixed together. Lastly, unlike DRAM cache, the SSD cache is nonvolatile, which is a persistent storage.
This study is motivated by the following challenges
caused by the properties of flash:
Taking the GC cost into account, the cache al- location cost of SSD cache is significantly higher
compared to that of the conventional DRAM
cache. Thus, the performance penalty of false prefetching is extremely expensive.
Since the prefetched data and cached data have remarkable disparities in spatial locality and
temporal locality, distributing them in the same
SSD cache area will cause extensive write-
amplifications due to GC. As a result, the life span and performance of flash cache layer will be
impaired.
This paper addresses the above-mentioned problems.
More specifically, it has the following contributions:
We have built the cost-benefit model of the SSD-
based prefetching considering the internal
features of flash storage, especially the GC cost.
We employ a relationship graph tool to mine the
sequentiality of the workload. Algorithms are also
designed to choose the optimal prefetching length so as to balance the prefetching aggressiveness
and prediction accuracy.
We split the space of SSD-based disk cache into cache partition and prefetch partition. A novel
time-aware cache allocation scheme is utilized to
manage the data layout of the prefetched data, by which the expensive internal garbage collection
cost of prefetching could be minimized.
The rest of this paper is organized as follows: The next
section describes the background on sequential prefetching technology briefly. Section III describes the
design of FLAP. Sections IV and V present the
experiment methodology and results. Related work is discussed in Section VI. Finally, Section VII concludes
this paper.
II. SEQUENTIAL PREFETCHING TECHNOLOGY
BACKGROUND
In this section, we present the background knowledge
of the sequential prefetching technology. Besides, we
present a Relationship Graph model to learn the
sequentiality strength of a certain workload. With this graph, we could evaluate the hit probability of a
prefetching request so that adjusting the prefetch length.
A. Sequentiality and Prefetching
Sequentiality is a common characteristic of practical
I/O workloads, which accesses the data contiguously. The ubiquity of sequentiality derives from the fact that file
systems and databases tend to contiguously store the file
data on storage disks. Hence, many common file
operations such as copy, scan, backup and recovery lead to sequential accesses. Additionally, some important
benchmarks also exhibit strong sequentiality, such as
TPC-D [3] and SPC-2 [4].
A stream is a sequence of I/O requests of an application. It is considered to be sequential if the
corresponding data are accessed contiguously. Due to the
semantic gap between the file system and the storage sub-
system, the only available information for sequential stream detection is Logical Block Address (LBA). With
LBA, sequential prefetching schemes can be easily
deployed without any hint of applications or file systems. By mining the history of the LBAs of past accesses, the
sequential prefetching schemes are able to achieve high
prediction accuracy. Since they are less sophisticated than
other kinds of forecasting techniques, sequential prefetching schemes are widely adopted in practical
storage systems.
Existing sequential prefetching schemes are mainly divided into three classes, namely, Prefetch Always (PA),
Prefetch On a Miss (PoM) and Prefetch On a Hit (PoH)
[5, 6]. As regarding the first, PA doesn’t need a
prediction module and always fetches contiguous data of a request. Assuming there is abundant cache space, it will
achieve the highest hit rate but lowest efficiency, this is
because lots of prefetched data will never be accessed. In
contrast, PoM only prefetches data on a miss, and thereby its miss rate is high. Additionally, the prefetching of
random requests with little spatial locality will pollute the
cache. PoH is popular in practice because it achieves both high hit rate and cache size economy. In PoH, a trigger
mechanism [7] is employed to avoid cache misses.
Specifically, when PoH prefetches a set of disk blocks
into the cache pages on a cache hit or miss, it chooses a trigger page by a trigger offset before the end of the
prefetched set. After that, when the trigger page is hit by
future request, a next prefetching request for this stream
is activated.
B. Sequential Stream Detection
The role of sequential stream detection module is to
detect the sequential access pattern from the workload. It
serves two primary purposes: the continuation of the
existing streams and the identification of new stream For the first purpose, a stream data structure is used to
represent the run-time object associated with each
identified sequential stream. This structure keeps
attributes describing the corresponding stream, e.g. the timestamp of the most recent request, the average request
length, the average inter-arrival rate, the LBA of the next
expected contiguous request, etc. We define the LBA of the next expected request as trigger, which serves as the
index of the stream. All the streams are organized by a
hash linked list named StreamQueue, using their triggers
as the keys and the pointers of the stream structures as the values. When a new request arrives, the request address is
searched in the hash table to see if any stream would be
triggered. If so, it means an extension of a stream occurs.
As a result, an update to the located stream is performed and a next prefetching operation is issued consequently.
For the second purpose, a particular address table
referred to as HistoryTable is used to record the I/O access history for further mining. When an incoming
request doesn’t extend any existing stream, its LBA is
searched in HistoryTable. If an address is hit, then it
JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014 2767
© 2014 ACADEMY PUBLISHER
suggests that two consecutive requests are found and a new stream is created. After that, the hit address is
required to be removed from the HistoryTable. Otherwise,
if no address is hit, the LBA of the expected succeeding
request is inserted into the HistoryTable for future sequential stream detection.
C. Relationship Graph
Prior studies demonstrate that there is a strong
correlation between the accessed length and the
remaining length of a sequential stream. In addition, this correlation is stable for a given type of workload, such as
workload of database [3] and networked storage server
[8]. Inspired by these observations, we propose using the
Relationship Graph to mine this correlation. Depending on this graph, FLAP is allowed to evaluate the hit
probability of a prefetching request so that adjusting the
prefetch length. As shown in Fig. 1, every vertex of the relationship
graph represents a unique length, which is identified in
units of flash page size. For example, a workload, whose
maximum stream length is n, will build n vertexes according to all the possible length. Each vertex has two
fields: Len and Count. The former field represents the
current length. The latter field specifies the count of
streams whose length is equal to or larger than its Len field. The locality strength between the accessed length
and the remaining length of a stream is represented as a
weighted edge. For example, the degree of edge(A,B) is 2 means that there have been two completed streams whose
length have grown from 1 to 3. We use P(S.Len,d) to rep-
resent the hit probability of a prefetching request. In this
function, S.Len represents the current stream length, and d is the prefetching depth. Considering a stream whose
current length and prefetching length are both 1, the hit
probability of this prefetching operation equals to P(1,1) = edge(A,A)/A.Count = 40%; when the prefetching
length increases to 3, the hit probability drops to P(1,3) =
edge(A,C)/A.Count = 10%.
Vertex A
Num: 10
Len: 1
Vertex B
Num: 4
Len: 2
Vertex C
Num: 2
Len: 3
2
4
1
Vertex D
Num: 1
Len: 4
1
1
2
Vertex A
Count:11
Len: 1
Vertex B
Count: 5
Len: 2
Vertex C
Count: 3
Len: 3
3
5
2
Vertex D
Count: 2
Len: 4
2
2
3
Count: 1
Len: 5
11
1
1
Vertex E
Figure 1. Diagram illustrating the construction of Relationship Graph
When a sequential stream is evicted from the
StreamQueue, FLAP interprets this stream has been
completed. Subsequently, the relationship graph should be updated by the length of this completed stream. An
example is provided in Fig.1 to illustrate the process of
updating. The original graph in Fig.1a consists of 10 sequential streams including 6 streams with length 1, 2
streams with length 2, 1 stream with length 3 and 1
stream with length 4. Subsequently, a sequential stream
with length 5 is completed, which activates the graph updating procedure. In this procedure, one new vertex of
E should be built since the maximum stream length is 5. After that, every vertex whose Len field is less than 5
needs increasing the Len count by 1. Additionally, for
any edge(X,Y ), if X.Len+Y.Len <= 5, then the edge
weight is increased by 1.
III. DESIGN OF THE FLAP TECHNIQUE
One key issue for FLAP is to determine an optimal
prefetching strength for the flash cache. Without proper strategy, the prefetching depth can be either too
conservative or too aggressive. In case of conservative
prefetching, only few costly physical accesses to
magnetic disks will be avoided and the capacity advantage of flash cache won’t be leveraged. In case of
aggressive prefetching, it suffers from inaccurate
prediction penalty, incurring bandwidth waste, premature evictions and cache pollution [8]. To address this problem,
we exploit a cost-benefit model in this section to make an
acceptable tradeoff between the prefetching depth and the
hit probability, thus maximizing the earning of flash cache prefetching.
Another key issue for FLAP is to minimize the GC
penalty. To achieve this goal, we separate the flash cache
into cache partition and prefetch partition, where only prefetched data could be allocated in the prefetch
partition. Moreover, a time-aware allocation scheme is
presented to optimize the data layout of the prefetched data in the prefetch partition so as to minimize the write-
amplification of GC.
A. Cost Model of HDD
We model the prefetching using time as the metric to
measure the cost and benefit. The related notations are
presented in the Table I. Specifically, disk service times can be broken into three primary components: seek time,
rotational latency, and data transfer time. Seek time is the
amount of time needed to move a disk head to the correct radial position. Rotational latency is the amount of time
needed for the desired sector to rotate under the disk head.
Here, we refer to the time combing the seek time and
rotational latency as head positioning time, which is
represented as position
HDDT . The data transfer time is
dependent on the disk bandwidth, which can be
calculated as HDD
d pagesize
B
.
TABLE I. THE NOTATIONS FOR COST-BENEFIT MODEL
Notations Description
d The prefetching depth that identified in units of flash
page size.
PageSize The page size of flash.
HDDB The bandwidth of HDD.
position
HDDT The rotational time and transfer latency of HDD.
P The actual access probability of the prefetched data.
S A sequential stream
S.Len The current length of a stream
S.avgRL The average request length of a stream
From the HDD’s perspective, the cost of prefetching with a length of d is derived by Equation (1). In contrast,
2768 JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014
© 2014 ACADEMY PUBLISHER
the cost of non-prefetching is derived by Equation (2), where P(S.Len,d) denotes the probability that the
prefetched data d will actually be accessed.
prefetch position
HDD HDD
HDD
d pagesizeC d T
B
(1)
. ,. ,
.
. ,
non prefetch position
HDD HDD
HDD
d P S Len dC d P S Len d C
s avgRL
d P S Len d PageSize
B
(2)
The benefit of prefetching derives from the avoidance
of the costly disk accesses. Assuming prediction accuracy
P(S.Len,d) is constant, the more aggressive the prefetching length is, the more costly disk accesses it
avoids. However, in practice, the hit probability decreases
with increasing prefetching length. Thus, a sophisticated
cost-benefit model is needed to compromise the access probability and the prefetching length to maximize the
prefetching earning.
B. Cost Model of SSD-based Disk Cache
Although SSD-based caches bear similarities to
DRAM-based caches, there are significant differences between them. For conventional DRAM-based caches,
the cost of caching and prefetching is just negligible.
However, for the SSD-based disk cache, the cache
allocation, cache hit and garbage collection costs are extremely expensive, such that they must be carefully
considered in the cost model.
invalidvalid
clean
invalidvalid
clean
Exceed expired-timeRead hit
Cac
he a
lloca
tion
(pro
gram
)
(invalidate)
Garbage collection
(copyback and erase)
Read hit
Cache evictionWrite hit
(invalidate)
Cac
he a
lloca
tion
(pro
gram
)
Garbage collection
(copyback and erase)
(caching) (prefetching)
Figure 2. Diagram illustrating the page state transformation of caching
and prefetching
Generally, there are three states for the pages of SSD-
based disk cache, namely, clean (erased), valid, and invalid. Fig. 2 illustrates the transformations between
these page states. For both cached data and prefetched
data, when loading them, they will be filled in the chosen
clean pages by the cache allocation module. The cost of cache allocation is measured by the time that loading
them into the SSD-based disk cache. It can be derived by
Equation (3).
allocation Page program
SSD SSDC d d C (3)
When the cache space is full of data, the previously
cached blocks or prefetched blocks will be evicted to make room for the arriving data. Commonly, LRU or its
variant CLOCK [2] is adopted as the replacement policy
for the cached data. For prefetching, FIFO will be
preferred. In FLAP, the prefetched blocks which have
been accessed by user requests are inclined to be evicted as early as possible by the replacement algorithm. By
doing this, the GC module inside SSD could find the
victim candidate including more invalid pages than
before, thus improving the GC efficiency. As illustrated in Fig. 2(b), the valid pages that have been accessed or
exceeding its expired-time will be invalidated
immediately by FLAP. This is also due to the fact that sequential access data has little temporal locality. That is,
there is little value for caching them. The read hit cost is
derived by Equation (4).
. ,hit page read
SSD SSDC d d P S Len d C (4)
2. B
lock
era
se
1. Copy and write valid pages
1 Flash page read,1 Flash page write, 1Flash block erase
Clean Flash Page
Valid Flash Page
Invalid Flash Page
Write pointer
Cache space
Victim block for GC
Cache space
Reserved for GC
Figure 3. Diagram illustrating the greedy garbage collection policy in
SSD-based disk cache
When cached data and prefetched data eventually be
evicted from the SSD-based disk cache. The invalid
pages that they left must be reclaimed by the costly GC operations when the amount of clean pages is less than a
predefined threshold. Fig. 3 illustrates an example of the
greedy GC policy. We assume there are 4 pages per block
and 4 total blocks in the cache space. Firstly, the first block which has the maximum invalid pages is selected
as the victim block. Consequently, the valid pages of the
victim block are read and then written to a clean page at
the write pointer. Finally, the victim block is erased. Quantitatively, the GC cost is derived as equation (5).
GC copyback erase
SSD SSD SSD
dC d u N C C
N (5)
Here, u denotes the utilization of the selected victim
flash block, N is the number of pages in a flash block.
Thus, u N denotes the number of valid pages in the
victim block. In Fig. 3, we can see that the u of the victim
blocks equals to 0.25. Obviously, the less u is, the less GC cost we pay.
C. Prefetch Partition
To minimize the GC cost, FLAP split the SSD-based
disk cache into separate prefetch partition and cache
partition. The rationale is that the prefetched data and cached data have different access patterns. The cache
partition is designed to exploit the temporal locality,
where LRU is the commonly used replacement policy. As
the SSD-based disk cache is the last level disk cache, the newly cached data must stay for a long term due to the
JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014 2769
© 2014 ACADEMY PUBLISHER
reuse distance is very long. However, prefetched data exhibits strong spatial locality but poor temporal locality.
Therefore, prefetched pages are immediately evicted after
access or discarded when exceeding the predicted expired
time, thus making room for the new data and facilitating the GC by lower the utilization rate u.
By separating the SSD-based disk cache into a prefetch
and cache partition, we are able to reduce the u of the victim page thus reducing the GC cost. Fig. 4 shows an
example that highlights the benefits of splitting the SSD-
based disk cache into a prefetch and cache partition. The
left side shows the behavior of a unified SSD-based disk cache and the right side shows the behavior of splitting
the SSD-based disk cache into a prefetch and cache
partition. Fig. 4 assumes there are 4 pages per block and 4 total blocks in a SSD-based disk cache. Ci represents the
ith cached page, and P i represents the ith prefetched page.
Initially, as illustrated in Fig. 4(a), there are 4 cached
pages and 4 prefetched pages in the cache space. After that, as illustrated by Fig. 4(b), the prefetched pages,
including P1, P2, P3, P4, have been evicted from the
prefetching list because they have been accessed or
exceeded their expired time. Consequently, by reading all valid data from victim blocks containing invalid pages,
garbage collection proceeds, erasing those blocks and
then sequentially rewriting the valid data. In this example, when the SSD-based disk cache is split into prefetch and
cache partitions, only 1 block is erased in garbage
collection since u is equal to 0, thus minimizing the GC
cost. This dramatically reduces the internal operations of SSD, including reads, writes and erases, compared to the
unified SSD-based disk cache that mixes prefetched data
and cached data.
Prefetch AreaCache AreaUnified Cache
C1P1C2P2
C3P3C4P4
C1P1C2P2
C1P1C2P2
Before Garbage Collection Before Garbage Collection
Unified Cache
C1P1C2P2
C3P3C4P4
C1C2
Prefetch AreaCache Area
C1C2C3C4
P1P2P3P4
Garbage CollectionGarbage Collection
Copy valid pages
Victim for GC Victim for GC
Unified Cache
C1C2C3C4
After Garbage Collection
Prefetch AreaCache Area
C1C2C3C4
After Garbage Collection
2 Flash block erases, 4 Flash page writes, 4 Flash Page reads
1 Flash block erase
Clean Flash Page Valid Flash Page
Invalid Flash Page Write pointer
(a)
(b)
(c)
Figure 4. Diagram illustrating the benefit of splitting the SSD-based
disk cache into prefetch partition and cache partition
D. Time-aware Allocation Scheme
To achieve the goal of minimizing the GC cost of
SSD-based prefetching, FLAP employs a time-aware
cache allocation scheme, by which the utilization rate of
victim blocks (u) can nearly drop to zero. The idea is that the prefetched data with adjacent expired time should be
allocated contiguously in physical address space.
Specifically, the basic granularity of prefetching is based
on the flash page size. Any prefetching request must be aligned in the page size first, and then it is split into
multiple page-level sub-requests, each of which contains
a complete page and assigned with a timestamp of its expired time. When FLAP is allocating clean pages to the
page-level prefetching sub-requests, the sub-requests with
adjacent expired time should be filled to continuous
pages. By doing this, prefetching operations are optimized to minimize the GC cost.
Any generated prefetching sub-request is associated
with two timestamps, namely, expectedT and expiredT. Note that, “Time” here refers to the logical time, which is
measured by the number of total I/O accesses. The
expectedT represents the expected time when the
prefetched strip will be requested by future user demands. The expiredT represents the time threshold, beyond
which the sub-request should be cancelled. Using
Formula 6, we derive the expectedT of the sub-requests
in a prefetching request, where currentT is the current system time, SS represents the strip size, S.avgT
represents the average arrival interval of the stream, and i
represents that it is the ith sub-request in the prefetching request. In this formula, we set expectedT to 2 times the
interval time to avoid premature degradation of the sub-
request. The expiredT is set to 5 times of the expectedT,
where the fault canceling rate of prefetching sub-requests is only 1-3% in experiments with this setting.
2 .
expectedTi
len
i SS S avgTcurrentT
EP
(6)
… … …
Time
t1
t2 t3 t4
t2
t3
t4
Tim
e
Clean Flash Page Valid Flash Page
Invalid Flash Page Write Pointer
Conventional allocation scheme Time-aware allocation scheme
Garbage collection Garbage collection
3
1
t
sP
4
1
t
sP
1
2
t
sP 1
1
t
sP 3
1
t
sP
2
2
t
sP
1
3
t
sP
2
3
t
sP 1
2
t
sP 2
2
t
sP
2
3
t
sP1
3
t
sP 1
3
t
sP
3
2
t
sP
4
1
t
sP
4
1
t
sP2
1
t
sP
1
1
t
sP 2
1
t
sP
3
1
t
sP
4
1
t
sP
2
2
t
sP 2
3
t
sP2
1
t
sP
3
3
t
sP
4
3
t
sP
3
2
t
sP
4
2
t
sP
3
2
t
sP
4
2
t
sP
2
4
t
sP
3
4
t
sP
4
4
t
sP
1
1
t
sP 1
2
t
sP 1
3
t
sP 1
4
t
sP
2
2
t
sP
2
3
t
sP1
3
t
sP
3
2
t
sP 4
2
t
sP
3
2
t
sP
3
3
t
sP
3
4
t
sP
4
3
t
sP
4
4
t
sP
4
1
t
sP
1
3
t
sP
1
1
t
sP
1
2
t
sP
1
3
t
sP
3
1
t
sP2
1
t
sP
…
t2
t3
t4
t1
tnow
tnow
tnow
t2 t3 t4tnow
u=3/4
u=0
Figure 5. Diagram illustrating the benefit of time-aware allocation
scheme
Let’s assume currently there are 4 active sequential
streams in the workload, each of which has just been
triggered to issue a prefetching request of 4 pages. Then, all these 4 prefetching requests are divided into 16 page-
level sub-requests. We assume i
i
t
sP denotes a page-level
sub-request of stream Si with the expired time of ti. As illustrated in Fig. 5, the conventional allocation scheme
2770 JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014
© 2014 ACADEMY PUBLISHER
on the left side will allocate the clean pages for the prefetching requests sequentially, neglecting the expired-
time property. However, the time-aware allocation
scheme on the right side will allocate the clean pages for
the page-level sub-requests with respect to their expired time. That is, it tries to aggregate the sub-requests with
close expired-time values together. By doing this, when
tnow is equal to t1, the utilization rate u of conventional allocation scheme is as high as 3/4, while that of time-
aware allocation scheme is only 0. This demonstrates that
time-ware cache allocation can significantly reduce the
GC overhead of the SSD-based prefetching, since u is proportional to the GC cost.
E. Cost-benefit Model
The prefetching cost of SSD-based disk cache includes
the disk access cost, cache allocation cost, read hit cost,
and GC cost, which is derived by Equation (7). The benefit of prefetching can be measured by the disk access
cost of non-prefetching, which is formulated by Equation
(8). Thus, the earning of SSD-based prefetching can be
quantified by Equation (9).
dprefetch allocation hit GC
HDD SSD SSD SSDCost C d C d C d C (7)
. ,non prefetch
HDDBenefit C d P S Len d (8)
disk
. ,Earning 1
.
- . ,
1-P S.Len,d -
B
u N -
position
disk
page program page read
SSD SSD
copy back erase
SSD SSD
d P S Len dT
S avgRL
C P S Len d C d
pagesized
C C
N
d
(9)
To calculate the optimal prefetching depth that
maximizing the earning of prefetching, an iterative method is utilized. Starting from d = 1 with a step 1, a
loop iterates to calculate the earnings of prefetching until
d is equal to the upper bound of the relationship graph. By doing this, the prefetching depth which brings the
highest earning can be considered to be optimal. Note
that, we assume u is 0 on account of the efficiency of
time-ware allocation scheme.
IV. EXPERIMENT METHODOLOGY
In order to evaluate the efficiency of the proposed
techniques, we implemented a simulator for flash based
SSD cache, which is a modified version of Disksim 4.0 [9] and the SSD extension package [10]. The relative
parameters of the flash memory used in experiments are
listed in Table II. To study the performance and endurance impacts of
FLAP, we pick a set of a set of real-world enterprise-
scale workloads, namely, Live Map, and MSN [11]. LM
is a read intensive sequential I/O workload collected at the Microsoft Live Maps servers to provide satellite
images. The tile frontend server (TFE) takes a user request for a location and passes it to the tile back-end
server (TBE). RAD was collected at the back-end SQL
server for RADIUS authentication server. MSN was
collected at the Microsoft’s several Live file servers. The CFS server stores metadata information and blobs
correlating users to files stored on the back-end file server
(BEFS). Table III summarizes the basic characteristics of these traces, including the average request size for reads
and writes percentage of read requests, sequentiality, and
request inter-arrival time.
TABLE II. SIMULATION PARAMETERS
Device Parameter Configuration
SSD
Number of SSDs 1
Interface SATA
Capacity 4G
Dies Per Package 1
Planes Per Die 1
Blocks Per Plane 16384
Pages Per Block 64
Page Size 4K
Page Program 300us
Page Read 125us
Copyback 225 us
Block Erase 1.5ms
GC Policy Greedy 5%
HDD
Number of HDDs 3
Interface SATA
Positioning Time 4.7ms
Bandwidth 72M/s
We compare FLAP with following techniques, including no prefetching, traditional prefetching, and no
partition. First, no prefetching means that using the whole
second-level SSD cache as a pure cache without
prefetching. Second, traditional prefetching represents Linux’s prefetching scheme. For sequential accesses, it
adaptively grows the prefetching window to a tunable
upper bound, which is 128K by default. Note that, the partition scheme is enabled when using traditional
prefetching. Finally, no partition means tradition
prefetching without using partition scheme.
V. RESULTS
In this section, we present a number of measurements
illustrating the performance improvements with FLAP.
These measurements includes hit rate, average response
time, and erase times. In Fig. 6, we examine the hit rates of the second-level
flash cache under different workloads using various cache
management schemes. Note that, the first-level DRAM cache size is constant 0.5% of the backing store. As we
can see, in the beginning, it takes several quarters to
warm up the system. After that, the hit rates of different
schemes get stable. In Fig. 6(a)-6(b), the LM-TBE and RAD-BE are both workloads with strong sequentiality.
As expected, FLAP is more efficient that other schemes.
In the warm-up period, the hit rates of no prefetching
scheme are low since caching is a passive scheme. In contrast, FLAP and other schemes get higher hit rates
because prefetching is a proactive scheme. When the hit
rates get stable, FLAP outperforms no prefetching,
JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014 2771
© 2014 ACADEMY PUBLISHER
TABLE III. CHARACTERISTICS OF ENTERPRISE-SCALE WORKLOAD TRACES
Workload Description AVG.Req.Size Read/Write(KB) Read (%) Seq. (%) Interarriv (ms)
RAD RAD-BE 106.0/11.6 20.0 58.1 11.85
Live Map LM-TBE 53.04/61.90 78.9 70.4` 3.87
MSN MSN-CFS 9.71/8.67 73.8 8.3 6.58
MSN MSN-BEFS 10.68/11.15 67.2 4.35 1.10
30 60 100 150 200 250 30065
70
75
80
85
Hit R
ate(
%)
Time (minutes)
FLAP
traditional prefetching
no prefetching
no partition
20 40 80 120 160 20070
75
80
85
90
h
it r
ate
(%
)
Time (Minutes)
FLAP
traditional prefetching
no prefetching
no partition
(a) Under LM-TBE (b) Under RAD-BE
40 100 160 200 250 30075
76
77
78
79
80
hit r
ate
(%
)
Time(Minutes)
FLAP
traditional prefetching
no prefetching
no partition
50 100 150 200 250 300 350
68
70
72
Hit R
ate
(%
)
Time (minute)
FLAP
traditional prefetching
no prefetching
no partition
(c) Under MSN-CFS (d) Under MSN-BEFS
Figure 6. Flash cache Hit Rate under various workloads
1 5 10 20
5
10
15
20
25
%% %
avg
. re
sp
on
se
tim
e(m
s)
SSD size (percentage of backing store)
FLAP
tradition prefetching
no partition
no prefetching
%1 3 6 10
0
10
20
30
40
50
% %%
the
avg
. re
sp
. (m
s)
SSD SIZE (percentage of backing store)
FLAP
traditional prefetching
no partition
no prefetching
%
(a) Under LM-TBE (b) Under RAD-BE
10 15 20 3010
15
20
25
30
35
% % %
avg
. re
sp
. tim
e (
ms)
SSD Size (percentage of backing store)
FLAP
traditional prefetching
no partition
no prefetching
% 10 20 40 505
10
15
20
25
30
35
% % %
avg
. re
sp
. tim
e (
ms)
SSD size (percentage of backing store)
FLAP
traditional prefetching
no partition
no prefetching
%
(c) Under MSN-CFS (d) Under MSN-BEFS
Figure 7. Average response time under various workloads
traditional prefetching and no partition schemes by up to
13.55%, 5.71%, and 6.28%, respectively. The success of
FLAP should contribute to its high prediction accuracy of prefetching, as well as the separation of sequential
streams such that avoiding cache pollution. In Fig. 6(c)-
6(d), the MSN-CFS and MSN-BEFS are both workloads
with poor sequentiality. All the schemes get similarly performance since there is poor sequentiality for
sequential prefetching schemes to employ.
In Fig. 7, we examine the average response time of the
second-level flash cache under different workloads using various cache management schemes. Note that, the first-
level DRAM cache size is constant 0.5% of the backing
store. We can observe that the average response time under various prefetching schemes decreases when the
second-level flash cache increasing its capacity. It shows
the performance dependency of cache size. It’s noticeable
that FLAP performs better than other schemes especially
when the second-level cache size is small. This is because the cost-benefit model of FLAP makes the prefetching
decision more accurate. In addition, partition and time-
ware allocation schemes reduce the internal GC overhead,
such that reducing the response time.
LM-TBE RAD-BE MSN-BEFS MSN-CFS0.0
0.2
0.4
0.6
0.8
1.0
1.2
era
se
tim
es n
orm
aliz
ed
to
pu
re c
ach
e
Traces
FLAP without partition
FLAP without time-ware allocation
FLAP baseline
Figure 8. Summary of performance of several policies normalized to
pure cache without prefetching
Fig. 8 examines the erase times of the different
schemes normalized to pure cache without prefetching
under various workloads. It shows that the erase times of
FLAP is lower than pure cache without prefetching by up to 40% under sequential workloads. This suggests that
partition and time-aware allocation schemes are
successful in reducing the write-amplification of GC. We deduce that sequential prefetching could separate
sequential streams to the prefetch partition. As a result,
the cache partition won’t be polluted by sequential data.
In addition, the time-ware allocation scheme minimizes the overhead of GC of the prefetch partition.
VI. RELATED WORK
Sequential prefetching is the most popular data
prefetching on hard disks. There have been several recent studies on it. Some of these studies issue aggressive
large-size prefetching requests to reduce the prefetching
cost. For example, STEP [8] is inclined to perform aggressive prefetching to hide disk access latency and
reduce the number of expensive disk operations. AMP [7]
gradually and heuristically adjusts the prefetching degree
and trigger distance on a per-stream basis so as to achieve the highest possible aggregate throughput.
Also, several of these studies focus on improving
prediction accuracy. Cao and Felten [12] proposed integrating application-controlled prefetching, caching
and scheduling for file systems. Chang and Gibson [13]
suggested leveraging the application hints by speculative
execution without modifying the code. C-Miner [14] performs prefetching by mining the correlation between
data blocks. TAP [5] adopts a table based prefetching
2772 JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014
© 2014 ACADEMY PUBLISHER
approach, where only the addresses of the accessed blocks are stored. Therefore, there is more possibility for
TAP to identify potential sequential streams in the
workload by reserving more history information. STEP [8]
also use a table to detect the sequential streams. This table is organized as a balanced tree and each node
represents a detected or new sequential stream.
Since cache is the scarce resource in the computer system, lots of studies focus on how to efficiently manage
the prefetched data. The SARC algorithm [15] splits the
cache into prefetch cache and re-reference cache. It
focuses on balancing the cache allocation between them to adapt to the sequentiality of the workload. Li [16]
proposed a prefetch cache sizing scheme based on a
gradient descent-based greedy algorithm. TAP [5] used a downward pressure approach to reduce the prefetch size
when the cache hit rate is stable. It also evicts the data of
the previous request on a cache hit of the arriving request
in the already detected sequential steam. The purpose is to keep the prefetch cache as small as possible while
keeping a stable hit rate.
Numerous hybrid storage systems have been proposed
in order to combine the positive properties of HDDs and SSDs. Most previous work employs the SSD as a cache
on top of the larger hard disk to improve random access
performance. For example, Intel’s Turbo Memory [1] uses NAND-based non-volatile memory as an HDD
cache. It determines the optimal pinned set that subject to
the flash cache to making performance improvement.
Mercury [2] is designed as a persistent, write-through hypervisor cache for flash memory. FlashVM [17] is a
system architecture and a core virtual memory subsystem
built in the Linux kernel that uses dedicated flash for
paging. It modifies the paging system along code paths for allocating, reading and writing back pages to optimize
for the performance characteristics of flash. Hystor [18]
manages both SSDs and HDDs as one single block device. By monitoring I/O access patterns at runtime, Hystor can
effectively identify blocks that are best suitable to be held
in SSD. FlashTier [19] provides an interface designed for
caching. It achieves memory-efficient address space management, improved performance and cache
consistency to quickly recover cached data following a
crash.
Some researchers have combined solid state flash de- vices and prefetching. Flashy prefetching [20]
implemented a data prefetcher for flash based SSDs. It
dynamically controls the aggressiveness of prefetching through access pattern detection and feedback. Unlike
FLAP, Flashy prefetching retrieves data from the SSD
device to the volatile DRAM cache. FAST [21] is a I/O
prefetching technique using a entry-level SSD to provide end-users the fast application launch performance. Unlike
it, FLAP focuses on improving the run-time performance
of the hybrid storage system.
VII. CONCLUSION
The SSD-based disk cache, which plays an important
role in the storage sub-system, has its own distinct
features, such as internal GC operations and limited
lifespan. These special features introduce both challenges for the design of flash based sequential prefetching
techniques. Considering the technology trend that storage
servers are equipped with more processors and T byte
SSD-based disk caches, it is worthwhile to incorporate more sophisticated and powerful sequential prefetching
schemes to improve the performance of sequential
accesses. In this study, we have introduced a relationship graph
as a tool to determine the hit probability of a prefetching
request in a stream with a given length. Moreover, we
have incorporated this tool into our flash-aware cost-benefit model in order to explore the optimal prefetching
length according to the sequentiality strength of the
workload, which maximize the benefit of aggressive prefetching while avoids the performance penalty and
cache pollution due to inaccurate prefetching.
We have designed a FLAP prefetching technique to
adapt the prefetching operation to the SSD-based disk cache. The flash layer is divided into cache partition and
prefetch partition. In addition, a time-aware allocation
scheme is applied to the prefetch partition, so as to
improve the internal GC efficiency. We believe that the insight of FLAP is widely
applicable not only for SSD-based disk cache, but in any
system that has a flash acceleration layer and sequential workloads.
ACKNOWLEDGMENT
We would like to thank anonymous reviewers for their
insight and suggestions for improvement. This work is supported in part by the National Basic Research
Program of China (973 Program) under Grant
No.2011CB302303, the National Natural Science
Foundation of China (NSFC) under Grant No.60933002, the Soft Science Project of Henan Province under Grant
No.142400410123 and No. 142400410106.
REFERENCES
[1] J. Matthews, S. Trika, D. Hensgen, R. Coulson, and K. Grimsrud, “Intel turbo memory: Nonvolatile disk caches in the storage hierarchy of mainstream computer systems,” Trans. Storage, vol. 4, no. 2, pp. 401-424, May 2008.
[2] S. Byan, J. Lentini, A. Madan, and L. Pabon, “Mercury: Host-side flash caching for the data center,” in MSST, IEEE 28th Symposium on, 2012, pp. 1-12.
[3] W. W. Hsu, A. J. Smith, and H. C. Young, “I/O reference behavior of production database workloads and the tpc benchmarks - an analysis at the logic level,” ACM Trans. Database Syst., vol. 26, no. 1, pp. 96-143, Mar. 2001.
[4] Storage Performance Council. [Online]. Available: http: //www.storageperformance.org.
[5] M. Li, E. Varki, S. Bhatia, and A. Merchant, “Tap: table-based prefetching for storage caches,” in FAST’08, Berkeley, CA, USA, 2008.
[6] S. Bhatia, E. Varki, and A. Merchant, “Sequential prefetch cache sizing for maximal hit rate,” in IEEE International Symposium on MASCOTS’10, Miami Beach, Florida, USA, Aug 2010, pp. 89-98.
[7] B. S. Gill, L. Angel, and D. Bathen, “Amp: Adaptive multi-stream prefetching in a shared cache,” in FAST’07, 2007, pp. 185-198.
JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014 2773
© 2014 ACADEMY PUBLISHER
[8] S. Liang, S. Jiang, and X. Zhang, “Step: sequentiality and thrashing detection based prefetching to improve performance of networked storage servers,” in ICDCS ’07, Washington, DC, USA, 2007.
[9] G. Ganger, B. Worthington, Y. Patt, J. Griffin, J. Bucy, J. Schindler, and S. Schlosser, DiskSim 4.0. [Online]. Available: http://www.pdl.cmu.edu/DiskSim/.
[10] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Panigrahy, “Design tradeoffs for ssd performance,” in USENIX 2008 Annual Technical Conference on Annual Technical Conference, Berkeley, CA, USA, 2008, pp. 57-70.
[11] Storage Networking Industry Association. [Online]. Available: http://iotta.snia.org.
[12] P. Cao, E. W. Felten, A. R. Karlin, and K. Li, “Implementation and performance of integrated application- controlled file caching, prefetching, and disk scheduling,” ACM Trans. Comput. Syst., vol. 14, no. 4, pp. 311–343, Nov. 1996.
[13] F. Chang and G. A. Gibson, “Automatic i/o hint generation through speculative execution,” in OSDI ’99, Berkeley, CA, USA, 1999, pp. 1–14.
[14] Z. Li, Z. Chen, S. M. Srinivasan, and Y. Zhou, “C-miner: mining block correlations in storage systems,” in FAST’04, Berkeley, CA, USA, pp. 173–186.
[15] B. S. Gill and D. S. Modha, “Sarc: sequential prefetching in adaptive replacement cache,” in Proceedings of the annual conference on USENIX 2005 Annual Technical Conference, Berkeley, CA, USA, 2005, pp. 293–308.
[16] C. Li and K. Shen, “Managing prefetch memory for data- intensive online servers,” in FAST’05, Berkeley, CA, USA, 2005, pp. 253–266.
[17] M. Saxena and M. M. Swift, “FlashVM: virtual memory management on flash,” in Proceedings of the 2010 USENIX conference on USENIX annual technical conference. Berkeley, CA, USA, 2010.
[18] F. Chen, D. A. Koufaty, and X. Zhang, “Hystor: making the best use of solid state drives in high performance storage systems,” in ICS’11, New York, NY, USA, 2011, pp. 22-32.
[19] M. Saxena, M. M. Swift, and Y. Zhang, “Flashtier: a lightweight, consistent and durable storage cache,” in EuroSys ’12. New York, NY, USA, 2012, pp. 267-280.
[20] A. J. Uppal, R. C.-L. Chiang, and H. H. Huang, “Flashy prefetching for high-performance flash drives,” in MSST, IEEE 28th Symposium on, San Diego, CA, USA 2012, pp. 1-12.
[21] Y. Joo, J. Ryu, S. Park, and K. G. Shin, “Fast: quick application launch on solid-state drives,” in FAST’11, Berkeley, CA, USA, 2011.
[22] Junfeng Liu, Heng Li, Yingming Liu, “Data Interpretation Technology for Continuous Measurement Production Profile Logging,” Journal of Multimedia, Vol 9, No 5 (2014), 729-735, May 2014.
Yang Liu received the B.S. and M.S. degree in computer science and technology from Zhengzhou University, Zhengzhou, China, in 2003 and 2006, respectively, and Ph.D. degree in computer architecture from Huazhong University of Science and Technology, Wuhan, China in 2013. He is a researcher at the Big Data Research
Institute, Modern Educational Technology Center, Henan University of Economics and Law, Zhengzhou, China. His research interest includes hybrid storage system, multi-level cache management and prefetching techniques.
Wang Wei received the B.S degree in computer science from Chongqing Communication Institute, Chongqing, China, in 2006. She is an assistant professor at the Computer Science Department of Henan Industry and Trade Vocational College. Her research interest includes database technology and big data.
2774 JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014
© 2014 ACADEMY PUBLISHER
Recommended