FLAP: Flash-aware Prefetching for Improving SSD-based Disk ......systems play important roles in...

FLAP: Flash-aware Prefetching for Improving

SSD-based Disk Cache

Yang Liu The Big Data Research Institute of Modern Educational Technology Center, Henan University of Economics and Law,

Zhengzhou 450046, China

Email: yangliu.huel@foxmail.com

Wang Wei

Department of Computer Science, Henan Industry and Trade Vocational College, Zhengzhou 450003, China

Email: wangwei.hntvc@foxmail.com

Abstract—In modern enterprise storage systems, there is a

trend that using NAND flash based solid state disks (SSDs)

as a second-level disk cache to reduce the slow access to

hard disk drives (HDDs) by caching the hot data of HDDs

with SSDs. However, using SSDs for both caching and

prefetching has rarely been discussed due to the

performance penalty caused by unsuccessful prefetching, including garbage collection cost and disk bandwidth

wasting. In this paper, a FLash-Aware Prefetching (FLAP)

scheme has been presented to provide aggressive

prefetching for improving the performance of sequential

disk accesses. FLAP has been evaluated using well known

real-world workloads. Experiment results show that FLAP

can offer performance improvement under sequential workloads compared with pure caching, while reducing the

internal garbage collection of SSDs.

Index Terms—Hybrid Storage System; Flash Cache;

Prefetching; Garbage Collection

I. INTRODUCTION

UNDER current technology trends, the access time gap

between processor and disk keeps increasing. Under

current technology trends, the access time gap between processor and disk is kept increasing [1]. To bridge this

gap, multi-level buffer caches are employed along data

retrieving paths to reduce time-consuming disk accesses.

Besides multiple volatile buffer caches, many modern storage systems utilize NAND flash based large

nonvolatile memory as second-level buffer caches to

speed-up I/O accesses in storage servers [2]. As a promising nonvolatile storage with attractive

features, NAND flash based SSD caches in storage

systems play important roles in improving data access

performance. However, studies have shown that storage traffic exhibits much longer reuse distances at the SSD

cache due to the existence of the first-level DRAM cache.

Specifically, accesses to a SSD cache exhibit poorer

temporal locality than those to the first-level DRAM cache because these accesses are misses from the higher

level. Keeping increasing the capacity of SSD cache, the

cache miss rate would not be reduced beyond a certain size. To improve the resource utilization of SSD cache,

prefetching would be a potential choice for storage

servers to improve data access performance. Specifically,

it could fully leverage the capacity advantage of the SSD

cache to perform aggressive prefetching for the sequential access. In this way, more data could be sequentially

fetched from disks in a prefetching request, subsequently

more expensive disk accesses could be avoided.

Unlike a typical cache, flash cache has several unique features. Firstly, internal garbage collection (GC) cost

must be considered in a flash cache. In details, NAND

flash is organized in units of pages and blocks. A typical flash page is 4 KB in size and a flash block is made up of

64 flash pages (256 KB). Reads and writes are performed

on a page basis and flash erasures are performed on a

block basis. Erase is the slowest operation while write is slower than read. A flash must perform an erase on a

block before it can write any data to a page belonging to

the block. Each page on flash can be in one of three different states, namely, valid, invalid, and erased. When

data is written to an erased page, its state becomes valid.

If the page contains an older version of data, it’s said to

be in the valid state. Note that, out-of-place updates and cache evictions result in a large amount of invalid pages.

When SSD has little free pages for writing requests, the

garbage collector is employed to reclaim those invalid

pages and create free pages. Firstly, it selects a victim block based on the greedy policy, such as choosing a

block with maximum invalid pages. All valid pages

within the victim block should be copied and then written into another free block, this is called write amplification.

After that, the victim block is erased to create a free block.

The efficiency of garbage collector is one of the dominant

factors affecting flash memory performance. Secondly, the lifespan of SSD cache can be drastically reduced by

the write-amplification, which is measured by the ratio of

the amount of data physically written to flash memory to

the amount of data logically written by the host. It has been reported that each flash memory cell can sustain

about 100K erase operations for Single-Level-Cell (SLC)

and only 10K erase operations for Multi- Level-Cell (MLC). Evidently, write-amplification is very high

especially when pages with different access patterns

2766 JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014

mixed together. Lastly, unlike DRAM cache, the SSD cache is nonvolatile, which is a persistent storage.

This study is motivated by the following challenges

caused by the properties of flash:

Taking the GC cost into account, the cache allocation cost of SSD cache is significantly higher

compared to that of the conventional DRAM

cache. Thus, the performance penalty of false prefetching is extremely expensive.

Since the prefetched data and cached data have remarkable disparities in spatial locality and

temporal locality, distributing them in the same

SSD cache area will cause extensive write-

amplifications due to GC. As a result, the life span and performance of flash cache layer will be

impaired.

This paper addresses the above-mentioned problems.

More specifically, it has the following contributions:

We have built the cost-benefit model of the SSD-

based prefetching considering the internal

features of flash storage, especially the GC cost.

We employ a relationship graph tool to mine the

sequentiality of the workload. Algorithms are also

designed to choose the optimal prefetching length so as to balance the prefetching aggressiveness

and prediction accuracy.

We split the space of SSD-based disk cache into cache partition and prefetch partition. A novel

time-aware cache allocation scheme is utilized to

manage the data layout of the prefetched data, by which the expensive internal garbage collection

cost of prefetching could be minimized.

The rest of this paper is organized as follows: The next

section describes the background on sequential prefetching technology briefly. Section III describes the

design of FLAP. Sections IV and V present the

experiment methodology and results. Related work is discussed in Section VI. Finally, Section VII concludes

this paper.

II. SEQUENTIAL PREFETCHING TECHNOLOGY

BACKGROUND

In this section, we present the background knowledge

of the sequential prefetching technology. Besides, we

present a Relationship Graph model to learn the

sequentiality strength of a certain workload. With this graph, we could evaluate the hit probability of a

prefetching request so that adjusting the prefetch length.

A. Sequentiality and Prefetching

Sequentiality is a common characteristic of practical

I/O workloads, which accesses the data contiguously. The ubiquity of sequentiality derives from the fact that file

systems and databases tend to contiguously store the file

data on storage disks. Hence, many common file

operations such as copy, scan, backup and recovery lead to sequential accesses. Additionally, some important

benchmarks also exhibit strong sequentiality, such as

TPC-D [3] and SPC-2 [4].

A stream is a sequence of I/O requests of an application. It is considered to be sequential if the

corresponding data are accessed contiguously. Due to the

semantic gap between the file system and the storage sub-

system, the only available information for sequential stream detection is Logical Block Address (LBA). With

LBA, sequential prefetching schemes can be easily

deployed without any hint of applications or file systems. By mining the history of the LBAs of past accesses, the

sequential prefetching schemes are able to achieve high

prediction accuracy. Since they are less sophisticated than

other kinds of forecasting techniques, sequential prefetching schemes are widely adopted in practical

storage systems.

Existing sequential prefetching schemes are mainly divided into three classes, namely, Prefetch Always (PA),

Prefetch On a Miss (PoM) and Prefetch On a Hit (PoH)

[5, 6]. As regarding the first, PA doesn’t need a

prediction module and always fetches contiguous data of a request. Assuming there is abundant cache space, it will

achieve the highest hit rate but lowest efficiency, this is

because lots of prefetched data will never be accessed. In

contrast, PoM only prefetches data on a miss, and thereby its miss rate is high. Additionally, the prefetching of

random requests with little spatial locality will pollute the

cache. PoH is popular in practice because it achieves both high hit rate and cache size economy. In PoH, a trigger

mechanism [7] is employed to avoid cache misses.

Specifically, when PoH prefetches a set of disk blocks

into the cache pages on a cache hit or miss, it chooses a trigger page by a trigger offset before the end of the

prefetched set. After that, when the trigger page is hit by

future request, a next prefetching request for this stream

is activated.

B. Sequential Stream Detection

The role of sequential stream detection module is to

detect the sequential access pattern from the workload. It

serves two primary purposes: the continuation of the

existing streams and the identification of new stream For the first purpose, a stream data structure is used to

represent the run-time object associated with each

identified sequential stream. This structure keeps

attributes describing the corresponding stream, e.g. the timestamp of the most recent request, the average request

length, the average inter-arrival rate, the LBA of the next

expected contiguous request, etc. We define the LBA of the next expected request as trigger, which serves as the

index of the stream. All the streams are organized by a

hash linked list named StreamQueue, using their triggers

as the keys and the pointers of the stream structures as the values. When a new request arrives, the request address is

searched in the hash table to see if any stream would be

triggered. If so, it means an extension of a stream occurs.

As a result, an update to the located stream is performed and a next prefetching operation is issued consequently.

For the second purpose, a particular address table

referred to as HistoryTable is used to record the I/O access history for further mining. When an incoming

request doesn’t extend any existing stream, its LBA is

searched in HistoryTable. If an address is hit, then it

JOURNAL OF NETWORKS, VOL. 9, NO. 10, OCTOBER 2014 2767

suggests that two consecutive requests are found and a new stream is created. After that, the hit address is

required to be removed from the HistoryTable. Otherwise,

if no address is hit, the LBA of the expected succeeding

request is inserted into the HistoryTable for future sequential stream detection.

C. Relationship Graph

Prior studies demonstrate that there is a strong

correlation between the accessed length and the

remaining length of a sequential stream. In addition, this correlation is stable for a given type of workload, such as

workload of database [3] and networked storage server

[8]. Inspired by these observations, we propose using the

Relationship Graph to mine this correlation. Depending on this graph, FLAP is allowed to evaluate the hit

probability of a prefetching request so that adjusting the

prefetch length. As shown in Fig. 1, every vertex of the relationship

graph represents a unique length, which is identified in

units of flash page size. For example, a workload, whose

maximum stream length is n, will build n vertexes according to all the possible length. Each vertex has two

fields: Len and Count. The former field represents the

current length. The latter field specifies the count of

streams whose length is equal to or larger than its Len field. The locality strength between the accessed length

and the remaining length of a stream is represented as a

weighted edge. For example, the degree of edge(A,B) is 2 means that there have been two completed streams whose

length have grown from 1 to 3. We use P(S.Len,d) to rep-

resent the hit probability of a prefetching request. In this

function, S.Len represents the current stream length, and d is the prefetching depth. Considering a stream whose

current length and prefetching length are both 1, the hit

probability of this prefetching operation equals to P(1,1) = edge(A,A)/A.Count = 40%; when the prefetching

length increases to 3, the hit probability drops to P(1,3) =

edge(A,C)/A.Count = 10%.

Vertex A

Num: 10

Len: 1

Vertex B

Num: 4

Len: 2

Vertex C

Num: 2

Len: 3

Vertex D

Num: 1

Len: 4

Vertex A

Count:11

Len: 1

Vertex B

Count: 5

Len: 2

Vertex C

Count: 3

Len: 3

Vertex D

Count: 2

Len: 4

Count: 1

Len: 5

Vertex E

Figure 1. Diagram illustrating the construction of Relationship Graph

When a sequential stream is evicted from the

StreamQueue, FLAP interprets this stream has been

completed. Subsequently, the relationship graph should be updated by the length of this completed stream. An

example is provided in Fig.1 to illustrate the process of

updating. The original graph in Fig.1a consists of 10 sequential streams including 6 streams with length 1, 2

streams with length 2, 1 stream with length 3 and 1

stream with length 4. Subsequently, a sequential stream

with length 5 is completed, which activates the graph updating procedure. In this procedure, one new vertex of

E should be built since the maximum stream length is 5. After that, every vertex whose Len field is less than 5

needs increasing the Len count by 1. Additionally, for

any edge(X,Y ), if X.Len+Y.Len <= 5, then the edge

weight is increased by 1.

III. DESIGN OF THE FLAP TECHNIQUE

One key issue for FLAP is to determine an optimal

prefetching strength for the flash cache. Without proper strategy, the prefetching depth can be either too

conservative or too aggressive. In case of conservative

prefetching, only few costly physical accesses to

magnetic disks will be avoided and the capacity advantage of flash cache won’t be leveraged. In case of

aggressive prefetching, it suffers from inaccurate

prediction penalty, incurring bandwidth waste, premature evictions and cache pollution [8]. To address this problem,

we exploit a cost-benefit model in this section to make an

acceptable tradeoff between the prefetching depth and the

hit probability, thus maximizing the earning of flash cache prefetching.

Another key issue for FLAP is to minimize the GC

penalty. To achieve this goal, we separate the flash cache

into cache partition and prefetch partition, where only prefetched data could be allocated in the prefetch

partition. Moreover, a time-aware allocation scheme is

presented to optimize the data layout of the prefetched data in the prefetch partition so as to minimize the write-

amplification of GC.

A. Cost Model of HDD

We model the prefetching using time as the metric to

measure the cost and benefit. The related notations are

presented in the Table I. Specifically, disk service times can be broken into three primary components: seek time,

rotational latency, and data transfer time. Seek time is the

amount of time needed to move a disk head to the correct radial position. Rotational latency is the amount of time

needed for the desired sector to rotate under the disk head.

Here, we refer to the time combing the seek time and

rotational latency as head positioning time, which is

represented as position

HDDT . The data transfer time is

dependent on the disk bandwidth, which can be

calculated as HDD

d pagesize

TABLE I. THE NOTATIONS FOR COST-BENEFIT MODEL

Notations Description

d The prefetching depth that identified in units of flash

page size.

PageSize The page size of flash.

HDDB The bandwidth of HDD.

position

HDDT The rotational time and transfer latency of HDD.

P The actual access probability of the prefetched data.

S A sequential stream

S.Len The current length of a stream

S.avgRL The average request length of a stream

From the HDD’s perspective, the cost of prefetching with a length of d is derived by Equation (1). In contrast,

the cost of non-prefetching is derived by Equation (2), where P(S.Len,d) denotes the probability that the

prefetched data d will actually be accessed.

prefetch position

HDD HDD

d pagesizeC d T

. ,. ,

non prefetch position

HDD HDD

d P S Len dC d P S Len d C

s avgRL

d P S Len d PageSize

The benefit of prefetching derives from the avoidance

of the costly disk accesses. Assuming prediction accuracy

P(S.Len,d) is constant, the more aggressive the prefetching length is, the more costly disk accesses it

avoids. However, in practice, the hit probability decreases

with increasing prefetching length. Thus, a sophisticated

cost-benefit model is needed to compromise the access probability and the prefetching length to maximize the

prefetching earning.

B. Cost Model of SSD-based Disk Cache

Although SSD-based caches bear similarities to

DRAM-based caches, there are significant differences between them. For conventional DRAM-based caches,

the cost of caching and prefetching is just negligible.

However, for the SSD-based disk cache, the cache

allocation, cache hit and garbage collection costs are extremely expensive, such that they must be carefully

considered in the cost model.

invalidvalid

Exceed expired-timeRead hit

(invalidate)

Garbage collection

(copyback and erase)

Read hit

Cache evictionWrite hit

(invalidate)

Garbage collection

(copyback and erase)

(caching) (prefetching)

Figure 2. Diagram illustrating the page state transformation of caching

and prefetching

Generally, there are three states for the pages of SSD-

based disk cache, namely, clean (erased), valid, and invalid. Fig. 2 illustrates the transformations between

these page states. For both cached data and prefetched

data, when loading them, they will be filled in the chosen

clean pages by the cache allocation module. The cost of cache allocation is measured by the time that loading

them into the SSD-based disk cache. It can be derived by

Equation (3).

allocation Page program

SSD SSDC d d C (3)

When the cache space is full of data, the previously

cached blocks or prefetched blocks will be evicted to make room for the arriving data. Commonly, LRU or its

variant CLOCK [2] is adopted as the replacement policy

for the cached data. For prefetching, FIFO will be

preferred. In FLAP, the prefetched blocks which have

been accessed by user requests are inclined to be evicted as early as possible by the replacement algorithm. By

doing this, the GC module inside SSD could find the

victim candidate including more invalid pages than

before, thus improving the GC efficiency. As illustrated in Fig. 2(b), the valid pages that have been accessed or

exceeding its expired-time will be invalidated

immediately by FLAP. This is also due to the fact that sequential access data has little temporal locality. That is,

there is little value for caching them. The read hit cost is

derived by Equation (4).

. ,hit page read

SSD SSDC d d P S Len d C (4)

1. Copy and write valid pages

1 Flash page read,1 Flash page write, 1Flash block erase

Clean Flash Page

Valid Flash Page

Invalid Flash Page

Write pointer

Cache space

Victim block for GC

Cache space

Reserved for GC

Figure 3. Diagram illustrating the greedy garbage collection policy in

SSD-based disk cache

When cached data and prefetched data eventually be

evicted from the SSD-based disk cache. The invalid

pages that they left must be reclaimed by the costly GC operations when the amount of clean pages is less than a

predefined threshold. Fig. 3 illustrates an example of the

greedy GC policy. We assume there are 4 pages per block

and 4 total blocks in the cache space. Firstly, the first block which has the maximum invalid pages is selected

as the victim block. Consequently, the valid pages of the

victim block are read and then written to a clean page at

the write pointer. Finally, the victim block is erased. Quantitatively, the GC cost is derived as equation (5).

GC copyback erase

SSD SSD SSD

dC d u N C C

Here, u denotes the utilization of the selected victim

flash block, N is the number of pages in a flash block.

Thus, u N denotes the number of valid pages in the

victim block. In Fig. 3, we can see that the u of the victim

blocks equals to 0.25. Obviously, the less u is, the less GC cost we pay.

C. Prefetch Partition

To minimize the GC cost, FLAP split the SSD-based

disk cache into separate prefetch partition and cache

partition. The rationale is that the prefetched data and cached data have different access patterns. The cache

partition is designed to exploit the temporal locality,

where LRU is the commonly used replacement policy. As

the SSD-based disk cache is the last level disk cache, the newly cached data must stay for a long term due to the

reuse distance is very long. However, prefetched data exhibits strong spatial locality but poor temporal locality.

Therefore, prefetched pages are immediately evicted after

access or discarded when exceeding the predicted expired

time, thus making room for the new data and facilitating the GC by lower the utilization rate u.

By separating the SSD-based disk cache into a prefetch

and cache partition, we are able to reduce the u of the victim page thus reducing the GC cost. Fig. 4 shows an

example that highlights the benefits of splitting the SSD-

based disk cache into a prefetch and cache partition. The

left side shows the behavior of a unified SSD-based disk cache and the right side shows the behavior of splitting

the SSD-based disk cache into a prefetch and cache

partition. Fig. 4 assumes there are 4 pages per block and 4 total blocks in a SSD-based disk cache. Ci represents the

ith cached page, and P i represents the ith prefetched page.

Initially, as illustrated in Fig. 4(a), there are 4 cached

pages and 4 prefetched pages in the cache space. After that, as illustrated by Fig. 4(b), the prefetched pages,

including P1, P2, P3, P4, have been evicted from the

prefetching list because they have been accessed or

exceeded their expired time. Consequently, by reading all valid data from victim blocks containing invalid pages,

garbage collection proceeds, erasing those blocks and

then sequentially rewriting the valid data. In this example, when the SSD-based disk cache is split into prefetch and

cache partitions, only 1 block is erased in garbage

collection since u is equal to 0, thus minimizing the GC

cost. This dramatically reduces the internal operations of SSD, including reads, writes and erases, compared to the

unified SSD-based disk cache that mixes prefetched data

and cached data.

Prefetch AreaCache AreaUnified Cache

C1P1C2P2

C3P3C4P4

C1P1C2P2

Before Garbage Collection Before Garbage Collection

Unified Cache

C1P1C2P2

C3P3C4P4

Prefetch AreaCache Area

C1C2C3C4

P1P2P3P4

Garbage CollectionGarbage Collection

Copy valid pages

Victim for GC Victim for GC

Unified Cache

C1C2C3C4

After Garbage Collection

Prefetch AreaCache Area

C1C2C3C4

After Garbage Collection

2 Flash block erases, 4 Flash page writes, 4 Flash Page reads

1 Flash block erase

Clean Flash Page Valid Flash Page

Invalid Flash Page Write pointer

Figure 4. Diagram illustrating the benefit of splitting the SSD-based

disk cache into prefetch partition and cache partition

D. Time-aware Allocation Scheme

To achieve the goal of minimizing the GC cost of

SSD-based prefetching, FLAP employs a time-aware

cache allocation scheme, by which the utilization rate of

victim blocks (u) can nearly drop to zero. The idea is that the prefetched data with adjacent expired time should be

allocated contiguously in physical address space.

Specifically, the basic granularity of prefetching is based

on the flash page size. Any prefetching request must be aligned in the page size first, and then it is split into

multiple page-level sub-requests, each of which contains

a complete page and assigned with a timestamp of its expired time. When FLAP is allocating clean pages to the

page-level prefetching sub-requests, the sub-requests with

adjacent expired time should be filled to continuous

pages. By doing this, prefetching operations are optimized to minimize the GC cost.

Any generated prefetching sub-request is associated

with two timestamps, namely, expectedT and expiredT. Note that, “Time” here refers to the logical time, which is

measured by the number of total I/O accesses. The

expectedT represents the expected time when the

prefetched strip will be requested by future user demands. The expiredT represents the time threshold, beyond

which the sub-request should be cancelled. Using

Formula 6, we derive the expectedT of the sub-requests

in a prefetching request, where currentT is the current system time, SS represents the strip size, S.avgT

represents the average arrival interval of the stream, and i

represents that it is the ith sub-request in the prefetching request. In this formula, we set expectedT to 2 times the

interval time to avoid premature degradation of the sub-

request. The expiredT is set to 5 times of the expectedT,

where the fault canceling rate of prefetching sub-requests is only 1-3% in experiments with this setting.

expectedTi

i SS S avgTcurrentT

… … …

t2 t3 t4

Clean Flash Page Valid Flash Page

Invalid Flash Page Write Pointer

Conventional allocation scheme Time-aware allocation scheme

Garbage collection Garbage collection

t2 t3 t4tnow

Figure 5. Diagram illustrating the benefit of time-aware allocation

scheme

Let’s assume currently there are 4 active sequential

streams in the workload, each of which has just been

triggered to issue a prefetching request of 4 pages. Then, all these 4 prefetching requests are divided into 16 page-

level sub-requests. We assume i

sP denotes a page-level

sub-request of stream Si with the expired time of ti. As illustrated in Fig. 5, the conventional allocation scheme

on the left side will allocate the clean pages for the prefetching requests sequentially, neglecting the expired-

time property. However, the time-aware allocation

scheme on the right side will allocate the clean pages for

the page-level sub-requests with respect to their expired time. That is, it tries to aggregate the sub-requests with

close expired-time values together. By doing this, when

tnow is equal to t1, the utilization rate u of conventional allocation scheme is as high as 3/4, while that of time-

aware allocation scheme is only 0. This demonstrates that

time-ware cache allocation can significantly reduce the

GC overhead of the SSD-based prefetching, since u is proportional to the GC cost.

E. Cost-benefit Model

The prefetching cost of SSD-based disk cache includes

the disk access cost, cache allocation cost, read hit cost,

and GC cost, which is derived by Equation (7). The benefit of prefetching can be measured by the disk access

cost of non-prefetching, which is formulated by Equation

(8). Thus, the earning of SSD-based prefetching can be

quantified by Equation (9).

dprefetch allocation hit GC

HDD SSD SSD SSDCost C d C d C d C (7)

. ,non prefetch

HDDBenefit C d P S Len d (8)

. ,Earning 1

1-P S.Len,d -

position

page program page read

SSD SSD

copy back erase

SSD SSD

d P S Len dT

S avgRL

C P S Len d C d

pagesized

To calculate the optimal prefetching depth that

maximizing the earning of prefetching, an iterative method is utilized. Starting from d = 1 with a step 1, a

loop iterates to calculate the earnings of prefetching until

d is equal to the upper bound of the relationship graph. By doing this, the prefetching depth which brings the

highest earning can be considered to be optimal. Note

that, we assume u is 0 on account of the efficiency of

time-ware allocation scheme.

IV. EXPERIMENT METHODOLOGY

In order to evaluate the efficiency of the proposed

techniques, we implemented a simulator for flash based

SSD cache, which is a modified version of Disksim 4.0 [9] and the SSD extension package [10]. The relative

parameters of the flash memory used in experiments are

listed in Table II. To study the performance and endurance impacts of

FLAP, we pick a set of a set of real-world enterprise-

scale workloads, namely, Live Map, and MSN [11]. LM

is a read intensive sequential I/O workload collected at the Microsoft Live Maps servers to provide satellite

images. The tile frontend server (TFE) takes a user request for a location and passes it to the tile back-end

server (TBE). RAD was collected at the back-end SQL

server for RADIUS authentication server. MSN was

collected at the Microsoft’s several Live file servers. The CFS server stores metadata information and blobs

correlating users to files stored on the back-end file server

(BEFS). Table III summarizes the basic characteristics of these traces, including the average request size for reads

and writes percentage of read requests, sequentiality, and

request inter-arrival time.

TABLE II. SIMULATION PARAMETERS

Device Parameter Configuration

Number of SSDs 1

Interface SATA

Capacity 4G

Dies Per Package 1

Planes Per Die 1

Blocks Per Plane 16384

Pages Per Block 64

Page Size 4K

Page Program 300us

Page Read 125us

Copyback 225 us

Block Erase 1.5ms

GC Policy Greedy 5%

Number of HDDs 3

Interface SATA

Positioning Time 4.7ms

Bandwidth 72M/s

We compare FLAP with following techniques, including no prefetching, traditional prefetching, and no

partition. First, no prefetching means that using the whole

second-level SSD cache as a pure cache without

prefetching. Second, traditional prefetching represents Linux’s prefetching scheme. For sequential accesses, it

adaptively grows the prefetching window to a tunable

upper bound, which is 128K by default. Note that, the partition scheme is enabled when using traditional

prefetching. Finally, no partition means tradition

prefetching without using partition scheme.

V. RESULTS

In this section, we present a number of measurements

illustrating the performance improvements with FLAP.

These measurements includes hit rate, average response

time, and erase times. In Fig. 6, we examine the hit rates of the second-level

flash cache under different workloads using various cache

management schemes. Note that, the first-level DRAM cache size is constant 0.5% of the backing store. As we

can see, in the beginning, it takes several quarters to

warm up the system. After that, the hit rates of different

schemes get stable. In Fig. 6(a)-6(b), the LM-TBE and RAD-BE are both workloads with strong sequentiality.

As expected, FLAP is more efficient that other schemes.

In the warm-up period, the hit rates of no prefetching

scheme are low since caching is a passive scheme. In contrast, FLAP and other schemes get higher hit rates

because prefetching is a proactive scheme. When the hit

rates get stable, FLAP outperforms no prefetching,

TABLE III. CHARACTERISTICS OF ENTERPRISE-SCALE WORKLOAD TRACES

Workload Description AVG.Req.Size Read/Write(KB) Read (%) Seq. (%) Interarriv (ms)

RAD RAD-BE 106.0/11.6 20.0 58.1 11.85

Live Map LM-TBE 53.04/61.90 78.9 70.4` 3.87

MSN MSN-CFS 9.71/8.67 73.8 8.3 6.58

MSN MSN-BEFS 10.68/11.15 67.2 4.35 1.10

30 60 100 150 200 250 30065

ate（

Time (minutes)

traditional prefetching

no prefetching

no partition

20 40 80 120 160 20070

Time (Minutes)

no prefetching

no partition

(a) Under LM-TBE (b) Under RAD-BE

40 100 160 200 250 30075

Time(Minutes)

no prefetching

no partition

50 100 150 200 250 300 350

Time (minute)

no prefetching

no partition

(c) Under MSN-CFS (d) Under MSN-BEFS

Figure 6. Flash cache Hit Rate under various workloads

1 5 10 20

SSD size (percentage of backing store)

tradition prefetching

no partition

no prefetching

%1 3 6 10

SSD SIZE (percentage of backing store)

no partition

no prefetching

(a) Under LM-TBE (b) Under RAD-BE

10 15 20 3010

SSD Size (percentage of backing store)

no partition

no prefetching

% 10 20 40 505

SSD size (percentage of backing store)

no partition

no prefetching

(c) Under MSN-CFS (d) Under MSN-BEFS

Figure 7. Average response time under various workloads

traditional prefetching and no partition schemes by up to

13.55%, 5.71%, and 6.28%, respectively. The success of

FLAP should contribute to its high prediction accuracy of prefetching, as well as the separation of sequential

streams such that avoiding cache pollution. In Fig. 6(c)-

6(d), the MSN-CFS and MSN-BEFS are both workloads

with poor sequentiality. All the schemes get similarly performance since there is poor sequentiality for

sequential prefetching schemes to employ.

In Fig. 7, we examine the average response time of the

second-level flash cache under different workloads using various cache management schemes. Note that, the first-

level DRAM cache size is constant 0.5% of the backing

store. We can observe that the average response time under various prefetching schemes decreases when the

second-level flash cache increasing its capacity. It shows

the performance dependency of cache size. It’s noticeable

that FLAP performs better than other schemes especially

when the second-level cache size is small. This is because the cost-benefit model of FLAP makes the prefetching

decision more accurate. In addition, partition and time-

ware allocation schemes reduce the internal GC overhead,

such that reducing the response time.

LM-TBE RAD-BE MSN-BEFS MSN-CFS0.0

Traces

FLAP without partition

FLAP without time-ware allocation

FLAP baseline

Figure 8. Summary of performance of several policies normalized to

pure cache without prefetching

Fig. 8 examines the erase times of the different

schemes normalized to pure cache without prefetching

under various workloads. It shows that the erase times of

FLAP is lower than pure cache without prefetching by up to 40% under sequential workloads. This suggests that

partition and time-aware allocation schemes are

successful in reducing the write-amplification of GC. We deduce that sequential prefetching could separate

sequential streams to the prefetch partition. As a result,

the cache partition won’t be polluted by sequential data.

In addition, the time-ware allocation scheme minimizes the overhead of GC of the prefetch partition.

VI. RELATED WORK

Sequential prefetching is the most popular data

prefetching on hard disks. There have been several recent studies on it. Some of these studies issue aggressive

large-size prefetching requests to reduce the prefetching

cost. For example, STEP [8] is inclined to perform aggressive prefetching to hide disk access latency and

reduce the number of expensive disk operations. AMP [7]

gradually and heuristically adjusts the prefetching degree

and trigger distance on a per-stream basis so as to achieve the highest possible aggregate throughput.

Also, several of these studies focus on improving

prediction accuracy. Cao and Felten [12] proposed integrating application-controlled prefetching, caching

and scheduling for file systems. Chang and Gibson [13]

suggested leveraging the application hints by speculative

execution without modifying the code. C-Miner [14] performs prefetching by mining the correlation between

data blocks. TAP [5] adopts a table based prefetching

approach, where only the addresses of the accessed blocks are stored. Therefore, there is more possibility for

TAP to identify potential sequential streams in the

workload by reserving more history information. STEP [8]

also use a table to detect the sequential streams. This table is organized as a balanced tree and each node

represents a detected or new sequential stream.

Since cache is the scarce resource in the computer system, lots of studies focus on how to efficiently manage

the prefetched data. The SARC algorithm [15] splits the

cache into prefetch cache and re-reference cache. It

focuses on balancing the cache allocation between them to adapt to the sequentiality of the workload. Li [16]

proposed a prefetch cache sizing scheme based on a

gradient descent-based greedy algorithm. TAP [5] used a downward pressure approach to reduce the prefetch size

when the cache hit rate is stable. It also evicts the data of

the previous request on a cache hit of the arriving request

in the already detected sequential steam. The purpose is to keep the prefetch cache as small as possible while

keeping a stable hit rate.

Numerous hybrid storage systems have been proposed

in order to combine the positive properties of HDDs and SSDs. Most previous work employs the SSD as a cache

on top of the larger hard disk to improve random access

performance. For example, Intel’s Turbo Memory [1] uses NAND-based non-volatile memory as an HDD

cache. It determines the optimal pinned set that subject to

the flash cache to making performance improvement.

Mercury [2] is designed as a persistent, write-through hypervisor cache for flash memory. FlashVM [17] is a

system architecture and a core virtual memory subsystem

built in the Linux kernel that uses dedicated flash for

paging. It modifies the paging system along code paths for allocating, reading and writing back pages to optimize

for the performance characteristics of flash. Hystor [18]

manages both SSDs and HDDs as one single block device. By monitoring I/O access patterns at runtime, Hystor can

effectively identify blocks that are best suitable to be held

in SSD. FlashTier [19] provides an interface designed for

caching. It achieves memory-efficient address space management, improved performance and cache

consistency to quickly recover cached data following a

crash.

Some researchers have combined solid state flash de- vices and prefetching. Flashy prefetching [20]

implemented a data prefetcher for flash based SSDs. It

dynamically controls the aggressiveness of prefetching through access pattern detection and feedback. Unlike

FLAP, Flashy prefetching retrieves data from the SSD

device to the volatile DRAM cache. FAST [21] is a I/O

prefetching technique using a entry-level SSD to provide end-users the fast application launch performance. Unlike

it, FLAP focuses on improving the run-time performance

of the hybrid storage system.

VII. CONCLUSION

The SSD-based disk cache, which plays an important

role in the storage sub-system, has its own distinct

features, such as internal GC operations and limited

lifespan. These special features introduce both challenges for the design of flash based sequential prefetching

techniques. Considering the technology trend that storage

servers are equipped with more processors and T byte

SSD-based disk caches, it is worthwhile to incorporate more sophisticated and powerful sequential prefetching

schemes to improve the performance of sequential

accesses. In this study, we have introduced a relationship graph

as a tool to determine the hit probability of a prefetching

request in a stream with a given length. Moreover, we

have incorporated this tool into our flash-aware cost-benefit model in order to explore the optimal prefetching

length according to the sequentiality strength of the

workload, which maximize the benefit of aggressive prefetching while avoids the performance penalty and

cache pollution due to inaccurate prefetching.

We have designed a FLAP prefetching technique to

adapt the prefetching operation to the SSD-based disk cache. The flash layer is divided into cache partition and

prefetch partition. In addition, a time-aware allocation

scheme is applied to the prefetch partition, so as to

improve the internal GC efficiency. We believe that the insight of FLAP is widely

applicable not only for SSD-based disk cache, but in any

system that has a flash acceleration layer and sequential workloads.

ACKNOWLEDGMENT

We would like to thank anonymous reviewers for their

insight and suggestions for improvement. This work is supported in part by the National Basic Research

Program of China (973 Program) under Grant

No.2011CB302303, the National Natural Science

Foundation of China (NSFC) under Grant No.60933002, the Soft Science Project of Henan Province under Grant

No.142400410123 and No. 142400410106.

REFERENCES

[1] J. Matthews, S. Trika, D. Hensgen, R. Coulson, and K. Grimsrud, “Intel turbo memory: Nonvolatile disk caches in the storage hierarchy of mainstream computer systems,” Trans. Storage, vol. 4, no. 2, pp. 401-424, May 2008.

[2] S. Byan, J. Lentini, A. Madan, and L. Pabon, “Mercury: Host-side flash caching for the data center,” in MSST, IEEE 28th Symposium on, 2012, pp. 1-12.

[3] W. W. Hsu, A. J. Smith, and H. C. Young, “I/O reference behavior of production database workloads and the tpc benchmarks - an analysis at the logic level,” ACM Trans. Database Syst., vol. 26, no. 1, pp. 96-143, Mar. 2001.

[4] Storage Performance Council. [Online]. Available: http: //www.storageperformance.org.

[5] M. Li, E. Varki, S. Bhatia, and A. Merchant, “Tap: table-based prefetching for storage caches,” in FAST’08, Berkeley, CA, USA, 2008.

[6] S. Bhatia, E. Varki, and A. Merchant, “Sequential prefetch cache sizing for maximal hit rate,” in IEEE International Symposium on MASCOTS’10, Miami Beach, Florida, USA, Aug 2010, pp. 89-98.

[7] B. S. Gill, L. Angel, and D. Bathen, “Amp: Adaptive multi-stream prefetching in a shared cache,” in FAST’07, 2007, pp. 185-198.

[8] S. Liang, S. Jiang, and X. Zhang, “Step: sequentiality and thrashing detection based prefetching to improve performance of networked storage servers,” in ICDCS ’07, Washington, DC, USA, 2007.

[9] G. Ganger, B. Worthington, Y. Patt, J. Griffin, J. Bucy, J. Schindler, and S. Schlosser, DiskSim 4.0. [Online]. Available: http://www.pdl.cmu.edu/DiskSim/.

[10] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Panigrahy, “Design tradeoffs for ssd performance,” in USENIX 2008 Annual Technical Conference on Annual Technical Conference, Berkeley, CA, USA, 2008, pp. 57-70.

[11] Storage Networking Industry Association. [Online]. Available: http://iotta.snia.org.

[12] P. Cao, E. W. Felten, A. R. Karlin, and K. Li, “Implementation and performance of integrated application- controlled file caching, prefetching, and disk scheduling,” ACM Trans. Comput. Syst., vol. 14, no. 4, pp. 311–343, Nov. 1996.

[13] F. Chang and G. A. Gibson, “Automatic i/o hint generation through speculative execution,” in OSDI ’99, Berkeley, CA, USA, 1999, pp. 1–14.

[14] Z. Li, Z. Chen, S. M. Srinivasan, and Y. Zhou, “C-miner: mining block correlations in storage systems,” in FAST’04, Berkeley, CA, USA, pp. 173–186.

[15] B. S. Gill and D. S. Modha, “Sarc: sequential prefetching in adaptive replacement cache,” in Proceedings of the annual conference on USENIX 2005 Annual Technical Conference, Berkeley, CA, USA, 2005, pp. 293–308.

[16] C. Li and K. Shen, “Managing prefetch memory for data- intensive online servers,” in FAST’05, Berkeley, CA, USA, 2005, pp. 253–266.

[17] M. Saxena and M. M. Swift, “FlashVM: virtual memory management on flash,” in Proceedings of the 2010 USENIX conference on USENIX annual technical conference. Berkeley, CA, USA, 2010.

[18] F. Chen, D. A. Koufaty, and X. Zhang, “Hystor: making the best use of solid state drives in high performance storage systems,” in ICS’11, New York, NY, USA, 2011, pp. 22-32.

[19] M. Saxena, M. M. Swift, and Y. Zhang, “Flashtier: a lightweight, consistent and durable storage cache,” in EuroSys ’12. New York, NY, USA, 2012, pp. 267-280.

[20] A. J. Uppal, R. C.-L. Chiang, and H. H. Huang, “Flashy prefetching for high-performance flash drives,” in MSST, IEEE 28th Symposium on, San Diego, CA, USA 2012, pp. 1-12.

[21] Y. Joo, J. Ryu, S. Park, and K. G. Shin, “Fast: quick application launch on solid-state drives,” in FAST’11, Berkeley, CA, USA, 2011.

[22] Junfeng Liu, Heng Li, Yingming Liu, “Data Interpretation Technology for Continuous Measurement Production Profile Logging,” Journal of Multimedia, Vol 9, No 5 (2014), 729-735, May 2014.

Yang Liu received the B.S. and M.S. degree in computer science and technology from Zhengzhou University, Zhengzhou, China, in 2003 and 2006, respectively, and Ph.D. degree in computer architecture from Huazhong University of Science and Technology, Wuhan, China in 2013. He is a researcher at the Big Data Research

Institute, Modern Educational Technology Center, Henan University of Economics and Law, Zhengzhou, China. His research interest includes hybrid storage system, multi-level cache management and prefetching techniques.

Wang Wei received the B.S degree in computer science from Chongqing Communication Institute, Chongqing, China, in 2006. She is an assistant professor at the Computer Science Department of Henan Industry and Trade Vocational College. Her research interest includes database technology and big data.

FLAP: Flash-aware Prefetching for Improving SSD-based Disk ......systems play important roles in...

Documents

@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie

Optimal Prefetching via Data Compression

Enery efficient data prefetching

Context-Aware Prefetching at the Storage Server - USENIX · Context-Aware Prefetching at the Storage Server ... WWee implem perimental evaluation, we use th orage prefetching algorithm

Far Fetched Prefetching?

Timing Local Streams: Improving Timeliness in Data Prefetching

CS7810 Prefetching

Hybrid Prefetching for WWW Proxy Servers

Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie

Informed Prefetching and Caching

Hierarchical Caching and Prefetching for Improving the …nms.csail.mit.edu/~stavros/pubs/tucthesis.pdf · 2005-10-20 · 9 Chapter 1 Introduction 1.1 Applications The technological

Prefetching Techniques

Orchestrated Scheduling and Prefetching for GPGPUs

Dynamic Prefetching & Cache Optimizations

Improving the production process of the Flap …essay.utwente.nl/68998/1/Geurts_MA_MBS.pdf · Improving the production process of the Flap Wheels factory at Sundisc Abrasives . Commissioning

Practical Prefetching Techniques for Multiprocessor File

Adaptive Prefetching on POWER7: Improving Performance …people.ac.upc.edu/fcazorla/articles/vjimenez_topc_2014.pdf · 4 Adaptive Prefetching on POWER7: Improving Performance and

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

Preload --- An Adaptive Prefetching Daemon

Call Graph Prefetching for Database Applications