Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas)...

Preview:

Citation preview

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk

Myoungsoo Jung (UT Dallas)Mahmut Kandemir (PSU)University of Texas at DallasComputer Architecture and Memory systems Lab

Takeaway • Observations: – Employing more and more flash chips is not a promising

solution – Unbalanced flash chip utilization and low parallelism

• Challenges:– The degree of parallelism and utilization depends highly

on incoming I/O request patterns • Our approach:– Sprinkles I/O request based on internal resource layout

rather than the order imposed by a storage queue– Commits more memory requests to a specific internal

flash resource

Revisiting NAND Flash Performance

Memory Cell Performance (excluding data movement)– READ: 20 us ~ 115 us – WRITE: 200 us ~ 5 ms

ONFI 4.0 800 MB/secWRITE 1.6 ~ 20 MB/sec

READ 70 ~ 200 MB/sec

Flash Interface (ONFI 3.0)– SDR : 50 MB/sec– NV-DDR : 200 MB/sec– NV-DDR2 : 533 MB/sec

Revisiting NAND Flash Performance

ONFI 4.0 800 MB/sec

PCI Express (single lane)– 2.x: 500 MB/sec– 3.0: 985 MB/sec– 4.0: 1969 MB/sec PCIe 4.0 (16-lanes)

31.51 GB/sec

Revisiting NAND Flash Performance

200 MB/s

800 MB/s 31 GB/s

Performance Disparity (even under an ideal situation)

How can we reduce the performance disparity?

WAY 0 WAY 1

CH A

CH B

WAY 0 WAY 1CH

ACH

B

WAY 0 WAY 1

CH A

CH B

WAY 0 WAY 1

CH A

CH B

Internal Parallelism

A Single Host-level I/O Request

Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases

Many-chip SSD Performance

Performance stagnates

Utilization and Idleness

Utilization sharply goes down

Idleness keeps growing

Flas

h M

emor

y Re

ques

t(P

hysi

cal A

ddre

ss)

Flas

h M

emor

y Re

ques

t(V

irtua

l Add

ress

)

I/O Service Routine in a Many-chip SSD

NVMHC

Queuing Memory Request Building

Core (Flash Translation Layer)

Memory Request Commitment Transaction Handling

Flash Controllers

Dev

ice-

leve

l Que

ue

Arrivals

I/O

Req

uest

Parsing Data Movement Initiation

Memory Requests: data size is the same as atomic flash I/O unit size

AddressTranslation

Execution Sequence

Striping &Pipelining

Transaction Decision

Interleaving & Sharing

Out-of-order Scheduling System- and Flash-level Parallelism

A flash transaction should be decided before entering the execution stage

Challenge: I/O access patterns and sizes are all determined by host-side kernel modules

Challenge Examples

• Virtual Address Scheduler• Physical Address Scheduler

Virtual Address Scheduler(VAS)12345

CHAN

NEL

A

12

12

12

C3

Chip

ID Plan

e ID

BUS CELL

RB = true

RB = falseBUS CELL

RB = true

Physical Offset

Physical Offset

IdleStall due to the I/O Request 3 collision at C5

Tail Collision

BUS CELL

RB = true

Tail Collision

Physical Offset

BUS CELL

RB = true

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

Physical Offset

C0 C3 C6

C1 C4 C7

C2 C5 C8

CHIP 3 (C3)

Physical Address Scheduler (PAS)12345

CHAN

NEL

A

12

12

12

C3

Chip

ID Plan

e ID

C0 C3 C6

C1 C4 C7

C2 C5 C8

Physical Offset

Physical Offset

Physical Offset

Pipelining

BUS CELL

RB = true

Tail Collision

Tail CollisionCollision

RB = false

Tail Collision

Tail Collision

BUS CELL

RB = true

BUS CELL

RB = true

Tail CollisionBUS CELL

RB = true

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

CYCLE SAVEDCYCLE SAVED

CYCLE SAVED

CHIP 3 (C3)

Observations

• # of chips < # of memory requests– The total number of chips is relatively fewer than the

total number of memory request coming from different I/O requests

• There exist many requests heading to the same chip, but to different internal resources– Multiple memory requests can be built into a high FLP

transaction if we could change commit order

Insights• Stalled memory requests can be immediately served– If the scheduler could compose the requests beyond the boundary

of I/O requests and commit them regardless of the order of them• It can have more flexibility in building a flash transaction

with high FLP– If the scheduler can commit them targeting different flash internal

resources

Sprinkler

• Relaxing the parallelism dependency– Schedule and build memory requests based on the

internal resource layout• Improving transactional-locality – Supply many memory requests to underlying flash

controllers

RIOS: Resource-driven I/O Scheduling

C0 C3 C6

C1 C4 C7

C2 C5 C8

12345

• Relaxing the parallelism dependency– Schedule and build memory

requests based on the internal resource layout

67891011

RIOS: Resource-driven I/O Scheduling

C0 C3 C6

C1 C4 C7

C2 C5 C8

12345

• RIOS – Out-of-Order Scheduling– Fine Granule Out-of-Order

Execution– Maximizing Utilization

67891011

FARO: FLP-Aware Request Over-commitment

• High Flash-Level Parallelism (FLP)– Bring as many requests as possible to flash controllers,

allowing them to coalesce many memory requests into a single flash transaction

• Consideration– A careless memory requests over-commitment can

introduce more resource contention

• Overlap Depth– The number of memory requests heading to different planes and

dies, but the same chip• Connectivity – Maximum number of memory requests that belong to the same

I/O request

C3

FARO: FLP-Aware Request Overcommitment

RIOS

FARO

Overlap depth : 4Connectivity : 2

Overlap depth : 4Connectivity : 1

Sprinkler12345

CHAN

NEL

A

12

12

12

C3

Chip

ID Plan

e ID

C0 C3 C6

C1 C4 C7

C2 C5 C8

BUSBUS

CELLCELL

BUSBUS

CELLCELL

RB = true RB = false

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

CYCLE SAVEDCYCLE SAVED

CYCLE SAVED

Pipelining

Evaluations• Simulation

– NFS (NANDFlashSim) http://nfs.camelab.org– 64 ~ 1024 flash chips -- dual die, four plane

(our SSD simulator simultaneously executes 1024 NFS instances) – Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms,

read: 20 us)• Workloads

– Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj)

– High transactional locality workloads: cfs2, msnfs2~3• Schedulers

– VAS : Virtual Address Scheduler, using FIFO– PAS: Physical Address Scheduler, using extra queues– SPK1: Sprinkler, using only FARO– SPK2: Sprinkler, using only RIOS– SPK3: Sprinkler, using both FARO and RIOS

Throughput

300 MB/s improvement

Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance

4x improvement

[Bandwidth]

[IOPS]

I/O and Queuing Latency

SPK1 is worse than PAS

SPK2 is worse than SPK1

SPK1 itself cannot secure enough memory requests and still have

parallelism dependency Large req.size

[Avg. Latency]

[Queue Stall Time]SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively.

Idleness EvaluationSPK1 shows worse inter-idleness

reduction than PAS

SPK1 shows better intra-idleness reduction than PAS

[Inter-chip Idleness]

[Intra-chip Idleness]When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)

Conclusion and Related Work• Conclusion:– Sprinkler relaxes the parallelism dependency by

sprinkling memory requests based on the underlying internal resources

– Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller

• Related work:– Balancing timing constraints, fairness, and different

dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07]

– Physical Address Scheduling [ISCA’12 TC’11]

Parallelism Breakdown

[VAS]

[SPK1 FARO-only]

[SPK2 RIOS-only]

[SPK3 Sprinkler]

# of Transactions

[64-chips] [1024-chips]

Time Series Analysis

GC

Sensitivity Test

[64-chips]

[1024-chips]

[256-chips]

Recommended