33
Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas Computer Architecture and Memory systems Lab

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Embed Size (px)

Citation preview

Page 1: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk

Myoungsoo Jung (UT Dallas)Mahmut Kandemir (PSU)University of Texas at DallasComputer Architecture and Memory systems Lab

Page 2: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Takeaway • Observations: – Employing more and more flash chips is not a promising

solution – Unbalanced flash chip utilization and low parallelism

• Challenges:– The degree of parallelism and utilization depends highly

on incoming I/O request patterns • Our approach:– Sprinkles I/O request based on internal resource layout

rather than the order imposed by a storage queue– Commits more memory requests to a specific internal

flash resource

Page 3: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Revisiting NAND Flash Performance

Memory Cell Performance (excluding data movement)– READ: 20 us ~ 115 us – WRITE: 200 us ~ 5 ms

ONFI 4.0 800 MB/secWRITE 1.6 ~ 20 MB/sec

READ 70 ~ 200 MB/sec

Flash Interface (ONFI 3.0)– SDR : 50 MB/sec– NV-DDR : 200 MB/sec– NV-DDR2 : 533 MB/sec

Page 4: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Revisiting NAND Flash Performance

ONFI 4.0 800 MB/sec

PCI Express (single lane)– 2.x: 500 MB/sec– 3.0: 985 MB/sec– 4.0: 1969 MB/sec PCIe 4.0 (16-lanes)

31.51 GB/sec

Page 5: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Revisiting NAND Flash Performance

200 MB/s

800 MB/s 31 GB/s

Performance Disparity (even under an ideal situation)

Page 6: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

How can we reduce the performance disparity?

Page 7: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

WAY 0 WAY 1

CH A

CH B

WAY 0 WAY 1CH

ACH

B

WAY 0 WAY 1

CH A

CH B

WAY 0 WAY 1

CH A

CH B

Internal Parallelism

A Single Host-level I/O Request

Page 8: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases

Page 9: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Many-chip SSD Performance

Performance stagnates

Page 10: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Utilization and Idleness

Utilization sharply goes down

Idleness keeps growing

Page 11: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Flas

h M

emor

y Re

ques

t(P

hysi

cal A

ddre

ss)

Flas

h M

emor

y Re

ques

t(V

irtua

l Add

ress

)

I/O Service Routine in a Many-chip SSD

NVMHC

Queuing Memory Request Building

Core (Flash Translation Layer)

Memory Request Commitment Transaction Handling

Flash Controllers

Dev

ice-

leve

l Que

ue

Arrivals

I/O

Req

uest

Parsing Data Movement Initiation

Memory Requests: data size is the same as atomic flash I/O unit size

AddressTranslation

Execution Sequence

Striping &Pipelining

Transaction Decision

Interleaving & Sharing

Out-of-order Scheduling System- and Flash-level Parallelism

A flash transaction should be decided before entering the execution stage

Challenge: I/O access patterns and sizes are all determined by host-side kernel modules

Page 12: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Challenge Examples

• Virtual Address Scheduler• Physical Address Scheduler

Page 13: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Virtual Address Scheduler(VAS)12345

CHAN

NEL

A

12

12

12

C3

Chip

ID Plan

e ID

BUS CELL

RB = true

RB = falseBUS CELL

RB = true

Physical Offset

Physical Offset

IdleStall due to the I/O Request 3 collision at C5

Tail Collision

BUS CELL

RB = true

Tail Collision

Physical Offset

BUS CELL

RB = true

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

Physical Offset

C0 C3 C6

C1 C4 C7

C2 C5 C8

CHIP 3 (C3)

Page 14: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Physical Address Scheduler (PAS)12345

CHAN

NEL

A

12

12

12

C3

Chip

ID Plan

e ID

C0 C3 C6

C1 C4 C7

C2 C5 C8

Physical Offset

Physical Offset

Physical Offset

Pipelining

BUS CELL

RB = true

Tail Collision

Tail CollisionCollision

RB = false

Tail Collision

Tail Collision

BUS CELL

RB = true

BUS CELL

RB = true

Tail CollisionBUS CELL

RB = true

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

CYCLE SAVEDCYCLE SAVED

CYCLE SAVED

CHIP 3 (C3)

Page 15: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Observations

• # of chips < # of memory requests– The total number of chips is relatively fewer than the

total number of memory request coming from different I/O requests

• There exist many requests heading to the same chip, but to different internal resources– Multiple memory requests can be built into a high FLP

transaction if we could change commit order

Page 16: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Insights• Stalled memory requests can be immediately served– If the scheduler could compose the requests beyond the boundary

of I/O requests and commit them regardless of the order of them• It can have more flexibility in building a flash transaction

with high FLP– If the scheduler can commit them targeting different flash internal

resources

Page 17: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Sprinkler

• Relaxing the parallelism dependency– Schedule and build memory requests based on the

internal resource layout• Improving transactional-locality – Supply many memory requests to underlying flash

controllers

Page 18: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

RIOS: Resource-driven I/O Scheduling

C0 C3 C6

C1 C4 C7

C2 C5 C8

12345

• Relaxing the parallelism dependency– Schedule and build memory

requests based on the internal resource layout

67891011

Page 19: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

RIOS: Resource-driven I/O Scheduling

C0 C3 C6

C1 C4 C7

C2 C5 C8

12345

• RIOS – Out-of-Order Scheduling– Fine Granule Out-of-Order

Execution– Maximizing Utilization

67891011

Page 20: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

FARO: FLP-Aware Request Over-commitment

• High Flash-Level Parallelism (FLP)– Bring as many requests as possible to flash controllers,

allowing them to coalesce many memory requests into a single flash transaction

• Consideration– A careless memory requests over-commitment can

introduce more resource contention

Page 21: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

• Overlap Depth– The number of memory requests heading to different planes and

dies, but the same chip• Connectivity – Maximum number of memory requests that belong to the same

I/O request

C3

FARO: FLP-Aware Request Overcommitment

RIOS

FARO

Overlap depth : 4Connectivity : 2

Overlap depth : 4Connectivity : 1

Page 22: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Sprinkler12345

CHAN

NEL

A

12

12

12

C3

Chip

ID Plan

e ID

C0 C3 C6

C1 C4 C7

C2 C5 C8

BUSBUS

CELLCELL

BUSBUS

CELLCELL

RB = true RB = false

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

CYCLE SAVEDCYCLE SAVED

CYCLE SAVED

Pipelining

Page 23: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Evaluations• Simulation

– NFS (NANDFlashSim) http://nfs.camelab.org– 64 ~ 1024 flash chips -- dual die, four plane

(our SSD simulator simultaneously executes 1024 NFS instances) – Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms,

read: 20 us)• Workloads

– Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj)

– High transactional locality workloads: cfs2, msnfs2~3• Schedulers

– VAS : Virtual Address Scheduler, using FIFO– PAS: Physical Address Scheduler, using extra queues– SPK1: Sprinkler, using only FARO– SPK2: Sprinkler, using only RIOS– SPK3: Sprinkler, using both FARO and RIOS

Page 24: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Throughput

300 MB/s improvement

Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance

4x improvement

[Bandwidth]

[IOPS]

Page 25: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

I/O and Queuing Latency

SPK1 is worse than PAS

SPK2 is worse than SPK1

SPK1 itself cannot secure enough memory requests and still have

parallelism dependency Large req.size

[Avg. Latency]

[Queue Stall Time]SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively.

Page 26: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Idleness EvaluationSPK1 shows worse inter-idleness

reduction than PAS

SPK1 shows better intra-idleness reduction than PAS

[Inter-chip Idleness]

[Intra-chip Idleness]When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)

Page 27: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Conclusion and Related Work• Conclusion:– Sprinkler relaxes the parallelism dependency by

sprinkling memory requests based on the underlying internal resources

– Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller

• Related work:– Balancing timing constraints, fairness, and different

dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07]

– Physical Address Scheduling [ISCA’12 TC’11]

Page 28: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas
Page 29: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Parallelism Breakdown

[VAS]

[SPK1 FARO-only]

[SPK2 RIOS-only]

[SPK3 Sprinkler]

Page 30: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

# of Transactions

[64-chips] [1024-chips]

Page 31: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Time Series Analysis

Page 32: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

GC

Page 33: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas

Sensitivity Test

[64-chips]

[1024-chips]

[256-chips]