Upload
sydnie-golden
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk
Myoungsoo Jung (UT Dallas)Mahmut Kandemir (PSU)University of Texas at DallasComputer Architecture and Memory systems Lab
Takeaway • Observations: – Employing more and more flash chips is not a promising
solution – Unbalanced flash chip utilization and low parallelism
• Challenges:– The degree of parallelism and utilization depends highly
on incoming I/O request patterns • Our approach:– Sprinkles I/O request based on internal resource layout
rather than the order imposed by a storage queue– Commits more memory requests to a specific internal
flash resource
Revisiting NAND Flash Performance
Memory Cell Performance (excluding data movement)– READ: 20 us ~ 115 us – WRITE: 200 us ~ 5 ms
ONFI 4.0 800 MB/secWRITE 1.6 ~ 20 MB/sec
READ 70 ~ 200 MB/sec
Flash Interface (ONFI 3.0)– SDR : 50 MB/sec– NV-DDR : 200 MB/sec– NV-DDR2 : 533 MB/sec
Revisiting NAND Flash Performance
ONFI 4.0 800 MB/sec
PCI Express (single lane)– 2.x: 500 MB/sec– 3.0: 985 MB/sec– 4.0: 1969 MB/sec PCIe 4.0 (16-lanes)
31.51 GB/sec
Revisiting NAND Flash Performance
200 MB/s
800 MB/s 31 GB/s
Performance Disparity (even under an ideal situation)
How can we reduce the performance disparity?
WAY 0 WAY 1
CH A
CH B
WAY 0 WAY 1CH
ACH
B
WAY 0 WAY 1
CH A
CH B
WAY 0 WAY 1
CH A
CH B
Internal Parallelism
A Single Host-level I/O Request
Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases
Many-chip SSD Performance
Performance stagnates
Utilization and Idleness
Utilization sharply goes down
Idleness keeps growing
Flas
h M
emor
y Re
ques
t(P
hysi
cal A
ddre
ss)
Flas
h M
emor
y Re
ques
t(V
irtua
l Add
ress
)
I/O Service Routine in a Many-chip SSD
NVMHC
Queuing Memory Request Building
Core (Flash Translation Layer)
Memory Request Commitment Transaction Handling
Flash Controllers
Dev
ice-
leve
l Que
ue
Arrivals
I/O
Req
uest
Parsing Data Movement Initiation
Memory Requests: data size is the same as atomic flash I/O unit size
AddressTranslation
Execution Sequence
Striping &Pipelining
Transaction Decision
Interleaving & Sharing
Out-of-order Scheduling System- and Flash-level Parallelism
A flash transaction should be decided before entering the execution stage
Challenge: I/O access patterns and sizes are all determined by host-side kernel modules
Challenge Examples
• Virtual Address Scheduler• Physical Address Scheduler
Virtual Address Scheduler(VAS)12345
CHAN
NEL
A
12
12
12
C3
Chip
ID Plan
e ID
BUS CELL
RB = true
RB = falseBUS CELL
RB = true
Physical Offset
Physical Offset
IdleStall due to the I/O Request 3 collision at C5
Tail Collision
BUS CELL
RB = true
Tail Collision
Physical Offset
BUS CELL
RB = true
Req. 1Req. 2
LATENCYLATENCY
Req. 3Req. 4
LATENCYLATENCY
Req. 5 LATENCY
Physical Offset
C0 C3 C6
C1 C4 C7
C2 C5 C8
CHIP 3 (C3)
Physical Address Scheduler (PAS)12345
CHAN
NEL
A
12
12
12
C3
Chip
ID Plan
e ID
C0 C3 C6
C1 C4 C7
C2 C5 C8
Physical Offset
Physical Offset
Physical Offset
Pipelining
BUS CELL
RB = true
Tail Collision
Tail CollisionCollision
RB = false
Tail Collision
Tail Collision
BUS CELL
RB = true
BUS CELL
RB = true
Tail CollisionBUS CELL
RB = true
Req. 1Req. 2
LATENCYLATENCY
Req. 3Req. 4
LATENCYLATENCY
Req. 5 LATENCY
CYCLE SAVEDCYCLE SAVED
CYCLE SAVED
CHIP 3 (C3)
Observations
• # of chips < # of memory requests– The total number of chips is relatively fewer than the
total number of memory request coming from different I/O requests
• There exist many requests heading to the same chip, but to different internal resources– Multiple memory requests can be built into a high FLP
transaction if we could change commit order
Insights• Stalled memory requests can be immediately served– If the scheduler could compose the requests beyond the boundary
of I/O requests and commit them regardless of the order of them• It can have more flexibility in building a flash transaction
with high FLP– If the scheduler can commit them targeting different flash internal
resources
Sprinkler
• Relaxing the parallelism dependency– Schedule and build memory requests based on the
internal resource layout• Improving transactional-locality – Supply many memory requests to underlying flash
controllers
RIOS: Resource-driven I/O Scheduling
C0 C3 C6
C1 C4 C7
C2 C5 C8
12345
• Relaxing the parallelism dependency– Schedule and build memory
requests based on the internal resource layout
67891011
RIOS: Resource-driven I/O Scheduling
C0 C3 C6
C1 C4 C7
C2 C5 C8
12345
• RIOS – Out-of-Order Scheduling– Fine Granule Out-of-Order
Execution– Maximizing Utilization
67891011
FARO: FLP-Aware Request Over-commitment
• High Flash-Level Parallelism (FLP)– Bring as many requests as possible to flash controllers,
allowing them to coalesce many memory requests into a single flash transaction
• Consideration– A careless memory requests over-commitment can
introduce more resource contention
• Overlap Depth– The number of memory requests heading to different planes and
dies, but the same chip• Connectivity – Maximum number of memory requests that belong to the same
I/O request
C3
FARO: FLP-Aware Request Overcommitment
RIOS
FARO
Overlap depth : 4Connectivity : 2
Overlap depth : 4Connectivity : 1
Sprinkler12345
CHAN
NEL
A
12
12
12
C3
Chip
ID Plan
e ID
C0 C3 C6
C1 C4 C7
C2 C5 C8
BUSBUS
CELLCELL
BUSBUS
CELLCELL
RB = true RB = false
Req. 1Req. 2
LATENCYLATENCY
Req. 3Req. 4
LATENCYLATENCY
Req. 5 LATENCY
CYCLE SAVEDCYCLE SAVED
CYCLE SAVED
Pipelining
Evaluations• Simulation
– NFS (NANDFlashSim) http://nfs.camelab.org– 64 ~ 1024 flash chips -- dual die, four plane
(our SSD simulator simultaneously executes 1024 NFS instances) – Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms,
read: 20 us)• Workloads
– Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj)
– High transactional locality workloads: cfs2, msnfs2~3• Schedulers
– VAS : Virtual Address Scheduler, using FIFO– PAS: Physical Address Scheduler, using extra queues– SPK1: Sprinkler, using only FARO– SPK2: Sprinkler, using only RIOS– SPK3: Sprinkler, using both FARO and RIOS
Throughput
300 MB/s improvement
Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance
4x improvement
[Bandwidth]
[IOPS]
I/O and Queuing Latency
SPK1 is worse than PAS
SPK2 is worse than SPK1
SPK1 itself cannot secure enough memory requests and still have
parallelism dependency Large req.size
[Avg. Latency]
[Queue Stall Time]SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively.
Idleness EvaluationSPK1 shows worse inter-idleness
reduction than PAS
SPK1 shows better intra-idleness reduction than PAS
[Inter-chip Idleness]
[Intra-chip Idleness]When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)
Conclusion and Related Work• Conclusion:– Sprinkler relaxes the parallelism dependency by
sprinkling memory requests based on the underlying internal resources
– Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller
• Related work:– Balancing timing constraints, fairness, and different
dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07]
– Physical Address Scheduling [ISCA’12 TC’11]
Parallelism Breakdown
[VAS]
[SPK1 FARO-only]
[SPK2 RIOS-only]
[SPK3 Sprinkler]
# of Transactions
[64-chips] [1024-chips]
Time Series Analysis
GC
Sensitivity Test
[64-chips]
[1024-chips]
[256-chips]