Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas)...

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk

Myoungsoo Jung (UT Dallas)Mahmut Kandemir (PSU)University of Texas at DallasComputer Architecture and Memory systems Lab

Takeaway • Observations: – Employing more and more flash chips is not a promising

solution – Unbalanced flash chip utilization and low parallelism

• Challenges:– The degree of parallelism and utilization depends highly

on incoming I/O request patterns • Our approach:– Sprinkles I/O request based on internal resource layout

rather than the order imposed by a storage queue– Commits more memory requests to a specific internal

flash resource

Revisiting NAND Flash Performance

Memory Cell Performance (excluding data movement)– READ: 20 us ~ 115 us – WRITE: 200 us ~ 5 ms

ONFI 4.0 800 MB/secWRITE 1.6 ~ 20 MB/sec

READ 70 ~ 200 MB/sec

Flash Interface (ONFI 3.0)– SDR : 50 MB/sec– NV-DDR : 200 MB/sec– NV-DDR2 : 533 MB/sec

ONFI 4.0 800 MB/sec

PCI Express (single lane)– 2.x: 500 MB/sec– 3.0: 985 MB/sec– 4.0: 1969 MB/sec PCIe 4.0 (16-lanes)

31.51 GB/sec

200 MB/s

800 MB/s 31 GB/s

Performance Disparity (even under an ideal situation)

How can we reduce the performance disparity?

WAY 0 WAY 1

WAY 0 WAY 1CH

WAY 0 WAY 1

Internal Parallelism

A Single Host-level I/O Request

Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases

Many-chip SSD Performance

Performance stagnates

Utilization and Idleness

Utilization sharply goes down

Idleness keeps growing

I/O Service Routine in a Many-chip SSD

Queuing Memory Request Building

Core (Flash Translation Layer)

Memory Request Commitment Transaction Handling

Flash Controllers

Arrivals

Parsing Data Movement Initiation

Memory Requests: data size is the same as atomic flash I/O unit size

AddressTranslation

Execution Sequence

Striping &Pipelining

Transaction Decision

Interleaving & Sharing

Out-of-order Scheduling System- and Flash-level Parallelism

A flash transaction should be decided before entering the execution stage

Challenge: I/O access patterns and sizes are all determined by host-side kernel modules

Challenge Examples

• Virtual Address Scheduler• Physical Address Scheduler

Virtual Address Scheduler(VAS)12345

ID Plan

BUS CELL

RB = true

RB = falseBUS CELL

RB = true

Physical Offset

IdleStall due to the I/O Request 3 collision at C5

Tail Collision

BUS CELL

RB = true

Tail Collision

Physical Offset

BUS CELL

RB = true

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

Physical Offset

C0 C3 C6

C1 C4 C7

C2 C5 C8

CHIP 3 (C3)

Physical Address Scheduler (PAS)12345

ID Plan

C0 C3 C6

C1 C4 C7

C2 C5 C8

Physical Offset

Pipelining

BUS CELL

RB = true

Tail Collision

Tail CollisionCollision

RB = false

Tail Collision

BUS CELL

RB = true

BUS CELL

RB = true

Tail CollisionBUS CELL

RB = true

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

CYCLE SAVEDCYCLE SAVED

CYCLE SAVED

CHIP 3 (C3)

Observations

• # of chips < # of memory requests– The total number of chips is relatively fewer than the

total number of memory request coming from different I/O requests

• There exist many requests heading to the same chip, but to different internal resources– Multiple memory requests can be built into a high FLP

transaction if we could change commit order

Insights• Stalled memory requests can be immediately served– If the scheduler could compose the requests beyond the boundary

of I/O requests and commit them regardless of the order of them• It can have more flexibility in building a flash transaction

with high FLP– If the scheduler can commit them targeting different flash internal

resources

Sprinkler

• Relaxing the parallelism dependency– Schedule and build memory requests based on the

internal resource layout• Improving transactional-locality – Supply many memory requests to underlying flash

controllers

RIOS: Resource-driven I/O Scheduling

C0 C3 C6

C1 C4 C7

C2 C5 C8

• Relaxing the parallelism dependency– Schedule and build memory

requests based on the internal resource layout

67891011

RIOS: Resource-driven I/O Scheduling

C0 C3 C6

C1 C4 C7

C2 C5 C8

• RIOS – Out-of-Order Scheduling– Fine Granule Out-of-Order

Execution– Maximizing Utilization

67891011

FARO: FLP-Aware Request Over-commitment

• High Flash-Level Parallelism (FLP)– Bring as many requests as possible to flash controllers,

allowing them to coalesce many memory requests into a single flash transaction

• Consideration– A careless memory requests over-commitment can

introduce more resource contention

• Overlap Depth– The number of memory requests heading to different planes and

dies, but the same chip• Connectivity – Maximum number of memory requests that belong to the same

I/O request

FARO: FLP-Aware Request Overcommitment

Overlap depth : 4Connectivity : 2

Overlap depth : 4Connectivity : 1

Sprinkler12345

ID Plan

C0 C3 C6

C1 C4 C7

C2 C5 C8

BUSBUS

CELLCELL

BUSBUS

CELLCELL

RB = true RB = false

Req. 1Req. 2

LATENCYLATENCY

Req. 3Req. 4

LATENCYLATENCY

Req. 5 LATENCY

CYCLE SAVEDCYCLE SAVED

CYCLE SAVED

Pipelining

Evaluations• Simulation

– NFS (NANDFlashSim) http://nfs.camelab.org– 64 ~ 1024 flash chips -- dual die, four plane

(our SSD simulator simultaneously executes 1024 NFS instances) – Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms,

read: 20 us)• Workloads

– Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj)

– High transactional locality workloads: cfs2, msnfs2~3• Schedulers

– VAS : Virtual Address Scheduler, using FIFO– PAS: Physical Address Scheduler, using extra queues– SPK1: Sprinkler, using only FARO– SPK2: Sprinkler, using only RIOS– SPK3: Sprinkler, using both FARO and RIOS

Throughput

300 MB/s improvement

Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance

4x improvement

[Bandwidth]

[IOPS]

I/O and Queuing Latency

SPK1 is worse than PAS

SPK2 is worse than SPK1

SPK1 itself cannot secure enough memory requests and still have

parallelism dependency Large req.size

[Avg. Latency]

[Queue Stall Time]SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively.

Idleness EvaluationSPK1 shows worse inter-idleness

reduction than PAS

SPK1 shows better intra-idleness reduction than PAS

[Inter-chip Idleness]

[Intra-chip Idleness]When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)

Conclusion and Related Work• Conclusion:– Sprinkler relaxes the parallelism dependency by

sprinkling memory requests based on the underlying internal resources

– Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller

• Related work:– Balancing timing constraints, fairness, and different

dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07]

– Physical Address Scheduling [ISCA’12 TC’11]

Parallelism Breakdown

[SPK1 FARO-only]

[SPK2 RIOS-only]

[SPK3 Sprinkler]

# of Transactions

[64-chips] [1024-chips]

Time Series Analysis

Sensitivity Test

[64-chips]

[1024-chips]

[256-chips]

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas)...

Documents

CPhI_presentation_Ferudun Kandemir

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda

DALLAS, PENNSYLVANIA Use The PHONE DALLAS: [TRADING

Dallas Maids - Maid for Dallas (214) 308-2276

Dallas County · 2019. 11. 11. · Dallas County

City of Dallas - Save Dallas Water · · 2014-03-14City of Dallas Water Utilities Conservation Division 1500 Marilla, Room 5AS Dallas, Texas 75201 . City of Dallas Water Conservation

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

savasortaokulu.meb.k12.trsavasortaokulu.meb.k12.tr/meb_iys_dosyalar/42/03/73… · Web viewT.C.KONYA VALİLİĞİSavaş Şehit İsmail Kandemir Ortaokulu2015–2019 STRATEJİK PLAN

Mustafa Caglayan, Ozge Kandemir and Kostas Mouratidis

Tedarik Zinciri Dönüşümünde Dijital Çözümler · Tedarik Zinciri Dönüşümünde Dijital Çözümler Dr.Burak Kandemir Tedarik Zinciri Uygulamaları Çözüm Lideri Türkiye,

Ayşe Kandemir Hatice Semiz Stratejik Yönetimstrateji.gantep.edu.tr/upload/files/4 PESTLE ve SWOT.pdfPESTLE ve SWOT Analizi Barış Çarıkcı Ayşe Kandemir Hatice Semiz Bizim sektörümüzde

GENERAL CATALOGUE · 2017 - Characo · 6 7 dallas beige precut 33x60 dallas gris precut 33x60 dallas beige mosaic 33x33 dallas gris mosaic 33x33 dallas gris skirting 9,6x60 dallas

Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered

Microsoft Codename “Dallas” Data for your Apps! Moe Khosravy Group Manager Microsoft Codename “Dallas” (Microsoft.com/Dallas)Microsoft.com/Dallas MoeK@Microsoft.com

?temp id - Dallas Independent School District / Dallas

ParishPlusparishplus.com/StRita/ParishZipCodes.pdf · Garland Irving Dallas Grand Prairie Farmers Branch Rockwall Rowlett Dallas Wylie Dallas Dallas Dallas Dallas Plano Dallas Frisco

Galleria Dallas Dallas, Texas 75240

Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

DALLAS SUMMER CATALOG(ENGLISH) - Wallace Foundation · Dallas ISD and the City of Dallas have partnered under the umbrella Dallas . City of Learning to bring Dallas families dozens

Social Media Dallas Dallas - Twitter Marketing