Server-side optimization for next-generation ssd in G-Cube

This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent

Server Side Optimization for SSD (SSOS)

G-Cube Inc.

2This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Agenda

• Introduction

• OS’s storage stack

• Storage stack optimization for next-generation SSD

• Ongoing works


• It’s time for low-latency I/O.-Very high I/O demand in computer systems. -Social network services, cloud platform ,and desktop users

• New technology - Storage Class Memory (SCM) -high-performance and robustness + the archival capabilities and low cost.

-Deployable as a disk drive replacement as well as a scalable main memory alternative.

- NAND Flash, Phase-change memory (PCM), MRAM, RRAM, …

Introduction


There is No Free Lunch !

Storage Class Memory

Storage Stack in Modern OS

What matters us !Application performance: response time, throughput, TPM…

What SCM vendors give to us.High-throughput and ultra-latency

Show Different Perfor-mance Number !


User Experience with New Computing Resources, New Service Framework, but Traditional System & Interfaces

User

System

Resource* Many-core CPU ( 64 cores ~) * No mechanical movement(~ microseconds)

Virtual Memory File systemBlock I/O & I/O schedulerDevice driverHW Interfaces

User Experience is poor

Poor

parallel

Unaware

fast

UX

Scalable

Resourceawarness

Performance

non-scalable

Parallism

Service Mobile apps, Cloud, Big data, HPC… I/O

pressure

Highly frequent & concurrent & randomized

Interfaces

Narrow and Limited Information

Device-aware

Scalable

Rich & Context-aware

Best


SSD Performance in Modern Operating System

I/O load(users, applications)

Performance

HDD

SSD (UX)

SSD vendor performance

SSD

최대

성능

향상

2~4 배


Show Differ-ent Perfor-mance Num-ber !

Storage Class Memory

What SCM ven-dors give to us.High-throughput and ultra-latency

What matters us !Application perfor-mance: response time, throughput, TPM…


Technical strategy

Server Side Optimization for SSD: Make fast & broad I/O path between user & SSD

I/O 부하량(users, applications)

성능 (1/average response time)

SSD (now)

SSD (hardware)

SSD 최

대 성

능SSD 최대 부하

SSD ( 제안 제품 )

Fast I/O path

Broad I/O path

• User latency, response time

• Throughput

“Exploting peek device throughput even from con-

current small random access workload”



Device driver

IO Scheduler

Filesystem

Apps

Hard disk drive

Block layerLong and complex to mitigateperformance gap between CPU, memory and hard disk drive.

Diskified


Storage Stack Optimizations for NVM SSD

Device driver

IO Scheduler

Filesystem

Apps

NVM SSD

Block layer

De-diskify Employ SSD feature Remove diskified feature in

software storage stack Enlarge interfaces between

layers Make I/O path synchronous


Performance Metrics

•Latency-IOPS (I/Os per second)-Small random workload

•Throughput-Bandwidth (MB/s)

-Large sequential workload

•Our goal -Exploit peak device throughput from small random access workload.


Focusing on Latency…

•New Hardware Abstraction Layer.-Get rid of disk-based HAL overhead.

•Polling-based interfaces-Avoid interrupt handling overhead for NVM SSD with a few micro-second response time.


Polling vs Interrupt

I/O command

I/O command IRQ SoftIRQ

ApplicationContext

ApplicationContext

sleep

Device response

time

Device response

time

Busy wait(polling)

Interrupt-based I/O event handling

Polling-based I/O event handling

Schedule delay

4~5us 8~15us

2~3us

4~5us

14~23us


Performance Evaluation

Peak device throughput

polling

sequential read seq_write random read rand_write


Device Throughput

•Temporal merge-Device-level scatter-gather I/O

•Multiple I/O path-Non-blocking write


Temporal Merge vs Back/Front Merge

Temporal merged requestSingle I/O

StorageAddressSpace

HostMemory

Front/back merged No merged

Temporal merged into a single request


Merge I/O requests from temporal locality

REQ 1

Temporal Merge

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7

5

4

3

3

2

5

4

3

1 2

1

5

4

3

2

1 1 1 1 11

Read requests

2 22 2

REQ 2

REQ 3

REQ 4

Busy Wait

Perform I/O



Peak Device Throughput

Temporal Merge



Multiple I/O Path

• For handling the buffered write from filesystem.

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7

5

4

3

3

2

5

4

3

1 2

1

3

2 22 2

1 1 1 11

E

D

C

B

A

Flush thread

Scheduler A B D EC

BufferedWrite

(asynchronous)



19

ideal max throughput

All done except for random read!!!

Misunderstood read-ahead contextin EXT2 filesys-tem



Latency Again…

• Pipelined post I/O processing-Post I/O processing: DMA unmap, notifying I/O comple-tion, free pages.

-More effective for small-size requests.


Queuing & Pipelined Post I/O Processing

old

Temporal merged pages to do I/O

PCI DMA transfer

DMA DONEPost processingtime for n pages(serialized)


Pipelined Post I/O Processing

22

Pipelined Post I/O Processing

Temporal merged pages to do I/O

DMA DONE

Post processing time for a page (pipelined)

BLOCK I/O Done(new interfaces)



23

Seq.Read Seq.Write Rand.Read Rand.Write Rand.R/W(7/3)0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

500,000

FIO, 512 byte, 16 threadsincrease 14% (60,000 IOPS)


Issue: Scalable I/O for Many-core Machine

Device driver

IO Scheduler

Virtual Memory(page cache tree)

Many-core CPU

NVM SSD

Block layer

1 2 4 8 16 320

2000400060008000

10000120001400016000

initail writers

rewriters

readers

re-readers

random readers

random writers

# of threads

Thro

ughp

ut (M

B/se

c)

Ramdisk buffered read

1 2 4 8 16 320

5000

10000

15000

20000

25000

30000

initail writersrewritersreadersre-readersrandom readersrandom writers

# of threads

Thro

ughp

ut (M

B/se

c)

PCI-E SSD buffered read


Scalable Lock for Many-Core CPU

• Access radix tree in Linux kernel- For adding/removing/updating a page into the tree.- spin_lock_irq : tree_lock. non-scalable.

• Possible approaches for scalable lock- Less contented lock structure.

8 With help of application hint. 8 Finer-grained lock 8 With sacrifice of memory resource.

- Remote Core Locking8 No cache invalidation storm.


Conclusion

User applications

Broad & Thin I/O stack for NVM SSD

NVM SSD

High throughput (Broad way)

Low-latency(Thin layer)

Extended inter-faces for NVM

SSD

Software Storage Stack Optimizations for Next-generation SSD


References

• Young Jin Yu, Dong In Shin, Woong Shin, Nae Young Song, Jae Woo Choi, Hyeong Seog Kim, Hyeonsang Eom, Heon Young Yeom, Optimizing the Block I/O Subsystem for Fast Storage Devices, ACM Transactions on Computer Sys-tems (TOCS), Volume 32 Issue 2, June 2014, Article No. 6

• Dong In Shin, †Young Jin Yu, †Hyeong S. Kim, Jae Woo Choi,Do Yung Jung, †Heon Y. Yeom, "Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory", USENIX HotStorage 2013, San Jose, June 26-28, 2013

• Jae Woo Choi, Young Jin Yu, Hyeonsang Eom,Heon Young Yeom, Dong In Shin, "SAN Optimization for High Performance Storage with RDMA Data Trans-fer", 7th parallel data storage workshop (PDSW) in conjunction with SC'12, Saltlake City, USA, 10-16 Nov. 2012

• Young Jin Yu, Dong In Shin, Woong Shin, Nae Young Song, Hyeonsang Eom, and Heon Young Yeom, "Exploiting Peak Device Throughput from Random Ac-cessWorkload", USENIX HotStorage 2012, Boston, USA, 13-14 Jun. 2012