27
This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's pri Server Side Optimization for SSD (SSOS) G-Cube Inc.

Server-side optimization for next-generation ssd in G-Cube

  • Upload
    g-cube

  • View
    166

  • Download
    2

Embed Size (px)

Citation preview

This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent

Server Side Optimization for SSD (SSOS)

G-Cube Inc.

2This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Agenda

• Introduction

• OS’s storage stack

• Storage stack optimization for next-generation SSD

• Ongoing works

3This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

• It’s time for low-latency I/O.-Very high I/O demand in computer systems. -Social network services, cloud platform ,and desktop users

• New technology - Storage Class Memory (SCM) -high-performance and robustness + the archival capabilities and low cost.

-Deployable as a disk drive replacement as well as a scalable main memory alternative.

- NAND Flash, Phase-change memory (PCM), MRAM, RRAM, …

Introduction

4This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

There is No Free Lunch !

Storage Class Memory

Storage Stack in Modern OS

What matters us !Application performance: response time, throughput, TPM…

What SCM vendors give to us.High-throughput and ultra-latency

Show Different Perfor-mance Number !

5This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

User Experience with New Computing Resources, New Service Framework, but Traditional System & Interfaces

User

System

Resource* Many-core CPU ( 64 cores ~) * No mechanical movement(~ microseconds)

Virtual Memory File systemBlock I/O & I/O schedulerDevice driverHW Interfaces

User Experience is poor

Poor

parallel

Unaware

fast

UX

Scalable

Resourceawarness

Performance

non-scalable

Parallism

Service Mobile apps, Cloud, Big data, HPC… I/O

pressure

Highly frequent & concurrent & randomized

Interfaces

Narrow and Limited Information

Device-aware

Scalable

Rich & Context-aware

Best

6This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

SSD Performance in Modern Operating System

I/O load(users, applications)

Performance

HDD

SSD (UX)

SSD vendor performance

SSD

최대

성능

향상

2~4 배

Storage Stack in Modern OS

Show Differ-ent Perfor-mance Num-ber !

Storage Class Memory

What SCM ven-dors give to us.High-throughput and ultra-latency

What matters us !Application perfor-mance: response time, throughput, TPM…

7This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Technical strategy

Server Side Optimization for SSD: Make fast & broad I/O path between user & SSD

I/O 부하량(users, applications)

성능 (1/average response time)

SSD (now)

SSD (hardware)

SSD 최

대 성

능SSD 최대 부하

SSD ( 제안 제품 )

Fast I/O path

Broad I/O path

• User latency, response time

• Throughput

“Exploting peek device throughput even from con-

current small random access workload”

8This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Storage Stack in Modern OS

Device driver

IO Scheduler

Filesystem

Apps

Hard disk drive

Block layerLong and complex to mitigateperformance gap between CPU, memory and hard disk drive.

Diskified

9This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Storage Stack Optimizations for NVM SSD

Device driver

IO Scheduler

Filesystem

Apps

NVM SSD

Block layer

De-diskify Employ SSD feature Remove diskified feature in

software storage stack Enlarge interfaces between

layers Make I/O path synchronous

10This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Performance Metrics

•Latency-IOPS (I/Os per second)-Small random workload

•Throughput-Bandwidth (MB/s)

-Large sequential workload

•Our goal -Exploit peak device throughput from small random access workload.

11This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Focusing on Latency…

•New Hardware Abstraction Layer.-Get rid of disk-based HAL overhead.

•Polling-based interfaces-Avoid interrupt handling overhead for NVM SSD with a few micro-second response time.

12This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Polling vs Interrupt

I/O command

I/O command IRQ SoftIRQ

ApplicationContext

ApplicationContext

sleep

Device response

time

Device response

time

Busy wait(polling)

Interrupt-based I/O event handling

Polling-based I/O event handling

Schedule delay

4~5us 8~15us

2~3us

4~5us

14~23us

13This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Performance Evaluation

Peak device throughput

polling

sequential read seq_write random read rand_write

14This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Device Throughput

•Temporal merge-Device-level scatter-gather I/O

•Multiple I/O path-Non-blocking write

15This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Temporal Merge vs Back/Front Merge

Temporal merged requestSingle I/O

StorageAddressSpace

HostMemory

Front/back merged No merged

Temporal merged into a single request

16This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Merge I/O requests from temporal locality

REQ 1

Temporal Merge

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7

5

4

3

3

2

5

4

3

1 2

1

5

4

3

2

1 1 1 1 11

Read requests

2 22 2

REQ 2

REQ 3

REQ 4

Busy Wait

Perform I/O

17This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Performance Evaluation

Peak Device Throughput

Temporal Merge

sequential read seq_write random read rand_write

18This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Multiple I/O Path

• For handling the buffered write from filesystem.

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7

5

4

3

3

2

5

4

3

1 2

1

3

2 22 2

1 1 1 11

E

D

C

B

A

Flush thread

Scheduler A B D EC

BufferedWrite

(asynchronous)

19This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Performance Evaluation

19

ideal max throughput

All done except for random read!!!

Misunderstood read-ahead contextin EXT2 filesys-tem

sequential read seq_write random read rand_write

20This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Latency Again…

• Pipelined post I/O processing-Post I/O processing: DMA unmap, notifying I/O comple-tion, free pages.

-More effective for small-size requests.

21This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Queuing & Pipelined Post I/O Processing

old

Temporal merged pages to do I/O

PCI DMA transfer

DMA DONEPost processingtime for n pages(serialized)

22This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Pipelined Post I/O Processing

22

Pipelined Post I/O Processing

Temporal merged pages to do I/O

DMA DONE

Post processing time for a page (pipelined)

BLOCK I/O Done(new interfaces)

23This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Performance Evaluation

23

Seq.Read Seq.Write Rand.Read Rand.Write Rand.R/W(7/3)0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

500,000

FIO, 512 byte, 16 threadsincrease 14% (60,000 IOPS)

24This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Issue: Scalable I/O for Many-core Machine

Device driver

IO Scheduler

Virtual Memory(page cache tree)

Many-core CPU

NVM SSD

Block layer

1 2 4 8 16 320

2000400060008000

10000120001400016000

initail writers

rewriters

readers

re-readers

random readers

random writers

# of threads

Thro

ughp

ut (M

B/se

c)

Ramdisk buffered read

1 2 4 8 16 320

5000

10000

15000

20000

25000

30000

initail writersrewritersreadersre-readersrandom readersrandom writers

# of threads

Thro

ughp

ut (M

B/se

c)

PCI-E SSD buffered read

25This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Scalable Lock for Many-Core CPU

• Access radix tree in Linux kernel- For adding/removing/updating a page into the tree.- spin_lock_irq : tree_lock. non-scalable.

• Possible approaches for scalable lock- Less contented lock structure.

8 With help of application hint. 8 Finer-grained lock 8 With sacrifice of memory resource.

- Remote Core Locking8 No cache invalidation storm.

26This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

Conclusion

User applications

Broad & Thin I/O stack for NVM SSD

NVM SSD

High throughput (Broad way)

Low-latency(Thin layer)

Extended inter-faces for NVM

SSD

Software Storage Stack Optimizations for Next-generation SSD

27This information is confidential and was prepared by G-Cube solely for the use of our client and investor; it is not to be relied on by any 3rd party without G-Cube's prior written consent 141028-IR-Business proposal v05SEO

References

• Young Jin Yu, Dong In Shin, Woong Shin, Nae Young Song, Jae Woo Choi, Hyeong Seog Kim, Hyeonsang Eom, Heon Young Yeom, Optimizing the Block I/O Subsystem for Fast Storage Devices, ACM Transactions on Computer Sys-tems (TOCS), Volume 32 Issue 2, June 2014, Article No. 6

• Dong In Shin, †Young Jin Yu, †Hyeong S. Kim, Jae Woo Choi,Do Yung Jung, †Heon Y. Yeom, "Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory", USENIX HotStorage 2013, San Jose, June 26-28, 2013

• Jae Woo Choi, Young Jin Yu, Hyeonsang Eom,Heon Young Yeom, Dong In Shin, "SAN Optimization for High Performance Storage with RDMA Data Trans-fer", 7th parallel data storage workshop (PDSW) in conjunction with SC'12, Saltlake City, USA, 10-16 Nov. 2012

• Young Jin Yu, Dong In Shin, Woong Shin, Nae Young Song, Hyeonsang Eom, and Heon Young Yeom, "Exploiting Peak Device Throughput from Random Ac-cessWorkload", USENIX HotStorage 2012, Boston, USA, 13-14 Jun. 2012