35
Dynamic Near Data Processing Framework for SSDs Gunjae Koo *, Kiran Kumar Matam*, Te I , H.V. Krishina Giri Nara*, Jing Li , Hung-Wei Tseng , Steven Swanson , Murali Annavaram* *University of Southern California North Carolina State University University of California, San Diego

Dynamic Near Data Processing Framework for SSDsnvmw.ucsd.edu/nvmw18-program/unzip/current/nvmw2018-paper55... · New NVMe command New NVMe command decode RD&PROC(PPA,foo) f#2 goo()

  • Upload
    vannhi

  • View
    226

  • Download
    0

Embed Size (px)

Citation preview

Dynamic Near Data Processing Framework for SSDs

Gunjae Koo*, Kiran Kumar Matam*, Te I†, H.V. Krishina Giri Nara*, Jing Li‡,Hung-Wei Tseng†, Steven Swanson‡, Murali Annavaram*

*University of Southern California†North Carolina State University

‡University of California, San Diego

Conventional Storage = Cheap Passive Devices

2

Conventional storage devices• Slow, limited bandwidth (SATA 150 ~ 600 MB/s) • Passive devices (read, write, erase)

* Figures from Intel and Western Digital

Storage in Modern Server Systems

3

Storage devices for Big Data• Huge volumes of data slow, slower, much slower• Data movement is critical for performance

Intelligent Storage

4

NVM-based storage devices• No seek time, higher bandwidth over PCIe• Potential to be active systems

* Figures from Intel

Intelligent Storage

5

NVM-based storage devices• No seek time, higher bandwidth (PCIe)• Potential to be active systems

* Figures from Intel

SSDProcessor

DRAM

NAND flash packages

StorageProcessor

(SP)

Host

Near Data Processing (NDP)

6

CPU Storage interface

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

Host

CPU

Near Data Processing (NDP)

7

Storage interface

StorageProcessor

(SP)

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDPData computation @ storage

Host

Near Data Processing (NDP) on SSDs

8

CPU Storage interface SP

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDPData computation @ storage

Garbage collection

Wear-leveling

Data computation @ storage

Host

Near Data Processing (NDP) on SSDs

9

CPU Storage interface SP

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDP

Garbage collection

Wear-leveling

Data computation @ storage

Obstacles to in-SSD processing

• Less powerful embedded processor

• Dynamic computation resource availability

• Manual workload partitioning is difficult Summarizer: Dynamic NDP framework for SSD

Host

CPU

Summarizer –Basic Concept

10

Storage interface AP

Monitoring resources

Host

CPU

Summarizer –Basic Concept

11

Storage interface AP

Monitoring resources

Summarizer –Detailed Firmware Architecture

12

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

SSD Embedded Processors

Summarizer – Initialization (Function Offloading)

13

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

INIT ( foo)

foo()

foo()f#1Function offloading

Function registration

New NVMe command

Summarizer –Computation (Dynamic mode)

14

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo()f#1

RD&PROC( LBA,foo)

New NVMe command

New NVMe command decode

RD&PROC(PPA,foo)

goo()f#2

Summarizer –Computation (Dynamic mode)

15

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo()f#1

RD&PROC(PPA,foo)

RD&P(PPA1,foo)

RD&P(PPA2,foo)

Page data

RD&P(PPA1,foo)

goo()f#2

Summarizer –Computation (Dynamic mode)

16

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo1()f#1

RD&PROC(PPA,foo)

Page data

RD&P(PPA1,foo)

buf1, foo

CC/Proc

Register in TQ

goo()f#2

Summarizer –Computation (Dynamic mode)

17

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo()f#1

RD&PROC(PPA,foo)

Page data

RD&P(PPA1,foo)

CC

TQ is full

goo()f#2

Summarizer – Finalization

18

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

FINAL ( foo)

New NVMe command

foo()f#1

Results

goo()f#2

Evaluation Platform

• LS2085a intelligent SSD development platform

• ARM cores running FTL and Summarizerfirmware

• FPGA implementing NAND flash controller

• PCIe Gen. 3 4x lanes for host communication

19

LS2085a

Interconnection

DDR4 Memory Controller

DRAM DRAM

CPU

L1D(32KB)

L2(1MB)

L1I(48KB)

CPU

L1D(32KB)

L1I(48KB)

PC

Ie(h

ost

–L

S2

08

5a

)

PC

Ie(L

S2

08

5a

-F

PG

A)

FPGA(ALTERA Stratix V)

NAND flash DIMMNAND flash DIMMs

CPU

L1D(32KB)

L2(1MB)

L1I(48KB)

CPU

L1D(32KB)

L1I(48KB)

Evaluation Platform

• LS2085a intelligent SSD development platform

• ARM cores running FTL and Summarizerfirmware

• FPGA implementing NAND flash controller

• PCIe Gen. 3 4x lanes for host communication

20

LS2085a

Interconnection

DDR4 Memory Controller

DRAM DRAM

CPU

L1D(32KB)

L2(1MB)

L1I(48KB)

CPU

L1D(32KB)

L1I(48KB)

PC

Ie(h

ost

–L

S2

08

5a

)

PC

Ie(L

S2

08

5a

-F

PG

A)

FPGA(ALTERA Stratix V)

NAND flash DIMMNAND flash DIMMs

CPU

L1D(32KB)

L2(1MB)

L1I(48KB)

CPU

L1D(32KB)

L1I(48KB)

ARM Processor

NAND flash DIMMs

AlteraStratix V

PCIe (to host)

DRAM

Evaluation - Performance

21

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Static workload offloading

Evaluation - Performance

22

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

CPU only processing (baseline) SSD only processing

Evaluation - Performance

23

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Summarizer Dynamic Offloading

Evaluation - Performance

24

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

SSD processing + transfer time(internal + external + In-SSD processing)

Host CPU processing time

Evaluation - Performance

25

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host timeExecution time normalized to baseline (CPU only)

Evaluation - Performance

26

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Evaluation - Performance

27

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

0.70 0.60

0.30

0.24

0.0

0.2

0.4

0.6

0.8

1.0

1.2

CPU only Dynamic

Chart TitleSDD time Host timeE

xe

cuti

on

tim

e (n

orm

ali

zed

to

ba

seli

ne

)

Evaluation - Performance

28

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

0.70 0.62

0.30

0.24

0.0

0.2

0.4

0.6

0.8

1.0

1.2

CPU only Dynamic

Chart TitleSDD time Host time

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDPData computation @ storage

Evaluation - Performance

29

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Performance degraded by static NDP

Evaluation - Performance

30

16% 10%

20% 7%

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Design Exploration –Better SSD Processor

31

Host

CPU Storage interface

Better embedded processor is cost effective

AP

Design Exploration –Higher Internal Bandwidth

32

0%

20%

40%

60%

80%

100%

120%

X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16

TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average

Sp

ee

du

pChart Title

Embedded processor performance

Design Exploration –Higher Internal Bandwidth

33

0%

20%

40%

60%

80%

100%

120%

X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16

TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average

Sp

ee

du

pChart Title

Summarizer is a cost effective NDP solution with powerful storage processors

Conclusion

34

▪Dynamic computation offloading framework• Opportunistic in-SSD computation

• Page-level task control

• Optimal performance improvement

▪ Summrizer programming model

✓ Dynamic NDP framework for SSDs• Opportunistically enables in-SSD processing• Page-level NDP control• Automatic workload partitioning

✓ Summarizer programming model• Evaluation on the real development platform• Explored design space for future SSDs

Thank you

(We thank to Dell EMC for supporting the SSD development board)

Summarizer: Trading Communication with Computing Near Storage (MICRO ‘17)