20
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia 1

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Embed Size (px)

Citation preview

Page 1: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

VEGAS: Soft Vector Processor with Scratchpad Memory

Christopher Han-Yu Chou

Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux

University of British Columbia

1

Page 2: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Motivation

Embedded processing on FPGAs High performance, computationally intensive Soft processors, e.g. Nios/MicroBlaze, too slow

How to deliver High Performance? Multiprocessor on FPGA Custom Hardware accelerators (Verilog RTL) Synthesized accelerators (C to FPGA)

2

Page 3: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Motivation

Soft vector processor to the rescue Previous works have demonstrated soft vector

processor as a viable option to provide: Scalable performance and area Purely software-based Decouples hardware/software development

Key performance bottlenecks Memory access latency On-chip data storage efficiency

3

Page 4: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Contribution

VEGAS Architecture key features Cacheless Scratchpad Memory Fracturable ALUs Concurrent memory access via DMA

Advantages Eliminates on-chip data replication

Also: huge # of vectors, long vector lengths More parallel ALUs Fewer memory loads/stores

4

Page 5: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

VEGAS Architecture

Scalar Core:NiosII/f @ 200MHz

DMA Engine & External DDR2

Vector Core:VEGAS @ 120MHz

Concurrent Execution

FIFO synchronized

5

Page 6: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Scratchpad Memory in Action

Vector Scratchpad

Memory

Vector Lane 0

Vector Lane 1

Vector Lane 2

Vector Lane 3

srcAsrcBDest srcAsrcBDest

6

Page 7: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Scratchpad Memory in Action srcA Dest

7

Page 8: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Scratchpad Advantage

Performance Huge working set (256kB++) Explicitly managed by software Async load/store via concurrent DMA

Efficient data storage Double-clocked memory (Trad. RF 2x 2x

copiescopies)) 8b data stays as 8b (Trad. RF 4x copies4x copies) No cache (Trad. RF +1 copy+1 copy)

8

Page 9: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Scratchpad Advantage

Accessed by address register Huge # of vectors in scratchpad

VEGAS uses only 8 vector addr. reg. (V0..V7) Modify content to access different vectors Auto-increment lessens need to change V0..V7

Long vector lengths Fill entire scratchpad

9

Page 10: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Scratchpad Advantage: Median Filter Vector address registers easier than unrolling Traditional Vector Median Filter

For J = 0..12For I = J .. 24 V1 = vector[i] vector load

V2 = vector[j] vector loadCompareAndSwap( V1, V2 )vector[j] = V2 vector storeVector[i] = V1 vector store

Optimize away 1 vector load + 1 vector store using temp Total of 222 loads and 222 stores

10

Page 11: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

11

Scratchpad Advantage: Median Filter

L14: vld.b v2, vbase2, vinc0vmax v31, v2, v4vmin v4, v2, v4vst.b v31, vbase2, vinc1addi r2, r2, 1bge r6, r2, .L14

VIPERS ISA

Page 12: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Fracturable ALUs

12

Multiplier – uses 4 x 16b multipliersMultiplier – uses 4 x 16b multipliers

Multiplier also does shifts + rotateMultiplier also does shifts + rotate

Adder – uses 4 x 8b addersAdder – uses 4 x 8b adders

Page 13: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Fracturable ALUs Advantage

Increased processing power 4-Lane VEGAS

4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle

Median filter example 32b data: 184 cycles / pixel 16b data: 93 cycles / pixel 8b data: 47 cycles / pixel

13

Page 14: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Area and Frequency

14

Num. Lanes

VEGAS

ALM DSP M9K Fmax

1 3831 8 40 131

2 4881 12 40 131

4 6976 20 40 130

8 11824 36 40 125

16 19843 68 40 122

32 36611 132 40 116

Page 15: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

ALM Usage

15

Page 16: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Performance

16

Benchmark NiosII/f VEGAS NiosII/V32 Speedup

V1 V32

fir 509919 85549 4693 108x

motest 1668869 82515 24717 67x

median 1388 185 7 208x

autocor 124338 45027 2822 44x

conven 48988 3462 1897 25x

imgblend 1231172 175890 35485 34x

filt3x3 6556592 813471 75349 87x

Page 17: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Area-Delay Product

Area*Delay measures “throughput per mm2” Compared to earlier vector processors, VEGAS

offers 2-3x better throughput per unit area

17

Page 18: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Integer Matrix Multiply Integer Matrix Multiply

4096 x 4096 integers (64MB data set)

Intel Core 2 (65nm), 2.5GHz, 16GB DDR2 Vanilla IJK: 474 seconds Vanilla KIJ: 134 s Tiled IJK: 93 s Tiled KIJ: 68 s

VEGAS (65nm Altera Stratix3) Vector: 44 s (Nios only: 5407 s) 256kB Scratchpad, 32 Lanes (about 50% of chip) 200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM

18

Page 19: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

20

Conclusions

Key features Scratchpad Memory

Enhance performance with fewer loads/stores No on-chip data replication; efficient storage Double-clocked to hide memory latency

Fracturable ALUs Operates on 8b, 16b, 32b data efficiently Single vector core accelerates many applications

Result 2-3x better Area-Delay product than VIPERS/VESPA Out performs Intel Core 2 at Integer Matrix Multiply

Page 20: VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University

Issues / Future Work

No floating-point yet Adding “complex function” support, to include floating-point or

similar operations

Algorithms with only short vectors Split vector processor into 2, 4, 8 pieces Run multiple instances of algorithm

Multiple vector processors Connecting them to work cooperatively Goals: increase throughput, exploit task-level parallelism (ie,

chaining or pipelining)

21