Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob

Softcore Vector Processor

Team ASP

Brandon Harris

Arpith Jacob

Outline• Motivation

• Smith-Waterman

• Solution

• System Architecture• Overview• Functional Unit• Instruction Controller• Processing Element• Memory Controller

• ISA

• Results

• Future Research

Motivation• Smith-Waterman sequence alignment











Motivation

•Similar Problems

• HMMer, BLAST, RNA Secondary Structure Prediction

• Smith-Waterman sequence alignment

Our Solution• Softcore Vector Processor

• Massively Parallel

• Software programmable

• Configurable Instantiation

• Why Softcore?

• Optimize for specific applications

• Adapt to changes in algorithms

• FPGA technology improves with time

Architectural Overview• Streaming Architecture

• Memory Mapped FIFOs

• Read Once Data

• Write Once Data

• Provides communication between components

Software DMA

SVP

Functional

Unit

DMA Software

SVP

Functional

Unit

Architectural Overview

Software DMA

SVP

Functional

Unit

DMA Software

SVP

Functional

Unit

• Streaming Architecture• Memory Mapped FIFOs

• Read Once Data

• Write Once Data

• Provides communication between components

Functional Unit

Instruction

Controller

Instr.

Mem

Processing

Element

Processing

Element

Processing

Element

Memory Controller

Shared Local MemoryStream In Stream Out

Reg

File

Reg

File

Reg

File

Processing

Element

Processing

Element

Processing

Element

R0: 0R1: 1R2:R3:R4:R5:

R5R5 1010addi R1addi R1

Instruction Controller• SIMD Instruction Broadcast

addi 10R5 R1

R0: 0R1: 0R2: R3:R4:R5:

R0: 0R1: 2R2:R3:R4:R5:

0 1 2

10 11 12

Processing

Element

Processing

Element

Processing

Element

R2Ld R2 0Ld R3 0R3

• SIMD Instruction Broadcast

R0: 0R1: 0R2: R3: ptr1R4:R5:

R0: 0R1: 0R2: R3: ptr1R4:R5:

R0: 0R1: 0R2: R3: ptr1R4:R5:

R2 0Ld R3

ptr1 ptr1 ptr1

Instruction Controller

Processing

Element

Processing

Element

Processing

Element

R2Ldir IR3R0


• Instruction Register Broadcast• 40% Register Savings

R0: 0R1: 0R2: R3:R4:R5:

R0: 0R1: 0R2: R3:R4:R5:

R0: 0R1: 0R2: R3:R4:R5:

ptr1 ptr1 ptr1

R0: 0R1: R2: R3:R4:R5:

Ld

Processing

Element

Processing

Element

Processing

Element

R2 R0


• Instruction Register Broadcast• 40% Register Savings

R0: 0R1: 0R2: R3:R4:R5:

R0: 0R1: 0R2: R3:R4:R5:

R0: 0R1: 0R2: R3:R4:R5:

ptr1

R0: 0R1: R2: R3:R4:R5:

ptr1Ld

Processing Element

Register

File

Register

File

Ra Addr Rb Addr

Data Select

Pipeline Register

ALU

Pipeline Register

Compare

Write Enables Data

Ra Data Left

Rb Data Left Rb Data Right

Ra Data Right

ImmediateRa Addr Rb Addr

Wr Enable Left Wr En Right

Memory Controller

Mem Wr Enable

bmseti R17 EQ 16

1 1 1 1 1

1 2

2 16

0 1

0

0

0

Functional Unit

Reg

File

Instruction

Controller

Instr.

Mem

Reg

File

Processing

Element

Reg

File

Processing

Element

Processing

Element

Memory Controller


Functional Unit

Reg

File

Instruction

Controller

Instr.

Mem

Reg

File

Processing

Element

Reg

File

Processing

Element

Processing

Element

Memory Controller


Memory Controller

Memory Controller

Dual

Ported

Block

RAM

Dual

Ported

Block

RAM

Dual

Ported

Block

RAM

IC PE 0-3

Single Cycle Read

Memory Controller

Memory Controller

Dual

Ported

Block

RAM

Dual

Ported

Block

RAM

Dual

Ported

Block

RAM

IC PE 0-3

Multiple Cycle Write

Instruction Set Architecture• Custom ISA

• Two Sets of Instruction Types• Instruction Controller• Processing Element

• Optimized for target applications

• Max, Min, Loop

• Expandable

• Core vs. Application Specific

Sample Code_query_loop:

subir %r8, %r3, %ir10nopnopmax %r4, %r4, %r8add %r3, %r19, PE_ZERO_REG

bmseti PE_ID_REG EQ PE_NUM_ELEMENTS - 1icaddi %ir15, %ir8, PE_NUM_ELEMENTS - 1nopnopldir PE_MEM_REG, PE_ZERO_REG(%ir15)nopnopnopnopaddi %r3, PE_MEM_REG, 0

bmend

ld PE_MEM_REG, PE_ZERO_REG(DB_ADDRESS)icaddi %ir7, %ir7, 1icaddi %ir9, %ir9, 1

icloop %ir4, %ir5, _query_loop

_query_loop:

icaddi %ir15, %ir8, PE_NUM_ELEMENTS - 1subir %r8, %r3, %ir10add %r3, %r19, PE_ZERO_REGldir PE_MEM_REG, PE_ZERO_REG(%ir15)max %r4, %r4, %r8

bmseti PE_ID_REG EQ PE_NUM_ELEMENTS - 1icaddi %ir7, %ir7, 1icaddi %ir9, %ir9, 1addi %r3, PE_MEM_REG, 0

bmend

ld PE_MEM_REG, PE_ZERO_REG(DB_ADDRESS)

icloop %ir4, %ir5, _query_loop

Results• VHDL Implementation

• Simulated• Synthesized

• Smith-Waterman• 16 PE version tested• Millions of Cell Updates Per Second (MCUPS)

Smith-Waterman Speedup

System Freq MCUPS Speedup

P4 1.8 GHz 15 1

SVP16 150 MHz 52 3.47

SVP32 150 MHz 103 6.87

SVP64 125 MHz 167 11.13

SVP128 120 MHz 302 20.13

SVP128 150 MHz 378 25.20

Comparative Performance

System* Freq PEs/Chip MCUPS/PE

Chips MCUPS/Chip

Cost($1000)

MCUPS/$1000

SVP128 150 MHz 128 2.95 1 378 5 75

SVP128 120 MHz 128 2.36 1 302 5 60

SVP64 125 MHz 64 2.61 1 167 5 33

SVP32 150 MHz 32 3.22 1 103 5 20

Kestrel 20 MHz 64 0.78 8 50 25† 16

GeneMatcher2 192 MHz 192 5.21 16 1000 69 14

Fuzion 150 200 MHz 1536 1.63 1 2500 ? ?

* Reference [1]† Estimated

Performance

PEs Freq (MHz) Area BRAM

16 150 13% 22

32 150 22% 38

64 125 41% 70

128 120 80% 134

• Hardware• Xilinx Vertex 4 VLX200

Future Work • Software Development

• How can HMMer and other systolic algorithms be implemented?

• ISA Expansion• What additional instructions are needed?• What instructions can be added to optimize?

• Hardware Development• How can we optimize the hardware to make it

faster and smaller?• What hardware can we add to enhance performance?• How can we take advantage of advances in FPGAs, such as DSP48s?

Acknowledgments • Special Thanks

• Young Cho• Roger Chamberlain• Jeremy Buhler• Joseph Lancaster

• References• Di Blas et al, “The Kestrel Parallel Processor,” IEEE Transactions on Parallel and Distributed Systems, January 2005• A. Jacob et al, “Whole Genome Comparison Using Commodity Workstations,” Technical Report, 2003

Questions?

Team ASP

Brandon Harris

Arpith Jacob

Documents

Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob