Upload
samson-manning
View
223
Download
4
Embed Size (px)
Citation preview
Softcore Vector Processor
Team ASP
Brandon Harris
Arpith Jacob
Outline• Motivation
• Smith-Waterman
• Solution
• System Architecture• Overview• Functional Unit• Instruction Controller• Processing Element• Memory Controller
• ISA
• Results
• Future Research
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation• Smith-Waterman sequence alignment
Motivation
•Similar Problems
• HMMer, BLAST, RNA Secondary Structure Prediction
• Smith-Waterman sequence alignment
Our Solution• Softcore Vector Processor
• Massively Parallel
• Software programmable
• Configurable Instantiation
• Why Softcore?
• Optimize for specific applications
• Adapt to changes in algorithms
• FPGA technology improves with time
Architectural Overview• Streaming Architecture
• Memory Mapped FIFOs
• Read Once Data
• Write Once Data
• Provides communication between components
Software DMA
SVP
Functional
Unit
DMA Software
SVP
Functional
Unit
Architectural Overview
Software DMA
SVP
Functional
Unit
DMA Software
SVP
Functional
Unit
• Streaming Architecture• Memory Mapped FIFOs
• Read Once Data
• Write Once Data
• Provides communication between components
Functional Unit
Instruction
Controller
Instr.
Mem
Processing
Element
Processing
Element
Processing
Element
Memory Controller
Shared Local MemoryStream In Stream Out
Reg
File
Reg
File
Reg
File
Processing
Element
Processing
Element
Processing
Element
R0: 0R1: 1R2:R3:R4:R5:
R5R5 1010addi R1addi R1
Instruction Controller• SIMD Instruction Broadcast
addi 10R5 R1
R0: 0R1: 0R2: R3:R4:R5:
R0: 0R1: 2R2:R3:R4:R5:
0 1 2
10 11 12
Processing
Element
Processing
Element
Processing
Element
R2Ld R2 0Ld R3 0R3
• SIMD Instruction Broadcast
R0: 0R1: 0R2: R3: ptr1R4:R5:
R0: 0R1: 0R2: R3: ptr1R4:R5:
R0: 0R1: 0R2: R3: ptr1R4:R5:
R2 0Ld R3
ptr1 ptr1 ptr1
Instruction Controller
Processing
Element
Processing
Element
Processing
Element
R2Ldir IR3R0
Instruction Controller• SIMD Instruction Broadcast
• Instruction Register Broadcast• 40% Register Savings
R0: 0R1: 0R2: R3:R4:R5:
R0: 0R1: 0R2: R3:R4:R5:
R0: 0R1: 0R2: R3:R4:R5:
ptr1 ptr1 ptr1
R0: 0R1: R2: R3:R4:R5:
Ld
Processing
Element
Processing
Element
Processing
Element
R2 R0
Instruction Controller• SIMD Instruction Broadcast
• Instruction Register Broadcast• 40% Register Savings
R0: 0R1: 0R2: R3:R4:R5:
R0: 0R1: 0R2: R3:R4:R5:
R0: 0R1: 0R2: R3:R4:R5:
ptr1
R0: 0R1: R2: R3:R4:R5:
ptr1Ld
Processing Element
Register
File
Register
File
Ra Addr Rb Addr
Data Select
Pipeline Register
ALU
Pipeline Register
Compare
Write Enables Data
Ra Data Left
Rb Data Left Rb Data Right
Ra Data Right
ImmediateRa Addr Rb Addr
Wr Enable Left Wr En Right
Memory Controller
Mem Wr Enable
bmseti R17 EQ 16
1 1 1 1 1
1 2
2 16
0 1
0
0
0
Functional Unit
Reg
File
Instruction
Controller
Instr.
Mem
Reg
File
Processing
Element
Reg
File
Processing
Element
Processing
Element
Memory Controller
Shared Local MemoryStream In Stream Out
Functional Unit
Reg
File
Instruction
Controller
Instr.
Mem
Reg
File
Processing
Element
Reg
File
Processing
Element
Processing
Element
Memory Controller
Shared Local MemoryStream In Stream Out
Memory Controller
Memory Controller
Dual
Ported
Block
RAM
Dual
Ported
Block
RAM
Dual
Ported
Block
RAM
IC PE 0-3
Single Cycle Read
Memory Controller
Memory Controller
Dual
Ported
Block
RAM
Dual
Ported
Block
RAM
Dual
Ported
Block
RAM
IC PE 0-3
Multiple Cycle Write
Instruction Set Architecture• Custom ISA
• Two Sets of Instruction Types• Instruction Controller• Processing Element
• Optimized for target applications
• Max, Min, Loop
• Expandable
• Core vs. Application Specific
Sample Code_query_loop:
subir %r8, %r3, %ir10nopnopmax %r4, %r4, %r8add %r3, %r19, PE_ZERO_REG
bmseti PE_ID_REG EQ PE_NUM_ELEMENTS - 1icaddi %ir15, %ir8, PE_NUM_ELEMENTS - 1nopnopldir PE_MEM_REG, PE_ZERO_REG(%ir15)nopnopnopnopaddi %r3, PE_MEM_REG, 0
bmend
ld PE_MEM_REG, PE_ZERO_REG(DB_ADDRESS)icaddi %ir7, %ir7, 1icaddi %ir9, %ir9, 1
icloop %ir4, %ir5, _query_loop
_query_loop:
icaddi %ir15, %ir8, PE_NUM_ELEMENTS - 1subir %r8, %r3, %ir10add %r3, %r19, PE_ZERO_REGldir PE_MEM_REG, PE_ZERO_REG(%ir15)max %r4, %r4, %r8
bmseti PE_ID_REG EQ PE_NUM_ELEMENTS - 1icaddi %ir7, %ir7, 1icaddi %ir9, %ir9, 1addi %r3, PE_MEM_REG, 0
bmend
ld PE_MEM_REG, PE_ZERO_REG(DB_ADDRESS)
icloop %ir4, %ir5, _query_loop
Results• VHDL Implementation
• Simulated• Synthesized
• Smith-Waterman• 16 PE version tested• Millions of Cell Updates Per Second (MCUPS)
Smith-Waterman Speedup
System Freq MCUPS Speedup
P4 1.8 GHz 15 1
SVP16 150 MHz 52 3.47
SVP32 150 MHz 103 6.87
SVP64 125 MHz 167 11.13
SVP128 120 MHz 302 20.13
SVP128 150 MHz 378 25.20
Comparative Performance
System* Freq PEs/Chip MCUPS/PE
Chips MCUPS/Chip
Cost($1000)
MCUPS/$1000
SVP128 150 MHz 128 2.95 1 378 5 75
SVP128 120 MHz 128 2.36 1 302 5 60
SVP64 125 MHz 64 2.61 1 167 5 33
SVP32 150 MHz 32 3.22 1 103 5 20
Kestrel 20 MHz 64 0.78 8 50 25† 16
GeneMatcher2 192 MHz 192 5.21 16 1000 69 14
Fuzion 150 200 MHz 1536 1.63 1 2500 ? ?
* Reference [1]† Estimated
Performance
PEs Freq (MHz) Area BRAM
16 150 13% 22
32 150 22% 38
64 125 41% 70
128 120 80% 134
• Hardware• Xilinx Vertex 4 VLX200
Future Work • Software Development
• How can HMMer and other systolic algorithms be implemented?
• ISA Expansion• What additional instructions are needed?• What instructions can be added to optimize?
• Hardware Development• How can we optimize the hardware to make it
faster and smaller?• What hardware can we add to enhance performance?• How can we take advantage of advances in FPGAs, such as DSP48s?
Acknowledgments • Special Thanks
• Young Cho• Roger Chamberlain• Jeremy Buhler• Joseph Lancaster
• References• Di Blas et al, “The Kestrel Parallel Processor,” IEEE Transactions on Parallel and Distributed Systems, January 2005• A. Jacob et al, “Whole Genome Comparison Using Commodity Workstations,” Technical Report, 2003
Questions?
Team ASP
Brandon Harris
Arpith Jacob