Upload
wilfrid-gordon
View
229
Download
3
Embed Size (px)
Citation preview
Frank Vahid, UCR 1
Building Fake Body Parts: Digital Mockups
Frank Vahid
Univ. of California, Riverside
Support provided by NSF, SRC, Dept. of Educ.
Also CareFusion, Xilinx, METI
Chen Huang (UC Riverside, now Amazon)
Bailey Miller (UC Riverside, intern at SpaceX)
Prof. Tony Givargis (UC Irvine)
Ting-Shuo Chou (UC Irvine)
Others...
Frank Vahid, UCR 2Bailey Miller, UCR 2
Frank Vahid, UCR 3Frank Vahid, UCR 3
Models of physical world that run in real-time
http://www.nhlbi.nih.gov/
Test cyber-physical systems
Physical mockup
Transducermodels
Environmentmodel
Digital mockup
Frank Vahid, UCR 4
Issue: Real-time achieved via inaccuracy
Frank Vahid, UCR 4
Weibel lung complexity4 gen: 32 ODEs6 gen: 128 ODEs8 gen: 512 ODEs10 gen: 2048 ODEs
“2-3 minutes to simulate one breath accurately”
V[1],R[1]
V[2],R[2]V[7],R[7]
Frank Vahid, UCR 5Frank Vahid, UCR 5
0
100200
300400
500
600
700
800
900
1000
Weib
el
Neuro
n
Weib
el +
gas
Hemod
ynam
ic
Weib
el +
hem
o
Per
form
ance
(m
s)
PC(1)PC(4)GPU
Speedup vs real-time
PC(1): 0.8xPC(4): 3.1xGPU: 1.6x
PC & GPU14301490
1522
1184
Parallel computations + • Neighbor communication Seem like great match for FPGAs
Frank Vahid, UCR 6
for (i=0; i < 128; i++) y[i] += c[i] * x[i]......
FPGAs: Sw circuits (parallel)
for (i=0; i < 128; i++) y += c[i] * x[i]......
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
C Code for FIR Filter
Processor Processor
• 1000’s of instructions– Several thousand cycles
Circuit for FIR Filter
Processor FPGA
~ 7 cycles (though slower clock) Speedup > 10x-100x
Frank Vahid, UCR 7
2x2 switch matrix
y
z
01
01
w
x
FPGAs “101” (A Quick Intro)
a b a1
a0
4x2 Memory
ab
d1 d0
F G
00011011
LUT
F G
1 0 SM
SM
SM
SM
SM
SM
SM
LUT
SM
SM
SM
SM
SM
LUT
1110
1010
00011011
a b
0
1111
0
1110
1010
SM
a b c
D E
FPGA
a b c
D E
1 11 01 10 0
0 00 00 01 0
0110 1100
0000 1111
Frank Vahid, UCR 8Frank Vahid, UCR 8
0
100200
300400
500
600
700
800
900
1000
Weib
el
Neuro
n
Weib
el +
gas
Hemod
ynam
ic
Weib
el +
hem
o
Per
form
ance
(m
s)
PC(1)PC(4)GPUHLS / FPGAs
Speedup vs real-time
PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS/FPGA: 3.2x
HLS14301490
1522
1184
High-level synthesis:
Compiler that converts program to circuits
Frank Vahid, UCR 9Frank Vahid, UCR 9
Network of synchronized PEs on FPGAs
FPGA
Digital mockup
General Processing Element• Iterative ODE solver (Euler/RK4) 0.1 ms / 0.01 ms timestep
PEPE
PE 1 PE: 300 MHz
Frank Vahid, UCR 10
Synthesis tool
10K iterations
150K iterations
Convert virtual PEs to physical circuits using
FPGA place-route
1
2
Phase
Maps ODEs to virtual PEs using
simulated annealing
Frank Vahid, UCR 11Frank Vahid, UCR 11
0
100200
300400
500
600
700
800
900
1000
Weib
el
Neuro
n
Weib
el +
gas
Hemod
ynam
ic
weibel
+ he
mo
Per
form
ance
(m
s)
PC(1)PC(4)GPUHLSGeneral PEs
Speedup vs real-timePC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PEs: 4.9x
General PEs1430
14901522
1184
Frank Vahid, UCR 12
Problem: More PEs Lower frequency
FPGA
Inter-PE critical path
FPGA
DSP
INST MEM
DATA MEM
Internal PE critical path
0100200300400
0 200 400 600Freq
uenc
y (M
Hz)
# PEs
Frequency vs. PE network size
11-gen Weibel model, Virtex6 240T FPGA, general PEs
0
500
1000
1500
2000
1 10 20 50 100 200 400
# PEs
kOD
Es/
sec
Real ODEs/sec
Lost ODEs/sec due to freq drop
Frank Vahid, UCR 13
FPGA
Use model structure to improve
Graph embedding: Map guest graph to host graph, minim. max wire length
Guest
Host
Virtual PEs
Physical PEs
Avoid using FPGA placement (Phase 2)
Frank Vahid, UCR 14Frank Vahid, UCR 14
FPGA
12 3
…
FPGA FPGA
Phase 2 – Map virtual PEs to physical PEs
Embedding algorithm
H-tree embedding Linear embedding Direct map embedding
Guest
Host
[1] Zienicke, P. 1990. Embeddings of Treelike Graphs into 2-Dimensional Meshes. (WG '90).[2] Aleliunas, R., and Rosenberg, A.L. 1982. On Embedding Rectangular Grids in Square Grids. (Computers ‘82).[3] Berman, F., and Snyder, L. 1987. On mapping parallel algorithms into parallel architectures, (PDC, ‘87).
Frank Vahid, UCR 15
2D grid of physical PEs
FPGA
Bypass FPGA placement
EqP1EqV1
EqP4EqV4
EqP2EqV2
EqP5EqV5
EqP2EqV2
EqP6EqV6
EqP7EqV7
(Phase 1: May require "graph folding" first to reduce #PEs)
Frank Vahid, UCR 16
FPGA
Compare/backup: Simulated annealing
Cost function:
C = w1*sum + w2*max + w3*gaps
Sum = sum of wire distancesMax = max wire length (Euclidean dist.)Gaps = wires across architectural features
FPGA
P1 P1
P2
Neighbor function: Swap PEs based on distance to neighbors
Frank Vahid, UCR 17
Results
No placement strategy
Simulated annealing placement
Embedding placement
4 generations shown 5 generations shown 5 generations shown
Frank Vahid, UCR 18
Results
0
100
200
300
400
Model and #PEs
Cir
cuit
freq
uenc
y (M
Hz)
No placement strategy Simulated Annealing Embedding
Not routable
Strategy # LUTS # BRAM #DSP Equivalent LUTs
None 58362 512 256 306682
SA 58567 512 256 306887
Embed 58569 512 256 306889
No impact on size
2D Neuron model - 256PE – Xilinx Virtex6
Strategy Total power (mW)
Dynamic power (mW)
Static power (mW)
None 15525 8744 6481
SA 16604 10013 6590
Embed 19859 12999 6859
20% more power
0100200300400
0 200 400 600Fre
qu
en
cy (
MH
z)
# PEs
Frequency vs. PE network size
Frank Vahid, UCR 19Frank Vahid, UCR 19
Miller, B., F. Vahid, and T. Givargis. Embedding-Based Placement of Processing element Networks on FPGAs for Physical Model Simulation. ACM Int. Symp. on FPGAs, 2013.
Graph emb (Gen PEs)
Speedup vs real-time (avg)
PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2x
Frank Vahid, UCR 20
Custom Processing Element
• Custom datapath to solve specific type of equation
MUL
Const ROM
Address
Input_sel
Address
Inputs
Output
SUB
Controller
WeData RAM
Controller
PE
SUB MUL
FPGADigital mockup
V’ = F1 – F2
F’ = P1-P2-(F*CR)*CL
Custom PE for each ODE type
Modified synthesis tool to create custom PEs for given ODEs first, then synthesis ODEs to PEs
Frank Vahid, UCR 21Frank Vahid, UCR 21
0
100200
300400
500
600
700
800
900
1000
Weib
el
Neuro
n
Weib
el +
gas
Hemod
ynam
ic
weibel
+ he
mo
Per
form
ance
(m
s)
PC(1)PC(4)GPUHLSGeneral PEsCustom PEs
Huang, Vahid, Givargis. Synthesis of networks of custom processing elements for real-time physical system emulation. Transactions on Design Automation of Electronic Systems (TODAES), 2013 (to appear).
Custom PEs1430
14901522
1184
Speedup vs real-time (avg)
PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2xCustom PE: 6.1x
Frank Vahid, UCR 22
Networks of Heterogeneous PEs
Huang, Miller, Vahid, Givargis. Synthesis of Heterogeneous Processing Elements for Physical System Emulation. CODES+ISSS 2012, Oct, 2012.
• General PE: – Slow, flexible (can solve any types of ODEs)
• Custom PE: – Fast, inflexible (only solves one type of ODEs)
• Multi-Type PE– Combined multiple types of ODEs into single custom PE
FPGADigital mockup
Huge solution space:How to choose types of PEs?
How many PEs to allocate?
How to bind ODEs to PEs?
Frank Vahid, UCR 23
Automatic allocation and binding
Initial random allocation
PE allocator
ODE-to-PE mapper
New PE allocation
Cycles of each PE
Better solution Best solutionN
Y
Simulated annealing
Frank Vahid, UCR 24Frank Vahid, UCR 24
0
100200
300400
500
600
700
800
900
1000
Weib
el
Neuro
n
Weib
el +
gas
Hemod
ynam
ic
weibel
+ he
mo
Per
form
ance
(m
s)
PC(1)PC(4)GPUHLSGeneral PEsCustom PEsHeterogeneous PEs
Heterogeneous PEs
C. Huang, B. Miller, F. Vahid, T. Givargis. Synthesis of Custom Networks of Heterogeneous Processing Elements for Complex Physical System Emulation. IEEE/ACM Conf on Hardware/Software Codesign and System Synthesis (CODES/ISSS, part of ESWEEK), Finland, Oct 2012.
14301490
1522
1184
Speedup vs real-time (avg)
PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2xCustom PE: 6.1xHeterog PE: 34.5x
Frank Vahid, UCR 25
Network of general/custom/heterogeneous PEsVS HLS (regularity extraction)
Heterogeneous PE:
(10x, 1.1x) HLS
(7x, 0.85x) general PE
(6x, 1.35x) custom PE
(Speed, Size)0
50
100
150
200
250
300
350
0 50,000 100,000 150,000 200,000 250,000 300,000
HLS
General PE
Custom PE
Hetero PE
Performance (ms): time to emulate 1000 ms, using Euler with 0.01 ms step.
Size (equivalent LUTs)
Frank Vahid, UCR 26
Speedup / dollar
CPU (I7-950 + Intel X58 board): $480 GPU(GTX460 + I3-540 + H55 board): $380FPGA (Xilinx Virtex6 240T-2 board): $1800
0.00000
0.00500
0.01000
0.01500
0.02000
0.02500
0.03000
Weibel
Neuron
Weibel
+ gas
Hemodynam
ic
weibel +
hem
o
Nor
mal
ized
spe
edup
(sp
eedu
p /
dolla
r)
PC(1)PC(4)GPUHeterogeneous PEs
Heterogeneous PEs:
3X better than PC(4)
4.5x better than GPU
FPGA: Easier to build custom interfaces
Frank Vahid, UCR 27
Other projects
• Assistive monitoring- www.cs.ucr.edu/~vahid/assistivemonitoring/ - http://www.youtube.com/watch?feature=player_embedded&v=Sf8tU-78lXs
– ..\Desktop\Fall montage.mp4 ..\Desktop\Frank_pullChair_013113_cam3.video.wmv
• Web-based learning– "Textbook is dead"
– Multi-univ synergy
– pcpp.zyante.com (C++)
• Embedded systems educ.– New prog. model, virtual lab,
programmingembeddedsystems.com
– Also riosscheduler.org
• Drunk driving (DUI)– ..\Desktop\dui.MOV
– duicam.org • http://www.utsandiego.com/news/2013/feb/11/ucr-drun
ken-driving-app/
Frank Vahid, UCR 28
Summary
Frank Vahid, UCR 28http://www.meti.com/
Speedup vs real-time (avg)
PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2xCustom PE: 6.1xHeterog PE: 34.5x(Grph emb+HPE: 48.5x)
• FPGAs: Fastest cost-effective execution of physical models
–http://www.youtube.com/watch?v=ThUKVhqoA3Q
• Future– Manycore device– Beyond testing CPS
• Implement end-products
Frank Vahid, UCR 29
Questions?
Frank Vahid, UCR 29