Frank Vahid, UCR 1 Building Fake Body Parts: Digital Mockups Frank Vahid Univ. of California, Riverside Support provided by NSF, SRC, Dept. of Educ. Also

Frank Vahid, UCR 1

Building Fake Body Parts: Digital Mockups

Frank Vahid

Univ. of California, Riverside

Support provided by NSF, SRC, Dept. of Educ.

Also CareFusion, Xilinx, METI

Chen Huang (UC Riverside, now Amazon)

Bailey Miller (UC Riverside, intern at SpaceX)

Prof. Tony Givargis (UC Irvine)

Ting-Shuo Chou (UC Irvine)

Others...

Frank Vahid, UCR 2Bailey Miller, UCR 2

Frank Vahid, UCR 3Frank Vahid, UCR 3

Models of physical world that run in real-time

http://www.nhlbi.nih.gov/

Test cyber-physical systems

Physical mockup

Transducermodels

Environmentmodel

Digital mockup




Frank Vahid, UCR 4

Issue: Real-time achieved via inaccuracy

Frank Vahid, UCR 4

Weibel lung complexity4 gen: 32 ODEs6 gen: 128 ODEs8 gen: 512 ODEs10 gen: 2048 ODEs

“2-3 minutes to simulate one breath accurately”

V[1],R[1]

V[2],R[2]V[7],R[7]


0

100200

300400

500

600

700

800

900

1000

Weib

el

Neuro

n

Weib

el +

gas

Hemod

ynam

ic

Weib

el +

hem

o

Per

form

ance

(m

s)

PC(1)PC(4)GPU

Speedup vs real-time

PC(1): 0.8xPC(4): 3.1xGPU: 1.6x

PC & GPU14301490

1522

1184

Parallel computations + • Neighbor communication Seem like great match for FPGAs

Frank Vahid, UCR 6

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

FPGAs: Sw circuits (parallel)

for (i=0; i < 128; i++) y += c[i] * x[i]......

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

C Code for FIR Filter

Processor Processor

• 1000’s of instructions– Several thousand cycles

Circuit for FIR Filter

Processor FPGA

~ 7 cycles (though slower clock) Speedup > 10x-100x

Frank Vahid, UCR 7

2x2 switch matrix

y

z

01

01

w

x

FPGAs “101” (A Quick Intro)

a b a1

a0

4x2 Memory

ab

d1 d0

F G

00011011

LUT

F G

1 0 SM

SM

SM

SM

SM

SM

SM

LUT

SM

SM

SM

SM

SM

LUT

1110

1010

00011011

a b

0

1111

0

1110

1010

SM

a b c

D E

FPGA

a b c

D E

1 11 01 10 0

0 00 00 01 0

0110 1100

0000 1111


0

100200

300400

500

600

700

800

900

1000

Weib

el

Neuro

n

Weib

el +

gas

Hemod

ynam

ic

Weib

el +

hem

o

Per

form

ance

(m

s)

PC(1)PC(4)GPUHLS / FPGAs

Speedup vs real-time

PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS/FPGA: 3.2x

HLS14301490

1522

1184

High-level synthesis:

Compiler that converts program to circuits


Network of synchronized PEs on FPGAs

FPGA

Digital mockup

General Processing Element• Iterative ODE solver (Euler/RK4) 0.1 ms / 0.01 ms timestep

PEPE

PE 1 PE: 300 MHz

Frank Vahid, UCR 10

Synthesis tool

10K iterations

150K iterations

Convert virtual PEs to physical circuits using

FPGA place-route

1

2

Phase

Maps ODEs to virtual PEs using

simulated annealing


0

100200

300400

500

600

700

800

900

1000

Weib

el

Neuro

n

Weib

el +

gas

Hemod

ynam

ic

weibel

+ he

mo

Per

form

ance

(m

s)

PC(1)PC(4)GPUHLSGeneral PEs

Speedup vs real-timePC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PEs: 4.9x

General PEs1430

14901522

1184

Frank Vahid, UCR 12

Problem: More PEs Lower frequency

FPGA

Inter-PE critical path

FPGA

DSP

INST MEM

DATA MEM

Internal PE critical path

0100200300400

0 200 400 600Freq

uenc

y (M

Hz)

# PEs

Frequency vs. PE network size

11-gen Weibel model, Virtex6 240T FPGA, general PEs

0

500

1000

1500

2000

1 10 20 50 100 200 400

# PEs

kOD

Es/

sec

Real ODEs/sec

Lost ODEs/sec due to freq drop

Frank Vahid, UCR 13

FPGA

Use model structure to improve

Graph embedding: Map guest graph to host graph, minim. max wire length

Guest

Host

Virtual PEs

Physical PEs

Avoid using FPGA placement (Phase 2)


FPGA

12 3

…

FPGA FPGA

Phase 2 – Map virtual PEs to physical PEs

Embedding algorithm

H-tree embedding Linear embedding Direct map embedding

Guest

Host

[1] Zienicke, P. 1990. Embeddings of Treelike Graphs into 2-Dimensional Meshes. (WG '90).[2] Aleliunas, R., and Rosenberg, A.L. 1982. On Embedding Rectangular Grids in Square Grids. (Computers ‘82).[3] Berman, F., and Snyder, L. 1987. On mapping parallel algorithms into parallel architectures, (PDC, ‘87).

Frank Vahid, UCR 15

2D grid of physical PEs

FPGA

Bypass FPGA placement

EqP1EqV1

EqP4EqV4

EqP2EqV2

EqP5EqV5

EqP2EqV2

EqP6EqV6

EqP7EqV7

(Phase 1: May require "graph folding" first to reduce #PEs)

Frank Vahid, UCR 16

FPGA

Compare/backup: Simulated annealing

Cost function:

C = w1*sum + w2*max + w3*gaps

Sum = sum of wire distancesMax = max wire length (Euclidean dist.)Gaps = wires across architectural features

FPGA

P1 P1

P2

Neighbor function: Swap PEs based on distance to neighbors

Frank Vahid, UCR 17

Results

No placement strategy

Simulated annealing placement

Embedding placement

4 generations shown 5 generations shown 5 generations shown

F V

"No placement strategy" shows fewer nodes, misleads viewer into thinking the placement area is smaller. Can you show all 30 nodes for all three strategies?

Frank Vahid, UCR 18

Results

0

100

200

300

400

Model and #PEs

Cir

cuit

freq

uenc

y (M

Hz)

No placement strategy Simulated Annealing Embedding

Not routable

Strategy # LUTS # BRAM #DSP Equivalent LUTs

None 58362 512 256 306682

SA 58567 512 256 306887

Embed 58569 512 256 306889

No impact on size

2D Neuron model - 256PE – Xilinx Virtex6

Strategy Total power (mW)

Dynamic power (mW)

Static power (mW)

None 15525 8744 6481

SA 16604 10013 6590

Embed 19859 12999 6859

20% more power

0100200300400

0 200 400 600Fre

qu

en

cy (

MH

z)

# PEs

Frequency vs. PE network size


Miller, B., F. Vahid, and T. Givargis. Embedding-Based Placement of Processing element Networks on FPGAs for Physical Model Simulation. ACM Int. Symp. on FPGAs, 2013.

Graph emb (Gen PEs)

Speedup vs real-time (avg)

PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2x

Frank Vahid, UCR 20

Custom Processing Element

• Custom datapath to solve specific type of equation

MUL

Const ROM

Address

Input_sel

Address

Inputs

Output

SUB

Controller

WeData RAM

Controller

PE

SUB MUL

FPGADigital mockup

V’ = F1 – F2

F’ = P1-P2-(F*CR)*CL

Custom PE for each ODE type

Modified synthesis tool to create custom PEs for given ODEs first, then synthesis ODEs to PEs


0

100200

300400

500

600

700

800

900

1000

Weib

el

Neuro

n

Weib

el +

gas

Hemod

ynam

ic

weibel

+ he

mo

Per

form

ance

(m

s)

PC(1)PC(4)GPUHLSGeneral PEsCustom PEs

Huang, Vahid, Givargis. Synthesis of networks of custom processing elements for real-time physical system emulation. Transactions on Design Automation of Electronic Systems (TODAES), 2013 (to appear).

Custom PEs1430

14901522

1184


PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2xCustom PE: 6.1x

Frank Vahid, UCR 22

Networks of Heterogeneous PEs

Huang, Miller, Vahid, Givargis. Synthesis of Heterogeneous Processing Elements for Physical System Emulation. CODES+ISSS 2012, Oct, 2012.

• General PE: – Slow, flexible (can solve any types of ODEs)

• Custom PE: – Fast, inflexible (only solves one type of ODEs)

• Multi-Type PE– Combined multiple types of ODEs into single custom PE

FPGADigital mockup

Huge solution space:How to choose types of PEs?

How many PEs to allocate?

How to bind ODEs to PEs?

Frank Vahid, UCR 23

Automatic allocation and binding

Initial random allocation

PE allocator

ODE-to-PE mapper

New PE allocation

Cycles of each PE

Better solution Best solutionN

Y

Simulated annealing


0

100200

300400

500

600

700

800

900

1000

Weib

el

Neuro

n

Weib

el +

gas

Hemod

ynam

ic

weibel

+ he

mo

Per

form

ance

(m

s)

PC(1)PC(4)GPUHLSGeneral PEsCustom PEsHeterogeneous PEs

Heterogeneous PEs

C. Huang, B. Miller, F. Vahid, T. Givargis. Synthesis of Custom Networks of Heterogeneous Processing Elements for Complex Physical System Emulation. IEEE/ACM Conf on Hardware/Software Codesign and System Synthesis (CODES/ISSS, part of ESWEEK), Finland, Oct 2012.

14301490

1522

1184


PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2xCustom PE: 6.1xHeterog PE: 34.5x

Frank Vahid, UCR 25

Network of general/custom/heterogeneous PEsVS HLS (regularity extraction)

Heterogeneous PE:

(10x, 1.1x) HLS

(7x, 0.85x) general PE

(6x, 1.35x) custom PE

(Speed, Size)0

50

100

150

200

250

300

350

0 50,000 100,000 150,000 200,000 250,000 300,000

HLS

General PE

Custom PE

Hetero PE

Performance (ms): time to emulate 1000 ms, using Euler with 0.01 ms step.

Size (equivalent LUTs)

Frank Vahid, UCR 26

Speedup / dollar

CPU (I7-950 + Intel X58 board): $480 GPU(GTX460 + I3-540 + H55 board): $380FPGA (Xilinx Virtex6 240T-2 board): $1800

0.00000

0.00500

0.01000

0.01500

0.02000

0.02500

0.03000

Weibel

Neuron

Weibel

+ gas

Hemodynam

ic

weibel +

hem

o

Nor

mal

ized

spe

edup

(sp

eedu

p /

dolla

r)

PC(1)PC(4)GPUHeterogeneous PEs

Heterogeneous PEs:

3X better than PC(4)

4.5x better than GPU

FPGA: Easier to build custom interfaces

Frank Vahid, UCR 27

Other projects

• Assistive monitoring- www.cs.ucr.edu/~vahid/assistivemonitoring/ - http://www.youtube.com/watch?feature=player_embedded&v=Sf8tU-78lXs

– ..\Desktop\Fall montage.mp4 ..\Desktop\Frank_pullChair_013113_cam3.video.wmv

• Web-based learning– "Textbook is dead"

– Multi-univ synergy

– pcpp.zyante.com (C++)

• Embedded systems educ.– New prog. model, virtual lab,

programmingembeddedsystems.com

– Also riosscheduler.org

• Drunk driving (DUI)– ..\Desktop\dui.MOV

– duicam.org • http://www.utsandiego.com/news/2013/feb/11/ucr-drun

ken-driving-app/

http://www.cs.ucr.edu/~vahid/assistivemonitoring/

http://www.youtube.com/watch?feature=player_embedded&v=Sf8tU-78lXs

http://pcpp.zyante.com/

http://programmingembeddedsystems.com/

http://riosscheduler.org/

http://duicam.org/

http://www.utsandiego.com/news/2013/feb/11/ucr-drunken-driving-app/

http://www.utsandiego.com/news/2013/feb/11/ucr-drunken-driving-app/

Frank Vahid, UCR 28

Summary

Frank Vahid, UCR 28http://www.meti.com/


PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2xCustom PE: 6.1xHeterog PE: 34.5x(Grph emb+HPE: 48.5x)

• FPGAs: Fastest cost-effective execution of physical models

–http://www.youtube.com/watch?v=ThUKVhqoA3Q

• Future– Manycore device– Beyond testing CPS

• Implement end-products

http://www.meti.com/



http://www.youtube.com/watch?v=ThUKVhqoA3Q



F V

F V

Bailey, Please include data on algorithm runtime and size. We should almost never only report one metric, everything is a tradeoff, gotta be more complete. Please provide numerical summary in conclusion, including speedups over PC (1), PC(4), and GPU. Perhaps show on original table from Slide 5?

Frank Vahid, UCR 29

Questions?

Frank Vahid, UCR 29

Documents

Frank Vahid, UCR 1 Building Fake Body Parts: Digital Mockups Frank Vahid Univ. of California, Riverside Support provided by NSF, SRC, Dept. of Educ. Also