Upload
hugh
View
44
Download
0
Embed Size (px)
DESCRIPTION
Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration. Chen Huang and Frank Vahid Dept. of Computer Science and Engineering University of California, Riverside, USA {chuang,vahid}@cs.ucr.edu. This work was supported in part by NSF CNS-1016792. Outline. - PowerPoint PPT Presentation
Citation preview
1/21
Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration
Chen Huang and Frank Vahid
Dept. of Computer Science and Engineering University of California, Riverside, USA{chuang,vahid}@cs.ucr.edu
This work was supported in part by NSF CNS-1016792
Chen Huang UC Riverside
2/21
Outline
Haar-feature based object detection algorithm
Custom design space exploration: Feature mapping problem
Experimental results
Chen Huang UC Riverside
3/21
Original image
Scaled images
Haar-Feature based object detection algorithm
(320 – 20) * (240 – 20) = 66,000 sub-windows
X axis
Y axis
0
240
320
Movement of sub-window
Faces detected on different scales
… 20x20 sub- window
Face found
Chen Huang UC Riverside
4/21
Face detection in sub-window
Fail
Pass
Facial Haar features
Calculate Haar-feature value:
Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B)Constant time Pixel_Sum calculation
Pixel_Sum(R1) = P4 - P2 - P3 + P1 = 4
1 1 11 1 1
1 1 1
Original image Integral Image
1 2 32 4 6
3 6 9
p1 p2
p3 p4R1
Need 4 corner values
Stores Pixel sum of Rect(from top-left corner to this point)
P4
P2
P3
P120 x 20 sub-window
Chen Huang UC Riverside
5/21
Cascade decision process
Frontal-face has 2000 features
S12 features
S25 features
S316 features
S22212 features
Divided into multiple stages
……pass pass pass
Face detected
pass
Reject
Fail
Fail any stage will reject current sub-window
Chen Huang UC Riverside
6/21
Algorithm FPGA implementation
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler
20 x 20 Sub-window
Haar feature calculation/decision
Frame grabber
Video in
FPGA
Chen Huang UC Riverside
7/21
Integral image and Classifier
Frame grabber
Video in
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler Classifier
Integral Image Buffer
(20 x 20 17-bit register file)
a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4
0
Feature threshold>
Left value
Right valueFeature value
mux +
multiply b
y constant-1 x2 x2 x3
+(Feature sum)
Rect sum Rect sum Rect sum
Data delivery
Chen Huang UC Riverside
8/21
Communication bottleneck
A classifier port
……
20 x 20 Integral image
400-to-1 mux
400-to-1 17-bit MUX:
2300 LUTs
12 MUXes: 27,600 LUTs40% of Virtex5 110T(69,120)
General communication architecture
Drawbacks:
Does not scale well for multiple classifiers
Wire congestion problem
Chen Huang UC Riverside
9/21
Integral image
CF1 CF2 CF3 CF4
Multiple Classifiers
Custom communication architecture for multi-classifier
400-1 mux
CF1 CF2 CF3 CF4
Classifier number
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Feature num
ber
Chen Huang UC Riverside
10/21
Integral image
CF1 CF2 CF3 CF4
Multiple Classifiers
Custom communication architecture for multi-classifier
CF1_port1 CF2_port9 CF3_port7
24-1 mux 9-1 mux 24-1 mux16-1 mux
CF4_port2Custom communication architecture
Classifier number
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Feature num
ber
CF1 CF2 CF3 CF4
Chen Huang UC Riverside
11/21
1 2 3 4 Stage 1
Feature mapping problem
Mapping 26 features into 4 Classifiers
Stage and feature
CF1 CF2 CF3 CF4
5
Classifier
Stage 1
Stage 2
Stage n
pass
pass
Object found
Reject
Fail
Fail
Fail6 7 8 9
10 11 12Stage 2
13 14 15 16
17 18 19 20
21 22 23 24
25 26
Stage 3
Features
CF1 CF2 CF3 CF4
Chen Huang UC Riverside
12/21
Feature mapping problem
SwapMigrate
#possible mapping grows exponentially with #features
Simulated Annealing neighborT
otal stage
delay
Total wire number
Performance Size
Objective:Min (Total stage delay * Total wire number)
1 million iterations (30 min)
Mapping 26 features into 4 Classifiers
Stage and feature
CF1 CF2 CF3 CF4
Stage 3 S
tage 2 Stage 1 1 2 3 4
5
6 7 8 9
10 11 12
13 14 15 16
17 18 19 20
21 22 23 24
25 26
Classifier
CF1 CF2 CF3 CF4
Chen Huang UC Riverside
13/21
BRAM
Select
Automatic VHDL code generation
Scheduling:
Integral Image
5 24 46 92
MUX
Classifier 1
Feature mapping:
1, 4, 66, 3
(needs entry:
5, 24, 46, 92)
1
4
3
1 2 3 4
24 5 92 46
2Mux1: mux4 port map(II(5), II(24), II(46), II(92), select, dout);
C1: classifier port map(dout, …);
Bram1: bram generic map(2, 1, 4, 3, …) Port map(…., select);
Structural RTL code for communication components
dout
Chen Huang UC Riverside
14/21
Review of custom design space exploration
Object detection application
Custom design space exploration
Program analysis
Design exploration
Design generation
Resource constraints, performance requirements
Map to different FPGAs
Execution timePareto design points
Size
Different number of classifiers
Communication bottleneck
400-1 muxFeature mapping problem
Chen Huang UC Riverside
15/21
Experiment scenarios
Different implementations Desktop: Pentium4 3.0 GHz fixed-point C FPGA: 1 CF(1 mux), 1 CF(3 mux), 1 CF(6 mux), 1 CF, 2 CF, 4 CF, 8 CF, 16 CF on
Xilinx Virtex LX 50T, LX110T, and LX155T Feature sets
Face: 2135 features Eye: 1066 features
Sample images Face(simple) Face(complex) Eye
Classifier
12 ports
Chen Huang UC Riverside
16/21
Experiment: FPGA resource utilization
General comm. architecture
Custom comm. architecture
LX50T.(29,000)
LX100T.(69,000)
LX155T.(97,000)
Map to different Xilinx Virtex5 FPGAs
Communication architecture
400-1 mux
Classifier number
24-1 mux
9-1 mux
24-1 mux
16-1 mux
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
1 CF(6 mux)
1 CF(12 mux)
2 CF 4 CF 8 CF 16 CF
Des
ign
size
(nu
mbe
r of
LU
TS
)
Comms
Static
1 CF(3 mux)
1 CF(1 mux)
Chen Huang UC Riverside
17/21
Components' timing info
Image scaler
Buffer controller
Classifier
65 Mhz11 cycles/window
65 Mhz(3+examined features/#CF) cycles/window
130 Mhz6 cycles/pixel
Frame/sec
124110
0.6
201
Performance upper bound (110 fps)
Performance of different components
min max
Frame grabber
Video in
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler
Xilinx Virtex5 110T FPGA
Chen Huang UC Riverside
18/21
Performance comparison
Upper bound
FPGA implementations are
0.6 to 25X faster than desktop C
0
20
40
60
80
100
120
Desktop 1 CF(1 mux)
1 CF(3 mux)
1 CF(6 mux)
1 CF 2 CF 4 CF 8 CF
Per
form
ance
(fr
ame/
sec.
)
Face(complex)
Face(simple)
16 CF
Eye
Pentium 4 3.0 GHz
(determined by buffer controller)
Chen Huang UC Riverside
19/21
Comparison to previous work
Compared to Cho’s [FPGA 09] implementation of the same algorithm with 320x240 pixels on the same FPGA.
Size(LUTs) Performance(fps)
Cho's(1 CF) 64,143 17.5
Ours(1 CF) 45,713 19.3
Cho's(3 CFs) 84,232 28.8
Ours(16 CFs) 77,059 90.9
More scalable due to custom design space exploration
3x faster with 8% less LUTs
Chen Huang UC Riverside
20/21
Video Demo http://www.youtube.com/watch?v=gkQVanU5P5U
Chen Huang UC Riverside
21/21
Conclusions
Effectively implemented object detection algorithm on a modern series of FPGAs
Custom design space exploration is necessary for complex applications
Future work: Implement more applications using custom search/optimization
Thank you!