21
1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Explorati on Chen Huang and Frank Vahid Dept. of Computer Science and Engineering University of California, Riverside, USA {chuang,vahid}@cs.ucr.edu This work was supported in part by NSF CNS- 1016792

Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

  • Upload
    hugh

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration. Chen Huang and Frank Vahid Dept. of Computer Science and Engineering University of California, Riverside, USA {chuang,vahid}@cs.ucr.edu. This work was supported in part by NSF CNS-1016792. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

1/21

Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang and Frank Vahid

Dept. of Computer Science and Engineering University of California, Riverside, USA{chuang,vahid}@cs.ucr.edu

This work was supported in part by NSF CNS-1016792

Page 2: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

2/21

Outline

Haar-feature based object detection algorithm

Custom design space exploration: Feature mapping problem

Experimental results

Page 3: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

3/21

Original image

Scaled images

Haar-Feature based object detection algorithm

(320 – 20) * (240 – 20) = 66,000 sub-windows

X axis

Y axis

0

240

320

Movement of sub-window

Faces detected on different scales

… 20x20 sub- window

Face found

Page 4: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

4/21

Face detection in sub-window

Fail

Pass

Facial Haar features

Calculate Haar-feature value:

Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B)Constant time Pixel_Sum calculation

Pixel_Sum(R1) = P4 - P2 - P3 + P1 = 4

1 1 11 1 1

1 1 1

Original image Integral Image

1 2 32 4 6

3 6 9

p1 p2

p3 p4R1

Need 4 corner values

Stores Pixel sum of Rect(from top-left corner to this point)

P4

P2

P3

P120 x 20 sub-window

Page 5: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

5/21

Cascade decision process

Frontal-face has 2000 features

S12 features

S25 features

S316 features

S22212 features

Divided into multiple stages

……pass pass pass

Face detected

pass

Reject

Fail

Fail any stage will reject current sub-window

Page 6: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

6/21

Algorithm FPGA implementation

Buffer controller

Integral image Rectangle

drawer

Video out(objects in rectangles)

ClassifierImage scaler

20 x 20 Sub-window

Haar feature calculation/decision

Frame grabber

Video in

FPGA

Page 7: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

7/21

Integral image and Classifier

Frame grabber

Video in

Buffer controller

Integral image Rectangle

drawer

Video out(objects in rectangles)

ClassifierImage scaler Classifier

Integral Image Buffer

(20 x 20 17-bit register file)

a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4

0

Feature threshold>

Left value

Right valueFeature value

mux +

multiply b

y constant-1 x2 x2 x3

+(Feature sum)

Rect sum Rect sum Rect sum

Data delivery

Page 8: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

8/21

Communication bottleneck

A classifier port

……

20 x 20 Integral image

400-to-1 mux

400-to-1 17-bit MUX:

2300 LUTs

12 MUXes: 27,600 LUTs40% of Virtex5 110T(69,120)

General communication architecture

Drawbacks:

Does not scale well for multiple classifiers

Wire congestion problem

Page 9: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

9/21

Integral image

CF1 CF2 CF3 CF4

Multiple Classifiers

Custom communication architecture for multi-classifier

400-1 mux

CF1 CF2 CF3 CF4

Classifier number

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Feature num

ber

Page 10: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

10/21

Integral image

CF1 CF2 CF3 CF4

Multiple Classifiers

Custom communication architecture for multi-classifier

CF1_port1 CF2_port9 CF3_port7

24-1 mux 9-1 mux 24-1 mux16-1 mux

CF4_port2Custom communication architecture

Classifier number

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Feature num

ber

CF1 CF2 CF3 CF4

Page 11: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

11/21

1 2 3 4 Stage 1

Feature mapping problem

Mapping 26 features into 4 Classifiers

Stage and feature

CF1 CF2 CF3 CF4

5

Classifier

Stage 1

Stage 2

Stage n

pass

pass

Object found

Reject

Fail

Fail

Fail6 7 8 9

10 11 12Stage 2

13 14 15 16

17 18 19 20

21 22 23 24

25 26

Stage 3

Features

CF1 CF2 CF3 CF4

Page 12: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

12/21

Feature mapping problem

SwapMigrate

#possible mapping grows exponentially with #features

Simulated Annealing neighborT

otal stage

delay

Total wire number

Performance Size

Objective:Min (Total stage delay * Total wire number)

1 million iterations (30 min)

Mapping 26 features into 4 Classifiers

Stage and feature

CF1 CF2 CF3 CF4

Stage 3 S

tage 2 Stage 1 1 2 3 4

5

6 7 8 9

10 11 12

13 14 15 16

17 18 19 20

21 22 23 24

25 26

Classifier

CF1 CF2 CF3 CF4

Page 13: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

13/21

BRAM

Select

Automatic VHDL code generation

Scheduling:

Integral Image

5 24 46 92

MUX

Classifier 1

Feature mapping:

1, 4, 66, 3

(needs entry:

5, 24, 46, 92)

1

4

3

1 2 3 4

24 5 92 46

2Mux1: mux4 port map(II(5), II(24), II(46), II(92), select, dout);

C1: classifier port map(dout, …);

Bram1: bram generic map(2, 1, 4, 3, …) Port map(…., select);

Structural RTL code for communication components

dout

Page 14: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

14/21

Review of custom design space exploration

Object detection application

Custom design space exploration

Program analysis

Design exploration

Design generation

Resource constraints, performance requirements

Map to different FPGAs

Execution timePareto design points

Size

Different number of classifiers

Communication bottleneck

400-1 muxFeature mapping problem

Page 15: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

15/21

Experiment scenarios

Different implementations Desktop: Pentium4 3.0 GHz fixed-point C FPGA: 1 CF(1 mux), 1 CF(3 mux), 1 CF(6 mux), 1 CF, 2 CF, 4 CF, 8 CF, 16 CF on

Xilinx Virtex LX 50T, LX110T, and LX155T Feature sets

Face: 2135 features Eye: 1066 features

Sample images Face(simple) Face(complex) Eye

Classifier

12 ports

Page 16: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

16/21

Experiment: FPGA resource utilization

General comm. architecture

Custom comm. architecture

LX50T.(29,000)

LX100T.(69,000)

LX155T.(97,000)

Map to different Xilinx Virtex5 FPGAs

Communication architecture

400-1 mux

Classifier number

24-1 mux

9-1 mux

24-1 mux

16-1 mux

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

1 CF(6 mux)

1 CF(12 mux)

2 CF 4 CF 8 CF 16 CF

Des

ign

size

(nu

mbe

r of

LU

TS

)

Comms

Static

1 CF(3 mux)

1 CF(1 mux)

Page 17: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

17/21

Components' timing info

Image scaler

Buffer controller

Classifier

65 Mhz11 cycles/window

65 Mhz(3+examined features/#CF) cycles/window

130 Mhz6 cycles/pixel

Frame/sec

124110

0.6

201

Performance upper bound (110 fps)

Performance of different components

min max

Frame grabber

Video in

Buffer controller

Integral image Rectangle

drawer

Video out(objects in rectangles)

ClassifierImage scaler

Xilinx Virtex5 110T FPGA

Page 18: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

18/21

Performance comparison

Upper bound

FPGA implementations are

0.6 to 25X faster than desktop C

0

20

40

60

80

100

120

Desktop 1 CF(1 mux)

1 CF(3 mux)

1 CF(6 mux)

1 CF 2 CF 4 CF 8 CF

Per

form

ance

(fr

ame/

sec.

)

Face(complex)

Face(simple)

16 CF

Eye

Pentium 4 3.0 GHz

(determined by buffer controller)

Page 19: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

19/21

Comparison to previous work

Compared to Cho’s [FPGA 09] implementation of the same algorithm with 320x240 pixels on the same FPGA.

Size(LUTs) Performance(fps)

Cho's(1 CF) 64,143 17.5

Ours(1 CF) 45,713 19.3

Cho's(3 CFs) 84,232 28.8

Ours(16 CFs) 77,059 90.9

More scalable due to custom design space exploration

3x faster with 8% less LUTs

Page 20: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

20/21

Video Demo http://www.youtube.com/watch?v=gkQVanU5P5U

Page 21: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

Chen Huang UC Riverside

21/21

Conclusions

Effectively implemented object detection algorithm on a modern series of FPGAs

Custom design space exploration is necessary for complex applications

Future work: Implement more applications using custom search/optimization

Thank you!