6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro

6th/June, ISCA2005, 1/30NEC Corporation

An Integrated Memory Array Processor Architecture for Embedded Image Recognition

Systems

*1 Shorin KYO*1 Shin'ichiro OKAZAKI

*2 Tamio ARAI

*1 Media and Information Research Laboratories, NEC Corporation*2 School of Engineering, University of Tokyo


1. Challenges of Embedded Image Recognition Systems

2. Integrated Memory Array Processor (IMAP) Architecture

3. Programming Language and Compiler Design

4. Evaluations

5. Summary

Outline


Three Basic RequirementsThree Basic Requirements

1) High Performance

2) Cost/Power Efficiency

3) High Flexibility(Scalability and Versatility)

Low costEasy cooling (< 2 Watt)High Quality / ReliabilityLow EMI

Able to handle the combination of [ applications × situations×targets ]

101

100

1000

Robustness

GOPS

Lane MarksPreceding obstaclesSide/back obstacles

Traffic signs, pedestrians

Ex. Embedded Driver Asistant Systems

Realtime Response


Applications × Situations × TargetsApplications × Situations × Targets

Dynamic Back Up Aid

Cross Traffic WarningFollowing Distance Warning

Park Slot Measurement

Backup Parking AssistStop&Go

Side Pre-CrashCut-In

Front Pre-Crash

Lane Change Assist

Pedestrian Protection

Blind Spot Detection

Drownsinesswarning

Traffic Sign Recognition


Control circuit

Cost （ Die size / power consumption ）

Operation circuit

(peak) performanceFlexibility

100

100

ItaniumSparc64

SPE(CELL)FR1000

FR500IMAP-CE, IMAPCAR

CODEC LSI

a) Desktop/Server CPU (GPPs)

b) MIMDs (Multi-Cores)

c) DSPsd) Highly parallel SIMDs

e) Special purpose LSI

% o

f Con

tro

l Cir

cuitr

y

% of Operational Circuitry

(Fle

xib

ilit

y)

(Performance)

COR: Control versus Operational circuit RatioCOR: Control versus Operational circuit Ratio

1) Performance (higher)1) Performance (higher)

2) Cost (lower)2) Cost (lower)

3) Flexibility (higher)3) Flexibility (higher)

Trading-off items


(a) GPPs(b) DSPs and MIMDs(c) Highly parallel SIMDs(d) Custom logics+DSP core(e) Custom logics only

Fle

xib

ilit

y

Performance

a)b)

c)

d)

e)

Ctrl. circuits

Op. circuit

Ctrl. circuits

Op.

Op. circuit

Op. circuit

Op. circuit

Fixed Cost & Technology Constrain(a Technology Barrier)

Flexibility gap

Challenge of embedded image processors ⇒ Minimizing COR while overcoming the "Flexibility Gap"

Overcoming the Flexibility GapOvercoming the Flexibility Gap

Ctrl.


1. Challenge of Embedded Image Recognition Systems



4. Evaluation

5. Summary

Outline


IMAP-CEIMAP-CE

IMAP-1

IMAP-VISION

1990 　 1995 　 2000 2005 2010

1

0.1

10

100

40MHz, 32PE/Chip

15MHz, 8PE /Chip

Peak Performance(GOPS)

100MHz, 128PE/Chip4-Way VLIW ,50GOPS

0.18um, 2 ～ 4Watt

IMAP-240MHz, 64PE/Chip

IMAPCARIMAPCAR

100MHz, 128PE/Chip4-Way VLIW+MAC, 100GOPS(-40℃ ～ 85 ), 0.13 um, <2Watt ℃

1000

IMAP Series Processors IMAP Series Processors

ISSCC’03

ISSCC’95

Year

11.0mm

11.0

mm

PE8 PE8 PE8 PE8 PE8

PE8 PE8 PE8 PE8 PE8

PE8 PE8 PE8

PE8 PE8 PE8

CP

EXTIFDPLL

IMAP-CE(32.7M Tr, 0.18um)

(PE8: eight PEs integration block)

CAMP’97


Block Diagram and FeaturesBlock Diagram and Features

Video IN

Video OUTP$,D$,STK RAM

EMEM

Host Processor

Control Processor (CP)

4 Way VLIW PE

4 Way VLIW PE

4 Way VLIW PE

SR0 SR1SR2

IMEM

IMEM

IMEMExt

ern

al M

em.

I/F

12.8 GByte/s0.8 GByte/s

0

1

127

SR3

128

EMEM

EMEM

EMEM

ADD MUL RDU

24 x 8b General Purpose Registers

To/Fr other PEs

To/Fr IMEM

LSU COMM

To/Fr CP

LOG

4) 128 individual RAM blocks configuration5) 1DC (One Dimensional C) + “Line methods”6) Enhanced PE instruction set design for 1DC

1) 100MHz 128 4Way VLIW linear array PEs2) Two level memory architecture + user DMA 3) Automated mapping of image data to each PE PE PE

one pixel data

IMEM of one PE

column(s) of image

source (image) data

PE PE

CPinstructionbroadcast

(SIMD)SDRAM/SSRAM

2KB

128

64MB ～

ALUx1,MULx1,LOGx1,LSUx1


Memory Access Pattern CategoriesMemory Access Pattern Categories

Input Image X

(RNO)Recursive Neigh. Op.

Output Image Y

High-levelDecision

Local Feature basedDiscrimination

Measurements

Low-levelImage Processing

Intermediate-levelImage Processing

pixels

pixels

symbols

Output Image YInput Image X

Point Op. (PO)Input Image X

Output vector / scalar V

Statistical Op. (SO)

Input Image XOutput vector /

scalar V

Object Op. (OO)

Higher level Feature extraction

Low-level Feature Extraction


Global Op. (GlO)


Geometric. Op. (GeO)


Local Neigh. Op. (LNO)

Pre-processing

SensorsImage processing Image

recognition

E.R.Komen: Low-level Image Processing Architectures, Ph.d Thesis, TUD,Netherlands, 1990.

P.P.Jonker: Architectures for Multidimensional Low- and Intermidiate Level Image rocessing, Proc. of IAPR Workshop on Machine Vision Applications (MVA'90), pp.307--316, 1990.

ex. affine

ex. 2d-filters,NN

ex. labelling/propagation ex. distance trans.

ex. FFT

ex. histogram


Recursive

Data dependent

Conventional continous (or strided) address data supply (ex. streaming data supply) is not sufficient for parallelizing most memory access patterns been required

PO ○

LNO ○

SO ×

GlO ×

GeO ×

RNO ×

OO ×

Global

Global

Global

Completely local

Local NeighborhoodUnified

RAM

PE PE PE PE PE

SIMD + VLIW PEs

Memory Access Pattern Parallelization IssueMemory Access Pattern Parallelization Issue


Unconstrainedpixel update

Constrained pixel updateStatically constrained dynamically constrained

update location is statically predictable

update location must be dynamically determined

No

Yes

SO, GlO,GeO －

PO, LNO RNO OO

－

Lo

calit

y

slant-systolic

PE PE PE

autonomous

PE PE PE

row-systolic

PE PE PE

row-wise (PUL)

PE PE PE

image

requires one RAM block / PE configuration

Memory Access Pattern Parallelization DesignMemory Access Pattern Parallelization Design

(PUL: Pixel Updating Line)Line Methods


90 degree rotation

Thinning

Connect component labeling

Line Methods (1) ー Combination of PULs ーLine Methods (1) ー Combination of PULs ー

PE PE PE

PE PE PEPE PE PE

PE PE PE

+ Propagation

PE PE PEPE PE PE

+ +PE PE PE

2 times


N/2 ～ N time speedup by N PEsN/2 ～ N time speedup by N PEs

*1

*2

*1: When under an unified RAM approach*2: When using the memory array architecture

Line Methods (2) ー Expected Speedup ーLine Methods (2) ー Expected Speedup ー

(when using N PEs)





4. Evaluation

5. Summary

Outline


int d, e;sep char a,b;sep char c,ary[256];

One (vector like) data structure and six operators

1DC: An Extended C Language1DC: An Extended C Language

Correspondence between parallelizingtechniques and the 1DC syntax.


• Sequential Languages (Ex. C)

for (y=0; y < {number of lines} ； y++) for (x=0; x < {number of columns}; x++) .........

• When using 1DC, skip the {number of columns} loop

for (y=0; y < {number of lines} ； y++) ...........

y=0 y=120 y=200 y= {number of lines}

Ex. An Edge Detection Filter

1DC: Line-wise Parallel Operation1DC: Line-wise Parallel Operation


src[i]

src[i+1]

＋＋a8a6････････

b7 b8b6････････

c7 c8c6････････

src[i-1] a7

＋＋

････････a7+b7+c7a7+b7+c7

↓↓

csum a8+b8+c8a8+b8+c8a6+b6+c6a6+b6+c6

＋＋

＋＋

==

sep uchar src[256], dst[256];src[256], dst[256];ave33( ){ int i; sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; }}

Summing three lines at the same time

Average Filter in 1DC (1)Average Filter in 1DC (1)


ave33( ){ int i; sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; }}

csumcsum

:<csum:<csum

＋＋････････

････････

････････

:>csum:>csum

＋＋

････････

↓↓

dst[i]

a9+b9+c9a9+b9+c9

a7+b7+c7a7+b7+c7a5+b5+c5a5+b5+c5 a6+b6+c6a6+b6+c6

a7+b7+c7a7+b7+c7 a8+b8+c8a8+b8+c8a6+b6+c6a6+b6+c6

a7+b7+c7a7+b7+c7 a8+b8+c8a8+b8+c8

a5+b5+c5a5+b5+c5a6+b6+c6a6+b6+c6a7+b7+c7a7+b7+c7

a7+b7+c7a7+b7+c7a6+b6+c6a6+b6+c6a7+b7+c7a7+b7+c7 a8+b8+c8a8+b8+c8

a9+b9+c9a9+b9+c9a8+b8+c8a8+b8+c8

＋＋

＋＋

==

Neigh. ref.(:>,:<) and “+”

Average Filter in 1DC (2)Average Filter in 1DC (2)


Fast PE grouping

Fast PE grouping

PE array

Systolic

PE array PE array

Slant Autonomous

PE array

Row

Toward Efficient Execution of 1DC CodesToward Efficient Execution of 1DC Codes

Pipelined data exchange

Pipelined data exchange Fast left/right

referencingFast left/right referencing

1DC program

1DC compiler / linker

Fast index addressing Fast index addressing

Video IN

Video OUT P$,D$,STK RAM

Host Processor

Control Processor (CP)

4 Way VLIW PE

4 Way VLIW PE

4 Way VLIW PE

SR0 SR1SR2

IMEM

IMEM

IMEMExt

ern

al M

em. I

/F 0

1

127

SR3

128

SD

RA

M/S

SR

AM


Programming EnvironmentProgramming Environment

Assign variables to sliders

Timing measurement result for each source code line

1DC Source code window

Real-time value tuning d

ebugging

Source image window

Image recognition result window

1DC Optimizing Compiler

1DC Symbolic Debugger

1DC Source Code

Library IMAP Assembler

Linker

IMAP-CE PCI board





4. Evaluation

5. Summary

Outline


Operation Group Kernels Operation Group Kernels

Flexibility against various memory access patterns

Op. Grp.

Kernel Name IPC

PO Color format trans.

1.40

LNO 3x3 ave. filter 1.33

SO Histogram 1.66

GlO FFT 1.55

GeO 90 degree rotation

1.23

RNO Distance transform

1.52

OO Connected component labeling

1.400

1

2

3

4

5

6

7

8

PO LNO SO GlO GeO RNO OO (Ave.)0

20

40

60

80

100

120

140

GPPIMAP- CEParallelism

speedup parallelism (max.128)

IMAP-CE@100MHz, 1DC compiler codes

[email protected] , Intel C compiler codesOperation group kernels


name Purpose

Add2 dyadic arithmetic

GreyOpen3 3x3 grey morphology

Gauss5 5x5 filter

Mexican13 13x13 conv.

Var5Oct 5x5 texture analysis

Canny edge detection (3x3)

Smoothing edge preserving smoothing (7x7)

speed-up

PO

LNO

Processor Op.Freq.

PE # Peak Perf.

P4(SIMD) 2.4GHz 1PEx8x2 38.4GOPS

IMAP-CE 100MHz 128PEx4 51.2GOPS

IMAP-CE GPP x 1/24 x 32 x 1.33

Flexibility against algorithmic complexity

GOPS : in byte operation

Highly Parallel vs. Sub-Word SIMDHighly Parallel vs. Sub-Word SIMD

0123456789

10

Add2

Gre

yO

pen3

Gauss5

Mexic

an13

Var5

oct

Canny

Sm

ooth

ing

(Ave.) 0

2468101214161820

GPP(MMX)

IMAP-CE

Complexity

# of if-clause per pixel op.

IMAP-CE@100MHz, 1DC compiler codes

[email protected] , MMX codes

Benchmark kernels

Only PO,LNO kernels are used due to the nature of MMX inst.


Compared with Some Recent Media ProcessorsCompared with Some Recent Media Processors

PE PE

Image

128 bank memory

PE PE

(scratch pad memories)

SRF of Imagine (Stanford)

Frame Buffer of Morphosys (UC)

Local Store of SPE(CELL:Sony) 2KB

One to several banks

On chip vector partitioning & chaining

VIRAM (UCB), CODE (Stanford)

static vector partitioning

IMAP

1024 point 1D-FFT performance compared with other media processors

PE

Processor Name Cycle count Word Size Die-size Pwr(W) Tech(um)

Imagine(Float) 2176 16 12*12 4 0.15

Morphosys2 2636 16 16*16 4 0.13

IMAP-CE(IMAPCAR) 5000(3700) 8 11*11 4(2) 0.18(0.13)

VIRAM 5280 16 15*18 2 0.18


IMAP-CE@100MHz: use 1DC [email protected]: use C

0 20 40 60 80ms

Lane Mark DetectionVehicle Detection

A Real Application － Vehicle Detection －A Real Application － Vehicle Detection －

Flexibility at the application level

SearchSearch Tracking Tracking vechiclesvechicles

ValidateValidate

Lane Mark Detection

four local windows in max. six vehiclesforeward

looking camera


Processing time distribution

The Uneven Workload Issue The Uneven Workload Issue

Search

Search

Validate

Validate

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

IMAP-CE

GPP

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE array fullyutilized

Partial activation of PE array during sequential validatation of each candidate area

Search Validation





4. Evaluation

5. Summary

Outline


Summary Summary

Technology Barrier

(c)

(a)

(b)

(d)

GPPs

Highly parallel SIMD

Media Extended DSPs

Fle

xib

ilit

y

Performance

(e)

1) High Performance1) High Performance

2) Low Cost/ High Reliability2) Low Cost/ High Reliability

3) 3) High FlexibilityHigh Flexibility

Parallel and systolic algorithm design

methodology+

Hardware support of parallelizing methods

+Extended C Compiler

& GUI Debugger

The IMAP approach

Wired logics(+DSP core)

Assembly programmed

DSPs

Flexibility Gap

Embedded ImageRecognition Processor


The END

(Thank you for your attention)

Documents

6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems *1 Shorin KYO *1 Shin'ichiro

6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro