10
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, 2017 ISSN(Print) 1598-1657 https://doi.org/10.5573/JSTS.2017.17.3.473 ISSN(Online) 2233-4866 Manuscript received Jun. 11, 2017; accepted Jun. 14, 2017 Department of Electrical Engineering KASIT, Daejeon 305-701, Korea E-mail : [email protected],[email protected], [email protected] A Memory-efficient Hand Segmentation Architecture for Hand Gesture Recognition in Low-power Mobile Devices Sungpill Choi, Seongwook Park, and Hoi-Jun Yoo Abstract—Hand gesture recognition is regarded as new Human Computer Interaction (HCI) technologies for the next generation of mobile devices. Previous hand gesture implementation requires a large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile devices to users. Therefore, in this paper, we presents a low latency and memory-efficient hand segmentation architecture for natural hand gesture recognition. To obtain both high memory-efficiency and low latency, we propose a streaming hand contour tracing unit and a fast contour filling unit. As a result, it achieves 7.14 ms latency with only 34.8 KB on-chip memory, which are 1.65 times less latency and 1.68 times less on-chip memory, respectively, compare to the best-in-class. Index Terms—Hand gesture, low latency, memory- efficient design, image processing, human computer interaction I. INTRODUCTION Recently, hand gesture recognition becomes a salient technology due to the advent of hands-free mobile devices such as a Head Mounted Display (HMD) [1]. However, because most of HMDs are implemented by software-based interface system, these implementations suffer from slow response time. As a results, this high latency causes degradation of interaction accuracy and gives uncomfortable experience to users. To solve this problem, in [2], hand gesture system is recommended to process within 16.66 ms (60 fps). However, it has been failed to achieve 60 fps requirement in previous works [3, 4]. In the hand gesture recognition, there are three steps; hand segmentation, hand analyzing, and gesture detection. In the hand segmentation phase, hands’ candidate regions are searched from input images. After that, in the hand analyzing phase, each candidate region is verified to decide what hands are, and hands’ detail information such as fingertips and a palm center is detected. Finally, in the gesture detection phase, gesture information is recognized from a sequence of hand analyzing information. In [3], they utilize stereo vision algorithm for hand segmentation. However, it already spends 16.66 ms only for hand segmentation, and there is no timing margin for additional processing. Moreover, for high throughput and small on-chip memory, [3] utilizes low accuracy stereo algorithm [5]. In [4], it uses color based hand segmentation algorithm (Gaussian Mixture Model or GMM). However, it also achieve only 30 fps without hardware acceleration. This implemen- tation requires 64 MB for GMM look up table and it is impossible to integrate on chip memory. Moreover, although this approach is very efficient for latency in high performance CPU implementation, it is requires large latency of external memory access for mobile devices. In this work, we choose GMM based algorithm for

A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, 2017 ISSN(Print) 1598-1657 https://doi.org/10.5573/JSTS.2017.17.3.473 ISSN(Online) 2233-4866

Manuscript received Jun. 11, 2017; accepted Jun. 14, 2017 Department of Electrical Engineering KASIT, Daejeon 305-701, Korea E-mail : [email protected],[email protected], [email protected]

A Memory-efficient Hand Segmentation Architecture for Hand Gesture Recognition in Low-power Mobile

Devices

Sungpill Choi, Seongwook Park, and Hoi-Jun Yoo

Abstract—Hand gesture recognition is regarded as new Human Computer Interaction (HCI) technologies for the next generation of mobile devices. Previous hand gesture implementation requires a large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile devices to users. Therefore, in this paper, we presents a low latency and memory-efficient hand segmentation architecture for natural hand gesture recognition. To obtain both high memory-efficiency and low latency, we propose a streaming hand contour tracing unit and a fast contour filling unit. As a result, it achieves 7.14 ms latency with only 34.8 KB on-chip memory, which are 1.65 times less latency and 1.68 times less on-chip memory, respectively, compare to the best-in-class. Index Terms—Hand gesture, low latency, memory-efficient design, image processing, human computer interaction

I. INTRODUCTION

Recently, hand gesture recognition becomes a salient technology due to the advent of hands-free mobile devices such as a Head Mounted Display (HMD) [1]. However, because most of HMDs are implemented by software-based interface system, these implementations

suffer from slow response time. As a results, this high latency causes degradation of interaction accuracy and gives uncomfortable experience to users. To solve this problem, in [2], hand gesture system is recommended to process within 16.66 ms (60 fps). However, it has been failed to achieve 60 fps requirement in previous works [3, 4].

In the hand gesture recognition, there are three steps; hand segmentation, hand analyzing, and gesture detection. In the hand segmentation phase, hands’ candidate regions are searched from input images. After that, in the hand analyzing phase, each candidate region is verified to decide what hands are, and hands’ detail information such as fingertips and a palm center is detected. Finally, in the gesture detection phase, gesture information is recognized from a sequence of hand analyzing information. In [3], they utilize stereo vision algorithm for hand segmentation. However, it already spends 16.66 ms only for hand segmentation, and there is no timing margin for additional processing. Moreover, for high throughput and small on-chip memory, [3] utilizes low accuracy stereo algorithm [5]. In [4], it uses color based hand segmentation algorithm (Gaussian Mixture Model or GMM). However, it also achieve only 30 fps without hardware acceleration. This implemen- tation requires 64 MB for GMM look up table and it is impossible to integrate on chip memory. Moreover, although this approach is very efficient for latency in high performance CPU implementation, it is requires large latency of external memory access for mobile devices.

In this work, we choose GMM based algorithm for

Page 2: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

474 SUNGPILL CHOI et al : A MEMORY-EFFICIENT HAND SEGMENTATION ARCHITECTURE FOR HAND GESTURE …

latency and GMM is implemented by deep pipelined accelerator instead of LUT design for reducing LUT memory. In Fig. 1, we profiles the previous work [4], and hand segmentation occupies almost 60% of the overall processing time. Therefore, we optimize hand segmen- tation processing for 60 fps operation. We also simulate a hand segmentation phase by Altera Cyclone IV FPGA and analyzed timing breakdown of the best-in-class segmentation hardware [6, 7]. As shown in Fig. 1, over 48% of total processing is memory access which is a dominant bottleneck in latency requirement.

Therefore, in this paper, we propose low-memory access and memory-efficient architecture and finally achieve the 7.14 ms latency with 34.8 KB on-chip memory which is equal to 61.83 fps for overall hand gesture recognition processing. In this paper, we will introduce detail algorithm in section II and hand segmentation architecture in section III. After that, we will show implementation results in section IV and section V gives the conclusion.

II. HAND SEGMENTATION ALGORITHM

1. Hand Segmentation Flow Before explaining detail hardware architecture, we

briefly introduce hand segmentation algorithm in this section. Fundamental algorithm is described in [8]. The role of hand segmentation is searching hands’ candidate regions from color input images. As shown in Fig. 2, there are 5 stages of the hand segmentation. In first stage, GMM [4] or simple color thresholding algorithm [8] are utilized to detect hand color regions. Among them, we uses GMM because of its high hand color extraction accuracy. After that, the largest blob is selected as a hand because it has the largest area in an input image for HCI applications. For this, connected pixel segmentation or contour based segmentation is used. Generally, because GMM is based on stochastic model, there are some holes inside segments. Therefore, contour based algorithm is widely used for binary image segmentation. After this processing, by using contour information, it calculates inside area of each contour and selects the largest blob among contours. However, due to noise from an input image and GMM detection results, the selected largest contour is rough, and it can causes degradation of hand analysis accuracy. Therefore, contour approximation is usually utilized. Finally, contour hole filling is processed for generating hands’ region of interest (ROI).

In the next 2 sub-sections, we will introduce problems of convectional hand segmentation processing and present proposed hand segmentation flow.

Hand Segmentation

(13.24 ms)

Hand Analyzing(6.73 ms)

Gesture Detection(3 ms)

InputImage

Hand Segmentation

HandAnalyzing

Gesture Detection

Overall≈ 22.27 ms

(45 fps) MemoryAccess(48%)

ALUOperation

(35%)

Cmp. Operation (14%)

Exp. Operation (3%)

Fig. 1. Gesture recognition overall flow and its workload.

Contour Approximation

Contour Tracing

Largest Blob Selection

Hand Color Detection

Processing Sequence

Input Image FillingResult

Fig. 2. Hand segmentation flow.

Page 3: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, 2017 475

2. Memory Efficient Hand Contour Selection Algorithm

In the conventional algorithms, area calculation and

contour approximation are processed sequentially after contour tracing. Thus, they require intermediate buffers and memory access. This memory access occupies almost 43% of overall hand contour selection time which is a dominant factor of increasing latency in the hand contour selection.

For reducing this redundant memory access, we adopt streaming processing in the hand contour selection. For utilizing streaming processing, each area calculation and contour approximation is processed in a pixel-wiser manner, and they are processed immediately after one contour point is traced.

In the area calculation, we utilize (1) which is a simple polygon based method.

1 11

1Area ( )( )2

n

i i i ii

x x y y+ -=

= + -å (1)

Also, because contour tracing points always

accompany neighbor pixels, 1i iy y -- term can be -1, 0 and 1. Thus, Eq. (1) can be modified as (2).

1

1 11

1

11Area ( ) 02

1

i in

i i i i i ii

i i

y yw x x w y y

y y

-

+ -=

-

- <ìï= + = =íï >î

å (2)

As a result, the area calculation is implemented with

only one addition, one-subtraction and one comparison. Moreover, it can be operated seamlessly after on contour point is traced.

In the contour approximation, we adopt curvature based contour approximation [10]. As shown in Fig. 3(a), basic concept of this contour approximation is storing small curvature points in contour points. In Fig. 3(b), each curvature of points is calculated by angle between a center point and +k, and –k successive contour points. In this work, default k is 4, and it can be configured from 1 to 9. However, in [10], it has a problem that it can make poor approximation results if contour is a gradual curve. Therefore, we suppress long path approximation in this algorithm. If approximation points is larger than 7,

algorithm forces to store a contour pixel and resumes contour approximation after that point. Thanks to this, contour approximation can be operated seamlessly after one contour point is traced.

As a result, both of algorithm make streaming processing possible in the hand contour selection. The flow of this processing is described in Fig. 4. Due to streaming processing, 66% on-chip memory and 43% latency can be reduced. In addition, because there is no dependence between the area calculation and contour approximation, they work simultaneously. Therefore, it makes additional 50% reduction of pipeline register in the hand contour selection.

3. Memory Efficient Contour Filling Algorithm

In the previous contour filling works [7, 11], they use

too much redundant memory in spite of their low latency. In [7], it uses kernel convolution based contour filling algorithm, and this requires multiple bits (8-bit) for each pixel at filling operation. Also, in [11], it requires at least 3-bit and additional segments’ labeling bits. In the previous works, they implement general purpose contour algorithm. Thus, they needs 11bits for each pixel. However, in the hand segmentation, it is only required to detect largest 2 segments in an input image. In [7], it cannot calculate area until whole filling processing ends, therefore it is hard to reduce 8-bits memory per pixel constraint. In [11], it can reduce labeling memory bits by 2bits, but 3-bits filling memory cannot be reduced in the contour filling operation. Thus, it requires overall 5 bits

74

83 83

117 117

170

150 150

170

Fig. 3. Pixel-wise contour approximation.

Page 4: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

476 SUNGPILL CHOI et al : A MEMORY-EFFICIENT HAND SEGMENTATION ARCHITECTURE FOR HAND GESTURE …

per each pixel. In this work, we propose new memory-efficient and low-latency contour filling algorithm for hand segmentation which is called Fast Contour Filling (FCF). If there are 2 hands in an input image, it requires only 3bits per each pixels.

Fundamental of FCF is similar with that of [11]. Fig. 5 shows FCF’s overall flow. First, all contour points are initialized as start pixels as shown in Fig. 5(a). After that, for descent contour pixels (dash lines), they are labeled as Fig. 5(b). In end pixel labeling phase, descent contour points are usually end pixels, but there are some exception (will presents at the next paragraph). Due to start and end pixel labeling, it can easily fill as shown in Fig. 5(c). As shown in Fig. 5(d), the original contour is successfully filled. In this algorithm, each label requires only start and end information. Therefore, for the hand segmentation, it requires only 3-bits for each pixel

(outside pixel, hand ontour1, hand contour2). As a result, it can reduce overall 62.5% and 40% less buffer memory than [7] and [11], respectively.

Moreover, if the number of pixels is N, and the number of contour points is K, [7] and [11] require 3N memory access and 2N+2K memory access for kernel convolution, respectively. However, FCF only requires 2N+2K memory access although it requires less memory size. Generally, because K is under 2% of N, FCF is less than or equal to previous works.

For our algorithm, criteria of end pixel labeling is important. As shown in Fig. 6, there is a basic rule; Descent contour points are end pixels. Descents contour points are defined in Fig. 6(a) and Eq. (3). However, there are 3 labeling exceptions as described in Fig. 6(b) and Eqs. (4, 5). For inside 1 and 2 case, although there are descent contour points, all of them are not actual end

Contour Approx.Area Cal.One Point

Contour Tracing

Contour Approx.Area Cal.One Point

Contour Tracing

Contour Approx.Area Cal.One Point

Contour Tracing

Approx.Area Cal.One Point

Contour Tracing

One PointContour Tracing

One PointContour Tracing

Approx.Area Cal.

Approx.Area Cal.

Area Cal.

Area Cal.

Area Cal.

Area Cal.

Area Cal.

Area Cal.

Fig. 4. Pixel-wise streaming hand contour selection flow.

Fig. 5. Fast contour filling flow.

Falling After Falling Before Falling Both

Inside 1 Inside 2 Return

(a) Rule : Descent Contour Points are End Pixel

(b) Exception Case of RuleStart Pixel End Pixel

Fig. 6. End Pixel labeling criteria.

Page 5: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, 2017 477

points. Therefore, for these patterns, it skips end labeling. Another exception case is that there is a returning path as shown in Eq. (6). In this case, the contour points are always labeled as an end pixel.

1 1Rule(Descent Contour Points) : i i i iy y y y+ -< <U (3)

1 1 1 1 1Inside1 : i i i i i i i iy y y y x x x x- + - - += < < <I I I (4)

1 1 1 1 1Inside2 : i i i i i i i iy y y y x x x x- + - - += > > >I I I (5) 1 1 1 1Return : i i i iy y x x- + - += =I (6)

III. HAND SEGMENTATION HARDWARE

ARCHITECTURE

The proposed processor block diagram is shown in Fig. 7. There are 3 functional logic units, which are a GMM unit, a Streaming Hand Contour Tracing Unit (SHCTU) and a Fast Contour Filling Unit (FCFU), and 2 memory units, which are a global SRAM buffer and a contour buffer.

The GMM unit utilizes 35 depth deep pipeline architecture. Thanks to this deep pipeline and no stall in pipeline operation, it achieves low latency in hand color detection. Next, the SHCTU adopts our proposed pixel-wise streaming hand segmentation algorithm. Therefore, it makes possible to achieve low latency and small on-chip memory. Finally, the FCFU adopts our proposed FCF algorithm described in section II. Also, because each line filling operation is independent with other line filling operations, for low latency in contour filling, it is implemented as a 3-way parallel filling hardware. In the next 2 sub-sections, we will introduce the details of the SHCTU and the FCFU.

1. Streaming Hand Contour Tracing Unit

As shown in Fig. 8(a), there are two single cycle operating combinational logics for k-curvature approximation and area calculation. Because x, y coordinates 9-bits (0~319) and 8-bits (0~239) respectively, ALU operation of two processing requires small timing. Therefore, overall combinational can be implemented by single cycle combinational logic at 50 MHz operation.

Basically, the k-curvature approximation hardware always approximates all incoming contours. As shown in Fig. 8(b), incoming contours are stored in the temporal FIFO. After the number of contour points is sufficient, it starts to approximate contour points. First, it calculates with vectors of +4 points and -4 points to generate length values and inner product values. To reduce the burden of the square root calculation, it first calculates the inner product of two vectors and squares its results. Also, a

Streaming Hand ContourTracing Unit

GM

M U

nit

Hand Segmentation Core

HSV Image Data

Hand Candidate

ContourBuffer(6KB)

Fast Contour Filling Unit

Global SRAM Buffer (28.8KB)

Contour Labeler

HoleFiller

Area Calculator

Contour Tracer

ContourApproximator

Fig. 7. Overall block diagram of hand segmentation core.

Contour Tracing

ContourStack

43210-1-2-3-4

FIFON-5N-4N-3N-2N-1N

Single Cycle Operation

X,Y

OneNext

Contour Point -1

0

K-Curvature Approximation

Area CalculationAssert

Flush Signal when Small Size

Approx.Contour

Point

Thres.Angle

Θ

+4 Vector

0

+4

-4 Vector

0

-4

(a) Overall Hand Contour Tracing UnitK-Curvature Approximation

X,YOneNext

Contour Point

9 Points

Inner Product

Length Calculatior

Normalizer

Thre

shol

ding

Req

Div

Div

Combinational Logic

Overflow

Rst

Length Calculator

3 bitCounter

43210-1-2-3-4

ContourStack

N-5N-4N-3N-2N-1N

(b) K-Curvature Approximation Hardware

Previous Contour

Point

Combinational Logic

Area Calculation

XY

XY

Current Contour

Point

Are

a R

egis

ter

Contour Tracing

Thre

shol

ding

Tracing Finish Signal

Pointer Reset

ContourStack

N-5N-4N-3N-2N-1N

(c) Area Calculation Hardware

Fig. 8. Streaming Hand Contour Tracing Unit.

Page 6: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

478 SUNGPILL CHOI et al : A MEMORY-EFFICIENT HAND SEGMENTATION ARCHITECTURE FOR HAND GESTURE …

length value is calculated from a square value. Finally, this result is normalized by the length value. Therefore, it can get the square of a cosine value by this hardware. By thresholding, it can be decided whether it is stored or not. And, if it is not stored, internal 3-bit contour is increased. If this contour is overflowed, a contour point is stored to the contour stack. In this operation, approximation range K (default: 4) is reconfigurable from 1 to 9. Therefore, there are actually19 entries for FIFO. The initialized curvature threshold value is 70˚, and it is reconfigurable from 0˚ to 180˚.

As shown in Fig. 8(c), if the size of contour is small when area is default 10% of overall resolution (this is reconfigurable variable), the area calculation hardware outputs a contour stack pointer reset signal to empty a contour pointer. By resetting a contour stack pointer, it can easily flushing invalid contour points in memory. In addition, because area is calculated by Eq. (2), it can determine whether a valid contour or not immediately after tracing contour. Thus, there is no stall in contour tracing by post processing.

A contour stack is consisted of two regions. One is external contour points for two hands’ candidate blobs and the other is for internal contour. An internal contour stack is used to remove invalid internal holes. For example, ‘okay’ hand gesture includes an empty hole which is not a part of hand. Generally, internal contour is small, and it stores only large internal contour points. Based on our experiments, we allocates 4KB for the external contour stack and 2KB for the internal contour stack. In the contour stack, there are additional contour start pointer registers and end pointer registers. The Area calculation hardware decides its threshold from area information. Moreover, the contour approximation hardware also decides its starting address of the contour stack by referring to those registers. For external contour points, due to its large size, fragmentation of the contour stack can waste a memory. Therefore, we implement the contour stack as circular queue architecture to store only two largest hand contours. As a result, contours can be stored without memory fragmentation.

Finally, after all contours are traced, there are only two largest contours in the external contour stack and a few large inside contours in the internal contour stack. Thanks to this memory-efficient streaming architecture, it can reduces 26.1% of overall processing latency from

reducing 36% of overall memory access reduction compared to conventional contour tracing hardware.

2. Fast Contour Filling Unit

For generating region of interest, contour filling is performed after hand contour tracing. Fig. 9 shows the FCFU architecture. This hardware adopts FCF algorithm which is presented in section II. There are two parts in FCF, a contour labeler and a hole filler.

First, after contour tracing, the contour labeler gets contour points successively from the contour stack. Because this contour is approximated, it should be reconstructed as a connected line. Therefore, the line drawer which adopts Bresenham’s line drawing algorithm [12] reconstructs points between two approximated contour points and those points are stored in temporal contour FIFO. Then, the contour labeler initializes all contour points as start pixels within 1 clock cycle. After finishing it, the contour labeler draws end pixel by using FCF algorithm.

Finally, the hole filler fills inside of labeled contours. Because each line filling operation is independent from each other, it is processed as 3-way in parallel for reducing latency. Moreover, as shown in Fig. 10, filling decision is decided by previous buffer output and current

ContourStack

Fast Contour Filling Unit

N-5N-4N-3N-2N-1N

N+1

Global SRAM Buffer

Start & End Contour Labeler Hole FillerN

N-1

FIFO

M U XLine Drawer

Start Pixel Initializer

N-1NN+1Contour FIFO

End Pixel Labeler

P1P2P3P4P5

SRAM BANK1 (9.6KB)

SRAM BANK2 (9.6KB)

SRAM BANK3 (9.6KB)

Bank 1Filling FSM

Bank 2Filling FSM

Bank 3Filling FSM

Fig. 9. Fast Contour Filling Unit.

Filling

0 1Start

2 53 4End

CLKCurrent ADDRBuffer Output

Current Operation

0 1 2 53 4

0 1 2 3 4 5 6

N/A 0 Start 0 0 End 0

NOP NOP Fill [2] Fill [3] Fill [4] NOP NOP

Hole Filling State Out Out Out In In In Out

Fig. 10. Hole filling timing diagram.

Page 7: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, 2017 479

state, therefore, it accesses memory only twice (read & write). This FSM transition table is described in Table 1.

FCF prevents its start and end pixels from contamination by other labeling or contour data. Therefore, values in global SRAM buffer is allocated as Table 2. For contour tracing, 3 values (0, 1, and 2) are used. 2 values (4, 5) are used for hand 1, and other 2 values (6, 7) are used for hand 2. In addition, values (2, 3) is used for internal contours. In this case, value 2 is used for both tracing and internal contour filling. However, filling operation is processed after tracing operation and only value 0 is used for filling operation within values (0, 1 and 2). Thus, this labeling does not cause data contamination between tracing and filling. However, as shown in Fig. 11(a), when approximated contours are overlapped, start and end labeling information can be contaminated. It is a serious problem because it fills all the holes with wrong value, if overlapped points override end pixels. Therefore, the FCFU fills contours three times to eliminate errors as shown in Fig. 11(b). In fact, there are little information

loss in edges of ROI, but it is small portion of overall region area. We simulated cases for this overlapping. It is almost not occurred in general hand detection case, and it only generates several pixels overlapped in artificial generated test set. Therefore, we just handle an overlapping problem and neglect this small accuracy degradation.

As a result, because the FCF unit is possible to reduce required memory bit size which is explained in section II and process low latency, it can reduce overall 40.5% required memory than previous works and overall 17.7% latency.

IV. IMPLEMENTATION RESULTS

Proposed hardware, as shown in Fig. 12, is implemented by 65 nm CMOS technology processes, and it is operated at 1.2 V with 50 MHz frequency. It occupies 0.5748 mm2 with 34.8 KB SRAM. Power consumption is 5.544 mW and its latency is 7.14 ms.

In Fig. 13, there are hand segmentation results of the proposed hand segmentation architecture. Fig. 13(a) are

Table 1. Hole filling state transition table

0 or End

Buffer Output

NOP

Operation

Out

Current State

Start FILLOut

End NOPIn

0 or Start FILLIn

Out

Next State

In

Out

In

Table 2. Value allocation for fast contour filling

Hand 1 End Pixel

Description

4

Pixel Data

Hand 1 Start Pixel5

Hand 2 End Pixel6

Hand 2 Start Pixel7

Non Contour

Description

0

Pixel Data

Contour Points1

Searched Conotur/Internal End pixel2

Internal Start Pixel3

(a) Contour Overlapping Problem

Overlap PointsExternal Approx. PointsInternal Approx. Points

(b) Contour Filling Order

Hand 1 Labeling

Hand 1 Fiiling

Hand 2 Labeling

Hand 2 Fiiling

Internal Labeling

Internal Fiiling

Fig. 11. Filling operation flow.

Process

Area

Frequency

Gate

PowerConsumption

Latency

65nm CMOS

0.5748 mm2

50 MHz

283k logic gates

5.544 mW

7.14 ms

OperatingVoltage 1.2V

SRAM 34.8 KB

0.68 mm

0.85 mm

Global SRAMBANK3

Global SRAMBANK2

Global SRAMBANK1

Contour Stack

GMM CoreHCTCFC

F

Fig. 12. Layout summary.

(a) Input Image (b) Hand Segment (c) Hand ROI

Fig. 13. Hardware simulation results.

Page 8: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

480 SUNGPILL CHOI et al : A MEMORY-EFFICIENT HAND SEGMENTATION ARCHITECTURE FOR HAND GESTURE …

input images from a camera, and Fig. 13(b) are streaming hand contour tracing results. In Fig. 13(b) bottom, because it traces internal contours, it detects an invalid hand region, and it makes possible more accurate ROI results. Finally, Fig. 13(c) shows filling results from hand contour detection results. At 50 MHz operation, latency of hand segmentation are 6.75 ms, 6.94 ms, and 6.91 ms, respectively. We test it for 1000 hand images, and the worst latency and the average latency are 7.14 ms and 6.87 ms, respectively. The type of test inputs is CIF, and curvature approximation range K=4, threshold=70˚, minimum hands’ candidate area is 7680 pixels and minimum internal hands’ contour area is 768 pixels. And each parameter can be reconfigurable by changing internal control register file.

As shown in Table 3, the proposed work shows the highest energy and memory efficiency than previous works show. In previous [7, 9], they only implement contour tracing and filling operation. On the other hand, our work covers the overall process of the hand segmentation. Also, thanks to memory-efficient SHCTU and FCFU, it requires only 34.8 KB on-chip memory. For evaluating memory efficiency, we propose new figure of merit (memory/pixel). This FOM shows that the proposed work has the smallest memory requirement for the same resolution. This is 40% and 98.7% less memory than previous works [7, 9] respectively.

Fig. 14 shows performance improvements of the proposed work. If there is no proposed scheme, hand segmentation for CIF 50 MHz requires 11.75 ms. However, by adopting proposed SHCTU, this latency is reduced to 8.68 ms which is 26.1% latency reduction.

Moreover, our FCFU reduces additional latency and it achieves 7.14 ms. As the result, overall 39.2% of latency is reduced. Also, proposed hand contour architecture reduced required memory to 34.8 KB. Compared with the previous work [7], it achieve overall 40.5% of required memory reduction which is normalized to CIF resolution.

V. CONCLUSIONS

A memory-efficient hand segmentation architecture has been realized as hand gesture recognition preprocessor for mobile platform. By exploiting streaming hand contour tracing architecture, intermediate memory for hand segmentation can be removed and it achieve to reduce latency. Also, proposed new contour filling algorithm and architecture reduces more memory. As a result, thanks to both architecture, it can reduces 39.2% latency and 40.5% memory than conventional hand segmentation architecture. The proposed architecture is capable of hand segmentation at 7.14 ms low latency and with 34.8 KB low on-chip SRAM memory for a CIF image at 50 MHz, enabling natural hand gesture processing. Finally, this low latency makes possible to natural hand gesture recognition by achieving 16.17 ms (< 60 fps) latency in overall hand gesture recognition.

ACKNOWLEDGMENTS

This work was supported by the IT R&D program of MOTIE/KEIT. [10049270, SoC-SW platform for computer-vision based UI/UX on wearable smart devices]

Table 3. Performance comparison table

Algorithm

Implementation

Latency

Memory

Power

Contour Tracing and Filling

FPGA

4.90 ms

On-chip : 500 KBOff-chip : 2 MB

1.25 W

[9]

Energy Efficiency 79.75nJ/Pixel

Resolution CIF

Color-based Hand Segment.

ASIC

7.14 ms

On-chip : 34.8 KB

5.544 mW

This Work

0.52nJ/Pixel

CIF

Contour Tracing and Filling

FPGA

13.89 ms

On-chip : 234 KB

N/A

[7]

N/A

VGA

Memory Efficiency 35.55B/Pixel 0.45B/Pixel0.76B/Pixel

No Scheme StreamingHand Contour Tracing Unit

+ Fast Contour Filling Unit

11.75

8.687.14

26.1%

17.7%

39.2%

Late

ncy

(ms)

[7]

58.5

34.8

40.5%

Nor

mal

ized

Mem

ory

(KB

)

Proposed Hand Contour Architecture

Fig. 14. Performance Improvement.

Page 9: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, 2017 481

REFERENCES

[1] Witt, Hendrik, Tom Nicolai, and Holger Kenn. "Designing a Wearable User Interface for Hands-free Interaction in Maintenance Applications." PerCom Workshops. 2006.

[2] Segen, Jakub, and Senthil Kumar. "Gesture vr: vision-based 3d hand interace for spatial interaction." Proceedings of the sixth ACM international conference on Multimedia. ACM, 1998.

[3] Po-Kuan Huang, Tung-Yang Lin, Hsu-Ting Lin, Chi-Hao Wu, Ching-Chun Hsiao and et al., "Real-time stereo matching for 3D hand gesture recognition," SoC Design Conference (ISOCC), 2012 International, pp.29,32, 4-7 Nov. 2012

[4] Taehee Lee, and Hollerer, T., "Handy AR: Markerless Inspection of Augmented Reality Objects Using Fingertip Tracking," Wearable Computers, 2007 11th IEEE International Symposium on, pp.83,90, 11-13 Oct. 2007

[5] http://vision.middlebury.edu/stereo/eval/#alg49 [6] Cheng, O., Abdulla, W., Salcic, Z., "Hardware–

Software Codesign of Automatic Speech Recognition System for Embedded Real-Time Applications," in Industrial Electronics, IEEE Transactions on , vol.58, no.3, pp.850-859, March 2011

[7] Kim, D.K., Dae Ro Lee, Thien Cong Pham, Thuy Tuong Nguyen, Jae Wook Jeon, "Real-time component labeling and boundary tracing system based on FPGA," in Robotics and Biomimetics, 2007. ROBIO 2007. IEEE International Conference on , pp.189-194, 15-18 Dec. 2007

[8] Yeo, Hui-Shyong, Byung-Gook Lee, and Hyotaek Lim. "Hand tracking and gesture recognition system for human-computer interaction using low-cost hardware." Multimedia Tools and Applications ,vol.74 no.8, pp.2687-2715 April 2015

[9] Ratnayake, K., Amer, A., "A real-time implementation of chaotic contour tracing and filling of video objects on reconfigurable hardware," in Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on , vol., no., pp.1089-1094, 7-10 Oct. 2007

[10] Wu, Wen-Yen, and Mao-Jiun J. Wang. "Detecting

the dominant points by the curvature-based polygonal approximation." CVGIP: Graphical Models and Image Processing , vol.55, no.2 pp.79-88, 1993

[11] Ren, Mingwu, Wankou Yang, and Jingyu Yang. "A new and fast contour-filling algorithm." Pattern Recognition, vol.38, no.12, pp.2564-2577, 2005

[12] Bresenham, J.E., "Algorithm for computer control of a digital plotter," in IBM Systems Journal , vol.4, no.1, pp.25-30, 1965

[13] Sungpill Choi, Seongwook Park, Gyeonghoon Kim, Hoi-Jun Yoo, "A 124.9fps memory-efficient hand segmentation processor for hand gesture in mobile devices," in Circuits and Systems (ISCAS), 2015 IEEE International Symposium on, pp.742-745, 24-27 May 2015

Sungpill Choi (S'13 – M’15) received the B.S. and M.S. degrees in Department of Electrical Engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2012 and 2014, respectively. He is currently working

toward the Ph.D. degree in the same department at KAIST. His research interests include the low-power and memory-efficient architecture design for mobile vision SoC especially object segmentation, stereoscopic and machine learning.

Seong-Wook Park (S'12 – M’14) received the B.S. and M.S. degrees in Department of Electrical Engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2012 & 2014, respectively. He is currently working

toward the Ph.D. degree in the same department at KAIST. His research interests include the micro- architecture design for low-power mobile vision SoC & deep learning applications, and the cognitive computing for artificial intelligence. Furthermore, he is also interested in designing a recognition processor for low-power mobile interactive system.

Page 10: A Memory-efficient Hand Segmentation Architecture for Hand ...€¦ · large memory and computation power for hand segmentation, which fails to give real-time interaction with mobile

482 SUNGPILL CHOI et al : A MEMORY-EFFICIENT HAND SEGMENTATION ARCHITECTURE FOR HAND GESTURE …

Hoi-Jun Yoo (M’95 – SM’04 – F’08) graduated from the Electronic Department of Seoul National University, Seoul, Korea, in 1983 and received the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of

Science and Technology (KAIST), Daejeon, in 1985 and 1988, respectively. Since 1998, he has been the faculty of the Department of Electrical Engineering at KAIST and now is a full professor. From 2001 to 2005, he was the director of Korean System Integration and IP Authoring Research Center (SIPAC). From 2003 to 2005, he was the full time Advisor to Minister of Korea Ministry of Information and Communication and National Project Manager for SoC and Computer. In 2007, he founded System Design Innovation & Application Research Center (SDIA) at KAIST. Since 2010, he has served the general chair of Korean Institute of Next Generation Computing. His current interests are computer vision SoC, body area networks, biomedical devices and circuits. He is a co-author of DRAM Design (Korea: Hongleung, 1996), High Performance DRAM (Korea: Sigma, 1999), Networks on Chips (Morgan Kaufmann, 2006), Low-Power NoC for High-Performance SoC Design (CRC Press, 2008), Circuits at the Nanoscale (CRC Press, 2009), Embedded Memories for Nano-Scale VLSIs (Springer, 2009), Mobile 3D Graphics SoC from Algorithm to Chip (Wiley, 2010), and Bio-Medical CMOS ICs (Springer, 2011). Dr. Yoo received the Electronic Industrial Association of Korea Award for his contribution to DRAM technology in 1994, Hynix Development Award in 1995, the Korea Semiconductor Industry Association Award in 2002, Best Research of KAIST Award in 2007, Scientist/Engineer of this month Award from Ministry of Education, Science and Technology of Korea in 2010, Best Scholarship Awards of KAIST in 2011, and Order of Service Merit from Ministry of Public Administration and Security of Korea in 2011 and has been co-recipients of ASP-DAC Design Award 2001, Outstanding Design Awards of 2005, 2006, 2007, 2010, 2011 A-SSCC, Student Design Context Award of 2007, 2008, 2010, 2011 DAC/ISSCC. He has served as a member of the executive committee of ISSCC, Symposium on VLSI, and A-SSCC and the TPC chair of the A-SSCC 2008 and ISWC 2010, IEEE Fellow,

IEEE Distinguished Lecturer (’10-’11), Far East Chair of ISSCC (‘11-‘12), Technology Direction Sub-Committee Chair of ISSCC (’13), Vice TPC Chair of ISSCC (’14), and currently TPC Chair of ISSCC (’15).