High-speed reconstruction for ultra-low resolution faces

. RESEARCH PAPER .

SCIENCE CHINAInformation Sciences

September 2012 Vol. 55 No. 9: 2102–2108

doi: 10.1007/s11432-011-4457-7

c© Science China Press and Springer-Verlag Berlin Heidelberg 2012 info.scichina.com www.springerlink.com

High-speed reconstruction for ultra-lowresolution faces

WANG Li∗, CHEN JianSheng, HE JinPing & SU GuangDa

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Received September 13, 2010; accepted May 27, 2011; published online April 12, 2012

Abstract In this paper, a learning-based high-speed reconstruction system for ultra-low resolution faces is

implemented using a software/hardware co-design paradigm. The hardware component working at 60 MHz con-

tains a field programmable gate array, which is reconfigured to contain parallel processing units, and multiple

memories to create parallel data. The hardware component effectively handles generating and sorting computa-

tionally intensive similarity metrics. This solves the processing speed problem in learning-based super-resolution

reconstruction for ultra-low resolution faces. The system can reconstruct faces using 8× 6, 16× 12, and 32× 24

sized images, with 4× 4, 8× 8, or 16× 16 times magnification. The experimental results verify the effectiveness

of our system in terms of both visual effect and low root mean square errors. The processing speed can be

improved up to a maximum of 7900 times faster than a pure software implementation using C.

Keywords neighborhood image parallel computer, template matching, pixel compensation, pipeline

Citation Wang L, Chen J S, He J P, et al. High-speed reconstruction for ultra-low resolution faces. Sci China

Inf Sci, 2012, 55: 2102–2108, doi: 10.1007/s11432-011-4457-7

1 Introduction

Super-resolution (SR) reconstruction for ultra-low resolution faces has been a hot research topic in recent

years. It has both significant academic value and broad application prospects. For example, in video

sequences captured by traditional surveillance systems, human faces are usually extremely small in size.

This is generally caused by the wide field of view requirement in most surveillance scenarios. These face

images contain only hundreds or even only several tens of effective pixels, which means that they may not

be visually distinguishable even by human inspection. SR reconstruction for face images is a promising

solution to this problem. Face SR is the first step in a face recognition system and has strict requirements

in terms of reconstruction speed and visual effect.

Aiming at the SR reconstruction of ultra-low resolution face images, Baker and Kanade [1,2] developed

a learning based hallucination method by introducing high frequency information extracted from training

data to the reconstruction process to achieve reasonable visual reconstruction effects. Liu et al. developed

a two-step statistical approach integrating global and local models [3] to further improve the hallucination

method. Wang et al. obtained a face shape and texture model using principle component analysis [4].

Nevertheless, all the above reconstruction algorithms employ highly complex calculations. Moreover,

speed bottlenecks occur if the methods are implemented on microprocessors or DSPs. Thus, the SR

∗Corresponding author (email: [email protected])

Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9 2103

technique is seldom applied in practice. To solve this problem, we propose an SR reconstruction algorithm

that is able to generate highly realistic reconstruction results with low root mean square (RMS) errors

and is simple in structure and suitable for parallel implementation on a field programmable gate array

(FPGA).

Our work in face SR reconstruction is based on the NIPC-3 neighborhood image parallel computer,

the PCB board of which has been successfully developed in our laboratory [5]. The NIPC-3 platform

adopts a PCI interface for communicating with the host computer. The core computational component of

the NIPC-3 on which our SR reconstruction implementation resides is a Cyclone II FPGA manufactured

by Altera Company located at california. In addition, the NIPC-3 also includes SRAMs, a video A/D

component, CPLD and PCI interface chips. The results acquired from the NIPC-3 can be transferred

to the host PC through a PCI interface for continued processing. Thus, this is a software/hardware

co-design system. The essential idea of the NIPC-3 platform is parallel processing of neighborhood data

in images. A recent domestic peer evaluation of the NIPC-3 platform concluded that in terms of the

size of the neighborhood image core and processing speed of neighborhood data, the NIPC-3 is superior

to other similar hardware systems reported worldwide [5,6]. The NIPC-3 usually acts as an application

specific computation acceleration unit for processing the most time consuming tasks. At the same time,

the host PC is responsible for controlling the algorithm flow so that a balance can be achieved between

efficiency and flexibility. Such a design paradigm is adopted in our implementation.

The rest of this paper is organized as follows. Section 2 presents our face SR reconstruction algorithm.

The hardware implementation of our system is discussed in Section 3, with experimental results presented

in Section 4. The last section concludes our work.

2 Face super-resolution algorithm

The proposed SR reconstruction algorithm is illustrated in Figure 1. Symbol ai represents the i-th low

resolution (LR) face image in the training set, which contains a number of high resolution (HR) training

images aai and the corresponding LR training images ai. Each LR image is down-sampled from a HR

image. The mean of the image patch corresponds to a pixel point in the LR image. b3×3(x, y) is the 3×3

local image block centered at pixel b(x, y) in target LR image b, which can be compared with the LR

training set. Then, similarity measurements da11 da1

2 da13 · · · da1

N , da21 da2

2 da23 · · · da2

N , daM1 daM

2 daM3 · · · daM

N

are obtained. Furthermore, we can acquire the minimum feature distance di in the i-th training image.

There are M training samples, so we can get M feature distances. After sorting the M feature distances,

the five closest distances can be found and the corresponding HR image patches aai(Ak) are recorded.

The next step is linear combination of the HR patches. Finally, a pixel compensation to fuse the HR

patch is implemented. The above operation is repeated until all the pixels in the target LR image have

been calculated. In this way, we can construct an HR face image.

Three aspects of the algorithm are described in this paper, namely, the template matching process

including the similarity measurement and sorting, high resolution image patch fusion, and pixel compen-

sation.

2.1 Template matching process

In natural images, neighboring pixels are highly correlated and the local image structure can be used to

enhance the reliability of the super-resolution result [7]. When comparing b(x, y) with the training set, a

3×3 local image block b3×3(x, y) centered at b(x, y) can be considered as the template for a global search

across all training samples. Then, the nearest k 3× 3 image blocks in the LR training images are found.

Finally, the reconstruction HR patch is fused with HR patches corresponding to the local image blocks

centered pixels. The parameter k is set to 5 in our paper. In our hardware implementation, the 3 × 3

local image block of a different pixel is considered as a different template during the matching process,

hence, template updates correspond to pixel updates. The similarity measurement between two image

blocks is defined in equation (1), where b3×3(x, y) is the local image block centered at b(x, y) in the low

2104 Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9

Figure 1 Proposed face SR reconstruction algorithm.

resolution image to be constructed, and ai3×3(m,n) is the 3×3 local image block centered at ai(m,n) of the

i-th training sample. The similarity measurement between two image blocks is acquired by calculating the

accumulated sum of the absolute difference values. Equation (1) is called the MAD algorithm. Multipliers

are usually critical resources in FPGAs. However, if MAD were used to measure the similarity of two

image blocks instead of multipliers, the limitation on multiplier resources would not be a problem.

MAD =∑

|b3×3(x, y)− ai3×3(m,n)|. (1)

The template matching process is the most time consuming component in the whole algorithm, with

C simulation results showing that it consumes over 98% of the total running time. Therefore, we im-

plemented the template matching process using the NIPC-3, while other parts of the algorithm are

implemented using software on the host computer considering that their time consumption is much less

significant. NIPC-3 can provide parallel data and parallel processing and has characteristics of highly

parallel computing. Thus, it can effectively solve the problem of fast template matching.

2.2 Fusion of HR patches

LR image pixels correspond to HR image patches. The five closest pixels corresponding to the target

pixel are found in training samples. This is equivalent to finding five image patches in the HR training

samples. The HR patch size is dependent on the magnification. If the magnification is s×s, the HR patch

size is also s× s. Fusion of the HR patches occurs according to Eqs. (2) and (3), where aai(A) is the HR

patch of the training samples, aa(A) is the fusion patch, and Wi is the combination weight. ai(m,n) is

the mean of aai(A), which is also equal to the pixel value of the LR training image, and a(m,n) is the

fusing pixel value. The more feature distances that are close, the greater is the weight. Weights can be

calculated by the least squares method [8].

aa(A) =

5∑

i=1

Wi × aai(A), (2)

a(m,n) =

5∑

i=1

Wi × ai(m,n). (3)

2.3 Pixel compensation

The HR fusion patch aas×s differs somewhat from the original HR patch. Our algorithm is designed to

make fusion patch bbs×s approximate the original HR patch. As shown in Figure 2, the mean of the HR

patch aa(A) corresponds to the LR image pixel a(m,n), so the mean of the reconstruction patch bb(A)


Figure 2 Pixel compensation schemes.

should be equal to pixel b(x, y). If aa(A) replaces bb(A) directly, most of the a(m,n) would not be equal to

the b(x, y) . If aa(A) is compensated using the difference between a(m,n) and b(x, y), this would ensure

that the mean of the reconstructed HR patch would equal the target pixel value. The pixel compensation

is carried out according to Eqs. (4).

bb(A) = aa(A) + (b− a). (4)

3 Hardware implementation of face reconstruction

Figure 3 shows the architecture of the NIPC-3 platform. The shared memory consists of two SRAM

chips, which is the interface for data communication between the three subsystems (FPGA, DSP, and

PCI). After neighborhood processing, the results can be sent to the DSP for further processing, or be

transferred to the host computer through the PCI bus [9]. Data are transferred from the shared memory

to the neighborhood image frame memory. Then through neighborhood generation in line order [5,10],

data are transferred from a serial format to a parallel format for high speed processing. Neighborhood

image frame memory consists of 4 SRAM chips of type IS61SP6464. The four memory chips can store

64-bit data in each storage unit. For gray images, each pixel takes up 8 bits and each memory unit

stores 8 neighboring pixels in the same line. Therefore, 32 pixels on the same line can be accessed within

one read of the memory chip through pipelining. Our system forms 32× n (line×column) neighborhood

data. In our algorithm, n is set to 3. We use one FPGA chip of type EP2C70F896C6 to implement

neighborhood image processing, with the result of the process transferred to shared memory.

3.1 MAD module

Figure 4 presents the internal modules for template matching. The neighborhood image frame memory

stores all training sample data in line order. A section of the FPGA internal memory is used as the

template RAM which stores the input LR image.

The corresponding 3 × 3 neighborhood data are needed to perform a global search in each training

sample using the MAD operation to find the nearest feature distance di. We obtain M feature distances

when retrieving M training samples. The key in the design of the hardware is to find the closest five

points in the M feature distances, and then to transfer these to the host for further processing.

3.2 Comparator and pipeline

The function of the module comparator is to find the five closest points in the M feature distances

obtained by the MAD module. We need to search the M feature distances once to obtain a maximum


Figure 3 Architecture of the NIPC-3 platform. Figure 4 Template matching module.

Figure 5 Pipeline design.

or minimum value. Thus to obtain the five closest points, we need to search the M feature distances

five times. In order to use the M feature distance data repeatedly, internal RAM in FPGA is needed to

store the M feature distance data temporarily. RAMa and RAMb in Figure 4 completes this function.

In our algorithm, the template update control signal is not decided by completing the signal from the

comparator. Instead, it is decided by completing the signal of the MAD operation to search the training

samples once. Then, construction of the next pixel begins.

Figure 5 illustrates the design of the pipeline, where signal D (MAD) refers to the delay of the MAD

module with other symbols denoting the same meaning. Data from MAD are written to RAMa. Signal

w full is set when the M data have been written, and then the data are read from RAMa. The comparator

is set to compare and at the same time signal w full controls the template update module to go to the

next pixel’s MAD operation. The data are then written into RAMb. Thus when we read data from

RAMa, writing into RAMb is implemented. RAMa and RAMb form a ping-pong memory through a

2× 2 exchange switch. Two level pipelines are formed to improve the whole system efficiency.

4 Experimental results and analysis

Our system analyzed face reconstruction using three sizes of images, 8 × 6, 16 × 12, and 32 × 24. The

magnification can be 4×4, 8×8 or 16×16 times. Figures 6 illustrate the reconstruction results for 32×24,

16×12 and 8×6 images, respectively, with a magnification of 16×16. We can see that the reconstruction

results for our system are satisfactory in terms of visual reconstruction effect. In Table 1 we compare

the RMS error [11] (given in (4)) for our method with those for cubic interpolation and Baker’s method

[2]. The RMS value is calculated as the average of 100 test face images with magnification of 4 × 4 and

16× 16 for 32× 24 input images using Eq. (5).

RMS =

√√√√ 1

M ×N

M−1∑

x=0

N−1∑

y=0

‖f(x, y)− f̂(x, y)‖2. (5)


Figure 6 Reconstruction results. (a) Low resolution image; (b) magnified directly; (c) our algorithm; (d) original high

resolution image.

Table 1 32× 24 images, 100 test samples, 4× 4 times, 16× 16 times mean RMS

Cubic interpolation Baker method Our method

Mean RMS (4×4) 20.0786 18.6954 11.2721

Mean RMS (16×16) 21.5335 21.0397 13.2362

Table 2 NIPC-3 compared with Intel x64 with four CPU cores, 2.83 GHz main frequency and 8 GB RAM (C)

4×4 times (s) 8×8 times (s) 16×16 times (s)

NIPC-3 C Speedup factor NIPC-3 C Speedup factor NIPC-3 C Speedup factor

32×24 (500 samples) 0.266 2115.4 7952.6 0.485 2122.9 4377.1 1.183 2146.2 1814.2

16×12 (600 samples) 0.047 195.4 4157.4 0.063 196.0 3111.1 0.094 196.9 2094.7

8×6 (600 samples) 0.016 21.4 1337.5 0.018 21.5 1194.4 0.022 21.7 986.4

Table 3 32× 24 FPGA hardware resource utilization

Device EP2C70F896C6

Totallogic elements 47669/68416 (16%)

Total registers 7724

Total pins 437/622 (70%)

Total memory bits 120832/1152000(10%)

Embedded Multiplier 9-bit elements 0/300 (0%)

Total PLLs 1/4 (25%)

Our system achieves high speed processing, while ensuring a good visual effect. Because ultra-low

resolution face reconstruction is still in the algorithm research stage, worldwide reports of face SR systems

have been sparse and there has also been no mention of the speed index of SR algorithms.

We only compare the NIPC-3 performance with the host computer. The configuration parameters for

the host computer are an Intel x64 with four CPU cores, 2.83 GHz main frequency and 8 GB RAM.

Code for comparing face reconstruction algorithms on the host computer was written in C. Table 2 gives

a comparison of the time consumption for a 16×16 magnification. For example, reconstructing a 32×24

input image would take 2146.2 s using the host computer (about 35 min), whereas it only takes 1.183 s

using the NIPC-3 system. The speedup factor is 1814.2. Reconstructing a 16×12 input image needs only

0.01 s using the NIPC-3 system with a speedup factor of 2094.7. Such a significant improvement in the

reconstruction speed has actually made it possible for the SR reconstruction to be applied in real-time

applications such as intelligent surveillance systems.

Table 3 gives the resource utilization when reconstructing 32× 24 face images.


5 Discussion

We introduced a pixel compensation method, which contributes to the good visual effect of the recon-

struction result and decreases the RMS error. Meanwhile, we have solved the problem of processing large

amounts of data at high speed through the parallel storage and processing of the NIPC-3. For a 32× 24

input image, the system reaches the second level using 500 training samples. By further improving the

speed, we can decrease the number of training samples or reduce the magnification to satisfy real-time

requirements on the condition that the reconstruction results are not obviously affected. The system

combining software with hardware performs slower than a pure hardware system. However, when the

whole system satisfies real-time constraints, the microprocessor has the role of controlling the algorithmic

process. The whole system benefits from the speed advantage of the hardware and the flexibility of the

microprocessor.

Acknowledgements

This work was supported by National Basic Research Program of China (Grant No. 2007CB310600), and Key

Research Foundation of Public Security (Grant No. 2005ZDGGQHDX005).

References

1 Baker S, Kanade T. Limits on super-resolution and how to break them. Comput Vis Pattern Recogn, 2000, 9: 372–379

2 Baker S, Kanade T. Hallucinating faces. In: Proceedings of the 4th IEEE International Conference on Automatic Face

and Gesture Recognition. America: IEEE Inc., 2000. 83–88

3 Liu C, Shum H Y, Zhang C S. A two-step approach to hallucinating faces: global parametric model and local non-

parametric model. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern

Recognition Kauai Marriott. Piscataway: IEEE Press, 2001. 192–198

4 Wang X G. Hallucinating face by eigentransformation. IEEE Trans Syst Man Cy C, 2005, 35: 425–434

5 Su G D, Liu J X, Shang Y, et al. Theory and application of image neighborhood parallel processing. In: IEEE 16th

International Conference on Image Processing (ICIP2009). Cairo: IEEE Publisher, 2009. 2313–2316

6 Su G D. New Neighborhood function pipeline structures in neighborhood image processor. Acta Electron Sin, 2000,

28: 1–4

7 Freeman W T, Jones T R, Pasztor E C. Example-based super-resolution. IEEE Comput Graph, 2002, 22: 56–65

8 Yang D Q, Su G D, Xu T W. A new technique for face image resolution enhancement based on face hallucination. In:

Proceeding of the 27th Chinese Control Conference. Kunming: IEEE Publisher, 2008. 477–481

9 Chen B Y. The Development of a Large Neighborhood Image Processing System. Beijing: Tsinghua University, 2006.

1–50

10 Su G D. Image Parallel Processing Technology. Beijing: Tsinghua University Press, 2001. 65–70

11 Chen W F, Liu C H, Lander K. Comparison of human face matching behavior and computational image similarity

measure. Sci China Ser F-Inf Sci, 2009, 52: 316–321

Documents

High-speed reconstruction for ultra-low resolution faces