Upload
guangda-su
View
214
Download
1
Embed Size (px)
Citation preview
. RESEARCH PAPER .
SCIENCE CHINAInformation Sciences
September 2012 Vol. 55 No. 9: 2102–2108
doi: 10.1007/s11432-011-4457-7
c© Science China Press and Springer-Verlag Berlin Heidelberg 2012 info.scichina.com www.springerlink.com
High-speed reconstruction for ultra-lowresolution faces
WANG Li∗, CHEN JianSheng, HE JinPing & SU GuangDa
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
Received September 13, 2010; accepted May 27, 2011; published online April 12, 2012
Abstract In this paper, a learning-based high-speed reconstruction system for ultra-low resolution faces is
implemented using a software/hardware co-design paradigm. The hardware component working at 60 MHz con-
tains a field programmable gate array, which is reconfigured to contain parallel processing units, and multiple
memories to create parallel data. The hardware component effectively handles generating and sorting computa-
tionally intensive similarity metrics. This solves the processing speed problem in learning-based super-resolution
reconstruction for ultra-low resolution faces. The system can reconstruct faces using 8× 6, 16× 12, and 32× 24
sized images, with 4× 4, 8× 8, or 16× 16 times magnification. The experimental results verify the effectiveness
of our system in terms of both visual effect and low root mean square errors. The processing speed can be
improved up to a maximum of 7900 times faster than a pure software implementation using C.
Keywords neighborhood image parallel computer, template matching, pixel compensation, pipeline
Citation Wang L, Chen J S, He J P, et al. High-speed reconstruction for ultra-low resolution faces. Sci China
Inf Sci, 2012, 55: 2102–2108, doi: 10.1007/s11432-011-4457-7
1 Introduction
Super-resolution (SR) reconstruction for ultra-low resolution faces has been a hot research topic in recent
years. It has both significant academic value and broad application prospects. For example, in video
sequences captured by traditional surveillance systems, human faces are usually extremely small in size.
This is generally caused by the wide field of view requirement in most surveillance scenarios. These face
images contain only hundreds or even only several tens of effective pixels, which means that they may not
be visually distinguishable even by human inspection. SR reconstruction for face images is a promising
solution to this problem. Face SR is the first step in a face recognition system and has strict requirements
in terms of reconstruction speed and visual effect.
Aiming at the SR reconstruction of ultra-low resolution face images, Baker and Kanade [1,2] developed
a learning based hallucination method by introducing high frequency information extracted from training
data to the reconstruction process to achieve reasonable visual reconstruction effects. Liu et al. developed
a two-step statistical approach integrating global and local models [3] to further improve the hallucination
method. Wang et al. obtained a face shape and texture model using principle component analysis [4].
Nevertheless, all the above reconstruction algorithms employ highly complex calculations. Moreover,
speed bottlenecks occur if the methods are implemented on microprocessors or DSPs. Thus, the SR
∗Corresponding author (email: [email protected])
Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9 2103
technique is seldom applied in practice. To solve this problem, we propose an SR reconstruction algorithm
that is able to generate highly realistic reconstruction results with low root mean square (RMS) errors
and is simple in structure and suitable for parallel implementation on a field programmable gate array
(FPGA).
Our work in face SR reconstruction is based on the NIPC-3 neighborhood image parallel computer,
the PCB board of which has been successfully developed in our laboratory [5]. The NIPC-3 platform
adopts a PCI interface for communicating with the host computer. The core computational component of
the NIPC-3 on which our SR reconstruction implementation resides is a Cyclone II FPGA manufactured
by Altera Company located at california. In addition, the NIPC-3 also includes SRAMs, a video A/D
component, CPLD and PCI interface chips. The results acquired from the NIPC-3 can be transferred
to the host PC through a PCI interface for continued processing. Thus, this is a software/hardware
co-design system. The essential idea of the NIPC-3 platform is parallel processing of neighborhood data
in images. A recent domestic peer evaluation of the NIPC-3 platform concluded that in terms of the
size of the neighborhood image core and processing speed of neighborhood data, the NIPC-3 is superior
to other similar hardware systems reported worldwide [5,6]. The NIPC-3 usually acts as an application
specific computation acceleration unit for processing the most time consuming tasks. At the same time,
the host PC is responsible for controlling the algorithm flow so that a balance can be achieved between
efficiency and flexibility. Such a design paradigm is adopted in our implementation.
The rest of this paper is organized as follows. Section 2 presents our face SR reconstruction algorithm.
The hardware implementation of our system is discussed in Section 3, with experimental results presented
in Section 4. The last section concludes our work.
2 Face super-resolution algorithm
The proposed SR reconstruction algorithm is illustrated in Figure 1. Symbol ai represents the i-th low
resolution (LR) face image in the training set, which contains a number of high resolution (HR) training
images aai and the corresponding LR training images ai. Each LR image is down-sampled from a HR
image. The mean of the image patch corresponds to a pixel point in the LR image. b3×3(x, y) is the 3×3
local image block centered at pixel b(x, y) in target LR image b, which can be compared with the LR
training set. Then, similarity measurements da11 da1
2 da13 · · · da1
N , da21 da2
2 da23 · · · da2
N , daM1 daM
2 daM3 · · · daM
N
are obtained. Furthermore, we can acquire the minimum feature distance di in the i-th training image.
There are M training samples, so we can get M feature distances. After sorting the M feature distances,
the five closest distances can be found and the corresponding HR image patches aai(Ak) are recorded.
The next step is linear combination of the HR patches. Finally, a pixel compensation to fuse the HR
patch is implemented. The above operation is repeated until all the pixels in the target LR image have
been calculated. In this way, we can construct an HR face image.
Three aspects of the algorithm are described in this paper, namely, the template matching process
including the similarity measurement and sorting, high resolution image patch fusion, and pixel compen-
sation.
2.1 Template matching process
In natural images, neighboring pixels are highly correlated and the local image structure can be used to
enhance the reliability of the super-resolution result [7]. When comparing b(x, y) with the training set, a
3×3 local image block b3×3(x, y) centered at b(x, y) can be considered as the template for a global search
across all training samples. Then, the nearest k 3× 3 image blocks in the LR training images are found.
Finally, the reconstruction HR patch is fused with HR patches corresponding to the local image blocks
centered pixels. The parameter k is set to 5 in our paper. In our hardware implementation, the 3 × 3
local image block of a different pixel is considered as a different template during the matching process,
hence, template updates correspond to pixel updates. The similarity measurement between two image
blocks is defined in equation (1), where b3×3(x, y) is the local image block centered at b(x, y) in the low
2104 Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9
Figure 1 Proposed face SR reconstruction algorithm.
resolution image to be constructed, and ai3×3(m,n) is the 3×3 local image block centered at ai(m,n) of the
i-th training sample. The similarity measurement between two image blocks is acquired by calculating the
accumulated sum of the absolute difference values. Equation (1) is called the MAD algorithm. Multipliers
are usually critical resources in FPGAs. However, if MAD were used to measure the similarity of two
image blocks instead of multipliers, the limitation on multiplier resources would not be a problem.
MAD =∑
|b3×3(x, y)− ai3×3(m,n)|. (1)
The template matching process is the most time consuming component in the whole algorithm, with
C simulation results showing that it consumes over 98% of the total running time. Therefore, we im-
plemented the template matching process using the NIPC-3, while other parts of the algorithm are
implemented using software on the host computer considering that their time consumption is much less
significant. NIPC-3 can provide parallel data and parallel processing and has characteristics of highly
parallel computing. Thus, it can effectively solve the problem of fast template matching.
2.2 Fusion of HR patches
LR image pixels correspond to HR image patches. The five closest pixels corresponding to the target
pixel are found in training samples. This is equivalent to finding five image patches in the HR training
samples. The HR patch size is dependent on the magnification. If the magnification is s×s, the HR patch
size is also s× s. Fusion of the HR patches occurs according to Eqs. (2) and (3), where aai(A) is the HR
patch of the training samples, aa(A) is the fusion patch, and Wi is the combination weight. ai(m,n) is
the mean of aai(A), which is also equal to the pixel value of the LR training image, and a(m,n) is the
fusing pixel value. The more feature distances that are close, the greater is the weight. Weights can be
calculated by the least squares method [8].
aa(A) =
5∑
i=1
Wi × aai(A), (2)
a(m,n) =
5∑
i=1
Wi × ai(m,n). (3)
2.3 Pixel compensation
The HR fusion patch aas×s differs somewhat from the original HR patch. Our algorithm is designed to
make fusion patch bbs×s approximate the original HR patch. As shown in Figure 2, the mean of the HR
patch aa(A) corresponds to the LR image pixel a(m,n), so the mean of the reconstruction patch bb(A)
Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9 2105
Figure 2 Pixel compensation schemes.
should be equal to pixel b(x, y). If aa(A) replaces bb(A) directly, most of the a(m,n) would not be equal to
the b(x, y) . If aa(A) is compensated using the difference between a(m,n) and b(x, y), this would ensure
that the mean of the reconstructed HR patch would equal the target pixel value. The pixel compensation
is carried out according to Eqs. (4).
bb(A) = aa(A) + (b− a). (4)
3 Hardware implementation of face reconstruction
Figure 3 shows the architecture of the NIPC-3 platform. The shared memory consists of two SRAM
chips, which is the interface for data communication between the three subsystems (FPGA, DSP, and
PCI). After neighborhood processing, the results can be sent to the DSP for further processing, or be
transferred to the host computer through the PCI bus [9]. Data are transferred from the shared memory
to the neighborhood image frame memory. Then through neighborhood generation in line order [5,10],
data are transferred from a serial format to a parallel format for high speed processing. Neighborhood
image frame memory consists of 4 SRAM chips of type IS61SP6464. The four memory chips can store
64-bit data in each storage unit. For gray images, each pixel takes up 8 bits and each memory unit
stores 8 neighboring pixels in the same line. Therefore, 32 pixels on the same line can be accessed within
one read of the memory chip through pipelining. Our system forms 32× n (line×column) neighborhood
data. In our algorithm, n is set to 3. We use one FPGA chip of type EP2C70F896C6 to implement
neighborhood image processing, with the result of the process transferred to shared memory.
3.1 MAD module
Figure 4 presents the internal modules for template matching. The neighborhood image frame memory
stores all training sample data in line order. A section of the FPGA internal memory is used as the
template RAM which stores the input LR image.
The corresponding 3 × 3 neighborhood data are needed to perform a global search in each training
sample using the MAD operation to find the nearest feature distance di. We obtain M feature distances
when retrieving M training samples. The key in the design of the hardware is to find the closest five
points in the M feature distances, and then to transfer these to the host for further processing.
3.2 Comparator and pipeline
The function of the module comparator is to find the five closest points in the M feature distances
obtained by the MAD module. We need to search the M feature distances once to obtain a maximum
2106 Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9
Figure 3 Architecture of the NIPC-3 platform. Figure 4 Template matching module.
Figure 5 Pipeline design.
or minimum value. Thus to obtain the five closest points, we need to search the M feature distances
five times. In order to use the M feature distance data repeatedly, internal RAM in FPGA is needed to
store the M feature distance data temporarily. RAMa and RAMb in Figure 4 completes this function.
In our algorithm, the template update control signal is not decided by completing the signal from the
comparator. Instead, it is decided by completing the signal of the MAD operation to search the training
samples once. Then, construction of the next pixel begins.
Figure 5 illustrates the design of the pipeline, where signal D (MAD) refers to the delay of the MAD
module with other symbols denoting the same meaning. Data from MAD are written to RAMa. Signal
w full is set when the M data have been written, and then the data are read from RAMa. The comparator
is set to compare and at the same time signal w full controls the template update module to go to the
next pixel’s MAD operation. The data are then written into RAMb. Thus when we read data from
RAMa, writing into RAMb is implemented. RAMa and RAMb form a ping-pong memory through a
2× 2 exchange switch. Two level pipelines are formed to improve the whole system efficiency.
4 Experimental results and analysis
Our system analyzed face reconstruction using three sizes of images, 8 × 6, 16 × 12, and 32 × 24. The
magnification can be 4×4, 8×8 or 16×16 times. Figures 6 illustrate the reconstruction results for 32×24,
16×12 and 8×6 images, respectively, with a magnification of 16×16. We can see that the reconstruction
results for our system are satisfactory in terms of visual reconstruction effect. In Table 1 we compare
the RMS error [11] (given in (4)) for our method with those for cubic interpolation and Baker’s method
[2]. The RMS value is calculated as the average of 100 test face images with magnification of 4 × 4 and
16× 16 for 32× 24 input images using Eq. (5).
RMS =
√√√√ 1
M ×N
M−1∑
x=0
N−1∑
y=0
‖f(x, y)− f̂(x, y)‖2. (5)
Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9 2107
Figure 6 Reconstruction results. (a) Low resolution image; (b) magnified directly; (c) our algorithm; (d) original high
resolution image.
Table 1 32× 24 images, 100 test samples, 4× 4 times, 16× 16 times mean RMS
Cubic interpolation Baker method Our method
Mean RMS (4×4) 20.0786 18.6954 11.2721
Mean RMS (16×16) 21.5335 21.0397 13.2362
Table 2 NIPC-3 compared with Intel x64 with four CPU cores, 2.83 GHz main frequency and 8 GB RAM (C)
4×4 times (s) 8×8 times (s) 16×16 times (s)
NIPC-3 C Speedup factor NIPC-3 C Speedup factor NIPC-3 C Speedup factor
32×24 (500 samples) 0.266 2115.4 7952.6 0.485 2122.9 4377.1 1.183 2146.2 1814.2
16×12 (600 samples) 0.047 195.4 4157.4 0.063 196.0 3111.1 0.094 196.9 2094.7
8×6 (600 samples) 0.016 21.4 1337.5 0.018 21.5 1194.4 0.022 21.7 986.4
Table 3 32× 24 FPGA hardware resource utilization
Device EP2C70F896C6
Totallogic elements 47669/68416 (16%)
Total registers 7724
Total pins 437/622 (70%)
Total memory bits 120832/1152000(10%)
Embedded Multiplier 9-bit elements 0/300 (0%)
Total PLLs 1/4 (25%)
Our system achieves high speed processing, while ensuring a good visual effect. Because ultra-low
resolution face reconstruction is still in the algorithm research stage, worldwide reports of face SR systems
have been sparse and there has also been no mention of the speed index of SR algorithms.
We only compare the NIPC-3 performance with the host computer. The configuration parameters for
the host computer are an Intel x64 with four CPU cores, 2.83 GHz main frequency and 8 GB RAM.
Code for comparing face reconstruction algorithms on the host computer was written in C. Table 2 gives
a comparison of the time consumption for a 16×16 magnification. For example, reconstructing a 32×24
input image would take 2146.2 s using the host computer (about 35 min), whereas it only takes 1.183 s
using the NIPC-3 system. The speedup factor is 1814.2. Reconstructing a 16×12 input image needs only
0.01 s using the NIPC-3 system with a speedup factor of 2094.7. Such a significant improvement in the
reconstruction speed has actually made it possible for the SR reconstruction to be applied in real-time
applications such as intelligent surveillance systems.
Table 3 gives the resource utilization when reconstructing 32× 24 face images.
2108 Wang L, et al. Sci China Inf Sci September 2012 Vol. 55 No. 9
5 Discussion
We introduced a pixel compensation method, which contributes to the good visual effect of the recon-
struction result and decreases the RMS error. Meanwhile, we have solved the problem of processing large
amounts of data at high speed through the parallel storage and processing of the NIPC-3. For a 32× 24
input image, the system reaches the second level using 500 training samples. By further improving the
speed, we can decrease the number of training samples or reduce the magnification to satisfy real-time
requirements on the condition that the reconstruction results are not obviously affected. The system
combining software with hardware performs slower than a pure hardware system. However, when the
whole system satisfies real-time constraints, the microprocessor has the role of controlling the algorithmic
process. The whole system benefits from the speed advantage of the hardware and the flexibility of the
microprocessor.
Acknowledgements
This work was supported by National Basic Research Program of China (Grant No. 2007CB310600), and Key
Research Foundation of Public Security (Grant No. 2005ZDGGQHDX005).
References
1 Baker S, Kanade T. Limits on super-resolution and how to break them. Comput Vis Pattern Recogn, 2000, 9: 372–379
2 Baker S, Kanade T. Hallucinating faces. In: Proceedings of the 4th IEEE International Conference on Automatic Face
and Gesture Recognition. America: IEEE Inc., 2000. 83–88
3 Liu C, Shum H Y, Zhang C S. A two-step approach to hallucinating faces: global parametric model and local non-
parametric model. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition Kauai Marriott. Piscataway: IEEE Press, 2001. 192–198
4 Wang X G. Hallucinating face by eigentransformation. IEEE Trans Syst Man Cy C, 2005, 35: 425–434
5 Su G D, Liu J X, Shang Y, et al. Theory and application of image neighborhood parallel processing. In: IEEE 16th
International Conference on Image Processing (ICIP2009). Cairo: IEEE Publisher, 2009. 2313–2316
6 Su G D. New Neighborhood function pipeline structures in neighborhood image processor. Acta Electron Sin, 2000,
28: 1–4
7 Freeman W T, Jones T R, Pasztor E C. Example-based super-resolution. IEEE Comput Graph, 2002, 22: 56–65
8 Yang D Q, Su G D, Xu T W. A new technique for face image resolution enhancement based on face hallucination. In:
Proceeding of the 27th Chinese Control Conference. Kunming: IEEE Publisher, 2008. 477–481
9 Chen B Y. The Development of a Large Neighborhood Image Processing System. Beijing: Tsinghua University, 2006.
1–50
10 Su G D. Image Parallel Processing Technology. Beijing: Tsinghua University Press, 2001. 65–70
11 Chen W F, Liu C H, Lander K. Comparison of human face matching behavior and computational image similarity
measure. Sci China Ser F-Inf Sci, 2009, 52: 316–321