1 Low Resolution Face Recognition in the Wild - arXivdatasets (AR [32] and YouTube Faces (YTF) [55]) is also presented. The performance gap between recognition of faces captured in

1

Low Resolution Face Recognition in the WildPei Li1 Loreto Prieto2 Domingo Mery2 Patrick Flynn1

1Department of Computer Science and EngineeringUniversity of Notre Dame

2Department of Computer SciencePontificia Universidad Catolica de Chile

Abstract—Although face recognition systems have achievedimpressive performance in recent years, the low-resolution facerecognition (LRFR) task remains challenging, especially whenthe LR faces are captured under non-ideal conditions, as iscommon in surveillance-based applications. Faces captured insuch conditions are often contaminated by blur, nonuniformlighting, and nonfrontal face pose. In this paper, we analyzeface recognition techniques using data captured under low-quality conditions in the wild. We provide a comprehensiveanalysis of experimental results for two of the most importantapplications in real surveillance applications, and demonstratepractical approaches to handle both cases that show promisingperformance. The following three contributions are made: (i) weconduct experiments to evaluate super-resolution methods forlow-resolution face recognition; (ii) we study face re-identificationon various public face datasets including real surveillance andlow-resolution subsets of large-scale datasets, present a baselineresult for several deep learning based approaches, and improvethem by introducing a GAN pre-training approach and fullyconvolutional architecture; and (iii) we explore low-resolutionface identification by employing a state-of-the-art superviseddiscriminative learning approach. Evaluations are conducted onchallenging portions of the SCFace and UCCSface datasets.

I. INTRODUCTION

In recent years, we have witnessed tremendous improve-ments in face recognition system performance, especiallywhen employing different specially designed deep learningarchitectures. Some state-of-the-art methods based on deeplearning models including [47], [44], [36], [48], [54] and [30]have achieved an accuracy over 99 percent on public facedatasets such as LFW [14]. Although these algorithms can dealeffectively with faces with significant pose variations, thesefaces generally need to be large in area. Also, pre-processingtechniques such as face frontalization and face alignmentare needed. These procedures (that are designed for high-resolution face image data) cannot be applied directly to low-quality face images. The use of surveillance systems in a widerange of public places is increasing, creating a challenging usecase for face recognition in an environment where detectedfaces will be low in resolution. The research in this paperis motivated by this need. Practical face recognition systemsfor images captured in surveillance scenarios can address faceidentification tasks (using a watch list) and face reidentificationtasks (where a subject is matched to a previous appearance ina surveillance system).

This work is based in part on prior work published in [24][23] but has many new contributions presented. In the originalwork [24], we focused on prototyping light weighted deepneural networks for LR face re-identification and proved thatLR face can be used for person re-identification. We addressedusing data from real surveillance system, contributed a datasetcalled VBOLO and conducted extensive study on it for personre-identification task. In this work, we address the LR facerecognition task in an larger scope which includes qualitativelyexploring the gap between controlled facial images and videoquality face images using super resolution techniques, faceidentification and face re-identification. For the part in facere-identification, we focus on boosting up the performancebased on [24] and extend our evaluation by introducing morenew dataset, namely Scface, UCCSface and Megaface. Thefollowing is a brief road-map to this paper:

In Section II, we provide a literature review and sum-mary descriptions of several newly released real surveillancedatasets, collected from operational surveillance camera net-works in checkpoints at public transportation stations or insidebuildings. In Section III-A, we present several baseline facerecognition results obtained by evaluating state-of-the-art superresolution methods [28], [59]. A summary of the performanceof super-resolution augmented face recognition techniquesemploying low-resolution face inputs from two popular facedatasets (AR [32] and YouTube Faces (YTF) [55]) is alsopresented. The performance gap between recognition of facescaptured in a controlled environment and recognition of facescaptured in the wild is shown. In Section III-B, we addressthe low-resolution face identification task employing centerregularization. The goal of this work is to learn a commonfeature space that locates low-resolution and high-resolutionface images of the same subject as close as possible in featurespace, without generating thousands of pairs for training.By conducting comprehensive evaluations on two differentdatasets, we are able to show the performance gap betweenopen-set and closed-set face identification when employinglow-resolution inputs. In Section III-C, we firstly summarizethe discoveries in our previous work [24] based on a realsurveillance dataset named VBOLO. Different deep architec-tures are designed and evaluated for LR surveillance faceimages a as baseline result and were improved by introducingfully convolutional spatial pyramid pooling (SPP). To gener-alize the ability of the performance on larger data scale, wefurther study face re-identification on several surveillance andlow resolution datasets from videos or images collected online.

arX

iv:1

805.

1152

9v1

[cs

.CV

] 2

9 M

ay 2

018

2

In addition we present a novel DCGAN-based pre-trainingapproaches to boost the performance further.

In Section IV, concluding remarks are provided.

II. RELATED WORKS AND DATASETS

A. Low-resolution face RecognitionThe low-resolution (LR) face recognition task is a subset

of the general face recognition problem. One of the mostuseful application scenarios for this task is recognition fromsurveillance systems. In this scenario, faces are captured in thewild from cameras with a large standoff, positioned above headheight, and sometimes under challenging lighting conditions.Even if the faces could be detected by a specially trained facedetector, it is hard to construct a robust feature representationdue to the lack of information embedded in the image itself.Although high-resolution (HR) face recognition systems haveachieved nearly perfect performance on several datasets, LRface recognition remains a challenging problem. Approachesto this problem in the literature follow two main themes.Some techniques employ super-resolution (SR) and deblurringtechniques to increase the input LR face size to a point wherea HR face recognition technique may work well. Anotherpopular approach is to learn a unified feature space for LR andHR face images, within which feature vector distance plays itstypical role as a matching score.

SR techniques are widely used to turn LR images into betterquality images. Face SR approaches are thus an intuitive wayto recover LR face images for recognition. Hennings-Yeomanset al. [12] included the prior extracted face features as measuresof fit of the SR result, and performs SR from both reconstruc-tion and recognition perspectives. Yang et al. [59] proposed ajoint dictionary training method for general SR that employsboth LR and HR image patches. Zou et al. [67] designed alinear regression model with two elements (namely, new dataand discriminative constraints) to learn the mapping. Jianget al. [15] proposed a coarse-to-fine face SR approach via amulti-layer locality-constrained iterative neighbor embedding.Kolouri et al. [21] introduced a single frame SR technique thatuses a transport-based formulation of this problem. Wang etal. [53] deployed deep learning pre-training with a carefullyselected loss function to achieve SR for matching between LRand HR face images, and achieves state-of-the-art performanceon a new surveillance dataset.

Another approach to solving the LR face recognition prob-lem is to seek a unified feature space that preserves proximitybetween faces of different resolutions. Li et al. [22] proposeda coupled mapping method that projects face images withdifferent resolutions into a unified feature space. Biswas etal. [3] simultaneously embeded the LR and HR faces in acommon space such that the distances between them in thetransformed space approximates the distance between betweentwo HR face images of the same subject. Ren et al. [42]used a coupled mapping strategy with both HR and LRcounterparts for learning the projection directions as wellas exploiting discriminant information. Shekhar et al. [45]proposed robust dictionary learning for LR face recognitionthat shares common sparse codes. Qiu et al. [39] tried to

learn a domain adaptive dictionary to handle the matching oftwo faces captured in source and target domain. Li et al. [23]proposed several shallow network structures to learn a latentspace between LR and HR images, and evaluates the proposedmethods on a new surveillance dataset.

There are also some other methods such as [41] and [13],which explore more robust features directly to improve recog-nition rate in blurry and degraded face images. In addition,methods like those proposed in [10], [34] and [35] work torestore the upsampled LR images using deblurring techniques.

Most of the approaches mentioned above are evaluated onlow-quality versions of constrained face datasets like Multi-PIE, FERET or FRGC. The images are created by directlydown-sampling or blurring original images. However, the LRface recognition problem becomes a challenge when facescaptured in an unconstrained environment. In such cases,the LR face recognition problem needs to be explored moreextensively.

B. Face Re-IdentificationHere we briefly introduce person re-identification (ReID)

and our motivation for face re-identification. A typical end-to-end ReID system generally involves the following steps: persondetection, preprocessing, feature extraction and matching. It iswidely used for surveillance purposes, modeled as given a pairof images and must find whether the images are from the sameperson or not. Often, the pictures are captured from differentcameras in a surveillance network. Traditional approaches tothe ReID problem focus on two main components: featureextraction and similarity computation for matching. We can di-vide most of the existing methods into two categories: methodsemploying deep learning, and those without. For the non-deeplearning methods, research have proposed and used differenthandcrafted features such as symmetry-driven accumulationof local features (SDALF) [2], color histograms [57], [64],color names [60], local binary patterns [18], aggregation ofpatch summary features [65], metric learning approaches [57],[18], [20], [26], [31], [27], [57], [16], [5], [63] and variouscombinations of these. Several deep learning approaches use“Siamese” deep convolutional neural networks [61], [51],[52], [29] for feature extraction and metric learning at thesame time providing novel end-to-end solutions. Ahmed etal. [1] proposed a new deep learning framework based onthe idea from Yi et al. [61], where two novel layers areemployed for computing cross-input neighborhood differencesby integrating local relationships based on mid-level features.They additionally showed that the features acquired from thehead and neck could be an important clue for person re-identification. Wu et al. [56] improved performance basedon Ahmed’s idea by using a deeper architecture and a newoptimization method. Other deep network structures such as[50] and [46] have been designed which also effectively solvedthe ReID problem on older ReID datasets. Qui et al. [39]attempted to perform facial ReID by using domain adaptationmethods to reconcile different facial poses; however, theirexperiments were performed on the Multi-PIE [9] dataset, inwhich face images have controlled poses and illuminations.

3

Although an increasing number of surveillance cameras havebeen deployed in public areas, the quality of the video framesis usually low and people captured in the frame are in anuncontrolled pose and illumination condition. Thus, generalperson re-identification can be a challenging task. As [4]shows, body and gait might play a role in recognizing thetarget in low-resolution video frames, however, obscuring thetarget produced a dramatic drop in human-level recognitionperformance. Also, the face-mask-out experiment in [23] alsodemonstrates face could be an indispensable part of identityrecognition. Thus, the very low resolution face recognitionproblem should be addressed as a component of the re-identification problem.

C. DatasetsThere are some good surveillance datasets and also some

large scale unconstrained face datasets that contain natural low-resolution faces that could be utilized in this challenging task.Here, we describe the datasets that we consider appropriate forthe LRFR task in our experiments. Most of the LR face imagesused for research are generated by downsampling a standardface recognition dataset that is collected under a controlledenvironment. We select the AR dataset to research LRFRtask under uncontrolled scenarios and other unconstrained LRface datasets for more exploration. It is used to illustrate thedifferent data distribution between LR face images artificiallygenerated from high quality controlled face images and thosedirectly collect in unconstrained scenarios such as surveillancecamera networks.

1) AR Face: The images in the AR database [32] were takenfrom 100 subjects (50 women and 50 men) with differentfacial expressions, illumination conditions, and occlusions bysunglasses or scarves under strict controlled conditions. Inour work, these images are used to estimate the baselineperformance of face recognition methods on aligned faceimages in controlled environments.

2) Megaface Challenge 2 LR subset: The Megaface chal-lenge 2 (MF2) training dataset [33] is the largest (in numberof identities) publicly available facial recognition dataset, with4.7 million face images and over 672,000 identities. The MF2dataset was obtained by running the Dlib [19] face detectoron images from Flickr [49], yielding 40 million unlabeledfaces across 130,154 distinct Flickr accounts. Automatic iden-tity labeling is performed using a clustering algorithm. Weperformed a subset selection from the Megaface Challenge 2training set with tight bounding boxes to generate a LR subsetof this dataset. Faces smaller than 50x50 pixels are gathered foreach identity, and then we eliminated identities with fewer thanfive images available. This subset selection approach produced6,700 identities and 85,344 face images in total. The extractionprocess does yield some non-face images, as does the originaldataset processing. No further data cleaning is conducted onthis subset.

3) YouTubeFaces: The YouTubeFaces Database [55] isdesigned for studying the problem of unconstrained facerecognition in video. The dataset contains 3,425 videos of1,595 different people. An average of 2.15 videos is availablefor each subject.

4) SCface: SCFace [8] is a database of static images ofhuman faces collected in an uncontrolled indoor environmentusing five video surveillance cameras of various qualities.The database contains 4,160 static images (in the visible andinfrared spectra) of 130 subjects. We choose the HR and LRvisible face subset for training and testing.

5) UCCSface: The UnConstrained College Students(UCCS) dataset [43] contains high-resolution images capturedfrom an 18 megapixel camera at the University of Coloradoat Colorado Springs, capturing people walking on a campussidewalk from a standoff of 100 to 150 meters, at oneframe per second. The dataset consists of more than 70,000hand-cropped face regions. An identity is manually assignedto many of the faces. We used the database subset that hasassigned identities (180 identities total). Although the data arecaptured by high-definition cameras, the faces region is tinycompared to the entire image frame due to the large standoffand contain a lot of noise and blurriness.

6) VBOLO face: This dataset [23][25] (formerly knownas EBOLO) was collected in several sessions at variouscheckpoints within public transportation facilities such astunnels, bridges, and hallways. These capture environmentsinclude different camera mount heights and depression angles,illuminations, backgrounds, resolutions, pedestrian poses, anddistractors. This dataset provides a good scenario for thefacial ReID problem. This dataset uses a small set of knownindividuals - “actors”, who move in and out of the surveillancecameras’ fields of view, together with the unknown personsdenoted as “distractors”. The “actors” change clothing ran-domly between each “appearance” in a camera’s field of view.Compared to a typical body-based ReID dataset, which hasonly a few images for each subject, the VBOLO dataset has alarge number of annotations for each subject from consecutivevideo frames, which mimic a real scenario for surveillancetracking and detection. This is significantly challenging formatching, because• Faces change size significantly (e.g., from 12x12 to

150x150) and exhibit significant pose variations as well.• The two cameras supplying the probe and gallery im-

ages may have different resolutions and points of view.We employ videos captured at two distinct locations (de-

noted as Station 1 and Station 2) in this research. Each of thecollections has nine actors with nine appearances each.

III. METHODS AND EXPERIMENTS

A. Super-resolution Techniques1) Description: In order to explore the gap between the

constrained and unconstrained low-resolution face recognitionperformance, we designed a small super-resolution (SR) ex-periment with the AR [32] and YouTube Faces (YTF) [55]datasets. In this experiment, the idea was to evaluate thematching performance of two face images: a low-resolution(LR) image and a high-resolution (HR) image.

In the data selection process, two images or video frameswere selected from the datasets for each subject. As illustratedin Fig. 1, ‘Original #1’ and ‘Original #2’ were used for LRand HR purposes respectively. The first one was downsampled

4

to each of three low-resolution sizes: (a) 21× 15, (b) 16× 12,and (c) 11 × 8 pixels. The second one was upsampled to thematcher’s required input size of 224 × 224 pixels. In bothcases, bicubic interpolation was used [7]. The images or frameswere selected as follows.• AR dataset: original LR and HR face images were takenfrom face images #01 and #14 respectively of each subject.Since the AR dataset has 100 subjects, we obtained 100 LR–HR pairs of face images for this experiment.• YouTube Faces dataset: original LR and HR face imageswere chosen as the earliest and latest frame containing a frontalface image in the latest video clip available for each subject.The detection of frontal faces was performed by applyingZhu’s approach [66] to each frame and selecting face imageswhere Zhu’s method returned a pose of 0. The dataset has1,595 subjects; however, only 1,463 subjects have two suitableframes available in the last video clip. Thus, we have for thisexperiment 1,463 LR–HR pairs of face images.

After the selection of images and frames and scaling totarget HR and LR sizes was performed, each of the LR imagesfrom that process was then upscaled to the same size ofthe HR image (224 × 224 pixels) by each of the followingmethods: (a) bicubic interpolation (BI) [7], (b) deep learningSR (DL) [28], and (c) sparse representation super-resolution(SP) [59]. The new images are tagged with intermediate LRresolution and SR method as follows: SR:21x15-BI, SR:21x15-DL, SR:21x15-SP, SR:16x12-BI, SR:16x12-DL, SR:16x12-SP,SR:11x8-BI, SR:11x8-DL, and SR:11x8-SP. In addition, weincluded in the experiments the ‘Direct method’, where orig-inal images #1 and #2 are compared without downsamplingusing bicubic interpolation only to achieve the size of 224 ×224 pixels required by the matcher. This experimental designis depicted in Figure 1.

2) Experiments and Results: The experimental approach ispresented in Fig. 1. The VGG-face trained network in [36]was used to produce a feature vector for the HR image andall nine of the SR images. Matching scores were obtained foreach match and nonmatch pair involving one HR and one ofthe nine different upscaled SR images. The cosine distancewas used as a match score.

The cumulative match characteristic curve for the AR andYouTube Faces datasets can be seen in Figures 2 and 3respectively. As might be expected, the performance decreaseswith decreasing resolution: the Direct method (with no down-sampling) achieves better results than those obtained by LR im-ages. Moreover, 21×15 LR-images obtain better performancethan 16×12 LR-images, and these ones better than 11×8 LR-images.

In order to show the degradation of the performance withlow-quality images, we conducted another experiment. Sincethe original faces in YouTube video frames might be of lowquality, we evaluated the performance not only in the matchingof all 1,463 pairs, but only in the 500 pairs with highest qualityand in the 500 pairs with lowest quality. We call this two newsubsets HQ-YT and LQ-YT respectively. For the measurementof quality of the original face images, we used a score based onthe ratio between the high frequency coefficients and the lowfrequency coefficients of the wavelet transform of the image

[37]. Low score values indicate low quality. We evaluated theperformance at rank 10 when changing the down-samplingtarget resolution in HQ-YT and LQ-YT. This can be seen inTable I, where high quality images yield a significantly betterperformance than the low quality ones.

One clear point from the results of Figures 2 and 3 is that,for these datasets and these algorithms implementations, at thesame resolution, sparse representation super-resolution (SP)consistently outperforms bicubic interpolation, which in turnoutperforms deep learning super-resolution. The low perfor-mance of the deep learning method is due to the introductionof artifacts in severely degraded images as we can see in theexample of Fig. 1.

Fig. 1: Super-resolution experiment: two images, #01 (forLR purposes) and #02 (for HR purposes) are matched usingthree different super-resolution algorithms for two different LRsizes.

Fig. 2: Cumulative match characteristic plot for AR FaceDataset.

5

Fig. 3: Cumulative match characteristic plot for YouTube FacesDataset.

Subset 21x15 16x12 11x8

HQ-YTBI 80.8% 53.2% 15.6%DL 69.6% 35.2% 9.6%SP 87.2% 66.4% 26.2%

LQ-YTBI 81.4% 45.8% 10.8%DL 64.4% 27.2% 7.2%SP 87.2% 61.0% 14.6%

TABLE I: Rank-10 correct match score for subsets HQ-YTand LQ-YT

B. Low Resolution Face Identification

In this section, we further study unconstrained low-resolution face recognition. We first focus on cross-resolutionface identification which applies when the enrolled face imagesare mostly collected in controlled scenarios with high resolu-tion and those low-resolution faces are captured with surveil-lance cameras with uncontrolled pose and lighting conditions.This is a challenging recognition task which relies strongly ongood resolution invariant representation.

1) Supervised discriminative cross-resolution learning:Description: Most existing approaches to cross resolution

matching would learn a unified space in which to represent lowand high resolution faces. This requires a carefully designedface pairing mining strategy during the training process whichis both time-consuming and performance-sensitive. In order tofully explore the intrinsic connection between the HR and LRdomains, we decide to involve the HR and LR face imagesequally, expecting to learn a common feature space that isable to cluster LR and HR faces within the same subjectas well as maintaining low inter-class proximity despite thedifference in resolution. We proposed an approach based on[54], which employs a novel regularization term to furtherforce the features from the same subject to cluster, which yieldsbetter discriminative features. To further stabilize the trainingprocess and reduce the overfitting of the model on smallerdatasets, we also employ the L2 regularization term. The loss

function is presented in Equation 1. Here xi represents batchesof face images from different resolutions, and cyi is the numberof total classes which updates while training.

L = −m∑i=1

logeW

Tyixi+byi∑n

j=1 eWT

jxi+bj

+2

λ

m∑i=1

‖xi − cyi‖22+η ‖W‖22

(1)Experiment and result: In most of the previous studies, be-

cause of the lack of low-resolution images collected from wild,high-resolution datasets like Multi-PIE [9] and FERET [38]are used for this task with down sampling of the HR faceimages to create LR counterparts. Due to the difference be-tween synthetic LR images created by this technique and realsurveillance images, the method has limitations when appliedto the real world. Two experiment protocols are defined asfollows:• Closed-set face identification is typically treated as a

classification problem, and the label is predicted di-rectly from the classification layer when a deep learningmethod is employed. However, when the subjects usedfor evaluation do not show up in the training stage,the classification architectures tend to be to inflexibleenough. This subject-disjoint training and testing proto-col is more practical in a real-life scenario.

• Open-set face identification requires a 1 to N matchwhen an individual is present to the system. However, italso requires that the system correctly reject individualswho are not enrolled in the system database. This isvery similar to how most surveillance systems work. Anindividual who randomly shows up in the scene may ormay not be in the face database. In this case, the systemmust correctly reject the probe if the person is not inthe database and correctly identify if the person is inthe database.

Evaluations are conducted on two surveillance qualitydatasets (SCface and UCCSface) which we considered bothmore challenging and closer to real scenario.

SCface: Each of the 130 subjects in the subset of Scfacedataset that we used has one mugshot HR face images takenby high-definition camera and several LR face images that arecaptured by five visible light cameras placed at three differentstandoff distances (1m, 2.6m and 4.2m). So each individualhas 1 HR face image and 15 surveillance quality images intotal which results in 2080 faces in total. We conduct two setsof experiments with two different protocols as defined in [58].We split the training set and testing set into 80 and 50 subjectsseparately in a subject-disjoint fashion. For Experiment 1, weused the HR images as gallery images, and those captured atthe three standoffs as probe images. In Experiment 2 we choseface images from 1m standoff as gallery and images from2.6m and 4.2m standoff as probe images. The other settingsare identical to experiment 1. All the HR and LR images areresized to 64x64 for presentation to the network for trainingand testing. We evaluate the matching using cosine distanceof the matching score and rank-1 rate is reported in II. Weachieve nearly 5 percent in rank-1 rate in Experiment 1 and9 percent in Experiment 2 compared to the state-of-the art

6

HD-1m HD-2.6m HD-4.7m method6.18 6.18 1.82 Scface3.08 4.32 3.46 CLPM18.09 13.2 7.04 SSR18.97 13.58 6.99 CSCDN20.69 14.85 9.79 CCA25.53 18.44 12.19 DCA18.46 18.08 15.77 C-RSDA31.71 20.8 20.4 Ours

TABLE II: Exp1:Rank-1 rate on SCface with HD and threestandoff distances

1.0m-2.6m Method29.12 CLPM40.08 SDA39.56 CMFA43.24 Coupled mapping method60.40 LMCM69.6 Ours

TABLE III: Exp2:Rank-1 rate on SCface with 1.0m and 2.6mstandoff distances

under the same protocol. The feature yield from our modelis more robust when the gallery and probe images have largeresolution level difference. The performance of other methodsdrops rapidly when the size of the probe image changes from1m standoff to 4.2m standoff while employing feature yieldfrom our model.

UCCSface: UCCSface is another dataset designed to beclose to a real surveillance setting. We followed the experimentsetting provided in [53] and [43] and evaluate both closed-set and open-set scenarios. For closed-set evaluation, 180subjects are used and for open set evaluation, we compared ourresult with the performance reported in [43] with the definedopenness at 14.11 percent. When looking at the result of theclosed-set evaluation, our method beats the UCCS baseline bynearly 20 percent on rank-1 accuracy and also outperform theDNN method in by nearly 35 percent on rank-1 rate under thesame training and evaluation protocol. For open-set evaluation,we achieve 73.6 percent of accuracy when compared to theUCCS face baseline result.

C. Low resolution face re-identification

In this section, we explored surveillance level face re-identification and evaluate it on several datasets captured in anunconstrained environment. We employed the VBOLO datasetfor an in-depth study and the SCFace, UCCS, and Megafacedatasets for other topical explorations.

1) Experiments and Results: VBOLO dataset:

Resolutions Rank-1 rate Method

Original vs Original 78 UCCSface95.1 Ours

(80x80(HR) vs 16x16(LR)) 59.03 DNN93.4 Ours

TABLE IV: Rank-1 rate on UCCSface dataset

a) Actor-Disjoint Experiment with selected deep net-works: We explored the literature and were inspired by recentstate-of-the-art patch matching approaches that may exhibitrobustness to small misalignments in automatic or manualannotations and robustness to different effective resolutions.We also expect the network to handle matching of facescaptured by the same or different cameras with differentsubject standoffs. We tried four state-of-the-art face matchingapproaches with basic DNN architectures, and boosted themwith fully convolutional structures to reduce overfitting onour dataset. Further, to let the network better accommodateresolution changes, we employed a spatial pyramid pooling(SPP) layer, hoping to learn discriminative features and themapping between different size of LR faces captured in thesurveillance cameras.

b) Matching protocols: We made two different matchingprotocols for this dataset. The first protocol is designed tomatch people in the same camera between different appear-ances (usually called “single camera ReID”). This experimentaims to test face identification performance on people fromdifferent appearances in the video while clothes are differentin the appearances. The second protocol matches probe andgallery images acquired from different camera locations. Thisprotocol aims to evaluate the comprehensive performance ofthe ReID model which includes single-camera and multi-camera person ReID at the same time. In order to increasethe matching complexity, we added distractors to the protocolto obtain more non-match pairs.

We train and test our dataset on several state-of-the-artdeep learning patch-matching architectures following the twomatching protocols. Like the experiment setting in last sub-section, the training and testing sets are disjoint by actor IDin order to mimic the reality that the targets will very lesslikely to appear in the training set. Six of the actors are usedfor training and three for testing. Each set of experiments wasconducted five times using the random pair sampling procedurebelow, and the results were averaged.

c) Training pairing strategy: Creating pairs and samplingthe generated pairs in training for matching is a key step ofpreprocessing. Since we have on average around 200 facesfor each person in each appearance, the numbers of positiveand negative pairs are highly unbalanced. In addition to pairscreated using faces in different appearances,we decide to addface pairs from the same appearance in the training data inorder to increase the number of training pairs. We denote thetraining set as T , and a particular face in T as tijf , wherei ∈ 1 . . . n is the ID of the actor, j ∈ 1 . . .m is the appearancenumber, and f denotes the frame number. We first randomlyshuffle all the faces in order to break the temporal continuityof the frames to avoid getting positive face pairs more oftenfrom frames close in time to each other. For each of the facesTijf with fixed i and j, we create a positive pair by randomlyselecting Tij′f ′ with j′ 6= j and f ′ 6= f . To create an equalnumber of negative pairs, we randomly select the faces fromTi′jf ′ with i′ 6= i and j′ and f ′ randomly chosen, and pairthe selected face with the previous face Tijf . By exploitingthe pairing approaches mentioned above, we are able to gainbalanced face pairs from each identity in each appearance.

7

d) VGG-Face[36] & SVM: We employed the pre-trainedVGG-face descriptor model and used it for facial featureextraction. We pair the faces from our dataset first and extractfeatures using the VGG face descriptor. The face images areresized from their original size to 224x224 for input to thenetwork.The Euclidean distance between feature vectors froma pair of faces is assigned with positive and negative binarylabels to represent if the pair of faces are from the same personor not. The distances themselves are used to train a linear SVMmodel for binary prediction. We conduct the experiment withthe two matching protocols mentioned above. We chose hyper-parameter value and obtain the best CV rate for the linearSVM and got a testing AUC of 0.695 as shown in Figure 5.It indicates the network successfully identified face featuresfrom poor quality faces extracted from surveillance videos.However, since the VGG face descriptor was trained on variousHR faces that are sufficiently aligned with facial landmarks,when used for surveillance quality faces that are both LR andhard to align, this baseline result is not outstanding. Also, LRface images need to be upscaled before being fed into thedeep pre-trained network, which is not efficient for learningLR discriminative features for re-identification.

e) Siamese Network: Siamese classification structures hadtheir first application in face verification in [6]. The Siamesearchitecture does not require categorical information in train-ing. It tries to learn a feature representation with two identicaltowers of network layers with shared weights. As with a tra-ditional convolution network, it has a series of convolutional,activation and max-pooling layers in each tower. Two featurerepresentations coming from each tower are concatenated andfed into some fully connected layers which are connectedwith contrastive loss. Since the two towers are identical instructure and weights, this kind of network aims to map twoinputs into an identical target space by training in an end-to-end fashion. We propose a tiny base network with a simplearchitecture, motivated by [53], which concludes that a deepernetwork architecture and a large number of filter channels maydegrade recognition performance. Our basic network is shownin Table V; it has three convolutional layers followed by max-pooling and a fully connected layer. We used a moderate filtersize and channel number in the tiny network. An input size of32x32 was chosen. We trained the network from scratch witha batch size of 8 and the SGD optimizer. The Siamese netconverged within 5 epochs with a AUC of 0.861 on Station 1data and 0.871 on Station 2 data with single camera matchingand 0.838 on the data from two stations with both single andcross camera matching.

f) MatchNet: MatchNet [11] is another state-of-the-artpatch matching approach that employs a two-tower structurewith shared weights similar to the Siamese net. However,instead of feeding two concatenated feature vectors producedby the two towers directly into the decision layer with carefullydesigned loss functions, it uses a series of fully connectedlayers as a subnet to learn the feature comparison for binaryclassification using cross-entropy. Compared to the Siamesenet, MatchNet has more flexibility in the metric subnet shownin Figure 4, which takes the paired features and maps themto a unified space that minimizes their distance. However, it

Feature tower

Feature tower

Feature tower

Feature tower

Contrastive-loss metric-subnet Softmax layer

Softmax layer

3x32x32 3x32x32 6x32x323x32x32 3x32x32

(a) (b) (c)

Fig. 4: Overview of the three deep architecutures: (a)Siamesenet (b) Matchnet (c) 6-channel net

Parameters 6-channel MactchNet Siamesec(32)-m-c(32)-m-c(64)-fc(64) 77.1 77.3 78.3c(8)-m-c(16)-m-c(32)-c(32)-m 76.7 78.1 77.8c(8)-m-c(16)-c(16)m-c(32)-c(32)-m 77.5 79.3 79.8c(8)-c(8)-m-c(16)-c(16)m-c(32)-c(32)-m 76.8 75.9 78.5

TABLE V: Model selection for a convolution kernel size of3×3. Testing accuracy is shown with different network layouts.

converges more slowly since the fully connected layer hasmany more parameters and higher complexity. A softmax layerand cross-entropy loss are employed during training. We obtainthe best result on our basic net using the SGD optimizer. Asshown in Figure 5, it obtained AUC of 0.847 on data fromStation 2, 0.902 on data from Station 1 for single cameramatching, 0.827 on data from both stations for single andcross-camera matching.

g) Six-Channel Net: Inspired by the two-channel modelproposed in [62], we tried to improve it by incorporatingthree color channels. This approach abandons the two-towerfeature by directly embedding the two face images into sixchannels, fed into the first layer of the network, with hingeloss and a one-bit binary output which is shown in Figure4. Compared to the previous two architectures, it has greaterflexibility – it has twice as many parameters as the two-towerstructures and is able to learn feature maps using six imagechannels instead of three jointly. However, it converges slowestamong the three and L2 regularization is needed for betterperformance. The best AUC values we achieve using the six-channel net on Station 1 and Station 2 single camera matchingare 0.891 and 0.818. It achieves AUC of 0.846 for both singleand cross camera matching, which outperform the Siamese netand MatchNet by 2 percent and 1 percent, respectively.

h) Fully convolutional structure and SPP pooling: In thissection, we try to improve the three architectures mentionedabove (the Siamese net, Matchnet, and the 6-channel net witha fully convolutional structure and a Spatial Pyramid Pooling(SPP) layer [17]). A fully convolutional CNN (FCN) is onewhere all the learnable layers are convolutional. A convo-lutional layer has fewer parameters than a fully connectedlayer, which would potentially reduce overfitting on a small

8

station 1

(a)

station 2

(b)

(c)

Fig. 5: Face matching with three matching basic net (first linein Table V) for three data subsets. (a): Station 1 (b): Station 2(c) Station 1 and Station 2.

SPP

Decision Layer

SPP

Face Patch 1Face Patch 2

(a)Fig. 6: Architecture of Siamese network with SPP layer.

dataset but keep more spatial information in the features. Wereplace the fully connected layer with a convolutional layerin the previous three models, tried several hyperparametersettings to adjust layer numbers and filter numbers, andchose the best set based on the testing accuracy observed.Performance comparison is summarized in Table V. Figure 7demonstrates that the FCN architecture effectively improvesthe performance (as measured by the AUC) over the three basicnetwork architectures (Siamese, Matchnet, 6-channel net) by1 percent, 5 percent and 4 percent roughly each. Further, weare inspired by the study of Zagoruyko et al. [62] who usedan SPP layer for patch matching and claimed a remarkableimprovement in performance. Compared with [62] who onlytested their architecture with SPP using size-identical imagepairs, we decide to take advantage of the fully connectedarchitecture of our network, feeding various size of face imagesinto the network. By replace the last max-pooling layer witha SPP layer before the decision layer, we can further testthe assumption from [62] on the VBOLO dataset with ourmodified architecture. We simplified the problem by setting upthree convolutional Siamese networks. Each is responsible fora matching at a specific resolution level. In this case, we havea) low to low, b) high to high, and c) low to high resolutionmatching with three separate but identical networks. Thesethree networks try to learn different metrics and features withface pairs close to their original sizes. We resize faces thatare smaller than 32×32 to 16×16 and denote them as low-resolution faces. Those faces bigger than 32×32 are resizedto 64×64 and denoted as high-resolution faces. We train andtest the three subnets with the SPP layer on the top shownin Figure 6. 4×4 SPP pooling is applied at the end of eachtower. We got a slight (0.1 percent) AUC improvement usingthe SPP layer together with the fully convolutional architecture.Compared with previous work [23], we achieve a significantimprovement in performance on the data from Station 2 byexploiting unified deep feature and metric learning instead ofoptimizing feature and metric separately.

i) Improved Pretraining Aproach on larger datasets:Although we achieved several sets of promising results onthe VBOLO dataset, more comprehensive experiments need tobe conducted on larger scale public datasets. In addition, weface the challenge that most of the deep architectures struggleto exploit LR images well due to over-fitting and limitedintrinsic dimensionality of the input. We executed a projectto improve the training process by evaluating on some largergeneral LR unconstrained face datasets or surveillance qualityface datasets. To achieve this goal, we employ the DCGANintroduced in [40] in order to obtain a pre-trained discriminatoras an initialization for the feature towers. Using this methodhas two advantages:• By optimizing the DCGAN on the LR training set, we

can understand how the network perceives the LR im-ages and adjust parameters such as activation function,number of layers by looking at the intermediate outputof the generator as shown in Figure 8. Visualizationsfrom filters and feature map are not intuitive enough toa training strategy.

• The pretrained discriminator could provide an initial

9

CNN

(a)

CNNwork

M

(b)

S

CNN

(c)

CNNCNN

(d)

Fig. 7: Face matching with fully convolutional networkson faces from two stations with multi-camera and multi-appearances. (a): 6-channel. (b): Matchnet. (c): Siamese net.(d): Siamese new with SPP layer.

weight on general LR faces which can be transferred toother LR face dataset via fine-tuning which can stabalizeand accelerate the training process.

The GAN discriminator is trained on the Megaface2 LR subsetand fine-tuned using the target dataset VBOLO, SCFace andUCCSface. We compared the model trained from scratchand a fully convolutional MatchNet model pre-trained usingDCGAN.

We performed the training and testing splitting as follows:For UCCSface, we choose 90 identities for training and 90 fortesting. For Megaface, we use 2999 identities for training and6699 identities for testing. For SCFace, we select 65 identitiesfor training and 65 for testing. We followed the VBOLOpairing and matching protocol for training and testing. Thevalidation set is a randomly selected 20 percent of the trainingset. All the experiments are run five times and the error ratesare averaged. We discover when starting with a pretrainedweight using larger amount of low-resolution faces by applyingDCGAN discriminator, the validation rate is more stable andtends to be higher compared with training the network fromscratch. Better performance was achieved on the same datasets.For VBOLO, UCCSface, Megaface and SCface, we achieve9.2, 21.8, 10.6, 11.3 percent decrease in error rate comparedwith training the model from scratch.

We also identified some architectural changes needed for aDCGAN model to successfully converge on LR face imagescompared with on HR images. For higher resolution modeling,Radford et. al [40] suggested that doing the following would

result in stable training:• Replace any pooling layers with strided convolutions

(discriminator) and fractional-strided convolutions (gen-erator)

• replace Tanh activation function with ReLU or Leakyrelufunctions

• adding batch-normalization in both generator and dis-criminator

However, we only achieved stable adversarial convergenceusing the Tanh activation function in both the generator andthe discriminator except for the last layer of the discrimi-nator, in which we employed a sigmoid nonlinearity. Batch-normalization does not function usefully for our DCGAN toconverge on the Megaface LR subset and is not applied in ourmodel.

Datasets VBOLO UCCSface Megaface SCfaceDCGAN-pretrained 18.8/17.6 14.7/11.8 20.1/19.8 24.3/24.1Train from scratch 20.7/19.5 18.8/18.6 22.5/21.8 27.4 /28.5

TABLE VI: Average error rate on Matchnet models

IV. CONCLUSIONS AND COMMENTARY

In this paper, we provide several novel contributions. First,we illustrate the performance gap between LR unconstrainedface and LR constrained face recognition when using a state-of-the-art super-resolution algorithm. Secondly, two importantapplication scenarios based on LR face recognition are defined:unconstrained LR face identification in the wild and LR facere-identification. For general low-resolution face identification,exploit a novel approach to handle the multi-dimensionalmismatching due to the quality difference of the face images inprobe and gallery. We also design different deep networks solv-ing the person re-identification problem to demonstrate betterperformance compared to our previous work [23], [25]. Weexploit a novel strategy using DCGAN pre-training to obtainboth the learning visualization of the network and improvedresult on larger scale datasets. We demonstrate the result fromthe extensive experiments on selected datasets and discoverthat incompatible feature for matching caused by dimensionalmismatching is the most challenging point especially in low-to-high resolution face identification task. The result demonstratesthat the approaches we proposed target different tasks, workefficiently, and yield impressive performance.

REFERENCES

[1] E. Ahmed, M. Jones, and T. K. Marks. An improved deep learningarchitecture for person re-identification. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 3908–3916, 2015.

[2] L. Bazzani, M. Cristani, and V. Murino. Symmetry-driven accumula-tion of local features for human characterization and re-identification.Comput. Vis. Image Underst., 117(2):130–144, 2013.

[3] S. Biswas, K. W. Bowyer, and P. J. Flynn. Multidimensional scalingfor matching low-resolution facial images. In Proceedings of the IEEEConference on Biometrics: Theory Applications and Systems (BTAS),pages 1–6, 2010.

10

Fig. 8: Generated LR faces from DCGAN generator

[4] A. M. Burton, S. Wilson, M. Cowan, and V. Bruce. Face recognition inpoor-quality video: Evidence from security surveillance. PsychologicalScience, 10(3):243–248, 1999.

[5] D. Chen, Z. Yuan, B. Chen, and N. Zheng. Similarity learning withspatial constraints for person re-identification. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages1268–1277, 2016.

[6] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metricdiscriminatively, with application to face verification. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,volume 1, pages 539–546, 2005.

[7] R. Gonzalez, R. Woods, and S. Eddins. Digital Image Processing.Prentice Hall Press, 2009.

[8] M. Grgic, K. Delac, and S. Grgic. Scface–surveillance cameras facedatabase. Multimedia tools and applications, 51(3):863–879, 2011.

[9] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie.Image and Vision Computing, 28(5):807–813, 2010.

[10] Z. Haichao, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S. Huang.Close the loop: Joint blind image restoration and recognition withsparse representation prior. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 770–777, Nov, 2011.

[11] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet:Unifying feature and metric learning for patch-based matching. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3279–3286, 2015.

[12] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar. Simultaneoussuper-resolution and feature extraction for recognition of low-resolutionfaces. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition,pages 1-8, 2008.

[13] C. Herrmann, D. Willersinn, and J. Beyerer. Low-resolution convolu-tional neural networks for video face recognition. In Proceedings of theIEEE International Conference on Advanced Video and Signal BasedSurveillance (AVSS), pages 221–227, 2016.

[14] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled facesin the wild: A database for studying face recognition in unconstrainedenvironments. Technical Report 07-49, University of Massachusetts,Amherst, October 2007.

[15] J. Jiang, R. Hu, Z. Wang, and Z. Han. Face super-resolution viamultilayer locality-constrained iterative neighbor embedding and inter-mediate dictionary learning. IEEE Transactions on Image Processing,23(10):4220–4231, Oct 2014.

[16] C. Jose and F. Fleuret. Scalable metric learning via weighted approxi-mate rank component analysis. arXiv preprint arXiv:1603.00370, 2016.

[17] H. Kaiming, Z. Xiangyu, R. Shaoqing, and J. Sun. Spatial pyramidpooling in deep convolutional networks for visual recognition. InProceedings of the IEEE European Conference on Computer Vision,2014.

[18] S. Khamis, C.-H. Kuo, V. K. Singh, V. D. Shet, and L. S. Davis. Jointlearning for attribute-consistent person re-identification. In Proceedingsof the European Conference on Computer Vision, pages 134–146.Springer, 2014.

[19] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine

Learning Research, 10(Jul):1755–1758, 2009.[20] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof.

Large scale metric learning from equivalence constraints. In Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2288–2295, 2012.

[21] S. Kolouri and G. K. Rohde. Transport-based single frame superresolution of very low resolution face images. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages4876–4884, 2015.

[22] B. Li, H. Chang, S. Shan, and X. Chen. Low-resolution face recognitionvia coupled locality preserving mappings. IEEE Signal ProcessingLetters, 17(1):20–23, Jan 2010.

[23] P. Li, J. Brogan, and P. J. Flynn. Toward facial re-identification:Experiments with data from an operational surveillance camera plant.In Proceedings of the IEEE Conference on Biometrics: Theory, Appli-cations and Systems (BTAS), pages 1–8, 2016.

[24] P. Li, M. L. Prieto, P. J. Flynn, and D. Mery. Learning face similarity forre-identification from real surveillance video: A deep metric solution.In Biometrics (IJCB), 2017 IEEE International Joint Conference on,pages 243–252. IEEE, 2017.

[25] P. Li, M. L. Prieto, P. J. Flynn, and D. Mery. Learning face similarity forre-identification from real surveillance video: A deep metric solution. InProceedings of the IEEE International Joint Conference on Biometrics(IJCB), Oct 2017.

[26] W. Li and X. Wang. Locally aligned feature transforms across views. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3594–3601, 2013.

[27] Z. Li, S. Chang, F. Liang, T. Huang, L. Cao, and J. Smith. Learninglocally-adaptive decision functions for person verification. In Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3610–3617, 2013.

[28] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang. Robustsingle image super-resolution via deep networks with sparse prior. IEEETransactions on Image Processing, 25(7):3194–3207, 2016.

[29] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. End-to-end compar-ative attention networks for person re-identification. arXiv preprintarXiv:1606.04404, 2016.

[30] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deephypersphere embedding for face recognition. In Proceedings of theIEEE conference on computer vision and pattern recognition, volume 1,2017.

[31] N. Martinel, C. Micheloni, and G. L. Foresti. Saliency weighted featuresfor person re-identification. In Proceedings of the ECCV Workshops,pages 191–208, 2014.

[32] A. M. Martinez. The ar face database. CVC technical report, 1998.[33] A. Nech and I. Kemelmacher-Shlizerman. Level playing field for

million scale face recognition. arXiv preprint arXiv:1705.00393, 2017.[34] M. Nishiyama, A. Hadid, H. Takeshima, J. Shotton, T. Kozakaya, and

O. Yamaguchi. Facial deblur inference using subspace analysis forrecognition of blurred faces. IEEE Transactions on Pattern Analysisand Machine Intelligence, 33(4):838–845, April 2011.

[35] J. Pan, Z. Hu, Z. Su, and M.-H. Yang. Deblurring face images with

11

exemplars. In Proceedings of the European Conference on ComputerVision, pages 47–62, 2014.

[36] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. InProceedings of the British Machine Vision Conference, Swansea, 2015.

[37] S. Pertuz, D. Puig, and M. Garcia. Analysis of focus measure operatorsfor shape-from-focus. Pattern Recognition, 2013.

[38] P. J. Phillips, H. Wechsler, J. Huang, and P. J. Rauss. The feret databaseand evaluation procedure for face-recognition algorithms. Image andVision Computing, 16(5):295–306, 1998.

[39] Q. Qiu, J. Ni, and R. Chellappa. Dictionary-based domain adaptationmethods for the re-identification of faces. In Person Re-Identification,pages 269–285, 2014.

[40] A. Radford, L. Metz, and S. Chintala. Unsupervised representationlearning with deep convolutional generative adversarial networks. arXivpreprint arXiv:1511.06434, 2015.

[41] E. Rahtu, J. Heikkila, V. Ojansivu, and T. Ahonen. Local phasequantization for blur-insensitive image analysis. Image and VisionComputing, 30(8):501–512, 2012.

[42] C.-X. Ren, D.-Q. Dai, and H. Yan. Coupled kernel embedding forlow-resolution face image recognition. IEEE Transactions on ImageProcessing, 21(8):3770–3783, 2012.

[43] A. Sapkota and T. E. Boult. Large scale unconstrained open set facedatabase. In Proceedings of the IEEE Conference on Biometrics:Theory, Applications and Systems (BTAS), pages 1-8, 2013.

[44] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unifiedembedding for face recognition and clustering. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages815–823, 2015.

[45] S. Shekhar, V. M. Patel, and R. Chellappa. Synthesis-based recognitionof low resolution faces. In Proceedings of the International JointConference on Biometrics (IJCB), pages 1–6, Oct 2011.

[46] A. Subramaniam, M. Chatterjee, and A. Mittal. Deep neural networkswith inexact matching for person re-identification. In Proceedings of theAdvances in Neural Information Processing Systems, pages 2667–2675,2016.

[47] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representa-tion by joint identification-verification. In Proceedings of the Advancesin Neural Information Processing Systems, pages 1988–1996, 2014.

[48] Y. Sun, X. Wang, and X. Tang. Sparsifying neural network connectionsfor face recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 4856–4864, 2016.

[49] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland,D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016.

[50] E. Ustinova, Y. Ganin, and V. Lempitsky. Multiregion bilinear con-volutional neural networks for person re-identification. arXiv preprintarXiv:1512.05300, 2015.

[51] R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutionalneural network architecture for human re-identification. In Proceedingsof the European Conference on Computer Vision, pages 791–808, 2016.

[52] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for human re-identification. In Proceedingsof the European Conference on Computer Vision, pages 135–153, 2016.

[53] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang. Studying verylow resolution recognition using deep networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages4792–4800, 2016.

[54] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learningapproach for deep face recognition. In Proceedings of the EuropeanConference on Computer Vision, pages 499–515, 2016.

[55] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrainedvideos with matched background similarity. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 529–534, 2011.

[56] L. Wu, C. Shen, and A. v. d. Hengel. Personnet: Person re-identification with deep convolutional neural networks. arXiv preprintarXiv:1601.07255, 2016.

[57] F. Xiong, M. Gou, O. Camps, and M. Sznaier. Person re-identificationusing kernel-based metric learning methods. In Proceedings of theEuropean Conference on Computer Vision, pages 1–16, 2014.

[58] F. Yang, W. Yang, R. Gao, and Q. Liao. Discriminative multidi-mensional scaling for low-resolution face recognition. IEEE SignalProcessing Letters, 2017.

[59] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolutionvia sparse representation. IEEE Transactions on Image Processing,19(11):2861–2873,2010.

[60] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li. Salient colornames for person re-identification. In Proceedings of the EuropeanConference on Computer Vision, pages 536–551, 2014.

[61] D. Yi, Z. Lei, and S. Z. Li. Deep metric learning for practical personre-identification. arXiv preprint arXiv:1407.4979, 2014.

[62] S. Zagoruyko and N. Komodakis. Learning to compare image patchesvia convolutional neural networks. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pages 4353–4361,2015.

[63] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null spacefor person re-identification. arXiv preprint arXiv:1603.02139, 2016.

[64] Z. Zhang, Y. Chen, and V. Saligrama. A novel visual word co-occurrence model for person re-identification. In Proceedings of theECCV Workshops, pages 122–133, 2014.

[65] R. Zhao, W. Ouyang, and X. Wang. Person re-identification by saliencematching. In Proceedings of the IEEE International Conference onComputer Vision, pages 2528–2535, 2013.

[66] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmarklocalization in the wild. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 2879–2886, 2012.

[67] W. W. Zou, P. C. Yuen, and R. Chellappa. Low-resolution facetracker robust to illumination variations. IEEE Transactions on ImageProcessing, 22(5):1726–1739, 2013.

Documents

1 Low Resolution Face Recognition in the Wild - arXivdatasets (AR [32] and YouTube Faces (YTF) [55]) is also presented. The performance gap between recognition of faces captured in