T. Swearingen and A. Ross, Predicting Missing Demographic ...rossarun/pubs/SwearingenRoss... · where gender or ethnicity is deduced from a single attribute of a single record. A

Predicting Missing Demographic Information inBiometric Records using Label Propagation

TechniquesThomas Swearingen

Department of Computer Science and EngineeringMichigan State University

East Lansing, Michigan 48824Email: [email protected]

Arun RossDepartment of Computer Science and Engineering

Michigan State UniversityEast Lansing, Michigan 48824

Email: [email protected]

Abstract—Biometric systems use biological attributes such asface, fingerprint, or iris to automatically recognize an individual.In many law enforcement applications, the biometric record of aperson in the database is often supplemented with biographic anddemographic data such as name, address, age, ethnicity, gender,etc. In such applications, some of the records may have missingor incorrect demographic data. In this work, we explore thepotential of a label propagation method to impute demographicdata to partially incomplete biometric records. The proposedmethod utilizes a graph-like structure to capture similaritiesbetween biometric records based on the available data. Thisstructure is then used by the label propagation method to predictmissing data. Experiments using the face image, name, genderand ethnicity of subjects in two datasets confirm the efficacy ofthe scheme in predicting missing data in biometric records.

I. INTRODUCTION

In classical biometric recognition, a biometric probe sampleis compared against a set of biometric gallery samples in orderto recognize an individual. For example, an unknown faceimage (the probe) may be compared against a set of knownface images in a database (the gallery) in order to recognizeit. The gallery samples pertaining to multiple individuals areoften assumed to be independent of one another. Therefore, theprobe sample is compared against every gallery sample (or asubset of gallery samples) independently in order to generatematch scores. These match scores are then used to either verifythe claimed identity of the probe sample, or to determine theidentity associated with the probe sample [1].

We explore the use of a graph structure to model therelationship between gallery samples. In this graph, each nodecorresponds to an identity and the edges between pairs ofnodes describe the similarity between identities. Each identity(node) is a combination of biometric, biographic (e.g., name,occupation) and demographic (e.g., ethnicity, gender) data ofa person. The relationship between two nodes (manifested asan edge or a set of edges) is defined by the similarity betweentheir biometric, biographic and demographic data. Utilizing agraph has several advantages. The output of the identificationprocess could be a subgraph consisting of not only “matching”candidates whose face images look similar to the probe image,

but also other candidate images that are “related” to the probe.For example, when searching for the identity of a probe samplein the graph, the output may consist of gallery identities thatare in social or professional proximity to the individual (suchas a close friend or a co-worker). This would be useful incases where the identity of the probe is not in the gallery, butrelated identities are present in the graph.

In this work, we focus on one advantage of such a graph— the ability to deduce missing (or incorrect) data ina node. Some nodes are likely to have incomplete informa-tion (e.g., there may be missing demographic information).Deducing missing information can improve the integrity ofthe gallery database. Since a similar pair of nodes are likelyto have a stronger edge between them, the confluence ofinformation from neighboring nodes can be used to imputemissing values to an incomplete node.

In the graph formulation, every node can be viewed as arecord consisting of several fields such as name, age, gender,ethnicity, face image, etc. The edge weights in the graph is afunction of the degree of similarity between two participatingnodes and is based on the available fields in the nodes. Somenodes may be complete, in that all their field values areavailable, while other nodes may be incomplete.

Many operational biometric systems store the biographicand demographic data of a person in addition to the biometricdata. Examples include the UIDAI Aadhaar program in India,and the OBIM and TWIC programs in the United States.In such programs, the gallery database could be viewed asconsisting of “records” of individuals. The main contributionof this work is a method for predicting demographic attributesof a biometric record that does not rely on the face imageor name alone, but exploits the existing relationship betweenrecords. The method is based on the label propagation tech-nique. Section II provides a review of related literature. SectionIII presents methods to predict demographic information fromfaces and names, which will be used as a baseline for compar-ison to the label propagation method. Section IV details thelabel propagation method. Section V reports the experimentsand results. Section VI discusses the results and Section VII

T. Swearingen and A. Ross, "Predicting Missing Demographic Information in Biometric Records using Label Propagation Techniques," Proc. of the 15th International Conference of the Biometrics Special Interest Group (BIOSIG), (Darmstadt, Germany), September 2016

offers concluding remarks.

II. LITERATURE REVIEW

A. Predicting Demographic Information

The field of soft biometrics, amongst other things, hasfocused on deducing demographic information from biometricdata. There is a rich literature on this topic and we refer theinterested reader to [2]. In particular, the problems of age [3],race [4] and gender [5] prediction from face images have beenstudied in detail. In addition, there have been some preliminaryattempts to predict a person’s occupation or name from aface photo [6], [7], but the success of such methods has beenextremely limited compared to gender/race prediction.

B. Label Propagation

Label propagation algorithms are examples of semi-supervised learning techniques. Label propagation operatesunder the assumption that points on the same manifold arelikely to have the same label. Since both labeled and unlabeleddata are available, the goal is to induce labels on the unlabeleddata from the labeled data using the natural structure of themanifolds within the data. This is accomplished by construct-ing a graph of nodes and edges. The nodes are simply the datapoints while the edge weights represent the similarity betweenthese data points. In the past, label propagation has been usedto improve the quality of labeling in datasets where there aremissing or incorrect labels [8], [9].

III. PREDICTING DEMOGRAPHIC INFORMATION FROM ASINGLE ATTRIBUTE/RECORD

In this work, we consider two types of demographic at-tributes: gender and ethnicity. Each attribute has two labelsassociated with it. The gender labels are Male and Female,while the ethnicity labels are White and Non-White. However,this work is applicable even when other types of attributesand labels are used. Before we describe the label propagationmethod used in this work, we first establish baseline methodswhere gender or ethnicity is deduced from a single attributeof a single record.

A. Demographic Prediction from Name

We use two data sources from United States Census Bureau(USCB) for predicting gender and ethnicity from name. Thefirst dataset, USCB-1990, is used to predict gender from aforename. The USCB published a list of forenames that werereported in the 1990 census and their corresponding frequen-cies for male and female categories [10]. The second dataset,USCB-2000, is used to predict ethnicity from a surname. TheUSCB published a list of surnames that were reported in the2000 census and their corresponding ethnicity posteriors [11].Figure 1 shows an overview of these methods.

Name

Name-Based

Gender

Likelihood

Male/

Female/

Unknown

Maximum

Posterior

Classi er

Pr(F ) Pr(M)

(n)

Pr(N = n|G = F )

Pr(N = n|G = M)

USCB-1990 Name Data

Compute

Posterior

(a) USCB-1990

Name

USCB-2000 Name Data

Name-Based

Ethnicity

Posteriors

White/

Non-White/

Unknown(n)

Pr(E = W |N = n)

Pr(E = NW |N = n)

Maximum

Posterior

Classi er

(b) USCB-2000

Fig. 1: Overview of the USCB-1990 Gender-from-Name Demographic Clas-sifier and the USCB-2000 Ethnicity-from-Name Demographic Classifier. InFigure 1a, a forename, n, is input into the system and a gender label,Male/Female/Unknown, is output. In Figure 1b, a surname, n, is input intothe system and an ethnicity label, White/Non-White/Unknown, is output.

B. Demographic Prediction from Face Image

As stated earlier, there are a number of publications thatdiscuss the possibility of deducing gender and ethnicity fromface images. In this work, we used the Intraface SDK todeduce ethnicity and gender from face images. Intraface isa face attribute extractor that includes functionality for deter-mining ethnicity and gender [12]. Given an input face image,the software outputs gender and ethnicity labels. The authorsin [12] tested their software on the PubFig dataset [13] whereit obtained an F1 score of 96.1% for gender prediction and anaverage F1 score of 91.8% for ethnicity prediction. Therefore,it is an appropriate choice for automatic demographic attributeextraction from face images.

IV. PREDICTING DEMOGRAPHIC INFORMATION USINGMULTIPLE RECORDS

In contrast to the baseline demographic prediction schemesdiscussed in the previous section, here we employ a graph-based method that uses evidence from multiple records (nodes)to estimate missing demographic values in a biometric record.In the proposed method, the gallery data is organized in agraph structure with nodes corresponding to biometric recordsand edge weights corresponding to degrees of pair-wise sim-ilarity between nodes. In such a graph, we identify two typesof nodes:

1. Complete Node: A nodal record which has no missingfields.

2. Incomplete Node: A nodal record that has one or moremissing demographic fields.

In order to propagate demographic labels from the completenodes to the incomplete nodes, we use a label propagationmethod [14]. Suppose that we are given a set of records/nodesR = {R1, . . . , Rv, Rv+1, . . . , Rn} where the first v recordsare complete and the remaining n− v records are incomplete.Each record has 4 fields: face, name, gender and ethnicity.


Therefore, let Ri = {Fi, Ni, Gi, Ei}. Here, gender and ethnic-ity are viewed as binary attributes who values are in the labelset L = {0,1}. For ethnicity, 0 is non-white and 1 is white.For gender, 0 is female and 1 is male. Let {y1, y2, . . . , yv},yi ∈ L, be the gender (or ethnicity) labels of the completenodes.

Algorithm 1 Demographic Label Propagation1: procedure PROPAGATELABELS(R, Y, σ, α)2: for i, j ∈ [1, n] do3: if i = j then4: Wij = 05: else6: Wij = exp

(− fdiff(Ri,Rj)

2

2σ2

). Edge weights are based on record similarity.

7: end if8: end for9: Dii = zeros(n)

10: for i ∈ [1, n] do11: Dii =

∑nj=1Wij

. Diagonal entries are the sum of the corresponding row in W .12: end for13: S = D− 1

2WD− 12

14: F ∗ = (I − αS)−1 Y15: for i ∈ (v, n] do16: li = argmax0≤j<k F

∗ij

17: end for18: return li’s . Labels for incomplete nodes.19: end procedure

Algorithm 1 details the label propagation method. To prop-agate the demographic labels, we first calculate the affinitymatrix for the graph. Note that the fdiff(Ri, Rj) function onLine 6 produces a scalar difference score between recordsRi and Rj using the match scores between available fields.We then normalize the affinity matrix with the sum of eachrow resulting in the similarity matrix S. Next, we constructa matrix Y of size n × k, where k is the cardinality of thelabel set. Without loss of generality, let us assume that thegender labels are missing for the incomplete nodes, Rv+1 toRn. Since k = 2, Y will be of size n × 2. The ith row ofY pertains to record Ri, and has two columns: Yi,1 = 1, ifGi = 0 and Yi,2 = 1, if Gi = 1. However, for incompleterecords that are missing gender labels, both Yi,1 and Yi,2 areset to 0. We then use Y to let information “flow” from thecomplete nodes to the incomplete nodes by using the noderelationships manifested as values in S.

As noted by [14], rather than iteratively pushing labelinformation between nodes, we can compute the final valuesdirectly as F ∗ = (I − αS)−1

Y . The (I − αS)−1 term canbe seen as a diffusion kernel that diffuses the complete nodelabeling from the upper part of Y onto the incomplete nodesin the lower part of Y . A label for an incomplete node isprimarily dictated by the subset of nodes that sent the mostinformation to it. The labels for the incomplete nodes can bedirectly derived from F ∗. The ith row of F ∗ has two columns;if F ∗

i,0 > F ∗i,1 then incomplete node i is predicted to have label

value 0, else it is predicted to have label value 1.

Name: L. Martin

Gender: Male

Ethnicity: Non-White

Name: A. Monk

Gender: Female

Ethnicity: White

Name: S. Burgin

Gender: Male

Ethnicity: White

(a) Knox County Arrest Dataset

Name: K. Durant

Gender: Male


Name: N. Jonas

Gender: Male

Ethnicity: White

Name: Z. Kravitz

Gender: Female


(b) Online Celebrity Dataset

Fig. 2: Example of biometric records from the two datasetsassembled in this work.

V. EXPERIMENTS AND RESULTS

A. Dataset

Most of the publicly available face datasets do not includeinformation about the name, gender and ethnicity of thesubjects. Therefore, we assembled two datasets from imagesdownloaded from the Web: (1) Knox County Arrest Datasetand (2) Online Celebrity Dataset. Two subsets are next ap-propriated from each dataset, one for gender prediction andone for ethnicity prediction. In each subset, one of the twodemographic attributes has equal-sized cohorts. Figure 2 showsexamples from both datasets and Table I summarizes thedemographic statistics of the datasets.

Knox County Arrest Dataset: The Knox County Sheriff’sOffice posts information about arrested individuals every 24hours. We downloaded arrestee information using an auto-mated script in order to compile the Knox County ArrestDataset. The Knox County Arrest Dataset consists of 1,422records each of which includes forename, surname, gender,ethnicity, and a face image.

Online Celebrity Dataset: We also assembled anotherdataset, that we refer to as the Online Celebrity Dataset,which contains biographic and demographic details of severalcelebrities. It consists of 521 records and each record containsforename, surname, gender, ethnicity, and 2 face images.

B. Demographic Prediction

Single Record Prediction: Table II and Table III showthe baseline results of gender prediction and ethnicity pre-diction, respectively, based on only names or faces from theOnline Celebrity Dataset. Table II shows results on the OnlineCelebrity G dataset. Table III show results on the OnlineCelebrity E dataset.

Graph-Based Multi-Record Prediction: For the labelpropagation method, the KC-G Dataset is used as the complete


TABLE I: Demographic counts of datasets by male (M), female (F), white (W), and non-white (NW).

Dataset Gender EthnicityM F W NW

Knox County Arrest Dataset 1001 421 1099 323KC-G: Knox County Subset G (for Gender Prediction) 421 421 678 164

KC-E: Knox County Subset E (for Ethnicity Prediction) 498 148 323 323Online Celebrity Dataset 246 275 399 122

OC-G: Online Celebrity Subset G (for Gender Prediction) 246 246 375 117OC-E: Online Celebrity Subset E (for Ethnicity Prediction) 140 104 122 122

TABLE II: Results of gender prediction via name (USCB-1990) and face (Intraface) on the OC-G Dataset.

USCB-1990 (%) Intraface (%)Overall 97.8 97.6Female 98.8 97.6Male 96.7 97.6

TABLE III: Results of ethnicity prediction via name (USCB-2000) and face (Intraface) on the OC-E Dataset.

USCB-2000 (%) Intraface (%)Overall 60.7 85.7

Non-White 40.2 73.0White 81.1 98.4

node set and the OC-G Dataset is used as the incompletenode set for gender prediction. For ethnicity prediction, theKC-E Dataset is used as the complete node set and the OC-E Dataset is used as the incomplete node set. The optimalparameter values for the label propagation method are foundby performing a grid search over the σ and α parameters. Thevalue for each parameter is varied in the interval [0.01, 0.99]in steps of 0.01. The overall test classification accuracy asa function of σ and α is shown in Figure 3. For genderprediction, the optimal value of σ is 0.12 and the optimalvalue of α is 0.91. For ethnicity prediction, the optimal valueof σ is 0.15 and the optimal value of α is 0.13. Table IVshows the results of gender and ethnicity prediction usingthe label propagation method with these optimal values. Thelabel propagation method uses all of the fields except the(missing) field that is being predicted. When finding similarity(edge weights) between records in the complete node set,the face, name, gender, and ethnicity fields are used. Whenpredicting gender and comparing nodes where at least onenode is from the incomplete node set, then only the face,name, and ethnicity fields are used. When predicting ethnicityand comparing nodes where at least one node is from theincomplete node set, then only the face, name, and genderfields are used.

The face similarity scores are obtained using a COTS facematcher. The name similarity is calculated using the leven-shtein distance between the two names. For gender (ethnicity),the score is set to 1.0 if the gender (ethnicity) matches, and0.0 if it does not.

50

1

60

0.8

70

1

Accura

cy (

%)

80

0.6 0.8

α

90

0.6

σ

0.4

100

0.40.2

0.20 0

(a) Gender prediction accuracy as a function of σ and α

50

1

60

0.8

70

1

Accura

cy (

%)

80

0.6 0.8

α

90

0.6

σ

0.4

100

0.40.2

0.20 0

(b) Ethnicity prediction accuracy as a function of σ and α

Fig. 3: Impact of label propagation parameters, σ and α, on demographicprediction.

VI. DISCUSSION

Based on the results shown in Section V, we observe thatgender prediction from face and name works well (>95%in all cases). However, ethnicity prediction exhibits a lowerperformance: only 85.7% from face and 60.7% from name.

We observe that the graph-based label propagation methodresults in similar performance as that of single attribute-based approaches. The label propagation prediction accuracyis 98.37% for gender and 80.33% for ethnicity. The differencebetween the two approaches is that the graph-method basedis more easily extensible to other attributes. The name-based


TABLE IV: Results of gender and ethnicity prediction using the label propagation technique on the Knox County Dataset G andE Subsets, respectively, as the complete nodes and Online Celebrity Dataset G and E subsets, respectively, as the incompletenodes.

Demographic Overall Accuracy (%) Female Accuracy (%) Male Accuracy (%)Gender 98.37 97.56 99.19

Non-White Accuracy (%) White Accuracy (%)Ethnicity 80.33 61.48 99.18

and face-based methods have a limited number of attributes forwhich they can be effective predictors (e.g., gender, ethnicity,age, etc.).

The advantage of the record-based label propagation methodis that it utilizes the evidence from multiple records and multi-ple fields in order to predict missing values. Thus, prediction isbased on relationships that exist between records. This type ofrelationship is not captured in single attribute-based classifiers,but is more easily captured in graph-based approaches. Whenpredicting a full range of attributes (e.g. occupation, education-level, etc.), we believe a graph-based method will yield betterresults than single attribute-based predictors.

VII. CONCLUSION

In this work, we demonstrated the benefit of structuringa biometric gallery using a graph structure. Such a structurenot only captures the relationship between gallery records, itcan also be used to deduce missing (or overwrite incorrect)information in these records. A label propagation schemewas adopted to illustrate the possibility of imputing missinggender and ethnicity information. Experiments on two datasetsdemonstrated that the proposed method is (a) capable ofimputing missing information and (b) generalizable acrossdatasets. Future work would involve testing the method ondemographic attributes that have more than two labels.

ACKNOWLEDGEMENT

This project was supported by Award No. 2015-R2-CX-0005, from the National Institute of Justice, Office of JusticePrograms, U.S. Department of Justice. The opinions, findings,and conclusions or recommendations expressed in this publi-cation are those of the authors and do not necessarily reflectthose of the Department of Justice.

REFERENCES

[1] A. Jain, A. Ross, and K. Nandakumar, Introduction to Biometrics.Springer US, 2011.

[2] A. Dantcheva, P. Elia, and A. Ross, “What else does your biometric datareveal? A survey on soft biometrics,” IEEE Transactions on InformationForensics and Security, vol. 11, no. 3, pp. 441–467, March 2016.

[3] Y. Fu, G. Guo, and T. S. Huang, “Age synthesis and estimation viafaces: A survey,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 32, no. 11, pp. 1955–1976, Nov 2010.

[4] S. Fu, H. He, and Z. G. Hou, “Learning race from face: A survey,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 36,no. 12, pp. 2483–2509, Dec 2014.

[5] E. Makinen and R. Raisamo, “Evaluation of gender classification meth-ods with automatically detected and aligned faces,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 30, no. 3, pp. 541–547, March 2008.

[6] W.-T. Chu and C.-H. Chiu, “Predicting occupation from single facialimages,” in Proceedings of the IEEE International Symposium onMultimedia, ser. ISM ’14. Washington, DC, USA: IEEE ComputerSociety, 2014, pp. 9–12.

[7] H. Chen, A. C. Gallagher, and B. Girod, “What’s in a name? First namesas facial attributes,” in IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 3366–3373.

[8] D. Liu, S. Yan, X. S. Hua, and H. J. Zhang, “Image retagging using col-laborative tag propagation,” IEEE Transactions on Multimedia, vol. 13,no. 4, pp. 702–712, Aug 2011.

[9] J. Tang, M. Li, Z. Li, and C. Zhao, “Tag ranking based on salient regiongraph propagation,” Multimedia Syst., vol. 21, no. 3, pp. 267–275,2015.

[10] “Frequently occuring first names and surnames from the 1990 census,”United States Census Bureau, Tech. Rep., 1995.

[11] D. L. Word, C. D. Coleman, R. Nunziata, and R. Kominski,“Demographic aspects of surnames from census 2000,” United StatesCensus Bureau, Tech. Rep.

[12] F. De la Torre, W.-S. Chu, X. Xiong, F. Vicente, X. Ding, and J. Cohn,“Intraface,” in Automatic Face and Gesture Recognition (FG), 2015 11thIEEE International Conference and Workshops on, May 2015, pp. 1–8.

[13] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attributeand Simile Classifiers for Face Verification,” in IEEE InternationalConference on Computer Vision (ICCV), Oct 2009.

[14] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf,“Learning with local and global consistency,” in Advances in NeuralInformation Processing Systems 16, S. Thrun, L. Saul, and B. Scholkopf,Eds. MIT Press, 2004, pp. 321–328.


Documents

T. Swearingen and A. Ross, Predicting Missing Demographic ...rossarun/pubs/SwearingenRoss... · where gender or ethnicity is deduced from a single attribute of a single record. A