13
arXiv:1606.03238v2 [cs.CV] 15 Jun 2016 1 IDNet: Smartphone-based Gait Recognition with Convolutional Neural Networks Matteo Gadaleta, Student Member, IEEE, and Michele Rossi, Senior Member, IEEE Abstract—Here, we present IDNet, an original user authenti- cation framework from smartphone-acquired motion signals. Its goal is to recognize a target user from her/his way of walking, using the accelerometer and gyroscope (inertial) signals provided by a commercial smartphone worn in the front pocket of the user’s trousers. Our design features several innovations includ- ing: a robust and smartphone-orientation-independent walking cycle extraction block, a novel feature extractor based on con- volutional neural networks, a one-class support vector machine to classify walking cycles, and the coherent integration of these into a multi-stage authentication system. Our system exploits convolutional neural networks as universal feature extractors for gait recognition, and uses classification results from subsequent walking cycles into a multi-stage decision making framework. Experimental results show the superiority of our approach against state-of-the-art techniques, leading to misclassification rates (either false negatives or positives) smaller than 0.15% in fewer than five walking cycles. Design choices are discussed and motivated throughout, assessing their impact on the authen- tication performance. Index Terms—Biometric gait analysis, smartphone inertial sensors, authentication systems, convolutional neural networks, support vector machines, feature extraction, signal processing, accelerometer, gyroscope. I. I NTRODUCTION W EARABLE technology is advancing at a very fast pace. Many wearable devices, such as smart watches and wristbands are currently available in the consumer market and they often possess miniaturized inertial motion sensors (ac- celerometer and gyroscope) as well as other sensing hardware capable of gathering biological signs such as photoplethysmo- graphic signals, skin temperature and so forth. Other wear- ables, such as the Zephyr’s Bioharness chestband [1], deliver a number of physiological signals via their wireless interfaces (e.g., Bluetooth low energy), including electrocardiogram, heart rate, respiration rate, body orientation and activity level. The same holds true for recent smartphones, that also possess unprecedented sensing capabilities, including motion sensors but also allowing for the collection of user’s feedback and for the realtime assessment of their health condition. A notable example is provided by Apple’s HealthKit, which turns mobile phones into personalized health data hubs. So, the sensing technology is already available, most wearables already have it and, thanks to smartphones’ pervasiveness and connectivity, in the near future these devices may give rise to the first worldwide human sensor network. This is already happen- ing, as testified by, e.g., the mPower Parkinson’s disease The authors are with the Department of Information Engineering, University of Padova, via Gradenigo 6/b, 35131, Padova, Italy. E-mail: {gadaleta,rossi}@dei.unipd.it study from Sage Bionetworks, a nonprofit biomedical research organization that in March 2016 released a huge dataset capturing the everyday experiences of thousands of people (featuring millions of data points), to help advance scientific progress toward effective medical treatments [2]. This data was collected through a mobile application designed for iPhones and allowed a revolutionary measurement campaign. Two major problems are now related to the analysis of wearable data and to the authentication of the mobile users who provide it, so that we can assess with reasonably high accuracy whether the data sources are genuine. Note that certifying the data sources is a necessary step toward the widespread use of this technology in the medical field. In this paper, we propose IDNet (IDentification Network), a new system for the authentication of mobile users from their smartphone-acquired motion data. In fact, as shown in [3], [4], modern phones possess highly accurate inertial sensors, which allow for non-obtrusive gait biometrics. IDNet features Convolutional Neural Networks (CNN) [5] and tools from machine learning, such as Support Vector Machines (SVM) [6], combining them in an innovative fashion. Specif- ically, we develop algorithms for i) walking cycle extraction, ii) feature identification and, finally, iii) user authentication. CNNs are used as universal feature extractors to discriminate gait signatures from different subjects. Single- as well as multi- step classifiers are finally combined with CNNs to authenticate the user from multiple walking cycles. As we show shortly, our solution authenticates the target user with high accuracy and outperforms state-of-the-art techniques such as [7]–[12]. The main contributions of this paper are: The design and validation of an original preprocessing techniques that includes: a robust algorithm for the ex- traction of walking cycles and an original transforma- tion to move smartphone acquired motion signals into an orientation invariant reference system. Subsequent processing is carried out within this reference system, as this considerably improves authentication results, see Section III. As opposed to making motion data orientation independent, previous papers either use data acquired from a sensor in a known and fixed position [7], [8], [10], [11], [13]–[16]. or use orientation independent features at the cost of losing information about the direction of the forces [17]. The design of a new CNN-based feature extraction tool, which is trained only once on a representative set of users and then used at runtime as a universal feature extractor, see Section IV. Note that with CNNs, statistical features are automatically extracted as part of the CNN training phase (automatic feature engineering) as opposed to the

IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

arX

iv:1

606.

0323

8v2

[cs.

CV

] 15

Jun

201

61

IDNet: Smartphone-based Gait Recognitionwith Convolutional Neural Networks

Matteo Gadaleta,Student Member, IEEE,and Michele Rossi,Senior Member, IEEE

Abstract—Here, we present IDNet, an original user authenti-cation framework from smartphone-acquired motion signals. Itsgoal is to recognize a target user from her/his way of walking,using the accelerometer and gyroscope (inertial) signals providedby a commercial smartphone worn in the front pocket of theuser’s trousers. Our design features several innovations includ-ing: a robust and smartphone-orientation-independent walkingcycle extraction block, a novel feature extractor based on con-volutional neural networks, a one-class support vector machineto classify walking cycles, and the coherent integration oftheseinto a multi-stage authentication system. Our system exploitsconvolutional neural networks as universal feature extractors forgait recognition, and uses classification results from subsequentwalking cycles into a multi-stage decision making framework.Experimental results show the superiority of our approachagainst state-of-the-art techniques, leading to misclassificationrates (either false negatives or positives) smaller than0.15%in fewer than five walking cycles. Design choices are discussedand motivated throughout, assessing their impact on the authen-tication performance.

Index Terms—Biometric gait analysis, smartphone inertialsensors, authentication systems, convolutional neural networks,support vector machines, feature extraction, signal processing,accelerometer, gyroscope.

I. I NTRODUCTION

W EARABLE technology is advancing at a very fast pace.Many wearable devices, such as smart watches and

wristbands are currently available in the consumer market andthey often possess miniaturized inertial motion sensors (ac-celerometer and gyroscope) as well as other sensing hardwarecapable of gathering biological signs such as photoplethysmo-graphic signals, skin temperature and so forth. Other wear-ables, such as the Zephyr’s Bioharness chestband [1], delivera number of physiological signals via their wireless interfaces(e.g., Bluetooth low energy), including electrocardiogram,heart rate, respiration rate, body orientation and activity level.The same holds true for recent smartphones, that also possessunprecedented sensing capabilities, including motion sensorsbut also allowing for the collection of user’s feedback and forthe realtime assessment of their health condition. A notableexample is provided by Apple’s HealthKit, which turns mobilephones into personalized health data hubs. So, the sensingtechnology is already available, most wearables already haveit and, thanks to smartphones’ pervasiveness and connectivity,in the near future these devices may give rise to the firstworldwide human sensor network. This is already happen-ing, as testified by, e.g., the mPower Parkinson’s disease

The authors are with the Department of Information Engineering, Universityof Padova, via Gradenigo 6/b, 35131, Padova, Italy.E-mail: {gadaleta,rossi}@dei.unipd.it

study from Sage Bionetworks, a nonprofit biomedical researchorganization that in March 2016 released a huge datasetcapturing the everyday experiences of thousands of people(featuring millions of data points), to help advance scientificprogress toward effective medical treatments [2]. This data wascollected through a mobile application designed for iPhonesand allowed a revolutionary measurement campaign.

Two major problems are now related to the analysis ofwearable data and to the authentication of the mobile userswho provide it, so that we can assess with reasonably highaccuracy whether the data sources are genuine. Note thatcertifying the data sources is a necessary step toward thewidespread use of this technology in the medical field.

In this paper, we propose IDNet (IDentification Network),a new system for the authentication of mobile users fromtheir smartphone-acquired motion data. In fact, as shownin [3], [4], modern phones possess highly accurate inertialsensors, which allow for non-obtrusive gait biometrics. IDNetfeatures Convolutional Neural Networks (CNN) [5] and toolsfrom machine learning, such as Support Vector Machines(SVM) [6], combining them in an innovative fashion. Specif-ically, we develop algorithms fori) walking cycle extraction,ii) feature identification and, finally,iii) user authentication.CNNs are used asuniversalfeature extractors to discriminategait signatures from different subjects. Single- as well asmulti-step classifiers are finally combined with CNNs to authenticatethe user from multiple walking cycles. As we show shortly,our solution authenticates the target user with high accuracyand outperforms state-of-the-art techniques such as [7]–[12].

The main contributions of this paper are:• The design and validation of an original preprocessing

techniques that includes: a robust algorithm for the ex-traction of walking cycles and an original transforma-tion to move smartphone acquired motion signals intoan orientation invariant reference system. Subsequentprocessing is carried out within this reference system,as this considerably improves authentication results, seeSection III. As opposed to making motion data orientationindependent, previous papers either use data acquiredfrom a sensor in a known and fixed position [7], [8], [10],[11], [13]–[16]. or use orientation independent features atthe cost of losing information about the direction of theforces [17].

• The design of a new CNN-based feature extraction tool,which is trained only once on a representative set of usersand then used at runtime as a universal feature extractor,see Section IV. Note that with CNNs, statistical featuresare automatically extracted as part of the CNN trainingphase (automatic feature engineering) as opposed to the

Page 2: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

2

selection of predefined and often arbitrary features, ascommonly done in the literature [8]–[10], [13].

• The combination of CNN-extracted features with aone-class SVM classifier [18], which is solely trainedon the target subject, see Section V. The resulting SVMscores are then accumulated across multiple walkingcycles to get higher accuracies, through a new multi-stagedecision framework, see Section VI.

• The coherent integration of these techniques into theIDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion data (pre-vious algorithms solely used the accelerometer signal).

• The experimental validation of IDNet, proving its supe-riority against state-of-the-art solutions, see Section IV,and achieving authentication errors below0.15% usingfewer than five walking cycles, see Section VI.

II. RELATED WORK

Interest in gait analysis began in the 60’s, when walkingpatterns from healthy people, termed as normal patterns,were investigated by Murray et al. [19]. These measurementswere performed through the analysis of photos acquired us-ing interrupted light photography. Murray compared normalgait parameters against those from pathologic gaits [20] andshowed that gait is unique to each individual. Since then, manystudies followed and human identification systems based ongait recognition have been enjoying a growing interest.

Although most of these systems are based on computervision [21], our interest in this paper is in human gaitidentification through smartphone inertial sensors. Ailisto etal. [22] were the first to look at this problem and they did itthrough accelerometer data. In their paper, they used a triaxialaccelerometer worn on a belt with fixed orientation: thex-axispointed forward, they-axis to the left and thez-axis wasaligned with the direction of gravity. Only data points fromthex andz axes were used for identification purposes. Gait cycleextraction was performed through a simple peak detectionmethod, and a template was built for each subject. Useridentification employed a template matching technique, forwhich different methods were explored: temporal correlation,frequency domain analysis and data distribution statistics.

In [23], Derawi et al. improved the detection system throughmore robust preprocessing, cycle detection and template com-parison algorithms. Data were acquired using a mobile phoneworn on the hip, and only the verticalz-axis was consideredfor their motion analysis. Dynamic Time Warping (DTW) [24]was used as the distance measure, to assure robustness againstnon-linear temporal shifts. This scheme was also tested in [15],where majority voting and cyclic rotation were comparedas inference rules. In a further paper [16], Hidden MarkovModels (HMM) were explored. Accelerometer data were splitinto windows of fixed length, which were then utilized to trainHMMs. Good identification results were obtained, but at thecost of long authentication phases (longer than30 seconds).

Classification algorithms based on machine learning werealso investigated. Either gait cycles extraction [25] or fixedwindows lengths [8] are possible signal segmentation methods.

After that, a feature extractiontechnique is applied to eachsegment and statistical measures such as mean, standard de-viation, root mean square, zero-crossing rate or histogrambincounts, are commonly used. However, more advanced featuresare required for better results, like cepstral coefficients, whichare widely used in speech recognition systems [8], or featuresextracted through frequency analysis, i.e., using Fourier[7] orwavelet transforms [25]. Supervised algorithms are typicallyused for classification, includingk-Nearest Neighbours (k-NN) [8], [10], [12], [13], Support Vector Machines (SVM) [9],[13], [25], Multi Layer Perceptrons (MLP) [9], [13] andClassification Trees (CT) [9], [13].

We stress that in most of the related work the acquisitionsystem was placed according to a controlled and well knownorientation on the subject body. In real scenarios, this ishowever unlikely to occur. It is thus important to implementan algorithm whose performance is invariant to the smart-phone orientation. In the mPower parkinson’s measurementcampaign, for example, participants were required to put theirmobile phone in the right front pocket of their trousers, walk20 steps in a straight line, turn around, stand still for30seconds and walk20 steps back. The phone orientation inthe pocket is somewhat unconstrained (and unknown) and thephone reference system is with good probability misalignedwith respect to the direction of motion, which makes thedefinition of walking templates impossible. To deal with this,two different approaches can be used. The first consists in theextraction of features that are rotation invariant (e.g., correla-tion matrices of Fourier transforms) [17]. The second relieson the transformation of inertial signals [9], projecting theminto a neworientation invariantthree-dimensional referencesystem, which is extracted directly from the data. Here, weadopt the latter approach.

Accelerometer based gait analysis has also interest in themedical field. Using time-frequency analysis, Huang et al.showed that signals acquired by a waist-worn device on apatient with cervical disc herniation differed before and afterthe surgery [14]. In [13], classification algorithms were used todiscriminate a group of subjects with non-specific chronic lowback pain from healthy subjects. Complex parameters, e.g.,dynamic symmetry and cyclic stability of gait, were extractedby Jiang et al. [26]. However, their evaluation requires to placesensors on the legs, and fine gait details are difficult to extractfrom signals acquired by a single waist-worn sensor, such asa smartphone.

III. S IGNAL PROCESSINGFRAMEWORK

The aim of IDNet is to correctly recognize a subject fromhis/her way of walking, through the acquisition of inertialsignals from a standard smartphone. The proposed processingworkflow is shown in Fig. 1. Walking data is first acquired,then we perform some preprocessing entailing:

1) pre-filtering to remove motion artifacts (Section III-A),2) the extraction of walking cycles (Section III-B),3) a transformation to move the raw walking data into a new

orientation independentreference system (Section III-C),

Page 3: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

3

Cycle Extraction

Orientation

Independent

Transformation

Preprocessing

Performance

Evaluation

Classical

Machine

Learning

CNN-based

authentication

NormalizationFiltering

accelerometer and

gyroscope signal

Fig. 1: Signal processing workflow.

4) a normalization to represent each walking cycle (ac-celerometer and gyroscope data) through fixed length,zero mean and unit variance vectors (Section III-D).

After this, the walking cycles are ready to be processed toidentify the user. The standard approach, called “ClassicalMachine Learning” entails the computation of a number ofpre-established statistical features, the most informative ofwhich are selected and used to train a classifier. Various ma-chine learning techniques are usually exploited to this purpose,and are trained through a supervised approach. Hence, theclassification performance is assessed and the whole processis usually iterated for a further feature selection phase. Inthis way, the features that are used for the classification taskare progressively refined until a final feature set is attained.Note that statistical features are often assessed by the designerthrough educated guesses and a trial and error approach.

As opposed to this, with IDNet we advocate the useof convolutional neural networks (see Sections IV and V).These have been successfully used by the video processingcommunity [27] but to the best of our knowledge have neverbeen exploited for the analysis of inertial data from wearabledevices. One of the main advantages of this approach is thatstatistical features are automatically assessed by the CNNas aresult of a supervised training phase. In Section IV, the CNNis trained to act as a universal feature extractor, whereas inSection V a one-class SVM is trained as the final classifier.Once the CNN is trained, our system operates assuming thatthe smartphone only has access to the walking patterns of thetarget user (i.e., the legitimate user) and the SVM is solelytrained using his/her walking data. Our system is based onthe premise that, at runtime, the CNN should be capable ofproducing discriminant features for unseen users and the one-class SVM, once trained on the target, should reliably detectimpostors, although their walks were not used for training.

The processing blocks are described in higher details inthe following subsections.

Notation: With xxx ∈ Rn we mean a column vectorxxx =

(x1, x2, . . . , xn)T with elementsxi ∈ R, where (·)T is the

transpose operator.|xxx| = n returns the number of elements invectorxxx. x = (

∑ni=1 xi)/n, whereas‖xxx‖ = (

∑ni=1 x

2i )

1/2 isthe L2-norm operator. Ifxxx,yyy ∈ R

n, we define their innerproduct asxxx · yyy = xxxTyyy and their entrywise product asxxx ◦ yyy = (x1y1, x2y2, . . . , xnyn)

T . Vector111n = (1, 1, . . . , 1)T

with |111n| = n. Matrices are denoted by uppercase and boldletters. For example, ifxxx,yyy,zzz ∈ R

n, we define a3 × nmatrix asMMM = [xxx,yyy,zzz]T , whose rows contain the three

vectors. In addition, element(i, j) of matrix XXX is denotedby Xi,j ∈ R. With ~rrr we mean a 3D vector~rrr = (r1, r2, r3)

T

and rrr is the corresponding 3D versorrrr = ~rrr/‖~rrr‖. For anytwo 3D vectors~rrr and ~sss we indicate their cross-product as~rrr×~sss = (r2s3 − r3s2, r3s1 − r1s3, r1s2 − r2s1)

T . The gravityvector is referred to as~ρρρ. With u(i) we mean a time series,wherei = 1, 2, . . . is the discrete time index. For accelerationdata, the boldface letteraaa is reserved for vectors,a(i) for timeseries andAAA for matrices. The same notation is adopted for thegyroscope data, usingggg, g(i) andGGG, respectively for vectors,time series and matrices.

A. Data Acquisition and Filtering

Data were acquired from50 subjects, during a period ofsix months through Android smartphones worn in the rightfront pocket of the users’ trousers. The following devices wereused: Asus Zenfone 2, Samsung S3 Neo, Samsung S4, LG G2,LG G4 and a Google Nexus 5. Several acquisition sessionsof about five minutes were performed for each subject, invariable conditions, e.g., with different shoes and clothes. Weasked each subject to walk as she/he felt comfortable with,to mimic real world scenarios. For the data acquisition, wedeveloped an Android inertial data logger application, whichsaves accelerometer, gyroscope and magnetometer signals intonon-volatile memory and then automatically transfers themtoan Internet server for further processing. The magnetometersignal is not used for identification purposes. In general, IDNetcan be used carrying the device in other positions but weremark that each requires a dedicated training.

In Fig. 2, we plot the power of accelerometer and gy-roscope signals at different frequencies through the Welch’smethod [28], considering a full walking trace and setting theHanning window length to1 s, with half window overlap.Most of the signal power is located at low frequencies, mostlybelow 40 Hz (where the power is at least30 dB smaller thanthe maximum). The raw inertial signals were acquired using anaverage sample frequency ranging between100 and 200 Hz(depending on the smartphone model), which is more thanappropriate to capture most of the walking signal’s energy.

At the first block of IDNet processing chain, due to thenon-uniform sampling performed by the smartphone operatingsystem, we apply a cubic Spline interpolation to represent theinput data through evenly spaced points (200 points/second).Hence, a low pass Finite Impulse Response (FIR) filter witha cutoff frequency offc1 = 40 Hz is used for denoisingand to reduce the motion artifacts that may appear at higherfrequencies. In fact, given the power profile of Fig. 2, the

Page 4: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

4

Fig. 2: Power spectral density of accelerometer (continuouslines, one for each axis) and gyroscope (dashed lines) data.

selected cutoff frequency only removes noise and preservestherelevant (discriminative) information about the user’s motion.

In the following, withax(i) andgx(i) we respectively meanthe filtered and interpolated acceleration and gyroscope timeseries along axisx, wherei = 1, 2, . . . is the sample number.The same notation holds for axesy andz.

B. Extraction of Walking Cycles

For the extraction of walking cycles we use a template-basedand iterative method that solely considers the accelerometer’smagnitude signal. This signal is in fact inherently invariant tothe rotation of the smartphone and, as such, allows for theprecise assessment of walking cycles regardless of how theuser carries the device in her/his front pocket. For each samplei = 1, 2, . . . the acceleration magnitude is computed as:

amag(i) = (ax(i)2 + ay(i)

2 + az(i))1/2 . (1)

To identify the template, a reference point inamag(i) has tobe located. To do so, inspired by [11] we first passamag(i)through a low-pass filter with cutoff frequencyfc2 = 3 Hz.Thus, we detect the first minimum of this filtered signal, whichcorresponds to the heel strike [29], and the correspondingindex is calledi. This minimum is then refined by lookingat the original signalamag(i) in a neighbourhood ofi andpicking the minimum value ofamag(i) in that neighborhood.This identifies a new indexi⋆ for which amag(i

⋆) is a localminimum. As an example, in Fig. 3, we show this minimumthrough a red vertical (dashed-dotted) line. As a second step,we pick a window of one second centered ini⋆, which inFig. 3 is represented through two vertical blue (dashed) lines.Now, the samples ofamag(i) falling between the two bluelines define the firstgait template, which we callTTT , with|TTT | = Ns samples, whereNs corresponds to the number ofsamples measured in one second.

The extracted template is then iteratively refined and, at thesame time, used to identify subsequent walking cycles. To this

Fig. 3: Template extraction using the accelerometer magnitudeamag(i). The first template is the signal between the bluedashed vertical lines. The red dashed line in the centercorresponds toi∗.

Fig. 4: Correlation distanceϕ(i) betweenamag(i) and thetemplateTTT of Fig. 3. Local minima identify the beginningof walking cycles.

end, we first define the following correlation distance, for anytwo real vectorsuuu andvvv of the same sizen we have:

corr dist(uuu,vvv) = 1−(uuu− u111n) · (vvv − v111n)

‖(uuu− u111n)‖ ‖(vvv − v111n)‖. (2)

The templateTTT is then processed with the accelerationmagnitude through the following Eq. (3), leading to a furthermetricϕ(i), wherei = 1, 2, . . . is the sample index:

vrect(amag(i))= (amag(i), . . . , amag(i+Ns − 1))T (3)

ϕ(i)= corr dist(TTT , vrect(amag(i)) , i = 1, . . . .

As can be seen from Fig. 4, the functionϕ(i) exhibits somelocal minima, which are promptly located by comparingϕ(i)with a suitable thresholdϕth and performing a fast search in-side the regions whereϕ(i) < ϕth. The indices correspondingto these minima determine the optimal alignments betweenthe templateTTT andamag(i). In particular, the second of these

Page 5: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

5

identifies the beginning of the second gait cycle. From thesefacts we have that:

1) the samples between the second and the third minimacorrespond to the second gait cycle. It is thus possibleto locate accelerometer and gyroscope vectors for thiswalking cycle, which are respectively defined as:aaax,aaay, aaaz and gggx, gggy, gggz, still expressed in the(x, y, z)coordinate system of the smartphone. We remark thatthe number of samples does not necessarily match thetemplate length and usually differs from cycle to cycle,as it depends on the length and duration of walking steps.

2) A second templateTTT ′ is obtained by readingNs samplesstarting from the second minimum.

At this point, a new template is obtained through a weightedaverage of the old templateTTT and the new oneTTT ′:

TTT = αTTT + (1 − α)TTT ′ , (4)

where for the results in this paper we usedα = 0.9. Thenew templateTTT is then considered for the extraction of thenext walking cycle and the procedure is iterated. Note that thistechnique makes it possible to obtain an increasingly robusttemplate at each new cycle.

A template matching approach exploiting a similar rationalewas used in [11], where the authors employed the Pearsonproduct-moment correlation coefficient between template andamag(i). The main differences between [11] and our approachare: we obtain the templateTTT in a neighborhood ofi⋆, usinga fixed number of samplesNs, whereas they take the samplesbetween two adjacent minima ofϕ(i) (which may then differin size for different cycles). In Eq. (4), a discrete-time filter isutilized to refine the templateTTT at each walking cycle, makingit more robust against speed changes. In previous work [11],the template is instead kept unchanged up to a point whenminima cannot be longer detected, and a new template is tobe obtained.

Finally, a normalization phase is required to represent allthe cycles through the same number of pointsN , as this isrequired by the following feature extraction and classificationalgorithms. Before doing this, however, a transformation ofaccelerometer and gyroscope signals is performed to expressthese inertial signals in arotation invariantreference system,as described next.

C. Orientation Independent Transformation

To evaluate the new orientation invariant coordinate system,three orthogonal versorsξξξ, ζζζ, ψψψ are to be found, whoseorientation is independent of that of the smartphone andaligned with gravity and the direction of motion. Specifically,our aim is to express accelerometer and gyroscope signals ina coordinate system that remains fixed during the walk, withversorζζζ pointing up (and parallel to the user’s torso), versorξξξ pointing forward (aligned with the direction of motion) andψψψ tracking the lateral movement and being orthogonal to theother two. This entails inferring the orientation of the mobiledevice carried in the front pocket from the acceleration signalacquired during the walk. To this end, we adopt a techniquesimilar to those of [30], [31].

Gravity is the main low frequency component in theaccelerometer data, and will be our starting point for thetransform. Moreover, although it is a constant vector, it con-tinuously changes when represented in the(x, y, z) coordinatesystem of the smartphone, due to the user’s mobility and thesubsequent change of orientation of the device. So, even thegravity vector~ρρρ is not constant when expressed through thesmartphone coordinates. As the first axis of the new referencesystem, we consider the mean direction of gravity within thecurrent walking cycle. Letnk be the number of samples in thecurrent cyclek, with k = 1, 2, . . . . We recall that, withaaax, aaayandaaaz we mean the acceleration samples in the current cyclekalong the three axesx, y andz, with |aaax| = |aaay| = |aaaz| = nk,whereas withgggx, gggy andgggz we indicate the gyroscope samplesin the same cyclek, with |gggx| = |gggy| = |gggz| = nk. The gravityvector~ρρρk within cycle k is estimated as:

~ρρρk = (ax, ay, az)T . (5)

The first versor of the new systemζζζ is obtained as:

ζζζ =~ρρρk‖~ρρρk‖

. (6)

Now, we define the acceleration matrixAAA = [aaax,aaay,aaaz ]T

of size 3 × nk, whose rows corresponds toaaax, aaay and aaaz.Likewise, the gyroscope matrix isGGG = [gggx, gggy, gggz]

T , whoserows corresponds togggx, gggy andgggz. The projected accelerationand gyroscope vectors along axisζζζ are:

aaaζ = AAA · ζζζ , gggζ =GGG · ζζζ , (7)

where the new vectors have the same sizenk. By removing thiscomponent from the original accelerometer signal, we projectthe latter on a plane that is orthogonal toζζζ. This is the hori-zontal plane (parallel to the floor). We represent thisflattenedacceleration data through a new matrixAAAf = [aaafx,aaa

fy ,aaa

fz ]T of

size3× nk, whereaaafx, aaafy andaaafz are vectors of sizenk thatdescribe the acceleration on the new plane:

AAAf = AAA− ζζζaaaTζ . (8)

Analyzing this flattened acceleration, we see that during awalking cycle it is unevenly distributed on the horizontal plane.Also, the acceleration points on this plane are dispersed arounda preferential direction, which has the highest excursion (vari-ance). Here, we assume that the direction with the largestvariance in our measurement space contain the dynamics ofinterest, i.e., it is parallel to the direction of motion, asit wasalso observed and verified in previous research [30]. Giventhis, we pick this direction as the second axis (versorξξξ)of the new reference system. This is done by applying thePrincipal Component Analysis (PCA) [32] on the projectedpoints, which finds the direction along which the variance ofthe measurements is maximized. The PCA procedure is asfollows:

1) Find the empirical mean along each directionx, y andz(rows1, 2 and3 of the flattened acceleration matrixAAAf ).Store the mean in a new vectoruuu of size3× 1., i.e.:

ui =1

nk

nk∑

j=1

Afi,j , i = 1, 2, 3 . (9)

Page 6: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

6

Fig. 5: Raw accelerometer data from two different walks, acquired from a smartphone worn in the right front pocket withdifferent orientations. Accelerometer data in the smartphone reference system(x, y, z) (left), and after the transformation(ξ, ψ, ζ) (right). IDNet implements a PCA-based transformation thatmakes walking data rotation invariant, i.e., subject-specificgait patterns emerge in the new coordinate system (see the red colored patterns in the right plots).

2) Subtract the empirical mean vectoruuu from each columnof matrixAAAf , obtaining the new matrixAAAfnorm:

AAAfnorm = AAAf − uuu(111nk)T . (10)

3) Compute the sample3× 3 autocovariance matrixΣΣΣ:

ΣΣΣ =AAAfnorm(AAA

fnorm)

T

nk − 1. (11)

4) The eigenvalues and the corresponding eigenvectors ofΣΣΣ are evaluated. The eigenvector~vvv associated with themaximum eigenvalue identifies the direction of maximumvariance in the dataset (i.e., the first principal componentof the PCA transform).

Hence, versorξξξ is evaluated as:

ξξξ =~vvv

‖~vvv‖. (12)

Accelerometer and gyroscope data are then projected alongξξξthrough the following equations:aaaξ = AAA · ξξξ andgggξ = GGG · ξξξ.Being ξξξ placed on a plane that is orthogonal toζζζ, these twoversors are also orthogonal. The third axis is then obtainedthrough a cross product:

ψψψ = ζζζ × ξξξ , (13)

and the new accelerometer and gyroscope data along this axisare respectively obtained as:aaaψ = AAA · ψψψ andgggψ =GGG · ψψψ. Thetransformed vectors(aaaξ,aaaψ,aaaζ) and (gggξ, gggψ, gggζ), along withthe magnitude vectorsaaamag and gggmag are the output of theOrientation Independent Transformation block of Fig. 1.

An example of this transform is shown in Fig. 5, whereaccelerometer and gyroscope data from two different walksfrom the same subject are plotted. These signals were acquiredcarrying the phone in the right front pocket of the subject’strousers using two different orientations. As highlightedin thefigure, our transform makes walking data rotation invariant. Infact, subject-specific gait patterns emerge in the new coordi-nate system (see the red colored patterns in the right plots).

D. Normalization

Each gait cycle has a different duration, which dependson the walking speed and stride length. So, considering theaccelerometer and gyroscope data collected during a fullwalking cycle, we remain with variable-size acceleration andgyroscope vectors, which are now expressed in the new orien-tation invariant coordinate system discussed in Section III-C.However, since feature extraction and classification algorithmsrequireN -sized vectors for each cycle, whereN has to befixed, a further adjustment is necessary. We cope with thiscycle length variability through a further Spline interpolationto represent all walking cycles through vectors ofN = 200samples each. This specific value ofN was selected to avoidaliasing. In fact, assuming a maximum cycle duration ofτ = 2 seconds, which corresponds to a very slow walk, anda signal bandwidth ofB = 40 Hz, a number of samplesN > 2Bτ = 160 samples/cycle is required. Amplitudenormalization was also implemented, to obtain vectors withzero mean and unit variance, as this leads to better trainingand classification performance. This results in a total of eightN -sized vectors for each walking cycle, which are inputtedinto the feature extraction and classification algorithms of thefollowing sections.

IV. CONVOLUTIONAL NEURAL NETWORK

In this section, we present the chosen Convolutional NeuralNetwork (CNN) architecture for IDNet (Section IV-A), itsoptimization, training and quantitative comparison against themost common classifiers from the literature (Section IV-B).

A. CNN Architecture

CNNs are feed-forward deep neural networks differing fromfully connected multilayer networks for the presence of oneor more convolutional layers. At each convolutional layer,anumber ofkernelsis defined. Each of them has a number of

Page 7: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

7

Multi-stage

authenticationPreprocessing

Feature

extraction

(CNN)

Feature

selection

(PCA)

Classification

(OSVM)

CNN Feature Extraction Block

Convolutional

layer 2 (CL2)Max Pooling

Fully-conn

layer 1 (FL1)

Fully-conn

layer 2 (FL2)

output vector

Input layer

(8x200 samples)Convolutional

layer 1 (CL1)

gyroscope data

accelerometer data

20@(1x10)

40@(4x10)

K=35

F=40

feature vector

Data

Acquisition

y1

y2

yK

f1

fF

CNN-based authentication

Fig. 6: IDNet authentication framework. CL1 and CL2 are convolutional layers, FL1 and FL2 are fully connected layers.X@(Y×Z) indicates the number of kernels, X, and the size of the kernel matrix, Y×Z.

weights, which are convolved with the input in a way thatthe same set of weights, i.e., the same kernel, is applied toall the input data, moving the convolution operation acrossthe input span. Note that, as the same weights are reused(shared weights), and each kernel operates on a small portionof the input signal, it follows that the network connectivitystructure is sparse. This leads to advantages such as a con-siderably reduced computational complexity with respect tofully connected feed forward neural networks. For more detailsthe reader is referred to [33]. CNNs have been proven to beexcellent feature extractors for images [34] and here we provetheir effectiveness for motion data. The CNN architecture thatwe designed to this purpose is shown in Fig. 6. It is composedof a cascade of two convolutional layers, followed by apooling and a fully-connected layer. The convolutional layersperform a dimensionality reduction (or feature extraction)task, whereas the fully-connected one acts as a classifier.Accelerometer and gyroscope data from each walking cycleare processed according to the algorithms of Section III.We refer to the input matrix for a generic walking cycle toasXXX = (aaaξ,aaaψ,aaaζ ,aaamag, gggξ, gggψ, gggζ , gggmag)

T , where all thevectors are normalized toN samples (see Section III-D).In detail, we have (CL = Convolutional Layer, FL = Fully-connected Layer):

• CL1 The first convolutional layer implements one dimen-sional kernels (1x10 samples) performing a first filteringof the input and processing each input vector (rows ofXXX) separately. This means that at this stage we do notcapture any correlation among different accelerometer andgyroscope axes. The activation functions are linear and thenumber of convolutional kernels is referred to asNk1.

• CL2 With the second convolutional layer we seek discrimi-nant and class-invariant features. Here, the cross-correlation

among input vectors is considered (kernels of size4x10samples) and the output activation functions are non-linearhyperbolic tangents. Max pooling is applied to the outputof CL2 to further reduce its dimensionality and increase thespatial invariance of features [35]. WithNk2 we mean thenumber of convolutional kernels used for CL2.

• FL1 This is a fully connected layer, i.e., each outputneuron of CL2 is connected to all input neurons of thislayer (weights are not shared). Hyperbolic tangent activationfunctions are used at the output neurons. FL1 output vectoris termedfff = (f1, . . . , fF )

T , and contains theF featuresextracted by the CNN.

• FL2 Each output neuron in this layer corresponds toa specific class (one class per user), for a total ofKneurons, whereK is the number of subjects consideredfor the training phase. TheK dimensional output vectoryyy = (y1, . . . , yK)T is obtained by asoftmax activationfunction, which implies thatyj ∈ (0, 1), j = 1, . . . ,K and∑K

j=1 yj = 1 (stochastic vector). Also,yj can be thoughtof as the probability that the current data matrixXXX belongsto class (user)j.

The network is trained in a supervised manner for a total ofK subjects solving a multi-class classification problem, whereeach of the input matricesXXX in the dataset is assigned toone ofK mutually exclusive classes. Thetargetoutput vectorttt = (t1, . . . , tK)T has binary entries and is encoded using a1-of-K coding scheme, i.e., they are all zero except for thatcorresponding to the subject that generated the input data.

B. CNN Optimization and Results

In this section, we propose some approaches for the opti-mization of the CNN, quantify its classification performanceand compare it against classification techniques from the

Page 8: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

8

literature. As said above, the output of layer FL2 is thestochastic vectoryyy, whosej-th entry yj, j = 1, . . . ,K, canbe seen as the probability that the input pattern belongs touser j, i.e., yj = yj(www,XXX) = Prob(tj = 1|www,XXX), wherewww is the vector containing all the CNN weights,XXX is thecurrent input matrix (walking cycle) andtj = 1 if XXX belongsto class j and tj = 0 otherwise. If X is the set of alltraining examples, we define the batch set asB ⊂ X . LetXXX ∈ B and denote the corresponding output vector byyyy(www,XXX)and its j-th entry by yj(www,XXX). The corresponding targetvector is ttt(XXX) = (t1(XXX), . . . , tK(XXX))T . The CNN is thentrained through a stochastic gradient descend algorithm whichminimizes a categorical cross-entropy loss functionL(www),defined as [6, Eq. (5.24) of Section 5.2]:

L(www) = −∑

XXX∈B

K∑

j=1

tj(XXX) log(yj(www,XXX)) . (14)

During training, Eq. (14) is iteratively minimized, by rotatingthe walking cycles (training examples) in the batch setB soas to span the entire input setX . Training continues until astopping criterion is met (see below).

Walking patterns fromK subjects are used to train the CNN,and the same number of cyclesNc is considered for eachof them, for a total ofKNc training cycles.Nt randomlychosen walking cycles from each subjects are used to obtaina test setP . The remaining cycles are split into trainingTand validationV sets, with|P| = KNt, |T | = KNc, X =P ∪ T ∪ V , where all the sets have null pairwise intersectionand are built picking input patterns fromX evenly at random.SetV is used to terminate the training phase, and terminationoccurs when the loss functionL(www) evaluated onV does notdecrease for twenty consecutive training epochs. After that, thenetwork weights which led to the minimum validation lossare used to assess the CNN performance on setP . This isdone through anaccuracymeasure, defined as the number ofwalking cycles correctly classified by the CNN divided by thetotal number of cycles inP . In the following graphs, we showthe mean accuracy obtained averaging the test set performanceover ten different networks, all of them trained through thejustexplained approach by consideringK = 35 subjects from ourdataset andNt = 100 cycles per subject.

As a first set of results, we look at the impact ofF (neuronsin layerFL1) and of the number of convolutional kernels inCL1 and CL2. Since the last layer FL2 acts as a classifier,Fcan be seen as the number of features extracted by the CNN. Ingeneral, a too smallF can lead to poor classification results;too many features, instead, would make the state space toobig to be effectively dealt with (curse of dimensionality) [36].BesidesF , we also investigate the right number of kernelsto use within each convolutional layer. Three networks areconsidered by picking different(Nk1, Nk2) pairs. For network1 we use(Nk1 = 10, Nk2 = 20), network 2 has(Nk1 =20, Nk2 = 40) and network 3 has(Nk1 = 30, Nk2 = 50). InFig. 7, we show the accuracy performance of these networksas a function ofF . From this plot, it can be seen that at leastF = 20 neurons have to be used at the output of FL1 and thatthe accuracy performance stabilizes aroundF = 40, leading

Fig. 7: CNN test accuracyvs number of featuresF in layerFL1. Three curves are shown for three different networkconfigurations (number of kernels in layers CL1 and CL2).

to negligible improvements asN grows beyond this value. Asfor the number of kernels, we conclude that small networks(network 1) perform worse than bigger ones (networks 2 and3), but increasing the number of kernels beyond that used fornetwork 2 does not lead to appreciable improvements. Hence,for the results of this paper we usedF = 40 with (Nk1 =20, Nk2 = 40).

A key performance comparison is shown in Fig. 8, wherethe accuracy is plotted againstNc for a CNN classifier andfour selected classification algorithms from the literature,i.e., Classification Trees (CT) [37], Naive Bayes (NB) clas-sifiers [38], k-Nearest Neighbors (k-NN) [39] and SupportVector Machines (SVM) [40].1 These approaches were usedin a large number of papers including [8]–[10], [13], [25].For their training,112 features were extracted from the signalsamples inXXX, including their variance, mean trend, windowedmean difference, variance trend, windowed variance differ-ence, maxima and minima, spectral entropy, zero crossingrate and bin counts. These features, were then utilized totrain the selected classifiers in a supervised manner. Note that,while the CNN automatically extracts its features (vectorfff ),with previous techniques these are manually selected basedonexperience.

From Fig. 8, we see that the CNN-based classification ap-proach surpasses all the previous classifiers from the literature,delivering better accuracies across the entire range ofNc. Also,the accuracy increases with an increasingNc until it saturatesand no noticeable improvements are observed. While a higherNc is always beneficial, a higher number of cycles also entailsa longer acquisition time, which we would rather avoid. Forthis reason, for the following results we have usedNc = 40 asit provided a good tradeoff between accuracy and complexityacross all our experiments.

To illustrate the superiority of CNN features with respect

1For SVM, we considered a linear kernel, as it outperformed polynomialand radial basis function ones (results are omitted in the interest of space). Aone-versus-all strategy was used solve the considered multiclass problem forthe binary classifers.

Page 9: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

9

Fig. 8: CNN test accuracyvs number of walking cyclesNc used for training. Results for CT, NB,k-NN and SVMclassifiers from the literature are also shown.

Fig. 9: Test accuracy of CT, NB,k-NN and SVM classi-fiers. “CNN” indicates training with CNN-extracted features,whereas “Manual” means standard feature extraction.

to manually extracted ones, in the following we conductan instructive experiment. We consider CNN as a featureextraction block, by removing the output vectoryyy and usingthe inner feature vectorfff to train the above classifiers fromthe literature (CT, NB,k-NN and SVM). The correspondingaccuracy results are provided in Fig. 9. All the classifiersperform better when trained using CNN features, with typicalimprovements in the test accuracy of more than10%. Forinstance, for ak-NN classifier trained withNc = 30 cycles persubject, the accuracy increases from71% (manually extractedfeatures) to94% (CNN features). The best performance isprovided by the combined use of CNN features and SVM.

A last consideration is in order. Most of the previouspapers only used accelerometer data, but our results showthat using both gyroscope and accelerometer provides furtherimprovements, see Fig. 10.

Fig. 10: Impact of gyroscope data. Lines represent the meanaccuracy (averaged over ten networks), whereas markers indi-cate the results of the ten network instances.

V. ONE-CLASS SUPPORTVECTORMACHINE TRAINING

In this section, we further extend the IDNet CNN-basedauthentication chain through the design of an SVM classifierwhich is trained solely using the motion data of the targetsubject. This is referred to as One-Class Classification (OCC)and is important for practical applications where motionsignals of the target user are available, but those belonging toother subjects are not. More importantly, with this approachthe classification framework can be extended to users that werenot considered in the CNN training.

A. Revised Classification Architecture

Due to the generalization properties of convolutional deepnetworks, once trained, the CNN can be used as a uni-versal feature extractor, providing meaningful features evenfor subjects that were not included in the training. To takeadvantage of this, we discard the output neurons of FL2and utilize the CNN as a dimensionality reduction tool that,given an input matrixXXX, returns a user dependent featurevectorfff . The CNN is then trained only once considering theoptimizations of Section IV-B. All its weights and biases arethen precomputed and will not be modified at classificationtime. Considering the diagram of Fig. 6, at the output of theCNN we obtain the feature vectorfff . We then apply a featureselection block to reduce the number of features fromF toS ≤ F (dimensionality reduction). PCA is used to accomplishthis task and the new feature vector is calledsss. Hence, we havesss = Υ(fff), whereΥ(·) : RF → R

S is the PCA transform.A One-class Support Vector Machine (OSVM) is then used

as the classification algorithm (Section V-B). It defines aboundaryaround the feature (training) vectors belonging to thetarget subject. At runtime, as a new walking cycle is processed,the OSVM takes the feature vectorsss and outputs ascore,which is a distance measure between the current feature vectorand the SVM boundary [6, Chapter 7]. As we discuss shortly,this score relates to the likelihood that the current walkingcycle belongs to the target user.

Page 10: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

10

B. One-Class SVM Design

Next, we design the OSVM block of Fig. 6. It differs froma standard binary SVM classifier as the SVM boundary isbuilt solely using patterns from the positive class (targetuser).The strategy proposed by Scholkopf is to map the data intothe feature space of a kernel, and to separate them fromthe origin with maximum margin [41]. The correspondingminimization problem is similar to that of the original SVMformulation [40]. The trick is to use a hyperplane (in the spacetransformed by a suitable kernel function) to discriminatethetarget vectors. OSVM takes as input the reduced feature vectorsss = (s1, . . . , sS)

T and we use the following Radial BasisFunction (RBF) kernel, that for anysss,sss′ ∈ R

S is defined as:

Ψ(sss,sss′) = (Φ(sss) · Φ(sss′)) = exp(

−γ ‖sss− sss′‖2)

, (15)

where Φ(sss) is a feature map andγ is the RBF kernelparameter, which intuitively relates to the radius of influencethat each training vector has for the space transformation.Withℓ we mean the number of training points (feature vectors),ωωωandb are the hyperplane parameters in the transformed domain(through Eq. (15)) andεεε = (ε1, . . . , εℓ)

T is the vector of slackvariables, which are introduced to deal with outliers. Giventhis, the following quadratic program is defined to separatethefeature vectors in the training set,sss1, . . . , sssℓ, from the origin:

minωωω,εεε,b

1

2‖ωωω‖2 +

1

νℓ

ℓ∑

j=1

εj − b (16)

subject to (ωωω · Φ(sj)) ≥ b− εj , εj ≥ 0 , j = 1, . . . , ℓ

ν ∈ (0, 1) is one of the most important parameters andsets an upper bound on the fraction of outliers and a lowerbound on the fraction of Support Vectors (SV) [41]. Thedecision function for a generic feature vectorsss is defined asd(sss) ∈ {−1,+1}, is obtained solving Eq. (16), and only de-pends on the training vectors through the following relations:

d(sss) = sgn(h(sss)) , h(sss) =ℓ

j=1

αjΨ(sssj , sss)− b . (17)

Now, αj ≥ 0, ∀ j, and only some of the training vectors haveαj > 0. These are thesupport vectorsassociated with theclassification problem and are the only ones who count in thedefinition of the SVM boundary.h(sss) is thescoreassociatedwith vectorsss. It weighs the distance from the SVM boundary,i.e., is greater than zero ifsss resides inside the boundary, zeroif it lies on it and negative otherwise.

Hence, the SVM is trained using a set ofℓ feature vectorsfrom the target user, obtaining the SVM boundary (and therelated decision function) through Eq. (17). After training, wetest the performance of the obtained SVM classifier consid-ering feature vectors from the positive classC1 (target user)and the negative oneC0 (any other user). Note that the vectorsused for this test were not considered during the SVM training.

As it is customary for binary classification approaches,the two most important metrics to assess the goodness ofa classifier are theprecision and the recall. The precisionis the fraction of true positives, i.e., the fraction of patternsidentified of the target class that in fact belong to the target

Fig. 11: OSVM: F-measure as a function ofγ andν.

user, while the recall corresponds to the fraction of targetpatterns that are correctly classified out of the entire positiveclass of samples [42]. Often, these two metrics are combinedinto their harmonic mean, which is called F-measure and isused as the single quality parameter.

In Fig. 11, the F-measure is plotted as a function of thetwo SVM parametersγ and ν. As seen from this plot, thearea where the classifier’s performance is maximum is quiteample. This is good as it means that even selectingγ andν once for all at design stage, the performance of the SVMclassifier is not expected to change much if the signal statisticschanges or a new target user is considered. In other words,this relatively weak dependence on the parameters entails anintrinsic robustness for the classifier. For the results that followwe have usedγ = 0.3 andν = 0.02.

Two last considerations are in order. The first relates to thePCA transformationΥ(·) and in particular to how many andwhich principal components have to be retained for the outputfeature vectors. In fact, as pointed out in [43], two optionsarepossible to go from the CNN-extracted feature vectorfff to sss.The first is to retain theS ≤ F entries of the transformedvector (expressed in the PCA basis) that correspond to theprincipal components with highest variance, whereas a secondoption is to retain those with the smallest. Fig. 12 shows theF-measure of the OSVM classifier as a function ofS for F = 40(number of CNN-extracted features). From this plot we seethat pickingS < F in general provides better results and alsothat considering the principal components with lowest varianceprovides better results for this class of problems. This is inaccordance with [43].

The last consideration regards the amount of feature vectorsbelonging to the target user that should be used for the OSVMtraining. Note that this number is related to the walking timerequired for a new subject to train his/her personal authentica-tion system. To perform this analysis, a fixed number of cycleswere randomly extracted from the whole target dataset andwere used to train the OSVM. The remaining walking cycleswere used as the positive test set. In Fig. 13 we show the F-measure as a function of this number of cycles. From these

Page 11: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

11

Fig. 12: OSVM: F-measure as a function of the numberof retained PCA featuresS. The number of CNN-extractedfeatures isF = 40.

Fig. 13: F-measure as a function of the number of walkingcycles used to train the OVSM classifier.

results, it follows that increasing the number of cycles beyond1, 000 leads to little improvement. This number corresponds toabout15 minutes of walking activity, distributed among differ-ent acquisition sessions. Multiple sessions are recommended toaccount for some statistical variation due to wearing differentclothes.

Once all the model’s parameters are defined, the OSVMscore can be analyzed. Letpθ(h(sss)) = p(h(sss) | sss ∈ Cθ) bethe estimated probability density function (pdf) of the OSVMscoreh(sss) ∈ R, provided that the walking cycle belongs toa user of classCθ with θ ∈ {0, 1}. Empirical pdfspθ(h(sss))obtained from our dataset are provided in Fig. 14.

VI. M ULTI -STAGE AUTHENTICATION

The so far discussed processing pipeline returns a scorefor each walking cycle. However, as seen in Fig. 14, whena score falls near the point where the two pdfs intersect,there is a high uncertainty about the identity of the user whogenerated it. In IDNet, we resolve this indetermination byjointly considering the scores from successive walking cycles.

Fig. 14: Empirical probability distribution functions of thescores for classC1 (p1(h(sss))) andC0 (p0(h(sss))).

Let O = (o1, o2, . . . ) be a sequence of subsequent OSVMscores from the same subject, whereoi = h(sssi) ∈ R andi = 1, 2, . . . is the walking cycle index. From our previousanalysis,oi can be thought of as a random process havingprobability density functionpθ(h(sssi)) = pθ(oi), θ ∈ {0, 1},and our objective is to reliably estimateθ from the scores inO. Toward this, we assume that subsequent scores belong tothe same user and that they are independent and identicallydistributed (i.i.d), i.e., they are independently drawn frompθ(·), with θ unknown.

For the estimation ofθ we use Wald’s probability ra-tio test (SPRT) [44], [45]. We define the two hypotheses{H1 : θ = 1}, meaning that the sequenceO belongs to thetarget user (classC1), and{H0 : θ = 0}, meaning that anotheruser generated it (classC0). Hence, we assess which one ofthese is true through SPRT sequential binary testing. That is,we keep measuring new scores and use them to decrease ouruncertainty aboutθ. Consideringn samples(o1, o2, . . . , on),the final decision takes on two valuesDn = 0 or Dn = 1,whereDn = j, j ∈ {0, 1} means that hypothesisHj isaccepted and therefore the alternative hypothesis is rejected.Owing to our assumptions (i.i.d. scores, generated by the samesubject), forn scoresOn = (o1, o2, . . . , on) the joint pdf is:

pθ(On) =

n∏

j=1

pθ(oj), θ ∈ {0, 1} . (18)

Defining λj = p1(oj)/p0(oj), the likelihood ratio of thesequenceO truncated at indexn, On, is

p1(On)

p0(On)=

n∏

j=1

p1(oj)

p0(oj)=

n∏

j=1

λj , (19)

and applying the logarithm, we get:

Λn = log

(

p1(On)

p0(On)

)

=

n∑

j=1

log (λj) . (20)

If we wait a further stepn + 1 before making a decision,from Eq. (20) the new log-likelihoodΛn+1 is convenientlyobtained asΛn+1 = Λn + log(λn+1). The SPRT test starts

Page 12: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

12

Fig. 15: Results of the multi-stage authentication framework. False positive and negative rates are shown in the top graphs, thenumber of walking cycles required to make a final decision on the user’s identity is shown in the bottom ones. Upper shadedareas extend for a full standard deviation from the mean and include about80% of the events.

from time 1, obtaining one-class OSVM scoreso1, o2, . . . foreach successive walking cycle. Aftern cycles, the cumulativelog-likelihood ratio isΛn = Λn−1 + log(λn), with Λ0 = 0.Two suitable thresholdsA and B are defined and the testcontinues to the next cyclen + 1 if A < Λn < B, H1

is accepted ifΛn ≥ B, whereasH0 is accepted ifΛn ≤ A.Moreover, definingα as the probability of acceptingH1 whenH0 is true andβ that of acceptingH0 whenH1 is true,Aand B can be approximated as:A = log(β/(1 − α)) andB = log((1 − β)/α), see [44].

A. Experimental Results

The motion data fromK = 35 subjects was used to train theCNN feature extractor, withNc = 40, F = 40 andS = 20.One user out of the remaining15 was considered as thetargetuser and14 as the negatives for the final tests. The followingresults are obtained through a leave-one-out cross-validationapproach for the sessions of the target user, i.e., out of twelvesessions, eleven are used for training and one for the finaltests. The session that is left out is rotated and the final resultsare averaged across all trials. The authentication resultsof themulti-stage framework are shown in Fig. 15. False positiverates (i.e., a user is mistakenly authenticated as the target)and false negative ones (i.e., the target is not recognized)aresmaller than0.15% for an appropriate choice of the SPRTthresholds (α andβ). Also, a reliable authentication requiresfewer than five walking cycles in80% of the cases. Thismeans that the framework is very accurate and at the sametime fast. We remark that the best authentication results thatwere obtained in previous papers lead to error rates rangingfrom 5 to 15% [7]–[12]. A comparison with these approachesis very difficult to carry out due to the different datasets (e.g.,number of subjects and walking time), acquisition settings(e.g., smartphone or sensors location). The reader can nev-ertheless refer to Section IV-B for a fair comparison between

our single-step classification framework and classical featureextraction techniques on our dataset.

As for our assumptions, in light of the small number ofcycles required, it is reasonable to presume that the samesubject generates the scores inO. For the i.i.d. assumption, weextended the decision framework to the first-order autoregres-sive model of [45, Chapter 3, p. 158], which allows trackingthe correlation across successive cycles. However, this did notlead to any appreciable performance improvement and onlyimplied a higher complexity. The reason is that scores arelightly correlated in time.

VII. C ONCLUSIONS

In this paper we have proposed IDNet, a user authenticationframework for smartphone-acquired inertial signals. Variousschemes performing manual feature extraction and using theselected features for user classification have appeared in therecent literature. In sharp contrast with these, IDNet exploitsconvolutional neural networks, as they allow for an auto-matic feature engineering and have excellent generalizationcapabilities. These deep neural networks are then used asuniversal feature extractors to feed classification techniques,combining them with one-class support vector machines anda novel multi-stage decision algorithm. With our framework,the neural network is trained once for all and subsequentlyutilized for new users. The one-class classifier is solely trainedusing motion data from the target subject; it returns a scoreweighing the dissimilarity of newly acquired data with respectto that of the target. Subsequent scores are then accumulatedthrough a multi-stage decision approach.

Experimental results show the superiority of IDNet againstprior work, leading to misclassification rates smaller than0.15% in fewer than five walking cycles. Design choicesand the optimization of the various processing blocks werediscussed and compared against classical approaches.

Page 13: IDNet: Smartphone-based Gait Recognition with Convolutional … · 2018-09-03 · IDNet authentication framework, that uses smartphone-acquired accelerometer and gyroscope motion

13

REFERENCES

[1] Zephyr Technology Corporation, “Bioharness 3 - Wireless ProfessionalHeart Rate Monitor and Physiological Monitor,” 2016. [Online].Available: http://www.zephyranywhere.com/

[2] B. M. Bot, C. Suver, E. C. Neto, M. K. A. Klein, C. Bare, M. Doerr,A. Pratap, J. Wilbanks, E. R. Dorsey, S. H. Friend, and A. D. Trister,“The mPower study, Parkinson disease mobile data collectedusingResearchKit,”Nature Scientific Data, vol. 3, pp. 70–73, Mar 2016.

[3] M. W. Whittle, Gait Analysis: An Introduction. Edinburgh: Butterworth-Heinemann, 2007.

[4] H. Chan, H. Zheng, H. Wang, and R. Sterritt, “Evaluating and over-coming the challenges in utilizing smart mobile phones and standaloneaccelerometer for gait analysis,” inIET Irish Signals and SystemsConference (ISSC 2012), Maynooth, Ireland, Jun 2012.

[5] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN Fea-tures Off-the-Shelf: An Astounding Baseline for Recognition,” in IEEEConference on Computer Vision and Pattern Recognition Workshops,Columbus, Ohio, US, Jun 2014.

[6] C. Bishop,Pattern Recognition and Machine Learning. Springer, 2007.[7] H. M. Thang, V. Q. Viet, N. D. Thuc, and D. Choi, “Gait identification

using accelerometer on mobile phone,” inInternational Conferenceon Control, Automation and Information Sciences (ICCAIS), Saigon,Vietnam, Nov 2012.

[8] C. Nickel, T. Wirtl, and C. Busch, “Authentication of smartphone usersbased on the way they walk using k-nn algorithm,” inInternationalConference on Intelligent Information Hiding and Multimedia SignalProcessing (IIH-MSP), Piraeus-Athens, Greece, Jul 2012.

[9] Y. Watanabe, “Influence of Holding Smart Phone for Acceleration-BasedGait Authentication,” inInternational Conference on Emerging SecurityTechnologies (EST), Houston, Texas, US, Sept 2014.

[10] S. Choi, I. H. Youn, R. LeMay, S. Burns, and J. H. Youn, “Biometricgait recognition based on wireless acceleration sensor using k-nearestneighbor classification,” inInternational Conference on Computing,Networking and Communications (ICNC), Honolulu, Hawaii, US, Feb2014.

[11] Y. Ren, Y. Chen, M. C. Chuah, and J. Yang, “Smartphone based userverification leveraging gait recognition for mobile healthcare systems,”in IEEE Communications Society Conference on Sensor, Mesh andAdHoc Communications and Networks (SECON), New Orleans, Louisiana,US, Jun 2013.

[12] S. Sprager and M. B. Juric, “An Efficient HOS-Based Gait Authen-tication of Accelerometer Data,”IEEE Transactions on InformationForensics and Security, vol. 10, no. 7, pp. 1486–1498, 2015.

[13] H. Chan, H. Zheng, H. Wang, R. Sterritt, and D. Newell, “Smart mobilephone based gait assessment of patients with low back pain,”in NinthInternational Conference on Natural Computation (ICNC), San Diego,California, US, Jul 2013.

[14] G.-S. Huang, C. C. Wu, and J. Lin, “Gait analysis by usingtri-axial accelerometer of smart phones,” inInternational Conference onComputerized Healthcare (ICCH), Hong Kong, China, Dec 2012.

[15] C. Nickel, M. O. Derawi, P. Bours, and C. Busch, “Scenario test ofaccelerometer-based biometric gait recognition,” inThird InternationalWorkshop on Security and Communication Networks (IWSCN), Gjøvik,Norway, May 2011.

[16] C. Nickel, C. Busch, S. Rangarajan, and M. Mobius, “Using hiddenmarkov models for accelerometer-based biometric gait recognition,”in IEEE 7th International Colloquium on Signal Processing anditsApplications (CSPA), Penang, Malaysia, Mar 2011.

[17] T. Kobayashi, K. Hasida, and N. Otsu, “Rotation invariant feature extrac-tion from 3-D acceleration signals,” inIEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), Prague, CzechRepublic, May 2011.

[18] B. Scholkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C.Williamson, “Estimating the support of a high-dimensionaldistribution,”Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001.

[19] M. P. Murray, A. B. Drought, and R. C. Kory, “Walking patterns ofnormal men,”The Journal of Bone & Joint Surgery, vol. 46, no. 2, pp.335–360, 1964.

[20] M. P. Murray, “Gait as a total pattern of movement: Including a bibliog-raphy on gait.”American Journal of Physical Medicine & Rehabilitation,vol. 46, no. 1, pp. 290–333, 1967.

[21] T. Nixon, M. S. ans Tieniu and C. Rama,Human identification basedon gait. Springer, 2006.

[22] J. Mantyjarvi, M. Lindholm, E. Vildjiounaite, S. M. Makela, and H. A.Ailisto, “Identifying users of portable devices from gait pattern withaccelerometers,” inIEEE International Conference on Acoustics, Speech,

and Signal Processing (ICASSP), Philadelphia, Pennsylvania, US, Mar2005.

[23] M. O. Derawi, C. Nickel, P. Bours, and C. Busch, “Unobtrusive user-authentication on mobile phones using biometric gait recognition,” in6th International Conference on Intelligent Information Hiding andMultimedia Signal Processing (IIH-MSP), Darmstadt, Germany, Oct2010.

[24] E. Keogh and C. Ratanamahatana, “Exact indexing of dynamic timewarping,” Knowledge and Information Systems, vol. 7, no. 3, pp. 358–386, 2005.

[25] F. Juefei-Xu, C. Bhagavatula, A. Jaech, U. Prasad, and M. Savvides,“Gait-id on the move: Pace independent human identificationusing cellphone accelerometer dynamics,” inFifth International Conference onBiometrics: Theory, Applications and Systems (BTAS), Washington DC,US, Sept 2012.

[26] S. Jiang, B. Zhang, G. Zou, and D. Wei, “The possibility of normal gaitanalysis based on a smart phone for healthcare,” inIEEE InternationalConference on Green Computing and Communications (GreenCom),Internet of Things (iThings), and Cyber, Physical and Social Computing(CPSCom), Beijing, China, Aug 2013.

[27] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale Video Classification with Convolutional Neu-ral Networks,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), Columbus, Ohio, US, Jun 2014.

[28] P. D. Welch, “The use of fast fourier transform for the estimation ofpower spectra: A method based on time averaging over short, modi-fied periodograms,”IEEE Transactions on Audio and Electroacoustics,vol. 15, no. 2, pp. 70–73, 1967.

[29] T. Teixeira, D. Jung, G. Dublon, and A. Savvides, “Pem-id: Identifyingpeople by gait-matching using cameras and wearable accelerometers,”in Third ACM/IEEE International Conference on Distributed SmartCameras (ICDSC), Como, Italy, Aug 2009.

[30] K. Kunze, P. Lukowicz, K. Partridge, and B. Begole, “Which Way Am IFacing: Inferring Horizontal Device Orientation from an AccelerometerSignal,” in IEEE International Symposium on Wearable Computers,Linz, Austria, Sept 2009.

[31] Z.-A. Deng, G. Wang, Y. Hu, and D. Wu, “Heading Estimation forIndoor Pedestrian Navigation Using a Smartphone in the Pocket,” MDPISensors, vol. 15, no. 9, pp. 21 518–21 536, 2015.

[32] C. R. Rao, “The Use and Interpretation of Principal Component Analysisin Applied Research,”Sankhy a: The Indian Journal of Statistics, vol. 26,no. 4, pp. 329–358, Dec 1964.

[33] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,and time series,” inThe Handbook of Brain Theory and Neural Net-works. MIT Press, 1998, pp. 255–258.

[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” inAdvances in Neural Infor-mation Processing Systems 25, 2012, pp. 1106–1114.

[35] D. Scherer, A. Muller, and S. Behnke, “Evaluation of pooling operationsin convolutional architectures for object recognition,” in 20th Interna-tional Conference on Artificial Neural Networks (ICANN), Thessaloniki,Greece, 2010.

[36] R. Hanka and T. P. Harte,Computer Intensive Methods in Control andSignal Processing: The Curse of Dimensionality. Birkhauser Boston,1997, ch. Curse of Dimensionality: Classifying Large Multi-DimensionalImages with Neural Networks, pp. 249–260.

[37] J. R. Quinlan,C4.5: Programs for Machine Learning. San Francisco,California, US: Morgan Kaufmann Publishers Inc., 1993.

[38] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network clas-sifiers,” Machine Learning, vol. 29, no. 2, pp. 131–163, 1997.

[39] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEETransactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.

[40] C. Cortes and V. Vapnik, “Support-vector networks,”Machine Learning,vol. 20, no. 3, pp. 273–297, 1995.

[41] B. Scholkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, J. C. Plattet al., “Support vector method for novelty detection,”Neural InformationProcessing Systems (NIPS), vol. 12, pp. 582–588, 1999.

[42] D. R. Musicant, V. Kumar, and A. Ozgur, “Optimizing F-Measure withSupport Vector Machines,” in16-th International FLAIRS Conference(FLAIRS), St. Augustine, Florida, US, May 2003.

[43] D. M. J. Tax and K. R. Muller,Artificial Neural Networks and NeuralInformation Processing. Berlin, Heidelberg: Springer, 2003, ch. FeatureExtraction for One-Class Classification, pp. 342–349.

[44] A. Wald, Sequential analysis. New York, NY, US: Dover, 1947.[45] A. Tartakovsky, I. Nikiforov, and M. Basseville,Sequential Analysis

Hypothesis Testing and Changepoint Detection. CRC Press, 2015.