Soft Authentication with Low-Cost Signaturescs752/papers/advance-019.pdf · for user authentication on mobile devices has not yet received much research attention. In this paper we

Soft Authentication with Low-Cost Signatures

Senaka Buthpitiya, Anind K. Dey, Martin GrissCarnegie Mellon University.

{sbuthpit@andrew, anind@cs, martin.griss@sv}.cmu.edu

Abstract—As mobile context-aware services gain mainstreampopularity, there is increased interest in developing techniquesthat can detect anomalous activities for applications such as userauthentication, adaptive assist technologies and remote elder-care monitoring. Existing approaches have limited applicabilityas they regularly poll power-hungry sensors (e.g., accelerometer,GPS) reducing the availability of devices to perform anomalydetection. This paper present SALCS (Soft Authentication withLow-Cost Signatures), an approach for anomaly detection on auser’s routine comprised of a collection of anomaly detectiontechniques utilizing soft-sensor data (e.g., call-logs, messages) andradio channel information (e.g., GSM cell IDs), all of whichare available as part of a phone’s routine usage. Using theseinformation sources we model aspects of a person’s routine,such as movement, messaging and conversation patterns. Wepresent extensive evaluations of the individual anomaly detectiontechniques, compare the collection SALCS to an existing power-hungry approach showing SALCS has a 7.6% higher detectionrate and gives 5x better coverage throughout the day.

I. INTRODUCTION

A smart phone accompanies its owner throughout (nearly) allaspects of his or her life, becoming an indispensable assistantthe busy user relies on to help navigate their life, usingmap applications to navigate the physical world, email andinstant messaging applications to keep in touch, media playerapplications to be entertained, etc. Over a period of time theowner will trust the smart phone with an enormous amountof personal information, everything from contact lists to call-logs to messaging histories to passwords. As a smart phonebecomes more capable of sensing the physical and virtualcontext of the user with an array of hard-sensors (e.g., GPS,accelerometer) and soft-sensors (e.g., email, social network,calendar) using these sensors to tailor the assistance it providesto the owner, the phone adds this multitude of informationabout the owner and their behavior to the treasure trove ofinformation it already possesses.

While smart phones are invaluable assistants due to theiromnipresence and the multitude of information they hold, theseproperties also make smart phones a major vulnerability to theuser’s privacy. A mobile device can easily be stolen and usedto gain sensitive information about the owner. Additionallystolen or “borrowed” mobile devices can be used by mali-cious individuals to impersonate the owner to gain personalinformation via the owner’s contacts or just simply to playa nasty prank on the owner. In securing mobile devices andinformation they hold, the most commonly used and moststraightforward approach is password-based authentication ofthe user. Text-based password schemes were common inearly mobile computing devices, but the cumbersomeness ofentering text passwords on mobile devices and the tediousness

of remembering text passwords have seen them give way tomore secure graphical password schemes. Graphical passwordschemes have been shown to be more user-friendly (in termsof password memorability and ease of entry) and more securethan text-based password schemes [1]. With repetitive useeven graphical passwords are susceptible to eavesdropping andtedium [2], and remembering multiple graphical passwords isdifficult [3]. Still graphical passwords remain the best optionas the final word in an authentication process when otherauthentication mechanisms cast doubt on the user’s identity.

Based on the rich set of information and data streams avail-able to the mobile phone, is it possible to model the behaviorof the owner, and use these models to authenticate the user? Toframe the problem in more broad terms, is it possible to detectanomalous behavior of the user, i.e., behavior that is signifi-cantly different from the usual behavior of the device owner.Variations in data streams used to model behavior could becaused by noise, novel behavior or anomalous behavior. Noiseis variation which is not of interest to a human-analyst and actsas a hindrance to data analysis. Novel patterns are typicallyincorporated into the normal model over time [4]. The abilityto recognize novel and anomalous behavior enables a variety ofdevice security application domains such as improved authenti-cation mechanisms, impersonation prevention, and informationtheft from stolen devicesx. With such authentication schemesbased on anomalous behavior detection, authentication can bepushed to the background without disturbing the user, revertingto the most secure graphical password schemes only whenanomalous behavior is detected or suspected.

There has been some work in authenticating users or detect-ing anomalous user behavior using sensor signatures. Mostof this work focuses on using a single sensor stream suchas GPS [5], [6], [7], accelerometer [8], [9], or WiFi signalstrengths [10]. These approaches suffer from one or more ofthe following major drawbacks, 1) large power drain due toconstant polling of sensors reducing the effective battery lifeand, therefore, anomaly detection / authentication capability,to a matter of hours, 2) limited coverage - likelihood of thedata available for a given time period, on an average day, beingsufficient for the system to perform anomaly detection / au-thentication, 3) coverage only at a very coarse granularity, and4) high noise sensitivity. Combining multiple sensor streamsfor user authentication on mobile devices has not yet receivedmuch research attention.

In this paper we present SALCS (Soft Authentication withLow-Cost Signatures), an approach for modeling various as-pects of a user’s behavior for detecting deviation from routine(i.e., anomaly detection) by combining multiple sensor streams.SALCS only uses sensor streams that are believed to be readily

2014 IEEE International Conference on Pervasive Computing and Communications (PerCom)

978-1-4799-3445-4/14/$31.00 ©2014 IEEE 172

available as part of a mobile device’s routine usage, i.e., theuser already has these streams enabled, so there is no additionalsensing cost. Initially, we model different aspects of a userbehavior in isolation by taking information from each of thesedata streams. The aspects we consider are: 1) the way the userinteracts with contacts, and 2) the user’s movement patterns.Then we focus on combining these models to give betteraccuracy for anomaly detection, and to improve coverage ofdetection throughout the day. The sensor streams used in thisapproach are 1) messaging logs, 2) call logs, 3) GSM signalstrengths and, 4) WiFi signal strengths.

The main advantages of the SALCS approach are 1) bettercoverage and 2) no extra taxation of the battery as the infor-mation needed for modeling is already available throughoutthe day from routine device usage. Our experiments show thateach of the models has reasonably high levels of detectionaccuracy, but in combination outperform approaches usingGPS information from prior work [7], best anomaly detectionresults in prior work. We also show that by combining thesemodels our approach attains better coverage throughout theday than by using a single power-hungry sensor stream.

In this paper, we make the following contributions toward theproblem of unobtrusively authenticating mobile phone users:• We present the idea of modeling various aspects of user

behavior for authentication.• We show that data streams available as part of phones’

routine operation can be effectively used as the data sourcesfor modeling user behavior.

• We evaluate the authentication accuracy and coverage, withextensive real-world data, and show that models based onreadily available data streams can be combined to achievebetter authentication accuracy and coverage than modelsbased on a single power hungry sensor stream.

II. RELATED WORK

Chandola et al., in their survey of anomaly detection tech-niques [4], provide a working definition for anomalies as sig-nificant variations that are of interest to a human analyst, thusdifferentiating anomalies from noise. This definition can befurther refined to differentiate novel patterns from anomalies,as novel patterns are incorporated into a routine over time.

Most existing work in mobile anomaly detection focuses onusing a single sensor information stream from the device todetect anomalies in user behavior, with the majority of workusing GPS [11], [12]. Generating bounding regions of frequenttravel, and Markov Random Fields for learning geo-tracks ofpeople for anomaly detection has been proposed [12], [11],but have not produced results that can be used for comparison.Other work on behavior prediction can be applied for anomalydetection. Ziebart et al. use Markov models to predict wherea user is going [6]. Similiarly, Krumm and Horvitz use thepartially traveled route to predict a user’s destination [5]. In[7], Buthpitiya et al. present an approach of using an n-gram model that is able to skip over previous locations whichare detractors (or non-contributors) to the current prediction(thus becoming more robust to noise in data caused by eitherGPS resolution or minor variations in a user’s movements).Since this approach has been the most effective at detectinganomalous behavior using GPS traces, we use it as a baselinefor comparison against the SALCS approach presented here.

The major drawbacks of GPS-based approaches in general are:1) large power drain due to constant GPS polling reducing theeffective battery life to a matter of hours, 2) limited coverageas anomaly detection works only outdoors and only at a coarsegranularity, and 3) its authentication reliability is reduced whenthe user is stationary for long periods (i.e., limited coverage).

Related work using WiFi signatures for user behavior model-ing, like GPS-based work, are targeted at either behavior pre-diction or anomaly detection employing statistical approachessuch as Bayesian classification or n-gram modeling [10], [13].Stable WiFi signatures covering large areas tend to be limitedto large corporate campuses or college campuses, thereforethese approaches do not provide good coverage throughoutthe day for many users. Due to the limited coverage ofthe WiFi-only approaches for anomaly detection, we willnot use them as a baseline for performance comparison tothe SALCS approach. The majority of accelerometer-basedauthentication mechanisms focus on gait features such as step-cycle periods [8], [9] modeled using histograms, frequencydomain analysis and data distribution statistics [14]. Themajor drawbacks of these approaches are 1) requiring specificplacement and orientation on the user’s body, and 2) requiringconstant polling of the accelerometer increasing the energyconsumption and reducing the battery life of the device [15].Apporoaches using accelerometer based activity recognitionfor anomaly detection require large amounts of training datato learn activity patterns through out a day’s routine. Otherwork that focusing on getting the user to perform gestures forauthentication [16] have low accuracy levels and defeat theobjective of unobtrusive authentication.

Shi et al. [17] provide an important exception to the earlierwork that focuses on a single data stream. This work uses GPSreadings in conjunction with mobile phone call- and browser-history to detect anomalies and test their approach with a largeamount of user data. While this work puts forward the ideaof combining sensor streams for authentication, its relianceon GPS makes the approach power hungry, and the simplisticmodels used in this approach makes it easy to trick (e.g., byremaining in a location the system has learned as familiar).

III. MODELING BEHAVIOR ASPECTS

This section describes the aspects of user behavior we modeland our modeling approaches using data available as part of thedevice’s routine usage, i.e., the user already has these streamsenabled, so there is no additional sensing cost. The sectionconcludes with a description of how the models are combinedto form SALCS. The aspects of user behavior we model are:• Message response patterns - user’s response types and

response delays for messages from different contacts.• Calling patterns - call durations/times to each contacts.• Outdoor mobility patterns - movement patterns outdoors

at a coarse granularity.• Indoor mobility patterns - movement patterns indoors.

A. Message Response Model

Here we model the message response aspect of a user’sbehavior based on the hypothesis that a delay in the userresponding to a received message will depend heavily on thesignificance the user places on the sender (or contact) of the


173

original message. Though the content of a message may alsoaffect the response delay, we assume, for this analysis, that thecriticality of message content will not vary significantly for asingle contact. Therefore the message response model consistsof a collection of smaller models, one for each contact the usercommunicates with. While we consider only SMS messagesin this work, it is straightforward to extend this approach toemails and instant messages as well. The intuition behind usingthis model for user authentication is that if the device is stolenthe thief would not respond to messages for fear of discovery,or if the thief were attempting to gain personal informationabout the device owner via the contacts the response delay tomessages would be anomalous.

To create the message response model, we model (in additionto the response delays) the likelihoods for the user 1) initiatinga messaging conversation, and 2) responding to a message. Alllikelihoods are calculated per-contact (all unknown numbersare treated as a single contact) using Maximum LikelihoodEstimation (MLE) on the training data. For each contact, theresponse delay distribution is modeled as a Gaussian MixtureModel (GMM), where a GMM is comprised of a variablenumber of one-dimensional Gaussian distributions. To train theGMM, the delays in responding to messages from the contactare compiled and used with the Expectation-Maximization(EM) algorithm and the Bayesian information criterion (BIC)to decide on the number of Gaussian distributions to use in themixture model and to train the final GMM. In the left plate of

Fig. 1: Visualization of response delays modeled as GMMs.Left: 6 most interacted-with contacts of a user. Right: most-interacted with contact of 6 different users.

Figure 1, we show the Gaussian mixture models representingthe response delay distributions for the top 6 contacts for asingle user, in terms of number of interactions. The varyingnumber of peaks and peak locations for each GMM shows thedifference in urgency in responding to messages from variouscontacts by the same user. The right plate of Figure 1 showthe response delay distributions of the top-most contact of6 different users. Once again the varying number of peaksand peak locations for each GMM shows the difference inurgency, of different users, in responding to messages fromtheir top contact. In this approach we have not incorporatedfeatures such as time-of-day, which could possibly increasethe accuracy of the model, at the cost of a drastic increase intraining data requirements.

For authentication, this approach takes into account all mes-saging activity, A, in the past t time period. A is comprisedof individual actions (a1, a2, . . . , aN ), such as initiating aconversation, responding to or ignoring a received messagefor the duration of t. For simplicity we assume a1, a2, . . . , aNare independent of each other. The messaging model, M , canbe used to estimate the likelihood L of the activity A beinggenerated by the device owner as,

L = P (a1, a2, . . . , aN |M)

= P (a1|a2, . . . , aN ,M) · P (a2, . . . , aN |M)

= P (a1|M) · P (a2, . . . , aN |M) {a1 ⊥ a2 . . . aN}= P (a1|M) · P (a2|M) · . . . · P (aN |M)

=

N∏i=1

P (ai|M) where,

P (Ai|M) =

P (Init. conv. with a contact)P (Init. conv. with a unknown num.)P (Responding to an unknown num.)P (Responding to msg from contact)

= P (responding|M) · P (delay|M)P (Unreplied msg from a contact)

= 1− P (responding|M)

This likelihood is then compared to an anomaly threshold(see Figure 2a and description) and if the likelihood is lower,an anomaly is flagged. Time windows with less than a pre-selected activity threshold number of activities are disregardedas not having sufficient information for classification, thusimpacting coverage of this approach.

B. Call Patterns Model

We take an approach, similar to our message response modelto build a call patterns model for modeling how users respondto and make calls. Despite the many factors (e.g., reasonfor the call, time of day, owner’s activity) that may impactcalling behavior, we assume that call duration and delays inresponding to missed calls depend primarily on who the calleris. This model consists of a collection of smaller ones, onefor each contact. The intuition behind authenticating using thismodel is that if the device is stolen, the thief would not answerincoming calls and not respond to these missed calls, while thethief might call numbers that the device owner would not, allof which would be detected as anomalous.

To create the voice call patterns model, we model likelihoodsfor each of the following factors per-contact (unknown num-bers are considered as a single contact): 1) duration of anincoming call, 2) duration of an outgoing call, 3) respondingto a missed call, and 4) delay in responding to a missed call.The likelihoods of responding to a missed call are estimatedusing MLE from training data. For each contact, the outgoing/ incoming call durations and missed call response delaydistributions are modeled as three separate GMMs trained andtuned using EM and BIC. As with the message response model,we have not incorporated features such as time-of-day dueto training data limitations even though such features couldpossibly increase the accuracy of the model.

For authentication, our approach takes into account all voicecall behavior, V , in the past t time period. V is comprisedof individual actions (v1, v2, . . . , vN ), such as responding toa missed call (with a delay of dvi ), making an outgoing call(with a duration of dvi ), or answering an incoming call (witha duration of dvi ) For simplicity we assume v1, v2, . . . , vN areindependent of each other. The voice call patterns model canbe used to estimate the likelihood L of the behavior V beinggenerated by the device owner where L =

∑Ni=1 P (vi|M)

(using a similar derivation to the messaging model’s), where,


174

P (vi|M) =

P (dvi long incoming call to contact)P (dvi long incoming call to unk. num.)P (dvi long outgoing call to contact)P (dvi long outgoing call to unk. num.)P (Resp. to miss-call from contact)

= P (responding|M) · P (delay|M)P (Resp. to miss-call from unk. num.)

= P (responding|M) · P (delay|M)

This likelihood is then compared to an anomaly threshold (seeFigure 2b and description) and if the likelihood is lower, ananomaly is flagged. Time windows with less than an activitythreshold number of actions are disregarded as not havingsufficient information for classification, impacting coverage.

C. Outdoor Mobility Model

Here we present our approach in modeling the movementpatterns of a user’s behavior, using GSM tower information asthe data source. While this works indoors as well, we call it anoutdoor mobility model as the model requires the user to movea significant distance within a GSM cell to generate a newsignature, which happens mostly when the user is outdoors.Our model is based on the same hypothesis as the geo-tracemodel [7], that a user’s next location is dependent largely onpresent location and a finite number of past locations andthese sequences of dependent locations can be combined tocreate a unique signature for a user’s routine behavior. Unlike[7], in this work we do not depend on power hungry GPSchips for modeling data: instead we use readily available GSMtower information to localize the user. Though this has alower granularity than GPS localization, the power savings andability to work even when the device is in covered areas offsetsthis disadvantage. We use n-gram models to learn and monitormovement patterns. Commonly used to model languages, n-gram models are in essence nth order Markov models and arevery robust at modeling sequences of words (i.e., labels) [18].

Since the n-gram model deals with labels, the GSM towersignal data needs to be converted into discrete labels. Toconvert the GSM tower signal data into labels, we use theID of the tower the device is currently connected to, anddiscard the list of towers visible to the device (our experimentalresults show the use of the entire tower list with only a fewweeks of training causes an increase in the true-positve rateat the expense of a drastic increase in the false-positive rate).Next, we quantize the signal strength of the connected tower,using a quantization scale with unique and arbitrary labelsassociated with each level in the scale. Finally the quantizedsignal strength’s level label and the tower ID are concatenatedto form a unique label indicative of the user’s present location.For this model, we will term such a label as an area-label (ai)and a sequence of labels as an area-trace (a1, a2, . . . , aN ).

The n-gram movement model consists of probabilities forshort area-traces (a1, a2, . . . , aN ). These probabilities can beestimated using Maximum Likelihood Estimation (MLE) onthe training data by counting the occurrences of area-labelsand area-traces:

P (a1, a2, . . . , aN ) =

N∏i=1

P (ai|ai−1i−n+1)

PMLE(ai|ai−1i−n+1) =

C(ai−n+1, . . . , ai−1, ai)

C(ai−n+1, . . . , ai−1)

However, MLE assigns zero probabilities to all area-traces thatdo not occur in the training data. This is explained as the“closed-world” assumption by Krumm and Horvitz [5] andgives rise to awkward behavior of the n-gram mobility modelwhen the user travels to locations that were not visited duringtraining. To avoid this problem we apply Good-Turing dis-counting and Katz backoff smoothing [19] that was developedfor language modeling, as described in [7]. The key idea ofsmoothing (also known as discounting) is to discount the MLEprobability for each observed n-gram in the training data toreserve some probability mass for unseen events. The trainedn-gram model can then be used to estimate the next area-label,ai, given the previous n− 1 area-labels from a user’s contextas, P (ai|ai−n+1, ai−n+2, . . . , ai−1) = P (ai|ai−1

i−n+1). Whenthe area-labels are created, we limit the number of times thesame area-label can consecutively recur (we term this processas label collapsing) which indicates the user is stationary (atthe resolution of the GSM signal strength data). As we attemptto model the user’s location transitions, and not time spent atlocations, when the user is stationary for an extended period oftime the area-label generation ceases (until a change in locationis sensed), essentially pushing the model into a sleep state.

The trained movement pattern is used for detecting anomaliesin behavior by continuously feeding the model area-labels inreal-time. The n-gram model then outputs a likelihood estimatefor the current area-label being generated by the user, giventhe area-trace seen up to that point in time. The probabilityestimate is compared against a heuristically decided threshold,anomaly threshold, (see Figure 2c and description), and if theestimate is below the threshold anomalous behavior is flagged.

D. Indoor Mobility Model

Next we present an approach to model the indoor movementpatterns of a device owner’s behavior using WiFi signal data.This model is based on the same hypothesis as the outdoormobility model, that a user’s next location is dependent largelyon his present location and a finite number of past locationsand these sequences of dependent locations combine to createa unique signature for each person. In this model we utilizeWiFi information that is periodically scanned and gathered(i.e., WiFi signatures) by the mobile device as part of itsroutine usage without incurring a sensing power penalty. Eventhough previous work [20] has shown that the granularityand accuracy of WiFi signature-based localization has lessgranularity and accuracy than GPS-based localization, it hasthe distinct advantage of working in WiFi-equipped indoorenvironments. Without performing explicit localization, wehypothesize that WiFi signatures are sufficient to performanomaly detection with high accuracy. While we label this asan indoor mobility model, it is applicable in any area with WiFicoverage (e.g., large corporate or college campuses). However,as stated before, this model’s coverage is limited to areas wheresufficient WiFi access is available.

We use n-gram models to learn and monitor movementpatterns, in a similar manner to the outdoor mobility model.As the n-gram model deals with labels, the WiFi signatureinformation has to be converted into discrete location labels.A WiFi signature consist of a list of WiFi access points (APs),the signal strength of each AP, and a list of networks to whicheach of the APs belong. While a straight-forward approach


175

would be to assign a unique label to each WiFi signature,this approach would make the labels extremely susceptible tominor variations in the environment (e.g., changes in signalstrength caused by changes in weather, an AP disappearing).To make the label assigning process more robust, we use aHamming distance-based hierarchical clustering approach togroup together signatures which are from the same or close-by locations and assign each cluster a unique label. This labelcreation approach is possible as we are interested only in theuser’s transitions between locations and not in the preciselocation. We assign each signature a Hamming code, a binarystring where each bit represents the networks seen in thetraining data and the bit is set to true if the network is seenin the current signature (we discard the signal strength datamaking the labeling scheme less susceptible to noise). Usingthe Hamming distance between the signatures, we performhierarchical clustering to produce a binary tree of clusterswhere the leaf nodes are the individual WiFi signatures. Nextthe cluster tree is traversed depth-first and if the distancebetween the two child clusters is greater than a threshold(chosen based on the required size of the vocabulary), the childclusters are assigned separate unique labels. For this model,we will call each unique label a region-label and a sequenceof labels a region-trace. As with the outdoor mobility model,label collapsing is used when creating region-traces and MLEwith Good-Turing discounting and Katz backoff smoothing areapplied to train the n-gram model. The trained indoor mobilitymodel is used in an identical manner to the outdoor mobilitymodel for anomaly detection.

E. SALCS: Combined Behavior Models

The experimental results (see Figures 2 and 3) for the indi-vidual models show that while each model provides reasonableaccuracy, the coverage of these models is quite low. Thereforein this section we describe our approach to combine thebehavior models (i.e., SALCS) employing a straight-forwardvoting scheme. Each model is configured to detect anomalies,using up to one hour of data. Each model presents its votetowards the final decision only if is has sufficient data to makea prediction. In the case of tied votes, we give precedenceto the outdoor movement, the indoor movement and themessage response models respectively (ordered by comparingperformance using ROC curves in Figure 2).

IV. EXPERIMENTAL DATA

For this work, we use a dataset consisting of a varietyof soft and hard sensor information streams recorded frommobile phones. The dataset has data from 30 users (20male and 10 female) living in the <city removed> area andincludes 6 students, 13 white-collar workers (5 of whom arein management), 10 blue-collar workers, and 1 businessman.Each user carried an Android mobile phone with data loggingsoftware installed for a period of 3 months using the deviceas their personal mobile phone. Any use during this period isassumed to be normal behavior and usage of the device by theparticipant (this includes e.g., lending the phone to make calls).Since each data stream has varying update rates and powercosts in sensing, the rate at which each stream is sensed andlogged varies. The logging rates were adjusted to get usabledata while allowing the phone battery to last approximately 24hours.

V. EVALUATIONS

In this section we describe the experimental designs and eval-uations performed on the behavior models. We first describethe experiments performed on the individual behavior modelsand present the results. Next, we describe the evaluations per-formed on SALCS, present the results from these experimentsand contrast the SALCS’s results to the individual models anda GPS-based anomaly detection system presented in [7].

A. Evaluating Individual Behavior Models

The evaluations on the individual models were performedusing the 30 user, 3 month dataset. We use a 3-fold validationapproach for each user, where one fold (i.e., one month of data)is used to train a model to be tested. We use the other two-thirds of the user’s data as testing data of “normal behavior”.To simulate “anomalous behavior” we use the complete datasets of the other 29 users in the dataset. This entire processis repeated for each fold of the 30 users in the dataset andthe results presented here are the averaged results from theseexperiments. This experimental procedure assumes that duringthe data collection period each user did not have any substantialanomalies in their own behavior.

To simulate anomalies to test the call and messaging modelswe replace the numbers in each user’s logs with numbersfrom a shared list. The process for this number substitutionis to first sort the numbers in a user’s logs according tothe number of iterations the user has with each number indescending order. Then replace each number with the numberin the corresponding position in the shared list. The shared listis a list of randomly generated non-repeating numbers. To testthe WiFi signatures across users, we use the WiFi signatureclustering and cluster labeling sections of each user on theirown dataset to generate the sequences of cluster labels and thenshare these sequences across users. Once again the labeling ofclusters is done from a shared list of labels. The case of majorvariations from routine where the user leaves known areasis straight-forward to detect, but these experiments simulateharder to detect cases where the user has broken from routinebut remains within known areas (e.g., within the user’s college-campus).

The first evaluation performed was to generate ReceiverOperating Characteristic (ROC) curves for each of the modelsto show how changing each model’s anomaly threshold wouldaffect the true positive rate (TPR) and false positive rate (FPR)for anomaly detection. On the ROC plots, the y-axis showscorrect anomaly detections (as a fraction of the total numberof anomalies present in the testing data), and the x-axis showsincorrect detections (as a fraction of non-anomalous test casesin the testing dataset). Points along the diagonal mean theratio is even, and the model is performing no better than acoin flip - for every one correct detection, there is an incorrectdetection. The upper left corner is a perfect anomaly detector,detecting all anomalies perfectly without any false detections.Each point on a ROC curve corresponds to a certain detectionthreshold. This graph makes it clear the trade-off betweenactivity threshold / time window size, TPR and FPR. Forthe message response model the five ROC curves displayedin Figure 2a are generated with activity thresholds of 2, 5,10, 15 and 25 and time windows of 1 hour for the first three


176

thresholds and 4 hours for the last two thresholds (with higheractivity thresholds and smaller time windows, the number oftesting data points is too low for meaningful results). For thecall patterns model, the seven curves displayed in Figure 2bare generated with varying time windows of 15 minutes, 30minutes, 1, 2, 4, 8, and 12 hours and corresponding activitythresholds of 1, 1, 3, 3, 6, 6, 6 events. For the outdoor mobilitymodel, four separate ROC curves were generated where theminimum length of an area-trace (number of area-labels in atrace) the model has to see before it performs a classificationis varied from 4, 8, 12, and 16 labels (see Figure 2c). AllROC curves are generated using 10-gram mobility models.We generate five separate ROC curves, where the minimumlength of a region-trace is varied from 1, 2, 4, 8, and 10 usinga 10-gram indoor mobility model (see Figure 2d).

The ROC curves for the message response model (see Figure2a) show that the message response model is effective atdetecting anomalies when it has seen at least 4 hours ofanomalous messaging logs, but the effectiveness of the modeldrops (to around 75% TPR at a 20% FRP) when the modelis forced to classify with just one hour of data. For the callpatterns model, the ROC curves (see Figure 2b) show thatthe voice call patterns model can only be used with moderateeffectiveness (with around a 70% TPR at a 20% FPR). TheROC curves also show how the model’s classification capabil-ities improve as more prior data (i.e., more events) are seen bythe model. The ROC curves in Figure 2c indicate the outdoormobility model can be used to effectively detect anomalies(with over 85% TPR at a 20% FPR). As expected, the model’sperformance improves as more prior data (i.e., longer area-traces) are available, but the model is able to perform quite welleven with short area-traces. The ROC curves with minimumtrace lengths of 8 and 10 in Figure 2d indicate that the indoormobility model is effective at anomaly detection (with over a90% TPR at a 20% FPR) when presented with longer historiesof movement. The degradation in performance for this modelis quite low (remaining over 80% TPR at 20% FPR) whenthe model sees at least the present and previous (i.e., 2)region-tags. While ROC curves for the outdoor and indoormobility models are generated for various history lengths, thehistory lengths are loosely correlated with time. A new tag isadded to a history when the user has moved sufficiently. Theamount of movement required to add a new tag in the outdoormovement model is when the user moves sufficiently for theGSM cell towers signal strength to change to a different levelof quantization. For the indoor movement model the amountof movement required is much less, in most cases out of or into the range of 1 or more WiFi access points.

Next we experiment on the message response and call pat-terns models to identify the effect of the testing data windowsize and the event threshold on the accuracy of detectinganomalies. Increasing the activity threshold increases the in-formation the model receives about the activities, therefore,increasing classification accuracy. On the other hand, increas-ing the activity threshold for a particular window size reducesthe probability of a window containing the sufficient activitiesto qualify for anomaly detection, reducing both accuracy andcoverage. The results from these experiments are shown inFigures 4a and 4b. All of the curves in Figure 4a show anincrease in accuracy, as expected, when the activity thresholdis initially increased. As the activity threshold is increased

(a) ROC plots for message response model.

(b) ROC plots for calling patterns model.

(c) ROC plots for outdoor mobility model.

(d) ROC plots for indoor mobility model.Fig. 2: ROC plots for individual models.


177

(a) (b)

(c) (d)Fig. 3: Coverage maps for, a) message response model, b)calling patterns model, c) outdoor mobility model, and d)indoor mobility model.

(a) (b)

(c) (d)Fig. 4: Variation of accuracies of message response model andcall pattern model vs. activity threshold (for different timewindow sizes) [a. and b.]. Variation of accuracies of outdoormobility model, and indoor mobility model vs. length of tracehistory considered (i.e., n). [c. and d.].

further the curves representing the smaller time windows showa decrease in accuracy. This accuracy drop is due to onlya small amount of data from the testing set qualifying forclassification and creating a bias in the results. All of the curvesin Figure 4b show an increase in accuracy, as expected, whenthe activity threshold is initially increased.

In the next experiment, we explore the impact of the area-trace history considered by the indoor and outdoor mobilitymodels on the accuracy of detecting anomalies. There are twoaspects to the area-trace history considered by the models.First, and the more straight-forward, is the minimum numberof area-labels that the model needs to see before making aclassification (i.e., trace-length-threshold). The second is thelength of the longest segments of area-traces modeled, i.e., thevalue of n in the n-gram model. Increasing either of these

would increase the amount of information the model has and,therefore, it should increase the accuracy of the classification.While increasing the trace-length-threshold reduces the likeli-hood of a user having sufficient data within a time window forclassification, increasing n increases the model size that hasto be held on the device and the training data requirements.The results from this experiment (see Figures 4c and 4d) showthat as the history considered by the model is increased, theaccuracy increases as expected but with diminishing gains.This indicates that the importance of a prior location on auser’s next location diminishes as the prior location is furtherback in the area-trace. In our experiments, increasing n to 20keeps the model size under 10MB and 20MB for the outdoormobility model and indoor mobility model respectively. Sinceretaining a model of this scale on a modern mobile deviceis not a strain on resources, model size does not impose anymajor restriction on history size. On the other hand, the trainingdata requirement increase showed that a 1 month long trainingdataset did not contain sufficient data to train a model withn > 12, causing the training process to use discounting toestimate the likelihoods of longer area-traces. Using the resultsof this experiment, we conclude that n = 10 for both modelsprovides a reasonable level of accuracy in return for a plausibletraining data requirement. As the models are used in the real-world over a period of time it can be retrained with dataaccumulated while in use to increase the size of the area-tracehistory it can consider.

The final set of statistics we generate are for coverage - thelikelihoods for a user having sufficient data (during each hourof a day) for a model to make an authentication / anomalyclassification. To generate coverage statistics for a user-modelpair the data tracks of a user are grouped into 1 hour chunksfor each of the 24 1-hour slots in a day. The coverage ofeach 1-hour slot is calculated as the percentage of chunksin that slot to qualify for classification by the model. Fora chunk of data to qualify for anomaly detection with themessage response model and the call pattern model, the chunkneeds to have at least 5 messaging events (received or sentmessages), and 3 call events (missed, outgoing or incoming),respectively. For the outdoor and indoor movement models a1-hour chunk requires at least two and five labels respectively.The qualifying threshold for each model was selected with theaid of their respective ROC curves such that each model hasat least 75% TPR at a 20% FPR. The coverage likelihoodsaveraged across all 30 users for the four models are shownin Figures 3a-3d. In summary, these experiments indicate thatthe amount of time during each day where there is sufficientdata is somewhat low for the message response model and thecall pattern model. The indoor and outdoor mobility models,while having greater coverage, are still not thorough throughout the day. The higher average coverage of the indoor mobilitymodel over the outdoor mobility model is due to the amount ofmovement required by each model’s label generation process.Therefore it would be ideal for these models to be combinedto give more complete coverage of the day.

B. Evaluating SALCS

To evaluate SALCS, we again use the previous dataset withthe evaluation used for the individual models. The testing datafrom each stream (when available) are grouped into 1-hourchunks of each day and provided to SALCS for classification.


178

(a)

(b)

Fig. 5: Evaluation results for SALCS, a) a slice of SALCS’sROC hyper-surface and the ROC curve of the GPS mobilitymodel, b) coverage maps of SALCS and the mobility model.

To compare and contrast the performance and coverage ofSALCS we use the GPS-based mobility model (using n-grammodels) described in [7]. We recreate the GPS mobility modelfor anomaly detection using the equal partitioning method with10-gram language models. The baseline results of the model’sROC curves (see Figures 5a) were created on our dataset usingthe evaluation process described earlier with 30 minute and 1-hour testing windows.

SALCS has four anomaly thresholds (one for each modelin the combination) that can be varied. This results in a fourdimensional ROC hyper-surface if all possible threshold valuesare iterated over. For comparison against the GPS mobilitymodel, we vary all four threshold values linearly through theirranges to generate a slice of the ROC hyper-surface (see Figure5a). When contrasting these ROCs against that of the GPSmobility model with a 1-hour time window, the SALCS’sROC curve outperforms the GPS model. The ROC-curve forSALCS with the 30 minute test window shows slightly higherperformance indicating that SALCS can perform at reasonableaccuracy even at a fine time granularity. In contrast, the systempresented in [17] (using GPS and other sensors) operates witha TPR of 90% with upwards of 20 minutes of data, assumingthat an intruder is continuously interacting with either themobile phone’s browser or dialer. A summary of the resultsis presented in Table I, which shows a 7.6% improvement inthe total anomalies detected at a FPR of 20% over the GPS-based system. We also generate a coverage map for SALCS,showing the likelihood for an average user from our datasetto have sufficient data to perform anomaly detection (seeFigure 5b). Contrasting this against the GPS mobility model’scoverage map, we see the coverage of SALCS is much higher.A summary of the coverage results is presented in Table Ishowing coverage increase throughout the day of more than 5times over the GPS-based system.

C. Computational Complexity of SALCS

In this section we describe the computational complexities ofevaluating the models that together form SALCS. We omit thecomputational complexities of training models as the trainingoperations are performed off-line, i.e., when the phone isdocked and charged with the user’s home computer.

The complexity approximations for the message responsemodel and call pattern model are similar in nature, therefore wepresent a generic calculation. Assume a contact list of lengthn and m events (i.e., either call events or messaging events)in the time-window used for anomaly detection. Parsing them-event log, past-to-present, keeping track of the last missedcall / message (if any) from of each number present in thelog, presents a complexity of O (m). Looking up the modelparameters for each contact from a sorted list of length npresents a complexity of O (log2 (n)), while sampling a modelof Gaussian mixture models is constant time. Therefore, thecomplexity of evaluating either model is O (m log2 (n)). Inour experimental data, m < 30 and the longest contact listwas n = 1400, a very manageable computational load.

In the case of the outdoor mobility model, assume the n-gram model has a vocabulary of s words (formed by con-catenating cell tower IDs and discretized signal power-levels).By representing the n-gram model as a tree with a branchingfactor of s, obtaining likelihood of a n-gram (assuming then-gram is present) is O(n × s). If the n-gram is not found,back-off smoothing requires the likelihood of a (n− 1)-gramsubsequence. This process repeats until a subsequence is foundin the model, and in the worst case it repeats until a unigramis reached. The querying complexity in the worst case is∑n

i=1(i × s) = (n + 1).(n/2).s. Since n is a constant value,the complexity is O(s) (where s < 1000 for our experiments).

The indoor mobility model’s n-gram look-up complexityremains constant-time, unlike the outdoor mobility model, asthe vocabulary size is fixed at training time by selecting thenumber of WiFi signature clusters (see Section III-D). Thismodel’s complexity is dominated by the process of convertinga list of observed networks to a “word” in the vocabulary. Ifthe number of networks in the model are m and the last scanretrieved a list of p networks, creating a bit-vector used to look-up the word, in our implementation, has O(m.p) complexity.While there is room for optimization, in our experimentsm < 1000 and p < 25, implementing these improvementswere not considered a priority.

D. Power Consumption of SALCS

Next we present the results of a power consumption ex-periment to contrast the power usage of SALCS in contrastto the GPS model. The experiment was conducted with 5students, each carrying an Android Nexus One device loadedwith the SALCS system and a separate application runningthe GPS model. The applications were loaded with modelstrained offline (i.e., on a desktop computer), as we expect themodels to be trained or re-trained when the users dock thedevice with their home computers. The power consumption ofeach application was monitored using PowerTutor [21] and theoverall results are presented in Table II. As the table shows,the amount of power consumed by SALCS is approximately3000X less than the GPS model.


179

TABLE I: Summary of results comparing the combined behavior models against the GPS mobility model.TPR FPR Coverage (Likelihood)

(at 20% FPR) (at 80% TPR) 0000 - 0800 0800 - 1200 1200 - 1800 1800 - 0000 OverallSALCS (60 min. window) 93.18% 14.20% 21.96% 46.89% 56.86% 47.74% 41.28%GPS model 86.58% 15.07% 5.02% 9.66% 11.07% 8.37% 8.14%

TABLE II: Power consumption comparison between the com-bined behavior models and the GPS mobility model.

Average (mW) St. Dev. (mW)

SALCS 0.0419 0.0178

GPS Model

Sensing 175.0028 35.0250Processing 0.0356 0.0136Total 175.0384 35.0386

VI. DISCUSSION

From the experimental results in the previous section it isclear that SALCS performs with high accuracy throughoutthe day with high coverage. SALCS’s high accuracy resultsin our experiments with real-world data shows its robustnessagainst noisy real-world data. The models in the experimentswere trained with just one month of user data indicatingsparse training data does not affect the performance of themodel significantly. However, with more training data, weexpect SALCS to show further improvements in anomalydetection accuracy. With larger training datasets it is possibleto incorporate a wider feature set for each of the models (e.g.,the message response model can incorporate the time of day asa feature to model the delay in responding to a message) thusimproving the accuracy of the models. As SALCS is deployedin the real-world, new data collected by the device (after aperiod of time to ensure the logged behavior is normal) canbe used to further improve the accuracy of the models. Eachapplication can tune SALCS’s four-dimensional parameterspace of threshold values according to its requirements (e.g.,a personal assistant might not want to miss diversions fromroutine at the expense of a few extra false positives). ThereforeSALCS can be used to detect anomalous behavior of the userfor a variety of different applications such as 1) improvingadaptiveness of digital assistants, 2) elder-care and child-caremonitoring systems where the phone continuously monitors aphone-carrying user’s behavior and reports to a caregiver onlyif there is a significant variation thus maintaining a reasonablelevel of privacy for the user while still being within quick reachof a caregiver, and 3) supplementing existing authenticationsystems by pushing authentication into the background andonly reverting to the most secure password based schemeswhen an anomaly is detected (or data is insufficient fordetection).

VII. CONCLUSION

In this paper, we present novel methods of modeling variousaspects of user behavior using information sources typicallyavailable as part of a modern mobile phones routine usage, atno extra sensing power cost. SALCS, formed by combiningthese models, outperforms an anomaly detection system thatuses power hungry information streams such as GPS. Inthis paper we also show that using these readily availableinformation sources, SALCS provides better anomaly detectioncoverage than systems using dedicated, power-hungry sensors.

While SALCS uses a straightforward voting approach witha static weighting of models, we plan to investigate the use of

ensemble learning approaches to better combine these models.We also plan to investigate the improvement that can be gainedby incorporating models that require information sources withhigher power requirements (e.g., GPS).

REFERENCES

[1] I. Jermyn, A. Mayer, F. Monrose, M. Reiter, and A. Rubin, “Thedesign and analysis of graphical passwords,” in 8th USENIX SecuritySymposium, 1999, pp. 1–14.

[2] R. Biddle, S. Chiasson, and P. Van Oorschot, “Graphical passwords:Learning from the first twelve years,” ACM Computing Surveys (CSUR),vol. 44, no. 4, p. 19, 2012.

[3] K. Everitt, T. Bragin, J. Fogarty, and T. Kohno, “A comprehensivestudy of frequency, interference, and training of multiple graphicalpasswords,” in CHI 2009, 2009, pp. 889–898.

[4] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: Asurvey,” ACM Computing Survey, vol. 41, no. 3, pp. 1–58, 2009.

[5] J. Krumm and E. Horvitz, “Predestination: Inferring destinations frompartial trajectories,” in Ubicomp 2006, 2006, pp. 243–260.

[6] B. D. Ziebart et al., “Navigate like a cabbie: probabilistic reasoningfrom observed context-aware behavior,” in UbiComp, 2008.

[7] S. Buthpitiya, Y. Zhang, A. Dey, and M. Griss, “n-gram geo-tracemodeling,” in Pervasive Computing, vol. 6696, 2011, pp. 97–114.

[8] D. Gafurov, E. Snekkenes, and P. Bours, “Gait authentication andidentification using wearable accelerometer sensor,” in Automatic Iden-tification Adv. Technologies, IEEE Workshop on, 2007, pp. 220–225.

[9] M. Derawi, C. Nickel, P. Bours, and C. Busch, “Unobtrusive user-authentication on mobile phones using biometric gait recognition,” inProc. IIH-MSP, 2010, pp. 306–311.

[10] J. Zhu and Y. Zhang, “Towards accountable mobility model: A languageapproach on user behavior modeling in office wlan,” in ComputerCommunications and Networks (ICCCN 2011), 2011, pp. 1–6.

[11] B. Liao, “Anomaly detection in gps data based on visual analytics,”Master’s thesis, University of Illinois at Urbana-Champaign, 2010.

[12] T.-S. Ma, “Real-time anomaly detection for traveling individuals,” inAssets ’09, 2009, pp. 273–274.

[13] L. Vu, Q. Do, and K. Nahrstedt, “Jyotish: A novel framework for con-structing predictive model of people movement from joint wifi/bluetoothtrace,” in PerCom, 2011, pp. 54–62.

[14] J. Mantyjarvi, M. Lindholm, E. Vildjiounaite, S. Makela, and H. Ailisto,“Identifying users of portable devices from gait pattern with accelerom-eters,” in Proc. ICASSP, 2005, pp. 973–976.

[15] B. Priyantha, D. Lymberopoulos, and J. Liu, “Littlerock: Enablingenergy-efficient continuous sensing on mobile phones,” IEEE PervasiveComput., vol. 10, no. 2, pp. 12–15, 2011.

[16] S. Patel, J. Pierce, and G. Abowd, “A gesture-based authenticationscheme for untrusted public terminals,” in UIST, 2004, pp. 157–160.

[17] E. Shi, Y. Niu, M. Jakobsson, and R. Chow, “Implicit authenticationthrough learning user behavior,” in Information Security, M. Burmester,G. Tsudik, S. Magliveras, and I. Ilic, Eds., 2011, vol. 6531, pp. 99–113.

[18] C. E. Shannon, “A mathematical theory of communication,” BellSystems Technical Journal, Tech. Rep., 1948.

[19] C. Zhai and J. Lafferty, “A study of smoothing methods for languagemodels applied to information retrieval,” ACM Transactions on Infor-mation and System Security, vol. 22, no. 2, pp. 179–214, 2004.

[20] H. Liu, H. Darabi, P. Banerjee, and J. Liu, “Survey of wireless indoorpositioning techniques and systems,” IEEE Trans. Syst., Man, Cybern.C, vol. 37, no. 6, pp. 1067–1080, 2007.

[21] L. Zhang et al., “Accurate online power estimation and automaticbattery behavior based power model generation for smartphones,” inProc. CODES+ISSS, 2010, pp. 105–114.


180

Documents

Soft Authentication with Low-Cost Signaturescs752/papers/advance-019.pdf · for user authentication on mobile devices has not yet received much research attention. In this paper we