Guest Editors’ Introduction Data Surveillance Csimson.net/clips/academic/2006.data-surveillance.pdf · 2006-12-07 · Guest Editors’ Introduction Government and industry are in-creasingly

Guest Editors’ Introduction

Government and industry are in-creasingly turning to data mining withthe hope that advanced statisticaltechniques will connect the dots anduncover important patterns in mas-sive databases. Proponents hope thatthis so-called data surveillance tech-nology will be able to anticipate andprevent terrorist attacks, detect dis-ease outbreaks, and allow for detailedsocial science research—all withoutthe corresponding risks to personalprivacy because machines, not peo-ple, perform the surveillance.

Emergence ofdata surveillanceThe US public got its first look atdata surveillance in 2002 when itlearned about the Pentagon’s TotalInformation Awareness (TIA) proj-ect. The project, one of several spon-sored by the Information AwarenessOffice (IAO) at DARPA, quicklybecame a lightening rod for attacksby privacy activists, critics of theBush administration, and even con-spiracy theorists. Some of these at-tacks were motivated by what theIAO proposed; others were based onthe fact that the IAO was headed byAdmiral John Poindexter, who had apivotal role in the 1980s Iran-Contra

Affair; and still others were based onthe IAO’s logo—the all-seeing Eyeof Providence floating above an un-finished pyramid (similar to the sealon the back of the US$1 bill), care-fully watching over the Earth.

Critics charge that data surveil-lance is fraught with problems andhidden danger. One school of thoughtsays that surveillance is surveillance:whether the surveillance is done by aperson or a computer, some kind ofviolation to personal privacy or libertyhas occurred. The database, once cre-ated, must be protected with extraor-dinary measures and is subject toabuse. Moreover, if data surveillancetechnology says a person is a potentialterrorist, that person could then besubject to additional surveillance,questioning, or detention—even ifthey haven’t done anything wrong.After extensive media coverage andcongressional hearings, Congress ter-minated funding for the IAO in 2003.

Data surveillance jumped againto the front pages of US newspapersin May 2006, when USA Todaypublished a story alleging that thenation’s telephone companies hadshared the telephone records of mil-lions of Americans with the Na-tional Security Agency (NSA).

According toUSA Today, theNSA used thisinformation tocreate a databaseof detailed infor-mation for everytelephone callmade within the na-tion’s borders. The spy agency thenmined this database to uncover hid-den terrorist networks.

A database of every phone callwithin the US extending years backthrough time could certainly proveuseful for fighting domestic terrorists.If a French male of Moroccan descentwere arrested in Minnesota after re-ceiving 50 hours of flight training anda US$14,000 wire transfer from aknown terrorist, such a database couldprovide a report of every person whocalled or received a telephone callfrom that individual. This kind of socialnetwork analysis would prove invalu-able in finding the would-be pilot’scomrades-in-arms, but it might alsoidentify his flight teacher, his pizza de-livery service, and even the teenagerwho mows his lawn.

Yet in all of the media coverage ofthe TIA, the NSA database, and sim-ilar projects, many questions seem to

SIMSON

GARFINKEL

AND MICHAEL

D. SMITH

HarvardUniversity

Data Surveillance

PUBLISHED BY THE IEEE COMPUTER SOCIETY ■ 1540-7993/06/$20.00 © 2006 IEEE ■ IEEE SECURITY & PRIVACY 15

C an you find a terrorist in your database? Do the register re-

ceipts from discount drug stores hold the secret to stop-

ping avian flu outbreaks before they become epidemics?

Can anonymized data be “re-identified,” compromising

privacy and possibly jeopardizing personal safety?


be left unasked and unanswered:Does the technology really work—can you find the terrorist in the data-base? Can you find the terrorist

without destroying everybody else’sprivacy in the process? Is it possibleto perform privacy-protecting datamining—at least in theory? And is itpossible to turn that theory intopractice, or do too many real-worldconcerns get in the way?

The workshopTo explore these questions, as well asto improve the public’s general un-derstanding of these questions, Har-vard’s Center for Research onComputation and Society held a day-long workshop in June 2006 on datasurveillance and privacy protection.Robert Popp, who served as deputyof the IAO under Poindexter and wasa driving force behind the TIA pro-gram, gave the keynote address. Popppresented his and Poindexter’s vi-sion for using data surveillance forcountering terrorism; an articlebased on that presentation appearson p. 18 of this special issue. (Formore information about the work-shop, please visit http://crcs.deas.harvard.edu/workshop/2006/.)

Whether data surveillance can findreal terrorists and stop actual attacksbefore they happen is still unproven inthe academic literature. Althoughthere’s no denying that data surveil-lance has resulted in arrests, it’s notclear if those individuals were actuallyterrorists or merely people writingnovels about terrorists. On the otherhand, attendees learned that data sur-veillance is good for a lot more activi-ties than hunting terrorists.

For example, it might be able todetect and help public health officialscontain an outbreak of avian flu.

Kenneth Mandl, a researcher at theHarvard Medical School Center forBiomedical Informatics, showedhow records of emergency rooms’

admissions could anticipate thedeaths associated with pneumoniaand influenza reported to the USCenters for Disease Control. Thisisn’t tremendously surprising, ofcourse—many deaths result from in-dividuals who didn’t seek treatmentuntil it was too late, then went to theemergency room. But what’s excit-ing, Mandl reported, is that pediatricadmissions peak roughly a monthbefore adult emergency room admis-sions. With this knowledge, publichealth officials could build a systemthat predicted adult outbreaks bymonitoring admissions of children. Ifoutbreaks can be predicted, it mightbe possible to nip them in the bud.

The Realtime Outbreak and Dis-ease Surveillance (RODS) project atthe University of Pittsburgh mightone day be able to provide health of-ficials with even better advancewarning. The RODS project moni-tors the sale of over-the-counter coldremedies and other healthcare prod-ucts in more than 20,000 storesthroughout the US. The theory hereis that people will attempt to treatthemselves with over-the-countercold medications before they get sosick that they report to the hospitalemergency room. Their work so farindicates that sales of these medica-tions peak two weeks before hospitaladmissions do.

RODS researchers stress that thisso-called biosurveillance (the continu-ous collation and analysis of medicallyrelated statistical data) doesn’t violateprivacy because no personally identi-fiable information is ever assembledor reported. The system collects only

aggregate sales from a sampling of thenation’s largest drugstores and mass-merchandise chains. Of course, in thecase of an actual avian flu outbreak orbioterrorism attack, it would be help-ful to know the names of thoseinfected. But collecting this informa-tion isn’t necessary to achieve the pro-ject’s primary goals and would createunacceptable risks to those involvedbecause the data would be so easilysubject to abuse.

In his presentation, Mandl dis-cussed five techniques that organiza-tions collecting data could adopt toprotect privacy in large-scale datamining efforts. Policies must be set inplace to limit access to sensitive data,and then organizations must self-po-lice themselves to make sure that thosepolicies are enforced. When possible,data subjects should be given the abil-ity to exert personal control over theirown information. Data should be de-identified whenever possible. Finally,says Mandl, all data must be storedwith encryption, so that it’s protectedin the event of a breach.

All of Mandl’s techniques assumethat sensitive information will be col-lected and ultimately used in a highlycontrolled environment. Indeed, thisis a model familiar to most peopletoday. Law enforcement, national in-telligence organizations, businesses,and even journalists collect a lot ofsensitive information during the day-to-day course of their work and thencarefully control who can access theinformation and how they can use it.Big fences and background investi-gations are an unfortunate but neces-sary part of the security model forthese organizations.

But another approach on thehorizon is to use advanced algo-rithms and cryptographic theory toavoid the problem in the first place.Algorithms and systems now underdevelopment make it possible to col-lect information in a kind of predi-gested form that allows some queries(but not others) and that makes itimpossible to recover the original,undigested data. The US National

16 IEEE SECURITY & PRIVACY ■ NOVEMBER/DECEMBER 2006

Does the technology really work—can you find the

terrorist in the database … without destroying

everybody else’s privacy in the process?


Science Foundation-funded projecton Privacy, Obligations, and Rightsin Technologies of Information As-sessment is developing a host of toolsand technologies to enable this kindof privacy-preserving data mining.Other work is being done at theUniversity of California, Los Ange-les’, Center for Information andComputer Security.

I n this special issue of IEEE Security& Privacy, we present two articles

based on presentations at our work-shop. In addition to the article fromPopp and Poindexter, we have an ar-ticle by Jeff Jonas, founder of IBM’sEntity Analytic Solutions division,who explains the process of entity res-olution—a technique by whichnames in different databases are deter-mined to represent the same person.

One of the fundamental techni-cal disagreements between these two

articles pertains to the kinds ofqueries performed. Popp andPoindexter advocate pattern-basedqueries—for example, a standingquery could scan for anyone whobuys a large quantity of fertilizer, fueloil, 55-gallon drums, and then rentsa truck. Such queries might findnew, spontaneous terrorist groups,but they’re also more likely to pickup people who have no evil intent—for example, farmers. Jonas, mean-while, argues that we should focusour limited resources using relation-ship information—for example,looking for terrorists by looking forindividuals who have nonobviousrelationships with other terrorists.

Although many of the intelligencetechniques and operational details ofthe global war on terrorism must nec-essarily remain secret for them to beeffective, we believe that it’s both pos-sible and necessary to have aninformed academic debate on both

the tools and appropriateness of datasurveillance. If these techniques are ef-fective, they could be tremendouslybeneficial to our society in the fightagainst terrorism, disease, and eveneconomic inefficiency. But if theydon’t work, we need to know that,too—so that we can spend our limitedresources on other approaches.

Simson L. Garfinkel is a postdoctoral fel-low at Harvard’s Center for Research onComputation and Society. His researchinterests include computer security, com-puter forensics, and privacy. Contact himat [email protected].

Michael D. Smith is the Gordon McKayProfessor of Computer Science and Elec-trical Engineering and the associate deanfor computer science and engineering atHarvard’s Division of Engineering andApplied Sciences. His research interestsinclude dynamic optimization, machine-specific and profile-driven compilation,high-performance computer architec-ture, and practical applications of com-puter security.

www.computer.org/security/ ■ IEEE SECURITY & PRIVACY 17

IEEE PervasiveComputing delivers thelatest peer-revieweddevelopments in pervasive,mobile, and ubiquitouscomputing to developers,

researchers, and educators whowant to keep abreast of rapid technology

change. With content that’s accessible and useful today, this publication acts as a catalyst for progress in thisemerging field, bringing together the leading experts in such areas as

• Hardware technologies

• Software infrastructure

• Sensing and interaction with the physical world

• Graceful integration of human users

• Systems considerations, including scalability,security, and privacy

SubscribeNow!

• Healthcare

• Building a Sensor-Rich World

• Urban Computing

• Security & Privacy

• Healthcare

• Building a Sensor-Rich World

• Urban Computing

• Security & Privacy

F E AT U R I N G IN2007

V I S I T www.computer.org/pervasive/subscribe.htm

Data Surveillance

T he terrorist attacks of September 11, 2001transformed America like no other event sincePearl Harbor. The resulting battle against ter-rorism has become a national focus, and “con-

necting the dots” has become the watchword for usinginformation and intelligence to protect the US fromfuture attacks.

Advanced and emerging information technologiesoffer key assets in confronting a secretive, asymmetric,and networked enemy. Yet, in a free and open society,policies must ensure that these powerful technologiesare used responsibly and that privacy and civil libertiesremain protected. In short, Americans want the gov-ernment to protect them from terrorist attacks, butfear the privacy implications of the government’s useof powerful technology inadequately controlled byregulation and oversight. Some people believe thedual objectives of greater security and greater privacypresent competing needs and require a trade-off; oth-ers disagree.1–3

This article describes a vision for countering terrorismthrough information and privacy-protection technolo-gies. This vision was initially imagined as part of a researchand development (R&D) agenda sponsored by DARPAin 2002 in the form of the Information Awareness Office(IAO) and the Total Information Awareness (TIA) pro-gram. It includes a critical focus and commitment to deli-cately balancing national security objectives with privacyand civil liberties. We strongly believe that the two don’tconflict and that the ultimate solution lies in utilizing in-formation technologies for counterterrorism along withprivacy-protection technologies to safeguard civil liber-ties, and twining them together with coordinated policies

that bind them totheir intended use.

Background and motivationTerrorists are typically indistinguishable from the localcivilian population. They aren’t part of an organized,conventional military force—rather, they form highlyadaptive organizational webs based on tribal or religiousaffinities. They conduct quasi-military operations usinginstruments of legitimate activity found in any open ormodern society, making extensive use of the Internet,cell phones, the press, schools, houses of worship, pris-ons, hospitals, commercial vehicles, and financial sys-tems. Terrorists deliberately attack civilian populationswith the objective to kill as many people as possible andcreate chaos and destruction. They see weapons of massdestruction not as an option of last resort but as an equal-izer—a weapon of choice.

Of the numerous challenges to countering terrorism,none are more significant then being able to detect, iden-tify, and preempt terrorists and terrorist cells whose iden-tities and whereabouts are unknown a priori. (AlanDershowitz’s Preemption: A Knife that Cuts Both Ways[W.W. Norton & Company, 2006] offers an extensivediscussion of preemption and the need for a legal struc-ture.) In our judgment, if preemption is the goal, the keyto detecting terrorists is to look for patterns of activity in-dicative of terrorist plots based on observations of currentplots and past terrorist attacks, including estimates abouthow terrorists will adapt to avoid detection. Our funda-mental hypothesis is if terrorists plan to launch an attack,the plot must involve people (the terrorists, their fi-nanciers, and so forth). The transactions all these people

Countering Terrorism throughInformation and PrivacyProtection Technologies

18 PUBLISHED BY THE IEEE COMPUTER SOCIETY ■ 1540-7993/06/$20.00 © 2006 IEEE ■ IEEE SECURITY & PRIVACY

Security and privacy aren’t dichotomous or conflicting

concerns—the solution lies in developing and integrating

advanced information technologies for counterterrorism

along with privacy-protection technologies to safeguard

civil liberties. Coordinated policies can help bind the two to

their intended use.

ROBERT POPP

NationalSecurityInnovations

JOHN

POINDEXTER

JMPConsulting

Data Surveillance

conduct will manifest in databases owned by public,commercial, and government sectors and will leave a sig-nature—detectable clues—in the information space. Be-cause terrorists operate worldwide, data associated withtheir activities will be mixed with data about people whoaren’t terrorists. If the government wants access to this ac-tivity data, then it must also have some way to protect theprivacy of those who aren’t involved in terrorism.

This hypothesis has several inherent critical chal-lenges. First, can counterterrorism analysts imagine andunderstand the numerous signatures that terrorist plans,plots, and activities will create? Second, if they do under-stand these signatures, can analysts detect them whenthey’re embedded in a world of information noise beforethe attacks happen (in this context, noise refers to transac-tions corresponding to nonterrorists)? Finally, can ana-lysts detect these signatures without adversely violatingthe privacy or civil liberties of nonterrorists? Ultimately,the goal should be to understand the level of improve-ment possible in our counterterrorism capabilities if thegovernment could use advanced information technolo-gies and access a greater portion of the information space;but also consider the impact—if any—on policies such asprivacy, and then mitigate this impact with privacy-protection technology and corresponding policy.2,3

Countering terrorismInformation technology plays a crucial role—and is amajor tenet—of our counterterrorism strategy because itultimately has to make sense out of and connect the rela-tively few and sparse dots embedded within the massiveamounts of information potentially available to, and al-ready flowing into, the government’s intelligence andcounterterrorism agencies.

Numerous information technologies can help intelli-gence analysts detect and understand the clues terroristsleave behind when plotting their next move. In the sim-plest terms, these technologies fall into one of two broadcategories: collections and analytics. Figure 1 provides asimple illustration of this counterterrorism framework.For collections, we won’t discuss the vast array of sensortechnologies that fall within this category here; instead,see Table 1 (p. 27), which provides a sample of the autho-rization provided to the US intelligence community forits foreign and domestic intelligence and counterintelli-gence data collections.

For analytics, key intelligence tools include collabora-tion; text analysis and decision aides; natural language pro-cessing (in particular, speech-to-text transcription andforeign-to-English translation); pattern analysis; and pre-dictive (anticipatory) modeling. These technologies helpanalysts create models (and discover instances of new mod-els) of terrorist activity patterns; search and exploit vastamounts of multimedia, multiformat, and multilingualspeech and text; extract entities and entity relationships

from massive amounts of data; collaborate, reason, and shareinformation and analyses so that analysts can hypothesize,test, and propose theories and mitigating strategies aboutplausible futures; and advise decision- and policy-makerson the impact of current or future policies and prospectivecourses of action. We don’t discuss these technologies in de-tail here, but more information appears elsewhere.1,4,5

In our view, modeling tools play a crucial role in coun-tering terrorism. The analytical community first createsscenarios of terrorist plots and attacks using previous at-tacks, intelligence reports, red teams, war games, table-topexercises, and the like. These terrorism scenarios wouldconsist of a range of transactions and steps that terroristsmust perform in support of their plot to attack a specifictarget type using a specific mode of attack. Analysts thencodify these scenarios in a set of quantitative and compu-tational models based on a wide range of nonlinear math-ematical and nondeterministic stochastic computationalapproaches for capturing social phenomena and patho-logical behavior. These models are essentially hypothesesabout terrorist plots and would be translated into a seriesof questions about the types of transactions terroristswould need to execute, the types of evidence analystswould need to accrue, the keywords and patterns analystswould need to associate, and the like.

Terrorist activity isn’t easily reduced or amenable toclassical analytical methods; moreover, the associated datacan be incredibly poor due to ambiguous, erroneous, andconflicting reports. No single theory or modeling ap-proach is sufficient, so we must integrate an ensemble ofmodels that have more information than any singlemodel has to estimate a range of plausible futures and pro-vide competing explanations as to what the informationmeans. Robust adaptive strategies that hedge across theseplausible futures will provide practical actionable optionsfor the decision-maker to consider.1,4,5

Early results show promiseThe importance (and promise) of these informationtechnologies has already emerged through experimentsconducted with several entities in the intelligence com-

munity. Experiments let us assess these technologies forutility and merit in the context of real-world problemsbefore large amounts of funds are expended to fully im-plement them. Moreover, to push the envelope of what’s


Numerous information technologies

can help intelligence analysts detect

and understand the clues terrorists

leave behind.

Data Surveillance

possible, failure in the experimental environment is anacceptable outcome for a particular technology.

Figure 2 shows an approach to understanding how tomeasure the operational payoff of information technolo-gies for counterterrorism. As the graphic shows, whendoing traditional analysis, an analyst spends much of hisor her time on the major processes broadly defined as re-search, analysis, and production. This bathtub curve showsthat analysts spend much time doing research and pro-duction but too little time doing analysis. An objective ofconducting experiments with this curve is to determinewhether we can improve analyses via information tech-nology by reversing this trend and inverting the curve.4,5

Specifically, Figure 2 shows the results of an experi-ment in which the intelligence question posed to ana-

lysts was, “What is the threat posed by Al Qaeda’sweapons of mass destruction capabilities to several citiesin the US?” The data were drawn from various classifiedintelligence sources, foreign news reports, and the As-sociated Press (AP) and other wire services.4,5 Theinformation technologies used in the experiments in-cluded a peer-to-peer collaboration tool, a structuredargumentation decision aide, a multilingual processingtool for audio phonetic searching/indexing as well astext filtering/categorization, and several graph-basedlink analysis tools. The results of the experiment showan inverted bathtub curve, allowing for more and betteranalysis in a shorter period of time, as a result of analystsusing information technologies. The obvious signifi-cance is that analysts spend a greater percentage of their


Figure 1. Counterterrorism framework. Information technologies fall into one of two broad categories: collections and analytics.

Discovery tools Models

Publicnews

DIANFIP Public

SIGINTPublic

NSANFIP

CITF

INSCOMNFIP UN

data

Humint Public FBINFIP Public

Repositories

Collaborative,multiagency,

policy environmentCollaborative, multiagency,analytical environment

DIA

NSA

INSCOM

CIA

FBI

Commercial

Government

Intelligence;counterintelligence

Public Datacollection focus

Agencystovepipes

. . .. . . EO 12333 . . .. . .

USSID18. . .. . .

DoD 5240.1-R Army 381-10

Plausible futures Options Analysisfocus

Foreignterroristthreat

Foreignterroristthreat in

US

Data Surveillance

time doing what is most important in our view—namely, the critical-thinking tasks instead of the moremundane research and production tasks. The results alsoincluded an impressive savings in analyst labor (that is,half as many analysts participated in the IT-enhancedanalysis) and an increase in the number of reports pro-duced (that is, analysts created five reports in the time ittook to create one manually).

Our explanation for the bathtub curve’s inversion forthe intelligence question at hand includes

• The time spent in the research phase shrank dramati-cally by using the collaboration tool (Groove) acrossmultiple agencies to harvest and share “all” pertinentdata.

• The structured argumentation modeling tool (SEAS,for Structured Evidentiary Argumentation System) letanalysts explicitly represent their hypotheses for com-parison and assessment, and identify evidentiary datagaps for which data must be searched and harvested.

• The multilingual processing tool (FastTalk) let analystsphonetically index and search vast quantities of foreignaudio streams and thereby reduce the time required tofind pertinent data.

• The link analysis tools (Analyst Notebook) let analystsautomatically capture portions of their analysis in aneasy-to-understand visual format.4,5

Figure 3 shows the utility of various informationtechnologies in detainee operations support. In this sce-nario, actual government interrogators questioned ac-tual detainees at the US military facility at GuantanamoBay, Cuba, and wanted analytical support to make senseof the stacks of real reports from hundreds of interroga-tion sessions. The analysts used a link analysis tool to findnonobvious relationships between different entities(people, places, and things), a group detection tool tofind nonobvious groupings among entities, an entityresolution tool to resolve entities and aliases in the inter-

rogation reports, a Bayesian classification tool to classifydetainees of unknown status as either statistically morelikely to resemble known terrorists or nonterrorists, anda link chart visualization tool to pull everything together.These tools showed the interrogators web-like diagramsof connections (or relationships) among different entitiesthat weren’t readily apparent, inconsistencies in detaineestories, salient relationships across detainees, useless datato disregard, and data that could be most informative forfollow-up interrogations. The tools’ output also in-cluded a rank-ordered list of detainees with the likeli-hood that each had attributes resembling knownterrorists or nonterrorists.4–6 (It should be noted that of-ficials at Guantanamo Bay established the “groundtruth” in terms of which detainees were terrorists andwhich ones weren’t.) Based on conversations with theintelligence analysts who performed this work, anecdo-tal evidence suggests that the detainees classified as“likely a terrorist” were in fact terrorists, and no cases


AUTHORITY DESCRIPTION

Executive Order (EO) 12333 Authorizes US intelligence activities

Foreign Intelligence Surveillance Prescribes procedures for physical and electronic surveillance and collection of intelligence

Act (FISA) of 1978 information between or among foreign powers

USA Patriot Act Dramatically expands the authority of American law enforcement for fighting terrorism in the US

and abroad

US Department of Defense (DoD) Provides the DoD with implementation guidance for EO 12333

Directive 5240.1-R

Army regulation 381-10 Provides the Army with implementation guidance for DoD Directive 5240.1-R

US Signals Intelligence Directive Governs signal intelligence (SIGINT) for the National Security Agency (NSA)

(USSID) 18

Table 1. Sample of the US intelligence community’s legal authority for data collection.

Figure 2. The analyst “bathtub” curve. The red curve representsthe baseline distribution of time an analyst manually spends onresearch, analysis, and production; the blue curve represents theimprovement due to information technology enhancements.

Research Analysis

26

67

Intelligence analysis phase

Production

80

70

60

50

40

30

20

10

0

Tim

e ex

pen

ded

(%

)

25

58

17

7

IT-enhancedmethodManually drivenmethod (baseline)

Data Surveillance

were found in which detainees who weren’t terroristswere classified as “likely a terrorist.”

Figure 4 shows an experiment in which a novel mul-tilingual IT front-end system automatically ingests,transforms, extracts, and autopopulates in near real timethe back-end analytical models from massive amounts oftext data. In this experiment, the problem concernedunderstanding and forecasting the preconditions androot causes that give rise to instability in nation states.Failed states are important because they offer a safe havenand potential breeding ground for terrorists. The chal-lenge posed to analysts here was to assess and forecast thelevel of instability in two specific countries in SoutheastAsia. The data came from a variety of open sources andincluded more than 1 million English documents and2,300 non-English documents. The information tech-nologies used included a back-end rebel activity model(RAM) based on a Bayesian network and hiddenMarkov models (HMMs) that measured the amount ofrebel activity (on the part of separatists, insurgents, ter-rorists, Islamic extremists, and so forth); a front-end lan-guage-independent text-based transformation andcategorization tool based on a Hilbert engine (a tech-nology that numerically encodes ASCII text into vec-

tors in Hilbert space); and a linguistic pattern analyzer(LPA) that automatically populates the HMMs in theRAM model.

The experiment’s results were impressive—given acorpus of 1,236,300 documents, a human would need117 man years to read it all (assuming it took 12 minutesto read each document), or 280 humans to read the doc-uments in six months. The automated front-end systembased on LPA, the Hilbert engine, and RAM would takea mere 0.05 man years with a one-time cost of 0.76 manyears to configure LPA with the numerous multilingualscripts. Assuming it cost US$100K per man year, theautomated front-end would provide a savings ofUS$11,695,141 over the human method.

Signatures in silosOne of the major criticisms leveled against an approachsuch as ours is that what we’re describing is mass data-veillance—warehousing massive amounts of data in amegadatabase and using data mining techniques thatwill lead to multiple false positives and a massive inva-sion of Americans’ privacy. We disagree. Although weappreciate the significant information policy challengesconcerning data analysis in actual transaction spaces, we


Figure 3. Detainee operations support at Guantanamo Bay. The analysts used information technologies to build web-likediagrams of relationships between entities that weren’t immediately apparent.

Known ”terrorists“ (training data set)

Known ”nonterrorists“ (training data set)

Unknowns (don’t know whether terrorists or nonterrorists)

?

?

?

Interrogationreports

Link chartBayesian classifier

(untrained)

Link chartBayesian classifier

(trained)

.

.

.

.

.

.

Most likely a terrrorist

Most likely a nonterrrorist

Aliasresolution

Entityextraction

Linkdiscovery

Data Surveillance

believe technology and enabling policies can help pre-serve civil liberties and protect the privacy of those peo-ple who aren’t terrorists while keeping us all safer fromattack.

Data mining commonly refers to using techniquesrooted in statistics, rule-based logic, or artificial intelligenceto comb through large amounts of data to discover previ-ously unknown but statistically significant patterns. How-ever, the general counterterrorism problem is much harderbecause unlike commercial data mining applications, wemust find extremely rare instances of patterns across an ex-tremely wide variety of activities and hidden relationshipsamong individuals. Table 2 gives a series of reasons for whycommercial data mining isn’t the same as terrorism detec-tion in this context. We call our technique for counterter-rorism activity data analysis, not data mining.7

Instead of warehousing data in one megadatabase, webelieve data must be left distributed over the large num-ber of heterogeneous databases residing with their dataowners. In an intelligence context, agency silos andstovepipes aren’t necessarily bad—they allow analystsfrom different agencies to create alternative competinghypotheses, and they also protect agency-specificsources and methods. In our judgment, the goalshouldn’t be to tear down these silos, but to punch holesin them and enable collaboration across agencies whenappropriate and advantageous.

Advanced search and discovery tools should be usedto search and query relevant databases—under rigorousaccess control and privacy protections—with the resultsof the search/query added to the analytical models. Be-cause finding evidence of a suspicious terrorist plot isn’t


Figure 4. A multilingual IT front-end system. This tool automatically ingests, transforms, extracts, and autopopulates in nearreal time the back-end analytical models from massive amounts of text data.

.

.

.

Threat assessment model

Rebel groupcapacity

Self-financingcapacity

Performancecapacity

Negotiatingaptitude

Resourceprocurement

aptitude

Group visibility

Threat tostability

Level ofattack

Rebelactivitymodel(RAM)

Measuring groupself-financing

capacity

Weapons andtactics used

Group statedidealogy

Target choice

Support frompatrons

Proximity to lootable resources

Diasporaremittances

Participation incriminal activity

Automated entire front-end processing chain from data ingest to model population/processing

Raw (multilingual) dataAutomated IT data front end

• News services• Email messages• Financial report

• News services• Magazine articles• Reference book excerpts• Web site HTML

• Metadata• Corroborating data• Technical data

.

.

.

Data transforms (Hilbert, LSI, AGS, ...)

Auto-ingestand

categorize

COMMERCIAL DATA MINING TERRORISM DETECTION

Discover comprehensive models of databases Detect connected instances of rare patterns

to develop statistically valid patterns

No starting points Known starting points or matches with patterns estimated by analysts

Apply models over entire data Reduce search space; results are starting points for human analysis

Independent instances (records) Linked transactions (networks)

No correlation between instances Significant autocorrelation

Minimal consolidation needed Consolidation is key

Dense attributes Sparse attributes

Sampling okay Sampling destroys connections

Homogenous data Heterogeneous data

Uniform privacy policy Nonuniform privacy policy

Table 2. Data mining vs. terrorism detection.

Data Surveillance

easy, we believe two basic types of queries are necessary:subject-based queries (sometimes referred to as particu-larized suspicion) and pattern-based queries (sometimesreferred to as nonparticularized suspicion).

Subject-based queries let analysts start with knownsuspects, look for links to other suspects, people, places,things, or suspicious activities, and do so within well-defined and practiced sets of legal and regulatory proto-cols. Law enforcement personnel have used thistechnique successfully for years as part of their back-ground investigations and as a forensic tool. In the pre-vious section, we gave examples of how subject-basedqueries can be beneficial for counterterrorism purposes(such as in the Guantanamo Bay detainee example), butthis might not be enough. To get ahead of the terrorismproblem, we need to consider pattern-based queriesthat don’t require a subject’s prior identification.

Pattern-based queries let analysts take a predictivemodel and create specific patterns that correspond to an-ticipated terrorist plots, and use (largely existing) discov-ery tools and advanced search methods to find instancesof these patterns in the information space. This latter ap-proach becomes essential because it can provide cluesabout terrorist sleeper cells made up of people who havenever engaged in activity that would link them to knownterrorists. Nonparticularized suspicion raises even higherthe question of civil liberties, though—currently, nowell-defined or practiced legal or regulatory protocolsgovern its operation, so a new privacy policy frameworkfor management and oversight is needed (we’ll brieflydiscuss this later).

With respect to false positives, some of our criticshave stated that pattern-based queries create more falsepositives than they help resolve. Dealing with false pos-itives—which are a legitimate concern given that thegovernment might get it wrong and stigmatize or in-convenience nonterrorists—requires pattern-based

queries to be issued iteratively in a privacy-sensitivemanner (specifically, via anonymization and selectiverevelation techniques). Handling them also requiresmultiple stages of human-driven analysis in which ana-lysts can’t act on the results of such queries until a third-party legal authority has established sufficient probablecause. Analysts would refine queries in stages, seekingto gain more confirmation while invoking numerousprivacy-protection techniques in the process. This isn’tunlike the tried and proven signal-processing analysistechniques found in antisubmarine warfare, in whichhuman-driven analysis addresses false positives at vari-ous stages in a similar manner.8

Safeguarding civil libertiesAmericans expect their government to protect themfrom enemy attack as well as safeguard (or at least not vi-olate) their civil liberties and privacy. We believe thesetwo ideals aren’t mutually exclusive: Figure 5 showshow our goal (and challenge) is to maximize security atan acceptable level of privacy. In other words, we canpick acceptable levels of privacy and through the devel-opment and use of technology, create new level of pri-vacy versus security curves, thus increasing security. Afull discussion of what privacy means from a legal andregulatory context is beyond this article’s scope, but fora working definition, we would argue that personal pri-vacy is only violated if the violated party suffers sometangible loss, such as unwarranted arrest or detention,for example. The right balance between the two mustbe understood, as well as the corresponding social costs,benefits, and roles played by the public, government,and private sectors.

As discussed earlier, analysts must systematically useinformation technologies to detect and discover in-stances of known or emerging terrorist signatures, butthey must also be able to exploit the permitted informa-tion sources they need to access and do so while protect-ing the privacy of nonterrorists. Privacy-protectiontechnology is a key part of the solution not only to pro-tect privacy but also to encourage the intelligence, lawenforcement, and counterterrorism communities toshare data without fear of compromising sources andmethods. However, the American public has legitimateconcerns about whether protections for privacy are ade-quate to address the potential negative consequences ofincreased government use of permitted informationsources. These concerns are heightened because there islittle understanding or knowledge about how the gov-ernment might use this data.

The R&D community has explored several promis-ing privacy-protection technologies, especially those thatare most relevant to the pattern-based query approach.We briefly describe some of them here, but more detailedinformation appears elsewhere.9–11


Figure 5. Security vs. privacy curves. Laws and policies dictate wherewe are on a curve; new privacy technology can create new curves.

Currenttechnology

More

Secu

rity

Less MorePrivacy

Post-9/11

Pre-9/11

Data Surveillance

Privacy applianceOur privacy appliance concept involves the use of a sep-arate tamper-resistant, cryptographically protected de-vice placed on top of databases. The appliance would bea trusted, guarded interface between the user and thedatabase analogous to a firewall, smart proxy, or a Webaccelerator. It would implement several privacy func-tions and accounting policies to enforce access rules es-tablished between the database owner and the user. Itwould also explicitly publish the details of its technology,verify the user’s access permissions and credentials (pack-aged with the query in terms of specific legal and policyauthorities), and filter out queries not permitted or thatillegally violate privacy. Finally, it would create an im-mutable audit log that captures the user’s activity andtransmits it to an appropriate trusted third-party over-sight authority to ensure that abuses are detected,stopped, and reported. (Granted, our privacy applianceconcept assumes the third party is trusted, which is oftenthe hardest problem to solve.) The privacy appliance’soperation must be automated to respond to the dy-namic, time-sensitive nature and scale of the problemand to ensure the privacy policy’s implementation. Fig-ure 6 illustrates the privacy appliance concept in terms ofsome of its key privacy functions as well as how it wouldwork operationally.

Data transformationUsed within the privacy appliance, data transformationemploys well-known mathematical encoding tech-

niques to transform data from a plaintext representa-tion to cipher, thus making computer processing moreefficient and the data unintelligible to humans. Oncetransformed, analysts could apply a plethora of dataanalysis functions to understand the data’s significance,keeping the identities of subjects hidden from analystsbut still allowing the detection of terrorist activity pat-terns, such as data searching, alias and entity resolution,and pattern-query matching. Because the data is repre-sented in unintelligible cipher, no personally identifi-able data is disclosed to the analyst, thus privacyprotection is maintained.

Anonymization Similar to data transformation, anonymization is a tech-nique used within the privacy appliance: it generalizes orobfuscates data, providing the system with a guaranteethat any personally identifiable information in the re-leased data can’t be determined, yet the data still remainsuseful from an analytical viewpoint. As an example, in-stead of releasing to an analyst a database record consistingof [name(first, last); telephone #(area code, exchange,line number); address(street, town, state, zip code)], ananonymized version of this database record could be[name(first); telephone #(area code); address(state)]. Forthis approach to work, analysts will have to make connec-tions between queries and thus will require some sort ofanonymized unique identifier as well. Much more thor-ough treatment of various anonymization techniquesand applications for privacy appears elsewhere.10,11


Figure 6. Privacy appliance concept. A tamper-resistant, cryptographically protected device serves as a trusted privacy-enforcing guard between the user and the database.

Datasource

Privacyappliance

User query

• Contains associative memory index (AMI)• Update in real time

• Authentication• Authorization• Anonymization• Immutable audit trail• Inference checking

Datasource

Privacyappliance

Cross-sourceprivacy

appliance

Datasource

Privacyappliance

Response

Governmentowned

Independentlyoperated

Private oragency owned

• Selective revelation• Data transformation• Policy is embedded• Create AMI

Data Surveillance

Selective revelationAnother technique to employ in the privacy appliance isselective revelation, which gives incremental access toand analysis of increasingly personally identifiable data.In this approach, what an analyst gets back in response toa pattern-based query varies in depth and specificity de-pending on the analyst, the investigation’s status, andother criteria. The analyst’s knowledge of an individual’sidentity would occur only after a sufficient level of suspi-cion and appropriate legal threshold were met. The ap-proach proceeds incrementally by requiring data ownersto release subsets of data—anonymized, filtered, or sta-tistically characterized—to an analyst’s pattern-basedquery. Initially, no personally identifiable data is pro-vided in response to the query. If the results turn out tobe meaningful after iteration and refined patterns orqueries—say, only an acceptably few individuals matchthe query, or the level of suspicion or probable cause hasbeen heightened—then additional permissions and au-thorization through an appropriate (yet currentlynonexistent) legal framework would need to be securedto release personally identifiable data of individualsunder suspicion.

Immutable auditAnother technique to be used in the privacy appliance is animmutable audit, which automatically and permanentlyrecords all accesses to data, with no possibility of unde-tected alteration or tampering. To prevent potential abusesby malicious agents, audit logs would be designed so thatany misdeeds or corruption are detected with the highestprobability. Audit logs would be cryptographically pro-tected and transmitted to a trusted third-party oversightauthority. Privacy tools to query and analyze audit logs arealso critical. The contents of the audit log could containfields such as the analyst’s identity and credentials, the au-thorizations and permissions allowed, the date and time ofthe data access, the data requested, and the data returned.

Self-reporting dataAn important technology that isn’t directly related to theprivacy appliance but is important from a civil libertiesperspective is self-reporting data. This is a method fortruth maintenance as well as for reporting on the data’s dis-tribution. Data used in analysis should be active (that is, itshould report back to a central authority about where it isand for what it’s being used); this point is essential to cor-rect any information that’s later proved to be false.

Privacy lawsGovernment access to personally identifiable data raiseslegitimate concerns about the protection of civil liberties,privacy, and due process. Given the limited applicability ofcurrent privacy laws to the modern digital era, practicalpolicies for new information technology use, redress, and

oversight are vital. New privacy policy can help ensurethat controls and protections accompany the use of the in-formation technologies we discussed earlier. Here, we listseveral basic principles as examples of the types of policyto consider:12

• Neutrality. New information technology should buildin existing legal and policy limitations about access topersonally identifiable or third-party data.

• Minimize intrusiveness. Personally identifiable data isvoluntary but might be required as a condition of ser-vice (such as driver’s licenses), thus it should beanonymized or rendered pseudononymous and dis-aggregated (when possible).

• Intermediate not ultimate consequence. Personal identifica-tion by a new information technology shouldn’t di-rectly lead to ultimate consequence (such as arrest);instead, analysts should view it as cause for additionalinvestigation.

• Audits and oversight. New information technologyshould have strong built-in technological safeguardssuch as audit and oversight mechanisms to detect anddeter abuse.

• Accountability. New information technology should beused in a manner that ensures accountability of the ex-ecutive branch to the legislative branch for its use.

• Necessity of redress mechanisms. Robust legal mechanismsfor the correction of false positives should be in place.

• People and policy. Internal policy controls, training, ad-ministrative oversight, enhanced congressional over-sight, and civil and criminal penalties for abuse shouldall be in place.

We hope these considerations will be taken into accountalong with legal protocols for pattern-based searches;technology does and can play a key role in the careful bal-ance of security with privacy.

I nformation and privacy-protection technologies arepowerful tools for counterterrorism, but it’s a mistake

to view technology as the complete solution to the prob-lem. Rather, the solution is a product of the wholesystem—the people, culture, policy, process, and tech-nology. Technological tools can help analysts do theirjobs better, automate some functions that analysts wouldotherwise have to perform manually, and even do someearly sorting of masses of data. But in the complex worldof counterterrorism, the technologies alone aren’t likelyto be the only source for a conclusion or decision.

Ultimately, the goal should be to understand the levelof improvement possible in our counterterrorism opera-tions using advanced tools such as those described herebut also to consider their impact—if any—on privacy. Ifresearch shows that a significant improvement to detect


Data Surveillance

and preempt terrorism is possible while still protectingthe privacy of nonterrorists, then it’s up to the govern-ment and the public to decide whether to change exist-ing laws and policies. However, research is critical toprove the value (and limits) of this work, so it’s unrealisticto draw conclusions about its outcomes prior to R&Dcompletion. As has been reported,6 research and devel-opment continues on information technologies to im-prove national security; encouragingly, the Office of theDirector of National Intelligence (ODNI) is embarkingon an R&D program to address many of the concernsraised about potential privacy infringements.

AcknowledgmentsThe views expressed herein are the authors’ alone and don’t reflect theviews of any private-sector or governmental entity.

References1. R. Popp and J. Yen, eds., Emergent Information Technolo-

gies and Enabling Policies for Counter-Terrorism, Wiley &Sons/IEEE Press, 2006.

2. Report to Congress Regarding the Terrorism Information Aware-ness Program, DARPA, May 2003; [response to Consol-idated Appropriations Resolution, Pub. L. no.108-7, div.M, sec. 111(b), 2003].

3. J. Poindexter, “Overview of the Information AwarenessOffice,” DARPATech 2002, DARPA, 2002; www.fas.org/irp/agency/dod/poindexter.html.

4. R. Popp et al., “Countering Terrorism through Infor-mation Technology,” Comm. ACM, vol. 47, no. 3, 2004,pp. 36–43.

5. E. Jonietz, “Total Information Overload,” MIT Tech.Rev., vol. 106, no. 6, 2003, p. 68.

6. S. Harris, “Signals and Noise,” Nat’l J., vol. 38, no. 24,2006, pp. 50–58.

7. M. DeRosa, Data Mining and Data Analysis for Counter-terrorism, CSIS Press, 2004.

8. T. Senator, “Multi-Stage Classification,” Proc. 5th IEEEInt’l Conf. Data Mining (ICDM 05), IEEE CS Press, 2005,pp. 386–393.

9. “Security with Privacy,” DARPA’s Information SystemsAdvanced Technology (ISAT) study, Dec. 2002; www.cs.berkeley.edu/~tygar/papers/ISAT-final-briefing.pdf.

10. L. Sweeney, “Weaving Technology and Policy Togetherto Maintain Confidentiality,” J. Law, Medicine and Ethics,vol. 25, nos. 2-3, 1997, pp. 98–110.

11. L. Sweeney, “k-Anonymity: A Model for Protecting Pri-vacy,” Int’l J. Uncertainty, Fuzziness and Knowledge-BasedSystems, vol. 10, no. 5, 2002, pp. 557–570.

12. P. Rosenzweig, “Privacy and Consequences: Legal andPolicy Structures for Implementing New Counter-Ter-rorism Technologies and Protecting Civil Liberty,” Emer-gent Information Technologies and Enabling Policies forCounter-Terrorism, R. Popp and J. Yen, eds., Wiley &Sons/IEEE Press, 2006, pp. 421–438.

Robert Popp, now CEO of National Security Innovations,recently served as a senior government executive within the USDepartment of Defense as deputy director of the InformationAwareness Office (IAO) at DARPA and assistant deputy under-secretary of defense for advanced systems and concepts in theOffice of the Secretary of Defense. He serves on the Defense Sci-ence Board, is a senior associate for the Center for Strategic andInternational Studies, and is an associate editor of IEEE Trans-actions on Systems, Man, and Cybernetics. Popp holds twopatents, authored numerous journal and conference papers,and edited Emergent Information Technologies and EnablingPolicies for Counter-Terrorism (Wiley & Sons/IEEE Press, 2006).Popp has a PhD in electrical engineering from the University ofConnecticut, and a BA/MA in computer science from Boston Uni-versity. Contact him at [email protected].

John Poindexter, now a private consultant, most recently servedas director of the Information Awareness Office (IAO) at DARPA.He also serves on the board of directors for Saffron Technology,a computer software company that produces associative mem-ory applications. Prior to working at DARPA, Poindexter servedas National Security Advisor and Deputy National Security Advi-sor under President Ronald Reagan from 1983 to 1986. He hasa PhD and an MS in physics, both from the California Instituteof Technology. Contact him at [email protected].


Stay on TrackIEEE Internet Computing reports emerging tools, technolo-gies, and applications implemented through the Internet tosupport a worldwide computing environment.

In 2007, we’ll look at• Autonomic Computing• Roaming• Distance Learning• Dynamic Information Dissemination

... and more!

www.computer.org/internet

Data Surveillance

L as Vegas, Nevada, is possibly the most interestingreal-world setting for a high-stakes game of datasurveillance. Most of the 38 million people whovisit the city annually are attracted by the gam-

bling, entertainment, shopping, architecture, dining, andshows.1 However, among them are a few thousand “op-portunists” who converge on Las Vegas solely to exploitits vulnerabilities. Some have become so infamous thatgaming regulators have banned them from ever againstepping foot in a Nevada casino. In fact, if a casino getscaught doing business with such a person, it can be heav-ily fined or, worse, lose its gaming license.

If you’re a casino operator, knowing with whomyou’re doing business isn’t just good business in terms ofprotecting corporate assets—it’s a matter of legal res-ponsibility. Finding a few bad actors, while minimizingthe disruption, inconvenience, and privacy invasions totens of millions of innocent tourists, has by necessitygrown from an art mastered by a few practitioners into ateachable discipline. Elements of that discipline includeregulatory policy, industry best practice procedures, staffdevelopment, and information technology.

This article presents the general problem domain ofmatching and relating identities, examines traditional ap-proaches to the problem, and introduces identity resolu-tion and relationship awareness. This combination offersimproved accuracy, scalability, and sustainability over tra-ditional methods.

Enterprise surveillance Before the age of electronic surveillance, casino securitypersonnel peered out from behind one-way mirrorswith binoculars while standing high above the casino

floor on speciallyconstructed catwalks.

Casinos began replacing these catwalks with elec-tronic surveillance cameras in the early 1980s, andcamera usage and placement were soon codified bystate law.2 Prior to the 1990s, casinos also hired peopleto compare casino and hotel guest lists against watchlists containing subjects of interest—that is, both crooksand highly desired customers the casinos wanted topamper. But as visitor volume grew, everything else didas well. Las Vegas hotels soon had 3,000 or morerooms—at times, more than 100,000 people a daymake their way through a mega-resort; currently, 18 ofthe 20 largest hotels in the US are in Las Vegas (www.airhighways.com/las_vegas.htm).

Today, casino operators rely on automated systems tofocus the casino’s finite surveillance and investigatory re-sources. Just as casinos use perimeter surveillance systemsto ensure that no one makes his or her way into the Miragehotel’s volcano fire spectacular, information-based systemsmonitor for subjects of interest who might be engaged ininappropriate transactions on the casino’s premises. Thesetechnologies also watch for fraud and “insider threats”(when employees secretly work against the enterprise’s in-terests)—an especially dangerous scenario in a businesswhere a single corrupt dealer can cost the casinoUS$250,000 in 15 minutes (if a dealer lets a player use a“pre-ordered” deck in a table game, for example).

For the gaming industry, subjects of interest comefrom several sources:

• Gaming regulatory compliance. Nevada gaming regulatorspublish an exclusionary list of the individuals banned

Threat and FraudIntelligence, Las Vegas Style

28 PUBLISHED BY THE IEEE COMPUTER SOCIETY ■ 1540-7993/06/$20.00 © 2006 IEEE ■ IEEE SECURITY & PRIVACY

Matching and relating identities is of the utmost

importance for Las Vegas casinos. The author describes a

specific matching technique known as identity resolution.

This approach provides superior results over traditional

identity matching systems.

JEFF JONAS

IBM

Data Surveillance

via statute from transacting with casinos under penaltyof license revocation or fines.3

• Federal regulatory compliance. The US Department ofState Office of Foreign Assets Control (OFAC) pub-lishes a list of specially designated nationals that describesthe countries, individuals, and organizations banned byvarious federal statutes from transacting with US busi-nesses (www.ustreas.gov/offices/enforcement/ofac/).

• Legally barred. Casinos can “formally trespass” a party, afterwhich the person will be arrested if he or she returns.

• Convicted cheaters. Many casinos consider those peoplepreviously arrested for felonious acts against the gamingindustry to be a potential risk that warrants additionallevels of scrutiny. Sometimes a patron is identified as aformer gaming felon and permitted to play, as any deci-sion to act involves human oversight and considerationof all available facts (for example, the crime occurredmany years ago and involved slots, but this individual isplaying blackjack today).

• Suspected cheaters and card counters. Although card count-ing isn’t illegal, casinos have the right to prevent anyonefrom engaging in gaming activity, which is likely if theplayer is determined to have a technique that materiallychanges the natural odds of the game. Many casinossubscribe to one or more subscription services that fa-cilitate information sharing (within the gaming indus-try) of gaming arrests, individuals suspected of illegallymanipulating games, and card counting.

• Self-declared problem gamblers. Casino patrons can placethemselves on a voluntary self-exclusion list as a “prob-lem gambler,” after which the casino inherits a degree ofresponsibility to neither market to nor allow the personto engage in casino activity. Problem gamblers have suedcasinos when they inadvertently allowed such people toaccrue additional losses (www.americangaming.org/publications/rglu_detail.cfv?id=229 and www.casinocitytimes.com/news/article.cfm?contentId=160709),although such suits are rarely successful.

Like any large organization, casinos have many dis-parate information systems, each with their own data sets.These include systems concerned with hotel reservations,hotel property management, customer loyalty, credit,point of sale, human resource job applicants, human re-source employment, and vendors, to name a few. Ar-guably, system data that has a nexus with a subject ofinterest would constitute a degree of risk to the enterpriseand might warrant additional scrutiny. Whether a knowncheater has just rented a room, joined the loyalty club, orapplied for a job, management appreciates being notified.

In many cases, however, relationships are nonobvious.Traditional practices, for example, help casinos catch anddetain roulette cheaters (in this case, the surveillance-room operator simply observes an illegal activity whilespot-checking a game). Although the dealer can claim

embarrassment for missing a blatant scam, gaming regu-lators take the fact that the dealer lives in the same apart-ment unit as the cheater as evidence of collusion, so bothare arrested. Or consider a promotions manager who“randomly” selects a ticket for a prize drawing and con-gratulates the winner of a new car; the recipient has a dif-ferent last name, but she is, in fact, the manager’s sister.This is evidenced again by common information be-tween the manager’s employment data and the winner’sself-provided information. Traditional analysis might notdiscover these connections.

Detecting fraud becomes harder as casinos and theirinformation systems grow in complexity. Let’s look atthree hard-to-detect scenarios from the gaming industrythat illustrate this.

Las Vegas problem scenario #1An individual barred by gaming regulators from transact-ing with casinos has just enrolled in your slot club, using aslightly different name and a date of birth in which themonth and day are transposed. He’s now playing in yourcasino, which places your gaming license at risk. Howwould you know?

Las Vegas problem scenario #2An employee who works in surveillance has just put infor an address change in your payroll system. This sameaddress is consistent with that of an individual arrestedearly last year for a $375,000 baccarat scam and now serv-ing time in jail. You had always suspected help from theinside but had no evidence. This latest piece of data in thepayroll system would be an important lead, but howwould you ever discover this important new fact?

Las Vegas problem scenario #3Your marketing team buys a list for a new direct mailcampaign. It has been scrubbed of problem gamblerswho have voluntarily placed themselves on the self-exclusionary list. You send the remaining people on thelist a promotional offer for the upcoming New Year’sEve event. In the months between when the promo-

tional offer mails and New Year’s Eve, two recipientsplace themselves on the self-exclusionary list. Can youdetect them before they arrive? Can you detect themwhen they arrive?


Detecting fraud becomes harder as

casino systems grow in complexity.

Traditional analysis might not

discover nonobvious relationships.

Data Surveillance

Problems with scenarios These scenarios are very difficult to discover, especiallymanually, by humans, even though the enterprise con-tains all the necessary evidence. That’s because this evi-

dence is trapped across isolated operational systems, andalthough these three problem scenarios clearly involveidentity matching, traditional matching algorithms ad-dress more mundane missions such as cleaning up cus-tomer mailing lists, detecting duplicate enrollment inloyalty-club programs, or detecting the arrival of a highroller who didn’t contact his host. Traditional algorithmsaren’t well suited to these scenarios—especially problemscenarios two and three.

Matching is further hampered by the poor quality ofthe underlying data. Lists containing subjects of interestcommonly have typographical errors. Data from opera-tional business systems is plagued by both intentional er-rors (those who intentionally misspell their names tofrustrate data matching efforts), and legitimate naturalvariability (Bob versus Robert and 123 Main Street ver-sus 123 S. Maine Street).

International data complicates matters further still. In2005, 12 percent of tourists who visited Las Vegas camefrom abroad.1 However, data entry operators (and pro-grammers) might not know how to handle internationalnames—for example, the name might be entered as “Haj Imhemed Otmane Abder-aqib” in West Africa or “Hajj Mohamed Uthman AbdAl Ragib” in Iraq: both English spellings signify thesame individual.

Dates are often a problem as well. Months and days aresometimes transposed, especially in international settings.Numbers often have transposition errors or might havebeen entered with a different number of leading zeros.

Naïve identity matching Organizations typically employ three general types ofidentity matching systems:

• Merge/purge and match/merge. Direct marketing organi-zations developed these systems to eliminate duplicatecustomer records in mailing lists. These systems gener-ally operate on data in batches; when organizationsneed a new de-duplicated list, they run the processagain from scratch.

• “Binary” matching engines. This system tests an identityin one data set for its presence in a second data set.These matching engines are also sometimes used tocompare one identity with another single identity(versus a list of possibilities), with the output often ex-pected to be a confidence value pertaining to the like-lihood that the two identity records are the same.These systems were designed to help organizationsrecognize individuals with whom they had previouslydone business (the recognition becomes apparent dur-ing certain transactions, like checking into the hotel)or, alternatively, recognize that the identity under eval-uation is known as a subject of interest—that is, on awatch list—thus warranting special handling. Thistype of identity matching system can be batch-handled or conducted in real time, although real timeis typically preferred.

• Centralized identity catalogues. These systems collectidentity data from disparate and heterogeneous datasources and assemble it into unique identities, while re-taining pointers to the original data source and recordwith the purpose of creating an index. Such systemshelp users locate enterprise content much in the sameway the library’s card catalog helps people locate books.

Each of the three types of identity matching systemsuses either probabilistic or deterministic matching algo-rithms. Probabilistic techniques rely on training data sets tocompute attribute distribution and frequency. Mark is acommon first name, for example, but Rody is rare.These statistics are stored and used later to determineconfidence levels in record matching. As a result, anyrecord containing simply the name Rody and a resi-dence in Maine might be considered the same personwith a high degree of probability. These systems lose ac-curacy when the underlying data’s statistics deviate fromthe original training set. To remedy this situation, suchsystems must be retrained from time to time and then allthe data reprocessed.

Deterministic techniques rely on pre-coded expert rulesto define when records should be matched. One rulemight be that if the names are close (Robert versus Rob)and the social security numbers are the same, the systemshould consider the records as matching identities. Thesesystems fail—sometimes spectacularly—when the rulesare no longer appropriate for the data being collected.

Nonobvious relationship awareness Nonobvious relationship awareness (NORA) is a sys-tem that Systems Research and Development ofNevada (which I founded) developed specifically tosolve Las Vegas casinos’ identity matching problems. Itran on a single server, accepted data feeds from numer-ous enterprise information systems, and built a modelof identities and relationships between identities (such


Data from operational business

systems is plagued by both

intentional errors and legitimate

natural variability.

Data Surveillance

as shared addresses or phone numbers) in real time. If anew identity matched or related to another identity in amanner that warranted human scrutiny (based on basicrules, such as good guy connected to very bad guy), thesystem would immediately generate an intelligencealert.

Requirements for the system became ambitious:

• Sequence neutrality. It needed to react to new data asthat data loaded. Matches and nonmatches had to beautomatically re-evaluated to see if the matches werestill probable as the new data loaded. This capabilitywas designed to eliminate the necessity of database re-loads. (See http://jeffjonas.typepad.com/jeff_jonas/2006/01/sequence_neutra.html for more on se-quence neutrality).

• Relationship aware. Relationship awareness was de-signed into the identity resolution process so that newlydiscovered relationships could generate realtime intelli-gence. Discovered relationships also persisted in thedatabase, which is essential to generate alerts to beyondone degree of separation.

• Perpetual analytics. When the system discovered some-thing of relevance during the identity matchingprocess, it had to publish an alert in real time to sec-ondary systems or users before the opportunity to actwas lost.

• Context accumulation. Identity resolution algorithmsevaluate incoming records against fully constructedidentities, which are made up of the accumulated at-tributes of all prior records. This technique enablednew records to match to known identities in toto, ratherthan relying on binary matching that could only matchrecords in pairs. Context accumulation improved accu-racy and greatly improved the handling of low-fidelitydata that might otherwise have been left as a large col-lection of unmatched orphan records.

• Extensible. The system needed to accept new datasources and new attributes through the modification ofconfiguration files, without requiring that the systembe taken offline.

• Knowledge-based name evaluations. The system neededdetailed name evaluation algorithms for high-accuracyname matching. Ideally, the algorithms would be basedon actual names taken from all over the world and de-veloped into statistical models to determine how andhow often each name occurred in its variant form. Thisempirical approach required that the system be able toautomatically determine the culture that the namemost likely came from because names vary in pre-dictable ways depending on their cultural origin.

• Real time. The system had to handle additions, changes,and deletions from real-time operational business sys-tems. Processing times are so fast that matching resultsand accompanying intelligence (such as if the person is

on a watch list or the address is missing an apartmentnumber based on prior observations) could be returnedto the operational systems in sub-seconds.

• Scalable. The system had to be able to process records ona standard transaction server, adding information to arepository that holds tens of millions of identities.

Following IBM’s acquisition of the company,NORA’s underlying code base was improved and thetechnology renamed IBM Identity Resolution and Re-lationship Resolution. Today, the system is implementedin C++ on top of an industry-standard SQL databasewith a Web services interface and optional plug-ins(knowledge-based name libraries, postal base files, and soon). Data integration services transform operational datainto prescribed and uniform XML documents. Config-uration files specify the deterministic matching rules andprobabilistic-emulating thresholds, relationship scoring,and conditions under which to issue intelligence alerts.

Although the gaming industry has relatively low dailytransactional volumes, the identity resolution engine iscapable of performing in real time against extraordinarydata volumes. The gaming industry’s requirements of lessthan 1 million affected records a day means that a typicalinstallation might involve a single Intel-based server andany one of several leading SQL database engines. Butperformance testing has demonstrated that the systemcan handle multibillion-row databases consisting ofhundreds of millions of constructed identities and ingestnew identities at a rate of more than 2,000 identity reso-lutions per second; such ultra-large deployments require64 or more CPUs and multiple terabytes of storage, andmove the performance bottleneck from the analytic en-gine to the database engine itself.

Identity resolutionIdentity resolution is designed to assemble i identityrecords from j data sources into k constructed, persistentidentities. The term “persistent” indicates that matchingoutcomes are physically stored in a database at the mo-ment a match is computed. Think of each constructedidentity as a bag filled with all the observed attributesfrom specific source identity records from which thatidentity was originally derived. Thus, newly presentedidentity records are never evaluated against another singleidentity record (binary matching), but rather are evalu-ated against fully preconstructed identities (each one builtfrom i previously resolved identity records).

Accurately evaluating the similarity of proper namesis undoubtedly one of the most complex (and most im-portant) elements of any identity matching system. Dic-tionary-based approaches—for example, mapping Boband Robert to the same database key—fail to handle thecomplex ways in which names from different culturescan appear.


Data Surveillance

The same is true for systems like Soundex, which is aphonetic algorithm for indexing names by their soundwhen pronounced in English. The basic aim is fornames with the same pronunciation to be encoded tothe same string so that matching can occur despiteminor differences in spelling. Such systems’ attempts toneutralize slight variations in name spelling by assigningsome form of reduced “key” to a name (by eliminatingvowels or eliminating double consonants) frequentlyfail because of external factors—for example, differentfuzzy matching rules are needed for names from differ-ent cultures.

We found that the deterministic method is essentialfor eliminating dependence on training data sets. Assuch, the system no longer needed periodic reloads toaccount for statistical changes to the underlying universeof data. But we found many common conditions inwhich deterministic techniques failed—specifically, cer-tain attributes were so overused that it made more senseto ignore them than to use them for identity matchingand detecting relationships. For example, two peoplewith the first name of “Rick” who share the same socialsecurity number are probably the same person—unlessthe number is 111-11-1111. Two people who have thesame phone number probably live at the same address—unless that phone number is a travel agency’s phonenumber. We refer to such values as generic because theoveruse diminishes the usefulness of the value itself. It’simpossible to know all of these generic values a priori—for one reason, they keep changing—thus probabilistic-like techniques are used to automatically detect andremember them.

Thus, our deployed identity resolution system uses ahybrid matching approach that combines deterministicexpert rules with a probabilistic-like component to de-tect generics in real time (to avoid the drawback of train-ing data sets). The result is expert rules that looksomething like this:

If the name is similar

AND there is a matching unique

identifier

THEN match

UNLESS this unique identifier is

generic

A unique identifier might include social security orcredit-card numbers, or a passport number, but wouldn’tinclude such values as phone number or date of birth.The term “generic” here means the value has become sowidely used (across a predefined number of discreet iden-tities) that we can no longer use this same value to disam-biguate one identity from another.

The actual deterministic matching rules are muchmore elaborate in practice because they must explicitlyaddress fuzzy matching (such as transposition errors innumbers, malformed addresses, and month and day re-versal in dates of birth), contain rules to deal with con-flicting attributes (such as the name, address, and phonenumber are the same, but the difference is the junior ver-sus senior designations), and so on.

As an example of sequence-neutral behavior, let’sconsider an identity resolution system that encounters arecord in 2002 for Marc R. Smith at 123 Main Street,phone number 713-730-5769, and driver’s licensenumber 0001133107. The following year, it encountersa second record for Randal Smith born 6/17/1974phone number 713-731-5577. With no other informa-tion, the system concludes that these are two differentpeople (see Figure 1).

Then, in 2004, the system encounters a new recordfor Marc Randy Smith, phone number 713-731-5577and driver’s license number 1133107. Because this 2004record matches both the 2002 and the 2003 records, thesystem would collapse all three records into a single iden-tity (see Figure 2).

In 2005, the system encounters a fourth record:Randy Smith Sr., born 6/17/1934, with the phonenumber 713-731-5577. The system now splits therecords into two distinct identities: the 2003 and 2005records are for Randy Smith Sr., the father, whereas the2002 and 2004 records are for Mark Randy Smith, theson (see Figure 3).

Users of identity resolution in threat and fraud intel-ligence missions invariably choose to limit the condi-tions in which the system generates intelligence “alerts”because they’re dealing with a finite number of inves-tigative resources. Through personal communication, Ilearned that one such Las Vegas casino reported its alertcriteria yields two “leads” a day on weekdays and fiveleads a day on weekends. Notably, a human analyst re-views all intelligence leads before action is taken. (Ac-tions come in many forms; in gaming, for example, anaction might involve mild additional scrutiny, more ex-


Figure 1. Records in an identity resolution system. These two recordsare initially determined not to be the same person.

1 2

Source A-701(2002)Marc R Smith123 Main St(713) 730 5769537-27-6402DL: 0001133107

Source B-9103 (2003)Randal SmithDOB: 06/17/1974(713) 731 5577

Data Surveillance

tensive surveillance, a physical conversation with thesubject, a request for identification, removal from thepremise, or in certain settings, handing a case over to au-thorities for an arrest.) Although misidentification is ex-ceedingly rare, further analysis might conclude that theintelligence doesn’t warrant any action. For example, ablackjack customer who has a 10-year-old felony slotconviction could be allowed to continue gambling at thecasino’s discretion.

An unexpected outcome from deployments of iden-tity resolution in the gaming industry was the discoveryof insider threats that previously hadn’t been considered.The original intention was to discover subjects of interestattempting to transact with the casino, but because thedesign placed all identities in the same dataspace, thegood guys (customers), bad guys (subjects of interest),and most trusted (employees and vendors) commingledas they were matched against each other. This resulted inexciting new kinds of alerts—such as an employee whowas also a vendor.

Although matching accuracy is highly dependent onthe available data, using the techniques described hereachieves the goals of identity resolution, which essen-tially boil down to accuracy, scalability, and sustainabilityeven in extremely large transactional environments (bil-lions of records representing hundreds of millions ofunique identities).

Relationship awarenessDetecting relationships is vastly simplified when amechanism for doing so is physically embedded into theidentity matching algorithm. Stating the obvious, beforewe can usefully analyze meaningful relationships, weshould first resolve unique identities. As such, identityresolution must occur first. We discovered that it wascomputationally efficient to observe relationships at themoment the identity record is resolved because in-memory residual artifacts (which are required to matchan identity) comprise a significant portion of what’sneeded to determine relevant relationships. Relevant re-lationships, much like matched identities, were thenpersisted in the same database.

A database supporting related identities essentiallycontains pairs of persistent keys—for example: entity ID#22 might be paired with entity ID #309 to denote thatthese two identities are in some manner related. Notably,some relationships are stronger than others; a relationshipscore that’s assigned with each relationship pair capturesthis strength. Living at the same address three times over10 years should yield a higher score than living at thesame address once for three months.

As identities are matched and relationships detected,the system evaluates user-configurable rules to determineif any new insight warrants an alert being published as anintelligence alert to a specific system or user. One sim-


Figure 2. Records in an identity resolution system. The third recordarriving in 2004 provides the evidence to conjoin the original tworecords—resulting in one entity, comprised of three records.

1



Source C-6251 (2004)Marc Randy Smith456 First Street(713) 731 5577DL: 1133107

Figure 3. Records in an identity resolution system. (a) The 2005record appears to match the existing identity. (b) However, thenew information learned in the 2005 record contains evidence thatthere are really two discreet identities (a junior and a senior) and,as such, the records are decoupled to immediately correct for thisnew information.

1

Source D-7214 (2005)Randy Smith Sr.6/17/1934(713) 731-5577423-22-7027




1 3




Source D-7214 (2005)Randy Smith Sr.6/17/1934(713) 731-5577423-22-7027

The Junior The Senior

(a)

(b)

Data Surveillance

plistic way to do this is via conflicting roles. A typical rulemight be notification any time a role “employee” is asso-ciated to a role “bad guy,” for example. In this case, associ-ated might mean zero degrees of separation (they’re thesame person) or one degree of separation (they’re room-mates). Relationships are maintained in the database toone degree of separation; higher degrees are determinedby walking the tree. Although the technology supportssearching for any degree of separation between identities,higher orders include many insignificant leads and arethus less useful.

Today, when an employee updates his or her employ-ment record, if this data reveals a relationship to a currentor former criminal investigation, such intelligence is de-tected in real time and published to the appropriate party.Notably, this doesn’t mean that any criminal activity hasoccurred or is imminent; rather, it simply helps focus fi-nite investigatory resources.

Applications outside of gamingIdentity resolution has helped organizations deal with real-time identity awareness in many sectors, including retail,financial services, national security, and disaster response.

In the retail sector, organized retail theft (ORT) is amultibillion-dollar a year fraud problem. Ring leadershire groups of people to steal select products such asrazor blades, batteries, infant formula, and so on. Thistype of fraud is exceedingly hard to detect because theshoplifting incidents appear as one-se-two-se events,with no real way to see trends of organized activity. Sev-eral years ago, using the identity resolution techniquewith relationship awareness, individual shoplifting inci-dents from four retailers were processed. The system de-termined the number of unique people, creating a viewof the number of unique parties involved in the shoplift-ing. During this activity, the system relied on commonaddresses to create relationships. The result was a reportof shared criminal facilities—the first such view of itskind—identifying hundreds of addresses where morethan one person had been arrested for stealing at morethan one location. One such find was a “Fagan opera-tion” (named after a character in Charles Dickens’sOliver Twist), in which an adult employed children tosteal a particularly popular brand of jeans.

In 1998, the US government recognized how it coulduse such technology to help discover nonobvious rela-tionships to identify potential criminal activity fromwithin. Following 9/11, this same technology found itsway into several national security missions.

The Hurricane Katrina disaster demonstrated yet an-other possible use of this technology. After the stormpassed, more than 50 Web sites emerged to host theidentities of the missing and the found. People identifiedas missing on one Web site were identified as found onanother. Some people registered the same person count-

less times on a single Web site (sometimes with differentname variations and sometimes just in desperation). Thequestion, then, was how many unique people were actu-ally reported missing, how many unique people were re-ported found, and how many of these people could bereunited if the identities matched? Working in partner-ship with several agencies within the Louisiana stategovernment and led by the state’s Office of InformationTechnology, many loved ones were reunited.

A chieving real-time enterprise awareness is vital toprotecting corporate assets not to mention the in-

tegrity of the brand itself. The ability to handle real-timetransactional data with sustained accuracy will continueto be of “front and center” importance as organizationsseek competitive advantage.

However, as those with criminal intent become moresophisticated, so must organizations raise the bar in theirability to detect and preempt potentially damaging trans-actional activity. The identity resolution technology de-scribed here provides a growing number of industrieswith enterprise awareness accurate to the split second, toaddress the most pressing societal problems such as frauddetection and counterterrorism. IBM sells this technol-ogy as an off-the-shelf product. In more recent develop-ments such a technique can now be performed usingonly anonymized data, which greatly reduces the risk ofinformation leakage (unintended disclosure) and whenimplemented correctly can enhance privacy protections.I hope to publish a similar technical piece on this newtechnology in the coming year.

References1. “Top 25 Frequently Asked Questions,” Las Vegas Con-

vention and Visitors Authority Research Dept., Mar.2006; www.lvcva.com/getfile/2005Top25Questions.pdf ?fileID=106.

2. “Surveillance Standards for Nonrestricted Licensees,”Nevada Gaming Commission and State Gaming Con-trol Board, Nov. 2005; http://gaming.nv.gov/stats_regs/reg5_survel_stnds.pdf.

3. “Regulation 28, List of Excluded Persons,” NevadaGaming Commission and State Gaming Control Board,Feb. 2001; http://gaming.nv.gov/stats_regs/reg28.pdf.

Jeff Jonas is chief scientist of the IBM Entity Analytic Solutionsgroup and an IBM Distinguished Engineer. He is a member of theMarkle Foundation Task Force on National Security in the Infor-mation Age and actively contributes his insights on privacy,technology, and homeland security to leading national thinktanks, privacy advocacy groups, and policy research organiza-tions, including the Center for Democracy and Technology, Her-itage Foundation, and the Office of the Secretary of DefenseHighlands Forum. Recently, Jonas was named as a senior advi-sor to the Center for Strategic and International Studies. Con-tact him via www.jeffjonas.typepad.com.


Documents

Guest Editors’ Introduction Data Surveillance Csimson.net/clips/academic/2006.data-surveillance.pdf · 2006-12-07 · Guest Editors’ Introduction Government and industry are in-creasingly