View
219
Download
0
Tags:
Embed Size (px)
Citation preview
1
Intelligence and Security Informatics for International Security:
Framework and Case Studies
Hsinchun Chen, Ph.D.
McClelland Professor of MISDirector, Artificial Intelligence Lab
NSF COPLINK CenterManagement Information Systems Department
Eller College of Management, University of Arizona
3
Intelligence and Security Informatics (ISI)
• development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a)
Building a New DisciplineBuilding a New Discipline
4
Conferences and Workshops:• NSF/DOJ/CIA, ISI 2003, Tucson, AZ• NSF/CIA/DHS, ISI 2004, Tucson, AZ• IEEE NSF/CIA/DHS, IEEE ISI 2005, Atlanta,
Georgia• PAKDD ISI Workshop 2006, Singapore• IEEE NSF/CIA/DHS, IEEE ISI 2006, San
Diego, CA
• IEEE ISI 2007 (NJ); IEEE ISI 2008 (Taiwan)
Building a New DisciplineBuilding a New Discipline
5
Professional Societies:• IEEE Intelligent Transportation Systems
Society (ITSS): hosting IEEE ISI• President: Wang; VP: Zeng; BOG: Chen
• IEEE ITSS Technical Committee on Homeland Security (TCHS)
• IEEE Systems, Man, and Cybernetics Society (SMCS) Technical Committee on Homeland Security (TCHS)
Building a New DisciplineBuilding a New Discipline
6
Journal Special Issues (Appeared and In Press)• “Intelligence and Security Informatics,” Journal of the
American Society for Information Science and Technology, special issue on Intelligence and Security Informatics, Volume 56, Number 3, Pages 217-220, 2005.
• “Artificial Intelligence for Homeland Security,” IEEE Intelligent Systems, special issue on AI for Homeland Security, Volume 20, Number 5, Pages 12-16, 2005.
• “Intelligence and Security Informatics for Homeland Security: Information, Communication, and Transportation,” IEEE Transactions on Intelligent Transportation Systems, special section, 2006 (in press)
• “Intelligence and Security Informatics: Information Systems Perspective,” Decision Support Systems, special issue on Intelligence and Security Informatics, 2006 (in press).
Building a New DisciplineBuilding a New Discipline
7
Books:• H. Chen, “Intelligence and Security Informatics for
International Security: Information Sharing and Data Mining,” Springer, forthcoming, 2005.
• H. Chen, T. S. Raghu, R. Ramesh, A. Vinze, and D. Zeng, “Handbooks in Information Systems -- National Security,” Elsevier Scientific, forthcoming, 2006.
• H. Chen, E. Reid, and J. Sinai, “Terrorism Informatics,” Springer, forthcoming, 2006.
Building a New DisciplineBuilding a New Discipline
8
Call for Participation:• PAKDD ISI Workshop 2006, Singapore, April 9-10,
2006 (Springer LNCS)• IEEE NSF/CIA/DHS, IEEE ISI 2006, San Diego,
CA, April 22-24, 2006 (Springer LNCS)
• IEEE Transactions on Knowledge and Data Engineering (TKDE), special issue
• IEEE Transactions on Intelligent Transportation Systems (TITS), special issue
• IEEE Transactions on Systems, Man, and Cybernetics (TSMC), special issue
Building a New DisciplineBuilding a New Discipline
9
Call for Participation:• IEEE ITSS and SMC Technical Committee
(TC) involvement• IEEE annual ITSS and SMC conference
special sessions (IEEE SMC October 2006, Taipei, Taiwan)
• IEEE ISI 2008 in Taipei, Taiwan (pending approval)
Building a New DisciplineBuilding a New Discipline
10
Intelligence and Security Informatics for International Security:
Information Sharing and Data Mining
12
• Intelligence and Security Informatics (ISI): Challenges and Opportunities
• An Information Sharing and Data Mining Research Framework
• ISI Research: Literature Review• National Security Critical Mission Areas and Case Studies
– Intelligence and Warning– Border and Transportation Security– Domestic Counter-terrorism– Protecting Critical Infrastructure and Key Assets– Defending Against Catastrophic Terrorism– Emergency Preparedness and Responses
• The Partnership and Collaboration Framework
OutlineOutline
13
• Federal authorities are actively implementing comprehensive strategies and measures in order to achieve the three objectives – Preventing future terrorist attacks – Reducing the nation’s vulnerability– Minimizing the damage and recovering from
attacks that occur
• Science and technology have been identified in the “National Strategy for Homeland Security” report as the keys to win the new counter-terrorism war.
IntroductionIntroduction
14
• Six critical mission areas – Intelligence and Warning – Border and Transportation Security– Domestic Counter-terrorism– Protecting Critical Infrastructure and Key Assets – Defending Against Catastrophic Terrorism – Emergency Preparedness and Response
Information Technology and National SecurityInformation Technology and National Security
15
• Facing the critical missions of national security and various data and technical challenges we believe there is a pressing need to develop the science of “Intelligence and Security Informatics” (ISI)
Problems and ChallengesProblems and Challenges
17
Federal Initiatives and Funding Opportunities in ISIFederal Initiatives and Funding Opportunities in ISI
• The abundant research and funding opportunities in ISI. – National Science Foundation (NSF), Information Technology Research
(ITR) Program
– Department of Homeland Security (DHS)
– National Institutes of Health (NIH), National Library of Medicine (NLM), Informatics for Disaster Management Program
– Center for Disease Control and Prevention (CDC), National Center for Infectious Diseases (NCID), Bioterrorism Extramural Research Grant Program
– Department of Defense (DOD), Advanced Research & Development Activity (ARDA) Program
– Department of Justice (DOJ), National Institute of Justice (NIJ)
19
• KDD techniques can play a central role in improving counter-terrorism and crime-fighting capabilities of intelligence, security, and law enforcement agencies by reducing the cognitive and information overload.
• Many of these KDD technologies could be applied in ISI studies (Chen et al., 2003a; Chen et al., 2004b). With the special characteristics of crimes, criminals, and crime-related data we categorize existing ISI technologies into six classes:
– information sharing and collaboration– crime association mining– crime classification and clustering– intelligence text mining– spatial and temporal crime mining– criminal network mining
An ISI Research FrameworkAn ISI Research Framework
20
A knowledge discovery research framework for ISI A knowledge discovery research framework for ISI
A knowledge discovery research framework for ISI
21
• The potential negative effects of intelligence gathering and analysis on the privacy and civil liberties of the public have been well publicized (Cook & Cook, 2003).
• There exist many laws, regulations, and agreements governing data collection, confidentiality, and reporting, which could directly impact the development and application of ISI technologies.
Caveats for Data MiningCaveats for Data Mining
22
• Framed in the context of domestic security surveillance, surveillance is considered as an important intelligence tool that has the potential to contribute significantly to national security but also to infringe civil liberties. (Strickland 2005)
• Data mining using public or private sector databases for national security purposes must proceed in caution:
– The search for general information must ensure anonymity.
– The acquisition of specific identity, if required, must by court authorized under appropriate standards or warrants.
– The peril of the “security-industrial complex” – marriage of private data and technology companies and government anti-terror initiatives. (R. O’Harrow, “No Place to Hide”)
Domestic Security, Civil Liberties, and Knowledge Discovery
Domestic Security, Civil Liberties, and Knowledge Discovery
23
• Information Sharing and Collaboration• Crime Association Mining• Crime Classification and Clustering• Intelligence Text Mining• Crime Spatial and Temporal Mining• Criminal Network Analysis
ISI Research: Literature Review ISI Research: Literature Review
24
• Information sharing across jurisdictional boundaries of intelligence and security agencies has been identified as one of the key foundations for securing national security (Office of Homeland Security, 2002).
• There are some difficulties of information sharing:
– Legal and cultural issues regarding information sharing
– Integrate and combine data that are organized in different schemas stored in different database systems running on different hardware platforms and operating systems
(Hasselbring, 2000).
Information Sharing and CollaborationInformation Sharing and Collaboration
25
• Three approaches to data integration have been proposed: (Garcia-Molina et al., 2002) – Federation: maintains data in their original, independent sources but
provides a uniformed data access mechanism (Buccella et al., 2003; Haas, 2002).
– Warehousing: an integrated system in which copies of data from different data sources are migrated and stored to provide uniform access
– Mediation: relies on “wrappers” to translate and pass queries from multiple data sources.
• These techniques are not mutually exclusive. All these techniques are dependent, to a great extent, on the matching between different databases
Approaches to data integrationApproaches to data integration
26
• The task of database matching can be broadly divided into schema-level and instance-level matching (Lim et al., 1996; Rahm & Bernstein, 2001). – Schema-level matching is preformed by aligning semantically
corresponding columns between two sources. – Instance-level or entity-level matching is to connect records describing a
particular object in one database to records describing the same object in another database.
– Instance-level matching is frequently performed after schema-level matching is completed.
• Information integration approaches have been used in law enforcement and intelligence agencies for investigation support.
• Information sharing has also been undertaken in intelligence and security agencies through cross-jurisdictional collaborative systems.– E.g. COPLINK (Chen et al., 2003b)
Database And ApplicationDatabase And Application
27
• One of most widely studied approaches is association rule mining, a process of discovering frequently occurring item sets in a database.
• An association is expressed as a rule X Y, indicating that item
set X and item set Y occur together in the same transaction (Agrawal et al., 1993).
• Each rule is evaluated using two probability measures, support and confidence, where support is defined as prob(XY) and confidence as prob(XY) / prob(X).
– E.g., “diaper milk with 60% support and 90% confidence” means that 60% of customers buy both diaper and milk in the same transaction and that 90% of the customers who buy diaper tend to also buy milk.
Crime Association MiningCrime Association Mining
28
• Crime association mining techniques can include incident association mining and entity association mining (Lin & Brown, 2003).
• Two approaches, similarity-based and outlier-based, have been developed for incident association mining– Similarity-based method detects associations between crime incidents
by comparing crimes’ features (O'Hara & O'Hara, 1980) – Outlier-based method focuses only on the distinctive features of a
crime (Lin & Brown, 2003)
• The task of finding and charting associations between crime entities such as persons, weapons, and organizations often is referred to as entity association mining (Lin & Brown, 2003) or link analysis.
Crime Association Mining TechniquesCrime Association Mining Techniques
29
• Three types of link analysis approaches have been suggested: heuristic-based, statistical-based, and template-based. – Heuristic-based approaches rely on decision rules used by domain
experts to determine whether two entities in question are related.
– Statistical-based approach E.g. Concept Space (Chen & Lynch, 1992). This approach measures the
weighted co-occurrence associations between records of entities (persons, organizations, vehicles, and locations) stored in crime databases.
– Template-based approach has been primarily used to identify associations between entities extracted from textual documents such as police report narratives.
Link Analysis ApproachesLink Analysis Approaches
30
• Classification is the process of mapping data items into one of several predefined categories based on attribute values of the items (Hand, 1981; Weiss & Kulikowski, 1991).
• Widely used classification techniques:– Discriminant analysis (Eisenbeis & Avery, 1972) – Bayesian models (Duda & Hart, 1973; Heckerman, 1995)– Decision trees (Quinlan, 1986, 1993)– Artificial neural networks (Rumelhart et al., 1986)– Support vector machines (SVM) (Vapnik, 1995)
• Several of these techniques have been applied in the intelligence and security domain to detect financial fraud and computer network intrusion.
Crime Classification and ClusteringCrime Classification and Clustering
31
• Clustering groups similar data items into clusters without knowing their class membership. The basic principle is to maximize intra-cluster similarity while minimizing inter-cluster similarity (Jain et al., 1999)
• Various clustering methods have been developed, including hierarchical approaches such as complete-link algorithms (Defays, 1977), partitional approaches such as k-means (Anderberg, 1973; Kohonen, 1995), and Self-Organizing Maps (SOM) (Kohonen, 1995).
• The use of clustering methods in the law enforcement and security domains can be categorized into two types: crime incident clustering and criminal clustering.
Crime Classification and ClusteringCrime Classification and Clustering
32
• Text mining has attracted increasing attention in recent years as the natural language processing capabilities advance (Chen, 2001). An important task of text mining is information extraction, a process of identifying and extracting from free text select types of information such as entities, relationships, and events (Grishman, 2003). The most widely studied information extraction subfield is named entity extraction.
• Four major named-entity extraction approaches have been proposed: – Lexical-lookup– Rule-based– Statistical model– Machine learning
• Intelligence text mining aims to identify people, organizations, locations, properties, and relationships of interest.
Intelligence Text MiningIntelligence Text Mining
33
• Most crimes, including terrorism, have significant spatial and temporal characteristics (Brantingham & Brantingham, 1981).
• Aims to gather intelligence about environmental factors that prevent or encourage crimes (Brantingham & Brantingham, 1981), identify geographic areas of high crime concentration (Levine, 2000), and detect trend of crimes (Schumacher & Leitner, 1999).
• Two major approaches for crime temporal pattern mining– Visualization
Present individual or aggregated temporal features of crimes using periodic view or timeline view
– Statistical approach Build statistical models from observations to capture the temporal patterns of
events.
Crime Spatial and Temporal MiningCrime Spatial and Temporal Mining
34
• Three approaches for crime spatial pattern mining :(Murray et al., 2001).
– Visual approach (crime mapping): Presents a city or region map annotated with various crime related information.
– Clustering approaches Has been used in hot spot analysis, a process of automatically identifying areas
with high crime concentration.
Partitional clustering algorithms such as the k-means methods are often used for finding hot spots of crimes. They usually require the user to predefine the number of clusters to be found
– Statistical approaches To conduct hot spot analysis or to test the significance of hot spots (Craglia et
al., 2000) To predict crime
Crime Spatial and Temporal MiningCrime Spatial and Temporal Mining
35
• Criminals seldom operate alone but instead interact with one another to carry out various illegal activities. Relationships between individual offenders form the basis for organized crime and are essential for the effective operation of a criminal enterprise.
• Criminal enterprises can be viewed as a network consisting of nodes (individual offenders) and links (relationships).
• Structural network patterns in terms of subgroups, between-group interactions, and individual roles thus are important to understanding the organization, structure, and operation of criminal enterprises.
Criminal Network AnalysisCriminal Network Analysis
36
• Social Network Analysis (SNA) provides a set of measures and approaches for structural network analysis (Wasserman & Faust, 1994).
• SNA is capable of – Subgroup detection– Central member identification – Discovery of patterns of interaction
• SNA also includes visualization methods that present networks graphically. – The Smallest Space Analysis (SSA) approach (Wasserman & Faust, 1994) is
used extensively in SNA to produce two-dimensional representations of social networks.
• Network Topological Analysis aims to identify topological characteristics of complex networks (e.g., random, small world, and scale-free networks) and their dynamics and guiding properties.
Criminal Network AnalysisCriminal Network Analysis
37
• The above-reviewed six classes of KDD techniques constitute the key components of our proposed ISI research framework. Our focus on the KDD methodology, however, does NOT exclude other approaches.
• Researchers from different disciplines can contribute to ISI. – DB, AI, data mining, algorithms, networking, and grid computing
researchers can contribute to core information infrastructure, integration, and analysis research of relevance to ISI
– IS and management science researchers could help develop the quantitative, system, and information theory based methodologies needed for the systematic study of national security.
– Cognitive science, behavioral research, and management and policy are critical to the understanding of the individual, group, organizational, and societal impacts and effective national security policies.
Conclusion and Future DirectionConclusion and Future Direction
38
• Intelligence and Warning • Border and Transportation Security • Domestic Counter-terrorism• Protecting Critical Infrastructure and Key
Assets • Defending Against Catastrophic Terrorism • Emergency Preparedness and Responses
National Security Critical Mission Areas and Case Studies
National Security Critical Mission Areas and Case Studies
39
• By analyzing the communication and activity patterns among terrorists and their contacts detecting deceptive identities, or employing other surveillance and monitoring techniques, intelligence and warning systems may issue timely, critical alerts to prevent attacks or crimes from occurring.
Intelligence and WarningIntelligence and Warning
Case Study
Project Data Characteristics Technologies UsedCritical Mission Area Addressed
1Detecting deceptive identities
Authoritative sourceStructured criminal identity records
Association miningIntelligence and warning
2Dark Web Portal
Open sourceWeb hyperlink data
Web spidering and archivingPortal access
Intelligence and warning
3Jihad on the Web
Open sourceMultilingual, web data
Web spideringMultilingual indexingLink and content analysis
Intelligence and warning
4Analyzing al qaeda network
Open sourceNews articles
Statistics-basedNetwork topological analysis
Intelligence and warning
Four case studies of relevance to intelligence and warning
40
• The capabilities of counter-terrorism and crime-fighting can be greatly improved by creating a “smart border,” where information from multiple sources is integrated and analyzed to help locate wanted terrorists or criminals. Technologies such as information sharing and integration, collaboration and communication, and biometrics and speech recognition will be greatly needed in such smart borders.
Border and Transportation SecurityBorder and Transportation Security
CaseStudy
Project Data Characteristics Technologies UsedCritical MissionArea Addressed
5BorderSafeinformationsharing
Authoritative sourceStructured criminal
identity records
Information sharing and integration
Database federation
Border andTransportationsecurity
6Cross-bordernetworkanalysis
Authoritative sourceStructured criminal
identify records
Network topological analysis
Border andTransportationSecurity
Two case studies of relevance to Border and Transportation Security
41
• As terrorists, both international and domestic, may be involved in local crimes. Information technologies that help find cooperative relationships between criminals and their interactive patterns
would also be helpful for analyzing domestic terrorism.
Domestic Counter-terrorismDomestic Counter-terrorism
CaseStudy
Project Data Characteristics Technologies UsedCritical MissionArea Addressed
7 COPLINK detectAuthoritative sourceStructured data
Association mining Domestic counter-terrorism
8Criminal networkanalysis
Authoritative sourceStructured data
Social network analysisCluster analysisVisualization
Domestic counter-terrorism
9Domesticextremists on the web
Open sourceWeb-based text data
Web spideringLink and content analysis
Domestic counter-terrorism
10Dark networksanalysis
Authoritative and open sources
Network topological analysis
Domestic counter-terrorism
Four case studies of relevance to Domestic Counter-terrorism Security in Chapter 7
42
• Criminals and terrorists are increasingly using the cyberspace to conduct illegal activities, share ideology, solicit funding, and recruit. One aspect of protecting cyber infrastructure is to determine the source and identity of unwanted threats or intrusions.
Protecting Critical Infrastructure and Key Assets
Protecting Critical Infrastructure and Key Assets
CaseStudy
Project Data Characteristics Technologies UsedCritical MissionArea Addressed
11Identity tracing incyber space
Open sourceMultilingual, text, web data
Feature extractionClassifications
Protecting criticalInfrastructure
12Writeprint featureselection
Open sourceMultilingual, text, web data
Feature extractionFeature selection
Protecting criticalinfrastructure
13Arabic authorshipanalysis
Open sourceMultilingual, text, web data
Feature extractionClassifications
Protecting criticalinfrastructure
Three case studies of relevance to Protecting Critical Infrastructure and Key Assets
43
• Biological attacks may cause contamination, infectious disease outbreaks, and significant loss of life. Information systems that can efficiently and effectively collect, access, analyze, and report data about catastrophe-leading events can help prevent, detect, respond to, and manage these attacks.
Defending Against Catastrophic TerrorismDefending Against Catastrophic Terrorism
CaseStudy
Project Data Characteristics Technologies UsedCritical Mission Area Addressed
14BioPortal forinformation sharing
Authoritative sourceStructured data
Information integration and messaging
GIS analysis and visualization
Defending againstCatastrophic terrorism
15 Hotspot analysisAuthoritative sourceStructured data
Statistics-based SatScanClustering; SVM
Defending againstcatastrophic terrorism
Two case studies of relevance to Defending Against Catastrophic Terrorism
44
• Information technologies that help optimize response plans, identify experts, train response professionals, and manage consequences are beneficial to defend against catastrophes in the long run. Moreover, information systems that provide social and psychological support to the victims of terrorist attacks can also help the society recover from disasters.
Emergency Preparedness and ResponsesEmergency Preparedness and Responses
CaseStudy
Project Data Characteristics Technologies UsedCritical MissionArea Addressed
16Terrorismexpert finder
Open sourceStructured, citation data
Bibliometric analysisEmergencypreparedness andresponses
17Chatterbot forterrorism information
Open sourceStructured data
Dialog systemEmergencypreparedness andresponses
Two case studies of relevance to Emergency Preparedness and Responses
45
• Dark Web Collection Building
• Dark Web Content Analysis
• Dark Web Forum Authorship Analysis and Visualization
ISI Dark Web Case StudiesISI Dark Web Case Studies
47
Terrorists’ Communication on the Internet
• Internet enables diverse forms of communication.• The complexity of communication can range from text only messages to the
use of multimedia.• Below is a comparison between communication mediums on the Internet.
Access Temporal flow Directionality Focus Feasibility of Automatic surveillance
Emails Private Asynchronous One to One Not focused
Not feasible
(only email service providers have access to the data)
Instant messengers
Private Synchronous One to One Not focused
Not feasible
(only network server have access to the data)
Forums, newsgroups, discussion boards
Public Asynchronous Many to Many
Usually focused
Feasible
(all registered group members have access to the data from all over the world)
Chat rooms Public/ Private
Synchronous Many to Many
Usually not focused
Not Feasible
(content does not retain; only chat room servers have access to the data)
48
Forum as Communication Tool for Terrorists and Their Supporters
The title of the board
Multiple pages of the board
The title of the thread
Multiple pages of
one thread
# of replies of the thread # of views of
the thread
A Typical Forum
49
Forum as Communication Tool for Terrorists and Their Supporters
Post time The title of the thread
The body of the message
User ID
The virtual rank of the author in the
forum
Other information about the
author
Reply of the main thread
A Typical Forum
51
3: Handle Different Forum Software
• Identify forum software packages that were used
• Identify the URL patterns of the forum software
• Customize spiders based on the forum package
List of Forum Software
Software Package Language
Crosstar PHP
DCForum PHP
ezboard CGI
IM PHP
Invision Power Board PHP
newbb PHP
phpBB PHP
rafia PHP
vBulletin PHP
WebRing CGI
WebWiz ASP
YaBB PHP
52
4. Identify Threads, Posts, Authors, etc.
• Identify and record URL patterns for threads, posts, authors, time posted, # of views, etc. of the forum software.
• Meta data from both board files and thread files need to be extracted. – vBulletin example: (# represents numbers)
• URL pattern of boards: forumdisplay.php?forumid=#&daysprune=#&sortorder=&sortfield=lastpost&perpage=#&pagenumber=#
• URL pattern of topics: showthread.php?threadid=###&perpage=##&pagenumber=##
53
Forum Board File Example--Google Groups
Description of the group
The title of the thread
Author
Post time
The title of the group
Google Groups
54
Dark Web Forums Identification
0 20 40 60 80 100 120
Middle-Eastern
Latin-American
USDomestic
# of Forums
Local ISP
AOL
MSN
Google Groups
Yahoo! Groups
Websites
Websites 48 4 18
Yahoo! Groups 20 11 31
Google Groups 0 32 47
MSN 0 5 9
AOL 0 0 5
Local ISP 0 8 0
Middle-Eastern Latin-American US Domestic
Forum Identification-- Overall Distribution by ISP Providers
55
Forum Identification -- Distribution Analysis
# of Forums by CategoryUS Domestic
US Domestic Forums
31 30
21
1410
40 0
0
5
10
15
20
25
30
35 White Supremacy
Militia
Neo Nazis
Black Separatist
Others
Christian Identity
Neo Confederate
Racist Skinhead
Series1 31 30 21 14 10 4 0 0
White Supre
MilitiaNeo Nazis
Black Separ
Others
Christian
Neo Confe
Racist
56
Forum Identification -- Distribution Analysis
# of Forums by Category
Middle-Eastern
Middle-Eastern
48
17
1 0 0 00
10
20
30
40
50
60
Sunni Muslim
Others
Secular
Jewish
Shi'a Muslim
Communist/Socialist
Series1 48 17 1 0 0 0
Sunni Muslim
Others Secular JewishShi'a
MuslimCommunist/S
57
Forum Collection
(Yahoo Groups) • US Domestic
Group Name Forum Name Messages Members
Animal Liberation Front Animal Liberation Front 890 31
National Alliance American National Socialist Group 258 69
Neo Nazi Angelic_Adolf 464 77
World Knights of the Ku Klux Klan aryannationsknights 32 14
Westboro Baptist Church Peace Love And Unity Topeka 48 19
Council of Conservative Citizens Citizens Councils News Update 434 79
Neo Nazi Neo-Nazi 1614 154
New Black Panther Party New Black Panther Party 5051 1102
National Socialist Movement NSM World 5660 789
United Nuwaubian Nation of Moors NUWAUBU RIGHT KNOWLEDGE 5269 258
Neo Nazi smashnazism 103 13
Sons of Liberty Southern Sons Of Liberty 1614 154
Neo Nazi thejapanesenazis 218 41
World Knights of the Ku Klux Klan World_Knights 248 7
58
Forum Collection(Yahoo Groups)
• Middle EastGroup Name Forum Name Messages Members
AlQaeda azzamy 560 1052
Al-Dawa dawa-support · Dawa Committee Supporters 203 77
General Jihad (the exact affiliation is not clear)
friends_in_islam 1541 77
Hezbollah hezbollah_iran · ya MAHDI adrekni 854 92
General Jihad (the exact affiliation is not clear)
Islamic_Action_Group · Islamic Action Group 886 151
General Jihad (the exact affiliation is not clear)
islamicresistance · Islamic Resistance - Speak out for truth, justice, and Palstine
2306 121
General Jihad (the exact affiliation is not clear)
islamic-union · اإلسالمي اإلتحاد 1123 256
Al-Aqsa Martyrs' Brigades kataeb • kataeb al aqsa , شهداء كتائباالقصى
400 142
Hamas kataeb_qassam 855 188
AlQaeda taybah3 · { الطيبة { طيبة 406 89
AlQaeda Usama_bin_laden · الدن بن اسامة 360 535
General Jihad (the exact affiliation is not clear)
wa-islamah · _إسالماه وا 5026 5336
59
Forum Filetype Analysis (Website Forums)
Website Forums Collection
Arabic US Domestic
# of Files Volume (Bytes) # of Files Volume (Bytes)
Total 496,186 20,658,746,269 116,419 7,694,035,712
Indexable Files 208,174 12,132,567,109 93,655 6,511,416,058
HTML Files 2 832,049 33 570,175
Word Files 0 0 0 0
PDF Files 0 0 0 0
Dynamic Files 208,171 12,131,735,060 93,620 6,510,845,724
Text Files 98 1,054,027,204 2 20,967,860
Excel Files 0 0 0 0
Powerpoint Files 0 0 0 0
XML Files 0 0 2 229
Multimedia Files 226,118 6,661,118,184 21,518 1,136,758,815
Image Files 224,485 4,119,229,029 21,177 373,953,750
Audio Files 393 232,709,714 107 405,282,727
Video Files 1,240 2,309,179,441 234 357,522,338
Archive Files 0 0 0 0
Non-Standard Files 61,894 1,865,060,976 1,246 45,860,839
60
Findings
3asfhwww.3asfh.net
Shawatiwww.shawati.com
Discussions (1) Poems praising extremist actions(2) List of Clerics with phone numbers and emails(3) responses to media postings such as the “Desecration of the Qur'an” video
(1) Allegations of abuse of Iraqi children by American soldiers(2) links to websites of clerics(3) Praise of the late Saudi King(4) Reports from Iraq Jihadists
Images (1) Banners of Bin Laden(2) Banners praising Palestinian extremists
(1) Pictures showing cadavers purported to be innocent Iraqis killed by American soldiers(2) Picture of Chechen martyrs
Audio (1) Readings from qur’an.(2) Jihad hymns
(1) Audio recordings of speeches by extremist clerics
Video (1) Desecration of the Qur'an Video showing the shooting of an Iraqi “collaborator”
• 3. Multimedia content heavily used– Discussion and multimedia file content examples from Middle-
Eastern forums
61
Findings
• 3. Multimedia content contains rich messages– Discussion and multimedia file content
examples from Middle-Eastern forums– On http://www.alm2sda.net/vb/ we
found the following: • Mentions leader’s name (Bin Laden,
Zarqawi, and Sayyid Qutb)• Provides information about different
kinds of bombs (i.e. how to prepare it, weight of each type)
• Includes news reporting of operations and events
• Provides detailed descriptions with images of different missile
• Some of the members are from Hamas and they are recruiting other members to join
• Provides information on distributing viruses (under E-Jihad)
62
5. Event Tracking in Extremist Forums: “US Against US”
• Chronology of Events in Iraq in 2003-2004• Two types of external events
– Attack events carried by extremists against Western countries• Istanbul Attacks Nov,15,2003• Madrid Bombings Mar 11,2004• Berg Beheading May, 11,2004
– Attack events carried by westerners which happened on extremists’ own land: strong response!
• Feb. 24, 2003 The United States, Great Britain, and Spain submit a proposed resolution to the UN Security Council stating, “Iraq has failed to take the final opportunity afforded to it in Resolution 1441.” The resolution concludes it is time to authorize use of military force.
• March 17, 2003 Great Britain's ambassador to the UN says the diplomatic process on Iraq has ended. Arms inspectors evacuate. Pres. George W. Bush gives Saddam Hussein and his sons 48 hours to leave Iraq or face war.
• March 21, 2003 The major phase of the war begins with heavy aerial attacks on Baghdad and other cities.
• March 24, 2003 Troops march within sixty miles of Baghdad. • April 9, 2003 The fall of Baghdad: U.S. forces advance into central
Baghdad.
64
Analyzing Terror Campaign on the Internet: Technical Sophistication, Media Richness,
and Web Interactivity
65
Existing Studies on Dark Web
Organization Description Access
Archive
1. Internet Archive (IA) 1996-. Spidering (every 2 mths.) to collect open access HTML pages Via http://www.archives.org
Research Center
2. Artificial Intelligence (AI) Lab, University of Arizona
2003-. Spidering (every 2 mths.) to collect terrorist Web sites. Has 942 Web sites: U.S. Domestic (422), Latin America (188), and Middle Eastern (332) Web sites, 97 gigabytes size, 541,800 multimedia files
Via testbed portal called Dark Web Portal
3. Anti-terrorism Coalition (ATC)
2003-. Jihad Watch. Has 448 terrorist Web sites & forums Via http://www.jihadwatch.org
4. Prism (ICT, Israel) 2002 -. Limited # of Web sites. Project for Research of Islamist Movements.
Access reports via Web sitehttp://www.e-prism.org
5. MEMRI 2003 -. Jihad & Terrorism Studies Project. Access reports via http://www.memri.org
6. Site Institute 2003 -. Manually capture Web sites every 24 hrs. Access reports & subscribe to fee-based intelligence reports & alerts http://siteinstitute.org
7. Weimann (Univ. Haifa, Israel)
1998 -. Manually capture Web sites daily Closed collection
Vigilante Community
8. Internet Haganah 2001- . Spidering. Confronting the Global Jihad Project. Has 100s links to Web sites.
Provides snapshots of terrorist Web sites http://haganah.us
9. Johnathanrgalt 2001 – spidering Islamic Terror Sites on the WebHas 60-70 sites. Monitors sites that closed.
Provides snapshots to terrorist Web sites that are closed. http://www.geocities.com/johnathanrgalt
Table 1: Organizations that Capture and Analyze Terrorists’ Web sites
66
Web Usage Analysis in e-Government
• Several large-scale studies have been dedicated to study governments’ Web usage:– The Cyberspace Policy Research Group (CyPRG;
www.cyprg.arizona.edu)– United Nations Online Network in public Administration and
Finance (UNPAN; www.unpan.org)– European Commission's IST program (www.cordis.lu/ist/)
• Other than the technical sophistication and media richness, the e-Government research also studied interactivity and transparency of government Web sites (Demchak et al., 2001).
67
Dark Web Collection Building Method
Figure 1. The Dark Web Collection Building Procedure
1. Identify Terrorist Groups
TerrorismLexicon
(Organizationnames, leader
names, slogans,special
keywords…
Government Reports(FBI, US State Department,UN Security Council, etc)
Research Centers(ATC, MEMRI,Dartmouth, NorwegianResearch, etc)
2. Identify Terrorist Group URLs
Government Reports(FBI, US State Department,UN Security Council, etc)
Research Centers(ATC, MEMRI,Dartmouth, NorwegianResearch, etc)
Search Engines(Google, Yahoo, etc)
Initial SeedURLs
3. Expand Terrorist Group URLsby Link and Forum Analysis
Filtering
Back-linkExtraction
Out-linkExtraction
ExpandedURLs
Website ForumAnalysis
4. Download Terrorist Site Contents
Automatic WebCrawler(Downloadmultilingual,
multimedia Webcontents)
Dark WebTestbed
68
Dark Web Analysis Framework (DWAF): Technical Sophistication Measures
• Technical sophistication (TS)
– To study the level of advancement of the techniques used by terrorists to establish and maintain their Web presence.
– Table 1 shows the TS Measures identified from Palmer and David (1998).
Measures Weights Comments
Basic HTML Techniques
Use of Lists 0/1 All attributes can be automatically identified from terrorist Websites using programs.
Use of tables 0/2
Use of Frames 0/2
Use of Forms 0/1.5
Embedded Multimedia
Use of Background Image 0/1
Use of Background Music 0/2
Stream Audio/Video 0/3.5
Advanced HTML Use of phtml/shtml 0/2.5
Use Predefined Functions? 0/2
Use Self-defined Functions? 0/4.5
Dynamic Web Programming
Use CGI 0/2.5
Use PHP 0/4.5
Use JSP/ASP 0/5.5
Table 1. Technical Sophistication Measures
69
Dark Web Analysis Framework: Media Richness Measures
• Media richness (MR)
– To study how effectively the information is disseminated from terrorist Websites to their target audiences (basic non-interaction function of Websites).
– Table 2 shows the MR measures identified from computer-mediated communication literature (Trevino et al., 1987; Palmer & Griffith, 1998 ).
Measures Scores Comments
Hyperlink # of Hyperlinks “Push Media” and “Content Search” may need manual identification. Other measures can be automatically extracted.
File/Software Download
# of downloads
Animation # of animations
Image # of images
Video/Audio File # of video/audio files
Table 2. Media Richness Measures
70
Dark Web Analysis Framework: Interactivity Measures
• Web interactivity (WI)– To study how effectively the
terrorist Websites facilitate the interactions between the terrorists and their supporters.
– Contains multiple sub-levels:• One-to-one interaction• Community-level interaction • Transaction-level interaction
– Table 3 shows the WI measures identified from literature (Berthon et al., 1999 ).
Measures Weights Comments
One-to-one interaction
Email Feedback 0/1.75 Automatic extraction + manual identificationEmail List 0/2.25
Contact Address 0/1.25
Feedback Form 0/2.75
Guest Book 0/1.5
Community-level interaction.
Private Messages 0/4.25 Automatic extraction + manual identificationOnline Forums 0/4.25
Chat rooms 0/4.75
Transaction-level interaction
Online Shop 0/4 Automatic extraction + manual identificationOnline Payment 0/4
Online Application Form 0/4
Table 3. Web Interactivity Measures
71
Middle East Terrorist Web Collection File Type Breakdown
• Dynamic files (e.g., PHP, ASP, JSP, etc.) are widely used in terrorist Web sites, indicating a high level of technical sophistication.
• Multimedia is also heavily used in terrorist Web sites.
Terrorist Collection # of Files Volume(Bytes)
Total 222,687 12,362,050,865
Indexable Files 179,223 4,854,971,043
HTML Files 44,334 1,137,725,685
Word Files 278 16,371,586
PDF Files 3,145 542,061,545
Dynamic Files 130,972 3,106,537,495
Text Files 390 45,982,886
Powerpoint Files 6 6,087,168
XML Files 98 204,678
Multimedia Files 35,164 5,915,442,276
Image Files 31,691 525,986,847
Audio Files 2,554 3,750,390,404
Video Files 919 1,230,046,468
Archive Files 1,281 483,138,149
Non-Standard Files 7,019 1,108,499,397
Number of Fi l es Di stri buti on (Arabi c)
80%
16%
0%
4%
I ndexabl eFi l esMul medi aFi l esArchi ve Fi l es
Non-StandardFi l es
Vol ume Di stri buti on (Arabi c)
39%
48%
4%9% I ndexabl e
Fi l esMul medi aFi l esArchi ve Fi l es
Non-StandardFi l es
(Terrorist)
(Terrorist)
72
US Government Web Collection File Type Breakdown
US Government Collection # of Files Volume (Bytes)
Total 277,274 19,341,345,384
Indexable Files 221,684 6,502,288,302
HTML Files 71,518 2,632,912,620
Word Files 298 210,906,045
PDF Files 841 663,293,376
Dynamic Files 145,590 2,071,734,849
Text Files 2,878 555,403,447
Excel Files 4 98,560
Powerpoint Files 5 725,017
XML Files 554 367,214,389
Multimedia Files 49,582 10,835,029,216
Image Files 45,707 850,011,712
Audio Files 3,429 8,153,419,931
Video Files 449 1,831,597,573
Archive Files 538 286,312,990
Non-Standard Files 5,471 1,717,714,876
Number of Fi l es Di st r i but i on (US)
80%
18%
2%
0%
I ndexabl eFi l esMul medi a Fi l es
Non- StandardFi l esArchi ve Fi l es
Vol ume Di stri buti on (US)
33%
56%
10% 1%I ndexabl eFi l esMul medi a Fi l es
Non-StandardFi l esArchi ve Fi l es
• Similarly to the terrorist collection, dynamic files and multimedia are also heavily used in government Web sites.
73
Analysis Results: Technical Sophistication
• Overall, the technical sophistication of terrorist Web sites is on par with US government Web sites.
• US government Web sites are better at the use of basic HTML techniques and dynamic Web programming.
• Terrorist Web sites are using more embedded multimedia.
High-level Attributes Weighted Average Score
t-Test Result
US Terrorists
Basic HTML Techniques 0.913043 0.710526 p < 0.0001**
Embedded Multimedia 0.565217 0.833333 p = 0.0027**
Dynamic HTML 1.789855 1.771929 p = 0.139
Dynamic Web Programming
2.159420 1.407894 p = 0.0066**
Average 1.356884 1.180921 p = 0.060
Table 4. Technical Sophistication Comparison Results
74
Analysis Results: Media Richness
• Overall, terrorist Web sites are not as good as US government Web sites in terms of Media Richness.
• Terrorist Web sites have significantly less hyperlinks and download contents.
Attributes Average Counts per Sites t-Test Result
US Terrorists
Hyperlink 3513.254654 3172.658483 p < 0.0001**
File/Software Download
400.9674532 151.868427 p = 0.0103*
Image 582.352456 540.0484563 p = 0.466
Video/Audio File
91.55434783 50.9736828 p < 0.0001**
Average 1154.531598 978.8871471 p < 0.0001**
Table 5. Media Richness Comparison Results
75
Result Discussions: Technical Sophistication and Media Richness
• Terrorist Web sites use more embedded media but achieved lower media richness scores.
– Government sites use many background images to improve the look of their pages.
– The small background images were not counted as “embedded media.”
– Terrorists use less background images, but more media with rich contents such as history pictures, posters, video/audio recordings, etc.
Arizona state government homepage alone contains 43 images; 42 of which are small background images (less than 4KB).
76
Result Discussions: Sample Media Provided by Terrorists
• Historical pictures or event pictures• Many movie clips of several
“martyrdom operations” in Iraq were posted in http://wwwlb.dm.net.lb/ubb/Forum4/
Flash animation and pictures depicting Marxist symbols, historical locations, and personalities on the Website of the Iranian People’s Fadaee Guerilla. (Source: http://siahkal.com/)
Documentation (with pictures) of an assassination attempt of Libyan president Mu’amar Kdhafi by members of the “Fighting Islamic Group” guerilla. (Source: http://www.almuqatila.com/)
The hero martyr Abdullah Radwan, may God have mercy upon him, hiding in the crowds and awaiting the arrival of the dictator. Shown clearly inside the red circle
77
Result Discussions: Sample Media Provided by Terrorists
• Posters praising terrorist leaders or inviting men to join Jihad.
A Hamas poster inviting men to join the military struggle.(Source: http://www.palestine-info.com).
Have you fought for the sake of God?You say no.Then you should have your mouth shot.
Emir Zarqawi, may God save him.Eagle of Iraq, volcano of Jihad, and the beheader.
Poster depicting terrorist leader in Iraq, Abu Mus’ab Zarqawi. (Source: http://www.islamic-f.net/vb/)
78
Result Discussions: Sample Media Provided by Terrorists
• Audio/video records from terrorist leaders as well as well as other extremist religious teachings.
A list of audio streams from the website of extremist cleric sheikh Hamed Al Ali. The audio files consist of preaching in the Salafi ideology and political issues.(Source: http://www.h-alali.net)
Kashmiri Jihad and the conference for recognizing Al-Taiba Pakistani extremist organization.
Anbaar Iraqi terrorism websites, audio section. Presents holy war songs and hymns.(Source: http://www.anbaar.net/audio/)
79
Analysis Results: Web Interactivity• At Web interactivity level, terrorist Web
sites do not show significant differences from US government Web sites.
• At one-to-one interaction level, the government Web sites are doing significantly better by providing their contact information (e.g., email, mail address, etc.) on their sites.
• However, terrorist Web sites are doing much better in supporting community-based interaction by providing online forums and chat rooms; while few government Web sites do.
• We did not identify transaction-based interaction in terrorist Web sites, although such interaction might be hidden in their sites.
Attributes Weighted Average Score
t-Test Result
US Terrorists
One-to-one 0.342857 0.292169 0.024*
Community 0.028571 0.168675 0.0025**
Transaction 0.3 Not presented
Average (Transaction not included)
0.185714 0.230422 0.056
Table 6. Web Interactivity Comparison Results
80
Result Discussions: Web Interactivity
• Terrorists use guest books and forums intensively to facilitate the communications among themselves and their supporters.
The Qalaa forum, one of the largest terrorist forums, has dozens of thousands of threads and hundreds of thousands of replies.(Source: http://www.qal3ati.net/)
An Al Queada guest book with 176 signitures(Source: http://www.alfida.jeeran.com/)
Welcome to the guest book of the Fida’ Website (Website of Sacrifice)
82
Authorship Identification Characteristics
• Features– Attributes or writing style features that are the most effective discriminators.– Lexical
• Word or character-based measures (e.g., sentence length, vocabulary richness etc).– Syntactic
• Sentence level writing style (e.g., punctuation, function words).– Structural
• Text organization and layout (de Vel et al. 2001)– Content Specific
• Keywords on specific topics (Martindale & McKenzie, 1995)
• Techniques– Analytical methods used to discriminate between authors.– Machine learning approaches typically outperform statistical methods due to
greater computational power and ability to handle noisy data• Parameters
– Number of categories and number of records per category used in experiments.– Generally, there will be some degree of drop off in performance as the number of
authors increases.
83
Online and Multilingual Messages
• Online Messages– Increasingly popular area due to augmented misuse of the internet (cyber crime).– Email
• Objective is to classify set of emails as belonging to particular author.– de Vel et al. 2000, 2001
– Web Forums• Attribute authorship of posted messages in chat groups.
– Zheng et al. 2005; Li et al., 2005; Abbasi & Chen, 2005.– Online Newspapers
• Evaluated online newspaper corpus.– Stamamatos et al., 2001
• Multilingual Content– Applying authorship analysis techniques across different languages. – Greek newspaper corpus.
• Stamatatos et al. 2001– Greek, Chinese, and English novels.
• Peng et al. 2003– Chinese and English web forum messages.
• Zheng et al. 2005– Russian Novels
• Khemeniv, 2003
84
Arabic Feature Extraction Component
Feature Set
Elongation FilterCount +1
Degree + 5
Incoming Message
Filtered Message
Root Dictionary
Root Clustering Algorithm
Similarity Scores (SC)
max(SC)+1
Generic Feature Extractor
All Remaining Features Values
1
3
2
4
85
Arabic Feature Set
Lexical Syntactic StructuralContent Specific
Feature Set
Char-Based
Word-Based
Punctuation
Function Words
Word Structure
Word R
oots
Technical Structure
Race/N
ationality
Violence
Char-Level
Letter Frequency
Special Char.
Word-Level
Vocab. Richness
Word Length D
ist.
(262) (15)(62)(79)
(418)
(48) (31) (12) (200) (48) (11) (4)
(4) (35) (9) (6) (8) (15)
(50)M
essage Level
Paragraph Level
Contact Inform
ation
Font Color
Font Size
Embedded Im
ages
(5) (6) (3) (29)
Hyperlinks
(14)
(8) (4) (7)
Elongation
(2)
86
An Authorship Identification Framework The Web
Dark Web
Extract Features
Feature Set
Elongation Filter Root
Dictionary
Clustering Algorithm
Word Root Feature Values
Collect Web Forum Messages
Text FormatHTML Format
Collection
Extraction
Feature Types
Lexical
Syntactic
Content
Structural
Experimental Techniques
SVM C4.5
Experiment 1
Experiment 4
Experiment 3
Experiment 2
Feature Set Relevance
Pair-wise t-test
Pair-wise t-test
Pair-wise t-test
SVM
AccuracyC4.5
Accuracy
Predictive Ability
Pair-wise
t-test
Technique Relevance
Experiment
Writing/Technical Feature Values
Extracted Values
Technical Structure Features
Writing Features
Filtered WordsRoots
87
Experiment Results
English Dataset Arabic Dataset
Features C4.5 SVM C4.5 SVM
F1 85.76% 88.0% 61.27% 87.77%
F1+F2 87.23% 90.77% 65.40% 91.00%
F1+F2+F3 88.30% 96.5% 71.23% 94.23%
F1+F2+F3+F4 90.10% 97.00% 71.93% 94.83%English Arabic
50.00
60.00
70.00
80.00
90.00
100.00
F1 F1+F2 F1+F2+F3 F1+F2+F3+F4 F1 F1+F2 F1+F2+F3 F1+F2+F3+F4
C4.5
SVM
Summary of Previous Authorship Visualization Studies
• All previous studies used n-grams.• None of the previous studies used an automated technique for
evaluating the visualizations.• None of the studies were applied to online messages.• There is no indication of whether the techniques can be successfully
applied in a multilingual setting, such as in cyberspace.
Study Type Visualization Name
Features Techniques Dataset Evaluation
Kjell et al., 1994 Authorship Identification
Nebulas, Histograms
N-grams PCA,
Cosine similarity
Federalist papers
Manual
Shaw et al., 1999
Authorship Identification
SFA N-grams PCA Biblical Texts
Manual
Ribler & Abrams, 2000
Similarity Detection
Patterngrams N-grams Matching Algorithm
Student Programs
Manual
90
Authorship Visualization Process Design
Collect MessagesThis is the first
message written in a long long time since
the olden days.
The Web
Feature Usage
Storage
Feature Set
Feature Extractor
Principal Component
Analysis
Entropy Based Feature
Selection
Writeprints
Ink Blots
Identification
Dynamic Sliding Window
Algorithm
Ink Blot Algorithm
Authentication
eigenvectors
key featuresfeature
vectorsinput messages
extracted messages
pattern coordinates
blot sizes/colors
Collect Messages Extract Features Reduce Dimensionality
Generate Visualization
Data
Create Visualizations
Perform Analysis
feature usage values
reference
The Web
91
Writeprints Using PCA
• Transform by multiplying feature usage vectors with eigenvectors.– The sum of the product of the primary eigenvector and
the feature vector is the x-coordinate.– The sum of the product of the secondary eigenvector
and the feature vector is the y-coordinate.• Plot transformations onto 2D/3D plane. • Sliding Window Algorithm (Kjell et al., 1994)
– An iterative algorithm used to generate more data points in order to create better writing patterns for text documents by capturing usage variations at a finer level of granularity.
Sliding Window Algorithm Illustration
1,0,0,2,1,2
0,1,3,0,1,0
0.533 0.956 -0.541 0.445 0.034 0.089 0.653 0.456 0.975 -0.085 0.143 -0.381
Compute eigenvectors for 2 principal components of feature group
Transform into 2-dimensional space
x
Extract feature usage vectors
y
x = Zx
y = Zy
Repeat steps 2 and 3
1.
3.
2.
x
y
Message Text
Feature Usage Vector Z
Eigenvectors
93
Selected Feature Groups
• Based on these criteria, the categories selected are highlighted.
• Function words were omitted since there were too many.
• Structural features could not be captured using the sliding window, so they were transformed using feature vectors at the message level.
Feature Group English Arabic
Char-Level Lexical 6 4
Letter Usage 26 35
Special Char. 21 15
Word-Level Lexical 6 6
Word Length 20 15
Punctuation 8 12
Function Words 150 250
Structural 14 14
Content Specific 15 15
Vocab. Richness 8 8
Interpreting Writeprint
Feature x y
~ 0 0
@ 0.022814 -0.01491
# 0 0
$ -0.01253 -0.17084
% 0 0
^ -0.01227 -0.01744
& -0.01753 -0.0777
* -0.03017 -0.05931
- -0.12656 0.991784
_ 0.998869 0.047184
= -0.05113 -0.07576
+ 0.142534 0.021726
> -0.1077 0.392182
< -0.10618 0.213193
[ 0 0
] 0 0
{ 0 0
} 0 0
/ -0.05075 -0.09065
\ 0 0
| -0.05965 0.428848
Special Char. Eigenvectors
Author A
Author B
Author C
Author D
Special Char. Writeprints
Determining Blot Size and Color
• The size of a blot is proportional to the ratio of entropy reduction to message length (except for structural features).– Done to compensate against
biases in favor of lengthier messages (blot overflow).
– For example, letter/word/punctuation usage is greater in longer messages.
• Color is based on feature usage. Heavy usage is red, low usage is blue, and everything in between is yellow.
• Thus, correct author-message matches should result in predominantly red ink blot patterns (“hot”) and a minimal amount of blue (“cold”).
d e/c
c = message length in characters
e = entropy reduction
d
Size
Tuning Blot Colors
• The color settings for each feature are “tuned” on the training set (by optimizing settings of q1 and q3).– This is done by maximizing the
ratio of red to blue area in correct messages and maximizing the ratio of blue to red in incorrect messages.
– The terms “low”, “medium” and “high” are defined based on usage rank thresholds set by q1 and q3.
– Since decision trees tend to pick outliers (in order to maximize entropy reduction/info. gain) this approach works well.
Low Medium High
min q1 q3 max
Feature Initial Setting
Low Medium High
min q1 q3 max
Feature Tuned Setting
100
Ink Blots: Al-Aqsa Martyr Dataset
This image shows 10 potential authors for a single message. Using Ink Blots, we can easily identify the correct author (the one with the greatest ratio of red/blue blots).
101
Evaluating Visualization Techniques
Collect MessagesThis is the first
message written in a long long time since
the olden days.
The Web
Feature Usage
Storage
Feature Set
Feature Extractor
Principal Component
Analysis
Entropy Based Feature
Selection
Writeprints
Ink Blots
Identification
Dynamic Sliding Window
Algorithm
Ink Blot Algorithm
Authentication
eigenvectors
key featuresfeature
vectorsinput messages
extracted messages
pattern coordinates
blot sizes/colors
Collect Messages Extract Features Dimensionality Reduction
Generate Visualization
Data
Create Visualizations
Perform Analysis
feature usage values
reference
The Web
102
Writeprint ResultsForum/Classifier 10-message groups 5-message groups 1-message groups*
Writeprint SVM Writeprint SVM Writeprint SVM
USENET Software 100.00% 50.00% 95.00% 55.00% 76.19% 93.00%
White Knights of KKK 100.00% 60.00% 100.00% 65.00% 85.14% 94.00%
Al-Aqsa Martyrs 100.00% 50.00% 90.00% 60.00% 68.89% 87.00%
It should be noted that for individual messages, Writeprint was not able to perform on messages shorter than 250 characters (approximately 35 words) due to the need to maintain a minimum sliding window size and gather sufficient data points for the evaluation algorithm. The table below shows the number of single messages classified out of the testing set of 100 per forum.
Forum Messages Classifiable
USENET Software 53
White Knights of KKK 60
Al-Aqsa Martyrs 74
103
Ink Blot ResultsForum/Messages Shorter Messages (< 200 characters) All Test Messages
Ink Blots SVM Ink Blots SVM
USENET Software 97.87% 94.59% 95.00% 93.00%
WK of the KKK 97.50% 92.31% 88.00% 94.00%
Al-Aqsa Martyrs 69.23% 84.62% 75.00% 87.00%
• In comparing the Ink Blots to SVM, the Ink Blots technique outperformed SVM on the USENET dataset but was outperformed overall when testing all messages.
• When evaluating the shorter test messages of length less than 200 characters (the messages unclassifiable by Writeprints), the Ink Blots tended to outperform SVM.
• Overall, the Ink Blot technique did not work as well on the Arabic messages.– This could be attributable to the inability of the entropy-based feature
selection technique to identify features that were clear cut enough to distinguish authors within the Al-Aqsa Martyrs forum.
105
• Ensuring Data Security and Confidentiality
• Reaching Agreements among Partners
• The COPLINK Chronicle
• Future Directions
The Partnership and Collaboration Framework
The Partnership and Collaboration Framework
106
• The Department of Homeland Security has proposed to establish a network of research centers across the nation– To create a multidisciplinary environment for developing
technologies to counter various threats to homeland security
• A variety of barriers need to addressed, including:– Security and confidentiality
Data regarding crimes, criminals, terrorist organizations, and potential terrorist attacks may be highly sensitive and confidential
Improper use of data could lead to fatal consequences– Trust and willingness to share information
Different agencies may not be motivated to share information and collaborate if there is no immediate gain
Fear that information being shared would be misused, resulting in legal liabilities.
– Data ownership and access control Who owns a particular data set? Who is allowed to access, aggregate, or
input data? Who owns the derivative data (knowledge)?
IntroductionIntroduction
107
• The NSF COPLINK Center at the Artificial Intelligence (AI) Lab of the University of Arizona is intended to become a part of the national network of ISI research laboratories.
– The COPLINK Center is a leading NSF research center for law enforcement and intelligence information and knowledge management
– The COPLINK Center has encountered many of these non-technical challenges in its partnerships with various law enforcement and federal agencies such as;
Tucson Police Department (TPD) Phoenix Police Department (PPD) Tucson Customs and Border Patrol (CBP)
• We present some of our experiences and lessons learned in this section.
The NSF COPLINK Center The NSF COPLINK Center
108
• At the COPLINK Center, we have taken the necessary measures to ensure data privacy, security, and confidentiality
– Only law enforcement data are shared between agencies
– All personnel who have access to law enforcement data are screened Background information and fingerprints are checked by TPD investigators All personnel sign a non-disclosure agreement (NDA) provided by TPD and
take the Terminal Operator Certificate (TOC) test every year Requirements are similar to those imposed upon non-commissioned civilian
personnel in a police department
– All law enforcement data reside behind a firewall and in a secure room accessible only by activated cards
– When an employee stops working on projects these data: Their card is de-activated The NDA is perpetual and remains in effect
Ensuring Data Security and ConfidentialityEnsuring Data Security and Confidentiality
109
• A sample individual user data license agreement was developed by university contracting officers and lawyers in several institutions and government agencies.
• Most of the terms and conditions are applicable to national security projects that demand confidentiality.
• It consists of the following sections:
– Permitted Uses– Access to the Information– Indemnification– Delivery and Acceptance
A Sample Individual User Data LicenseA Sample Individual User Data License
110
• Agreements between agencies within their respective jurisdictions are required to receive advanced approval from their governing hierarchy– This precludes informal information sharing agreements.
• Requirements varied from agency to agency according to the statutes by which they were governed. – The ordinances governing information sharing by the city of Tucson varied
somewhat from those governing the city of Phoenix.
• Similar language existed in the ordinances and statutes governing this exchange but the process varied significantly
• It appears as though the size of the jurisdiction is proportional to the level of bureaucracy required. – Negotiating a contract between University of Arizona and ARJIS (Automated
Regional Justice Information System) of Southern California required six to nine months of discussion between legal staff, contract specialists, and agency officials.
Reaching Agreements among PartnersReaching Agreements among Partners
111
• TPD has recently developed a generic Inter-Governmental Agreement (IGA) that could be adopted between different law enforcement agencies. – IGA was condensed from MOUs (Memorandum of Understanding),
policies, and agreements that previously existed – IGA was drafted in a generic manner, including language from those
laws, but excluding reference to any particular chapter or section.
• Sharing of information between agencies with disparate information systems has also led to bridging boundaries between software vendors and agencies (their customers). – We insured that non-disclosure agreements existed – We insured that contract language assured compliance with the
vendors’ licensing policies.
• We believe MOU and IGA can be used as templates of information sharing agreements and contracts and serve as a component of an ISI partnership framework.
Inter-Governmental Agreement (IGA)Inter-Governmental Agreement (IGA)
112
• Many agencies, partners, and individuals have contributed significantly to the success of this program
• The COPLINK System– Has been cited as a national model for public safety information sharing
and analysis– Has been adopted in more than 150 law enforcement and intelligence
agencies– Had been featured in New York Times, Newsweek, Los Angeles Times,
Washington Post, and Boston Globe, among others– Was selected as a finalist by the prestigious International Association of
Chiefs of Police (IACP)/Motorola 2003 Weaver Seavey Award for Quality in Law Enforcement
• The Research has recently been expanded to border protection (BorderSafe), disease and bioagent surveillance (BioPortal), and terrorism informatics research (Dark Web), funded by NSF, CIA, and DHS
The COPLINK SystemThe COPLINK System
113
• September 1994-August 1998, NSF/ARPA/NASA, Digital Library Initiative (DLI) funding: Selected concept association and data mining techniques developed under the DLI program.
• July 1997-January 2000, DOJ, National Institute of Justice (NIJ) funding: Initial COPLINK research -- database integration and access for a law enforcement Intranet.
• January 2000, first COPLINK prototype: Developed and tested in Tucson Police Department.
• May 2000, Knowledge Computing Corporation (KCC) founded: KCC received venture capital funding and licensed COPLINK technology.
• November 2, 2002, DC Sniper investigation, New York Times: “An electronic cop that plays hunches.”
• April 15, 2003, Newsweek and ABC News: “Google for cops.”• September 2003-August 2005, NSF, DHS, CNRI funding for BorderSafe project:
Cross-jurisdictional information sharing and criminal network analysis.• September 2003-August 2006, NSF, Digital Government Program funding for Dark
Web project: Social network analysis and identity deception detection for law enforcement and homeland security.
• August 2004-July 2008, NSF, Information Technology Research (ITR) Program funding for BioPortal project: A national center of excellence for infectious disease informatics.
The COPLINK ChronicleThe COPLINK Chronicle
114
– The BorderSafe project Continue to contribute to border safety and cross-
jurisdictional criminal network analysis research– The Dark Web project
Help create an invaluable terrorism research testbed
Develop advanced terrorism analysis methods– The BioPortal project
Contribute to the development of a national or even international infectious disease and bioagent information sharing and analysis system
Future DirectionsFuture Directions
115
• New technologies should be developed in a legal and ethical framework without compromising privacy or civil liberties of private citizens.
• Large scale non-sensitive data testbeds consisting of data from diverse, authoritative, and open sources and in different formats should be created and made available to the ISI research community.
• The ultimate goal of ISI research is to enhance our national security. However, the question of how this type of research has impacted and will impact society, organizations, and the general public reminds unanswered.
• Active ISI research will help improve knowledge discovery and dissemination and enhance information sharing and collaboration among academics, local, state, and federal agencies, and industry, thereby bringing positive impacts to all aspects of our society.
Conclusions and Future DirectionsConclusions and Future Directions
116
Tucson Police Department Phoenix Police Department Pima County Sheriff Department Tucson Customs and Border Protection San Diego, Automated Regional Justice Information Systems
(ARJIS) Corporation for National Research Initiatives (CNRI) California Department of Health Services New York State Department of Health United States Geological Survey Library of Congress San Diego Supercomputer Center (SDSC) National Center for Supercomputing Research (NCSA)
AcknowledgementsAcknowledgements
117
For more information:
AI Lab web site: http://ai.arizona.edu