Almaden Research Center
© 2006 IBM Corporation
IOP ’06Open Source Intelligence Lesson Learned
2
Almaden Research Center
© 2006 IBM Corporation
I
Issues in using open source for intelligence
Growth and complexity of heterogeneous content
Not all open source data is equal – Quantities vs. Qualitative
Requirements of Ecoinformatics Architectures
3
Almaden Research Center
© 2006 IBM Corporation
ISource: IBM 2005 GTOYears
1024 = 1Trillion Terabytes of data which is equivalent to all the information consumed visually by all humans in a year
Digital content is growing at dramatic rate
4
Almaden Research Center
© 2006 IBM Corporation
I Source: IBM 2005 GTO
The scale of open source data and its heterogeneous form increases complexity of extracting intelligence
Stora
ge o
nlin
e
Med
ical
dat
a st
ored
Perso
nal m
ultim
edia
Surve
illan
ce b
ytes
Photo
s m
ultim
edia
Scalable
Heterogeneity
Inte
llige
nce
Struct
ured
dat
a
Free
from
text
109
1012
1015
1021
1024
1027
5
Almaden Research Center
© 2006 IBM Corporation
I
Industry Publication
Company Internal Content
Company Publication
Industry Journals
Conference Proceedings
NGO Publications
Website affiliated with an organization
User Groups / Forums
News Letters
Content Aggregators
News & Press Releases
Legal Filings
Government Publications
Blogs / Weblogs
Non affiliated Websites Qualitative
Quantitative
Open Source Intelligence from the periphery requires an understanding of its topology, including strengths and weaknesses
sou
rces
in
th
e p
erip
her
y These are authoritative sources, where data is trusted and is defended
These are credentialed opinions , the source is
known and can be weighted
Open opinion, it is impossible to verify the authority of the source
6
Almaden Research Center
© 2006 IBM Corporation
I
Ecoinformatics Architectures need to be multi-layered
Cross-Page Annotators
ClassificationClassificationClusteringClustering CommunitiesCommunities RankingRanking
Applications
Network Associations
Network AssociationsSearch Search Topic
TrackingTopic
TrackingBuzz
AnalysisBuzz
Analysis
Per-Page Annotators
Auto Entity Spotters
Auto Entity Spotters
Auto Geography
Spotter
Auto Geography
Spotter
Porn & Dup Detection
Porn & Dup Detection
CustomerTaxonomy
Spotter
CustomerTaxonomy
Spotter10
0’s
10
00
’s
(pa
ge
s/se
con
d)
World Wide Web
BlogsNewspapers
Licensed Feeds Data BasesIntranet DataTaxonomies
Commercial Date Bases
IndexStore
Un-Structured DataDATA ACQUISITION
Structured Data
Parsing/Tokenizing
Annotation Searching
NaturalClustering
NaturalClustering
Affinity Analysis
Affinity Analysis
Snippet Analysis
Snippet Analysis
TrendingTrending
Performance Management
DrugResearch
Business Insights Workbench
Customer Applications
10
’s
Rel
evan
cy
Vo
lum
e
WebFountain
Business Insights Workbench
WS OminFind II
IndexStore
DATA ACQUISITION
Date SpottersDate Spotters Language SpottersLanguage Spotters Source SpottersSource Spotters
7
Almaden Research Center
© 2006 IBM Corporation
I 0
10
20
30
40
50
60
70
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005
# o
f We
b P
ag
es(0
00)
0
20
40
60
80
100
120
140
2001 2002 2003 2004 2005
# o
f We
b P
ag
es(0
00)
Year
0.0%0.5%1.0%1.5%2.0%2.5%3.0%3.5%4.0%4.5%
Congr
essm
an
Rob S
imm
ons
Dougla
s
Rushk
off Elio
t
Jard
ines
Majo
r Gen
eral
Patric
k Cam
mae
rt
Mr A
rno
Reuse
rRob
ert
Steele
Open Source Trend on Web
Some event happened in August
% o
f O
SI
we
b d
ocu
me
nts
One dominant voice
Finding intelligence can require different view of the same information
8
Almaden Research Center
© 2006 IBM Corporation
I
Robert Steele 6,440,000"Robert Steele" 170,000"Robert Steele" and Open Source Intelligence 2,400"Robert David Steele" and "Open Source Intelligence" within 5 words 73
Context
Network of Conference Attendees to auto-spotted Companies and Universities
In this network view we don’t care about
association with “Open Source Intelligence” but
with companies and universities
9
Almaden Research Center
© 2006 IBM Corporation
I
Computers don’t create intelligence, people do – computers enable smart people
Not all open source content is equal – know the sources
Not every thing you see is right – it’s all about the CONTEXT
Ecoinformation architecture supports- Large scale analytics of open source content- Integration of content other than open source- Power text analytic tools to support analysis of on topic stores
Conclusions on Open Source Intelligence