View
217
Download
0
Embed Size (px)
Citation preview
Snapshot of Semantic Web Commercial State of the Art
(presented at Science on the Semantic Web, Rutgers, October 2002)
Amit Sheth
CTO, Semagix Inc. Large Scale Distributed Information Systems (LSDIS) Lab
University Of Georgia; http://lsdis.cs.uga.edu
October 24, 2002© Amit Sheth
Based on Keynote
CONTENT- AND SEMANTIC-BASED INFORMATION RETRIEVAL @ SCI 2002
I am not selling any product here.
It is interesting to note SW = Software has move to SW = Semantic Web
Fundamental Issue
• Ontology Creation and maintenance– Human consensus + automatic KB
(assertion) extraction
• Automatic Semantic Annotation• Extremely fast computations
exploiting semantic metadata– Especially named relationships
Central Role of Metadata
Where is the
content? Whose is
it?
ProduceAggregate
What is this
content about?
Catalog/Index
What other
content is it
related to?
Integrate Syndicate
What is the right
content for this user?
Personalize
What is the best way to
monetize this interaction?
Interactive Marketing
Broadcast,Wireline,Wireless,Interactive TV
Semantic Metadata
ApplicationsBack End
"A Web content repository without metadata is like a library without an index." - Jack Jia, IWOV“Metadata increases content value in each step of content value chain.” Amit Sheth
A Metadata Classification
Data (Heterogeneous Types/Media)(Heterogeneous Types/Media)
Content Independent Metadata (creation-date, location, type-of-sensor...)(creation-date, location, type-of-sensor...)
Content Dependent Metadata (size, max colors, rows, columns...)(size, max colors, rows, columns...)
Direct Content Based Metadata (inverted lists, document vectors, LSI)(inverted lists, document vectors, LSI)
Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML(C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...)Document Type Definitions, C program structure...)
Domain Specific Metadata area, population (Census),area, population (Census), land-cover, relief (GIS),metadata land-cover, relief (GIS),metadata concept descriptions from ontologiesconcept descriptions from ontologies
OntologiesClassificationsClassificationsDomain ModelsDomain Models
User
More More
SemanticsSemantics
for for
Relevance Relevance
to tackleto tackle
InformationInformation
Overload!!Overload!!
Semantic Metadata Extraction, Semantic Annotation
WWW, EnterpriseRepositories
METADATAMETADATA
EXTRACTORSEXTRACTORS
Digital Maps
NexisUPIAPFeeds/
Documents
Digital Audios
Data Stores
Digital Videos
Digital Images. . .
. . . . . .
Key challenge: Create/extract as much (semantics)metadata automatically as possible
Semantic Content Organization and Retrieval Engine (SCORE) technology
• Automatically aggregates and extracts information
from
disparate sources and multiple formats
• Automatically tags/annotates and categorizes
content
• Automatically creates relevant associations
- Maps content topics and their relationships
• Semantic query engine relates information and
knowledge
both internal and external to the organization into a
single
view
Semagix Freedom Product Components
Market Guide (MG)ZDNet (ZD)
Hoover’s (H)Data supplied from NASA (DPL)
Federation of American Scientists (FAS)Central Intelligence Agency (CIA)
The Interdisciplinary Center (ICT)Federal Bureau of Investigation (FBI)
Capital Advantage (CA)Office of Foreign Assets Control (OFAC)
PERSON (OFAC, FBI, DPL)
-politician (OFAC, FBI, CIA, CA)
politician associated with politicalOrganziation
politician held politicalOffice
politician associated with politicalOffice
-terrorist (OFAC, FBI, DPL)
terrorist memberOf organization
terrorist appears on watchList
-companyExecutive (MG)
companyExecutive holdsOffice companyPosition
person has permanent address address (OFAC, FBI)
person has dob(date of birth) (OFAC, FBI)
person has pob(place of birth) (OFAC, FBI)
Knowledge Sources Used
THING
-event (ICT)
terroristOrganization participated in terroristSponsoredEvent (ICT)
-politicalOffice (CIA, CA)
politicalOffice office(s) within govtOrganization
politicalOffice associated with organization
-watchList (OFAC, FBI, DPL)
terroristOrganization appears on watchList (OFAC, FBI, DPL)
-organization (OFAC, FBI, FAS, ICT, CA, CIA)
organization appears on watchList
organization memberOf suborganization
-company
company manufactures product (ZD)
company identifiedBy tickeySymbol (H)
companyposition position in company (MG)
company memberOf industry (H)
-tickerSymbol (H)
tickerSymbol memberOf exchange (H)
PLACE
-organization located in place (H, OFAC)
-religiousAffiliation practiced in place (CIA)
-company headquarters in city (H)
Entity Classes and Relationships populated by these knowledge sources:
JIVA
Video withEditorialized Text on the Web
AutoCategorization
AutoCategorization
Semantic MetadataSemantic Metadata
Automatic Categorization & Metadata Tagging (unstructured text)
Extraction Agent
Enhanced Metadata Asset
Semantic Metadata Extraction/Annotation:Semi-structured source
Web Page
Semantic Metadata
Syntax Metadata
Semantic Content Enhancement Workflow
Enabling powerful linking of actionable information and facilitating important semantic applications such as knowledge discovery and link analysis
(user’s task of manually retrieving all the information he needs to know is greatly minimized; he can spend more time making effective decisions)
Semantic Metadata Content TagsCompany: Cisco Systems, Inc.Classification: Channel Partners,
E-Business SolutionsChannel Partner: Siemens NetworkChannel Partner: Voyager NetworkChannel Partner: Siemens NetworkChannel Partner: Wipro GroupE-Business Solution: CI S-1270 SecurityE-Business Solution: CI S-320 LearningE-Business Solution: CI S-6250 FinanceE-Business Solution: CI S-1005 e-MarketTicker: CSCOI ndustry: Telecommunication, . . .Sector: Computer HardwareExecutive: J ohn ChambersCompetition: Nortel Networks
Syntactic MetadataProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San J ose, CAURL: http:/ /bloomberg.com/1.htmMedia: Text
XML content item with enriched semantic tagging, ready to be queried
E-Business SolutionOntology
CiscoSystems
VoyagerNetwork
SiemensNetwork
WiproGroup
UlysysGroup
CIS-1270 Security
CIS-320Learning
CIS-6250 Finance
CIS-1005 e-Market
Channel Partner
belongs to
- - -
Ticker
represen
ted b
y
- - -
- - -
- - -
- - -
Industry
chan
nel p
artn
er of
- - -
- - -
- - -
- - -
Competitioncompetes with
provider of
- - -
- - -
- - -
- - -
Executives
works
for
- - -
- - -
- - -
- - -
Sectorbelo
ngs
to
Semantic Enhancement
Uniquelyexploiting
real-worldsemantic
associationsin the right
context
SemanticMetadataExtraction
(also syntactic)
Content TagsSemantic MetadataClassification: Channel Partners,
E-Business SolutionsCompany: Cisco Systems, Inc.
Syntactic MetadataProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San J ose, CAURL: http: //bloomberg.com/1.htmMedia: Text
ChannelPartners
E-BusinessSolutionsClassification
Content Tags
Semantic MetadataClassification: Channel Partners,
E-Business Solutions
Classification CommitteeKnowledge-base, Machine Learning &
Statistical Techniques
Content Asset Index Evolution
Focused relevantcontent
organizedby topic
(semantic categorization)
Automatic ContentAggregationfrom multiple
content providers and feeds
Related relevant content not
explicitly asked for (semantic
associations)
Competitive research inferred
automatically
Automatic 3rd party content
integration
Semantic Application Example – Analyst Workbench
Related Stock
News
Related Stock
News
Semantic Web – Intelligent Content
IndustryNews
IndustryNews
Technology Products
Technology Products
COMPANYCOMPANY
SECEPAEPA
RegulationsRegulations
CompetitionCompetition
COMPANIES in Same or Related INDUSTRY
COMPANIES inINDUSTRY with Competing PRODUCTS
Impacting INDUSTRY or Filed By COMPANY
Important to INDUSTRY or COMPANY
Intelligent Content = What You Asked for + What you need to know!
Syntax Metadata
Semantic Metadata
led by
Same entity
Human-assisted inference
Knowledge-based & Manual Associations
Blended Semantic Browsing and Querying (Intelligence Analyst
Workbench)
Innovations that affect User Experience
• BSBQ: Blended Semantic Browsing and Querying
– Ability to query and browse relevant desired content in a highly contextual manner
• Seamless access/processing of Content, Metadata and Knowledge
– Ability to retrieve relevant content, view related metadata, access relevant knowledge and switch between all the
above, allowing user to follow his train of thought
• dACE: dynamic Automatic Content Enhancement
– Ability to provide enhanced annotation features, allowing the user to retrieve relevant knowledge about significant
pieces of content during content consumption
• Semantic Engine APIs with XML output
– Ability to create customized APIs for the Semantic Engine involving Semantic Associations with XML output to
cater to any user application
VisionicsAcSysSecurity Portal
Check-in
Interrogation
Boarding Gate AirportAirspace
SemagixOntologyMetabase
Threat Scoring
Gov’t WatchlistsNews Media
Web Info
LexisNexisRiskWise
Passenger RecordsReservation Data
Airline DataAirport Data
Airline and Airport Data Future and Current Risks
Airport LEO
ARC AvSec ManagerData Management
Data Mining
IPG
Sources Used
Knowledge Sources:FBI - Most Wanted Terrorists
Denied Persons Lists
Terrorism Files
ICT
Office of Foreign Asset Control (OFAC)
Hamas terrorists
CNN Locations
FAA_Airport_Codes
About.com
Comtex_International
Hindustan Times
JerusalemPost
CNN
Newstrove_Hamas
Content Sources :
Africa News Service
AFX News – Asia/UK/Europe
AP Worldstream
Asia Pulse
BusinessWire
ComputerWire (CTW)
EFE News Services
FWN Select
Itar-TASS
Knight Ridder News (Open)
Knight-Ridder Open
M2 - International
M2 Airline Industry Information
New World Publishing
PR Newswire
PRLine (PRL)
Resource News International
RosBusiness
United Press International
UPI Spotlights
Semagix’s Semantic
Technology enables flight
authorities to :
- take a quick look at the
passenger’s history
- check quickly if the passenger is
on any official watchlist
- interpret and understand
passenger’s links to other
organizations (possibly terrorist)
- verify if the passenger has
boarded the flight from a “high
risk” region
- verify if the passenger originally
belongs to a “high risk” region
- check if the passenger’s name
has been mentioned in any news
article along with the name of a
known bad guy
Interrogation Kiosk – Unique Advantages of Semagix
SmithJohn
SmithJohn
Threat Score Components
LEXIS NEXIS ANNOTATION
Action: Information about or related to the passenger returned by Lexis Nexis is enhanced by linking important entities to Semagix’s rich ontology
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge, recognize entities in a piece of text and further automatically co-relate it with other data in the ontology to present a clear picture about the passenger to the flight official
Flight Coutry Check 45 0.15
Person Country Check 25 0.15
Nested Organizations Check 75 0.8
Aggregate Link Analysis Score: 17.7
LINK ANALYSIS
Action: Semantic analysis of the various components (watchlist, Lexis Nexis, ontology search, metabase search, etc.) to come up with an aggregate threat score for the passenger
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge, recognize entities in a piece of text, automatically co-relate it with other data in the ontology, search for relevant content to present an overall idea of the threat level fo the passenger, allowing him to take quick action
appearsOn watchList:
FBI
ONTOLOGY SEARCH
Action: Semagix’s rich ontology is searched for this name and associated information like position, aliases, relationships (past or present) of this name to other organizations, watchlists, country, etc. are retrieved
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge about a passenger and automatically co-relate it with other data in the ontology to present a visual association picture to the flight official
METABASE SEARCH
Action: Semagix’s rich metabase is searched for this name and associated content stories mentioning the passenger’s name are retrieved
Ability Proven: Ability to automatically aggregate and retrieve relevant content stories, field reports, etc. about the passenger that can be used by flight officials to determine if the passenger has any connections with known bad people or organizations
WATCHLIST ANALYSIS
Action: Semagix’s rich ontology is automatically searched for the possible appearance of this name on any of the watchlists
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge and automatically co-relate it and rank the threat factors to indicate threat level of the passenger on the watchlist front
What it will take RDBMS to support flight security application
Link Analysis Component # Queries (Voquette) # Queries (RDBMS) Time (Voquette) Time (RDBMS)
Direct Watchlist Match (person name)lookup person entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve person's relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 sec
Organization Watchlist Match (person name, organization name)lookup person entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve person's relationships to organizations 1 SQL Query 1 SQL Query .005 sec .005 secretrieve the organizations' relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 seclook up organization entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve the organizations' relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 sec
Nested Organization Watchlist Match (person name, organization name)look up organization entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve the organization's relationships to organizations 1 SQL Query 1 SQL Query .005 sec .005 secretrieve the organizations' relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 sec
Flight Origin (country name)retrieve country entity 1 SQL Query 1 SQL Query .005 sec .005 secsee if country is on a list containing "high-risk" countries 1 SQL Query 1 SQL Query .005 sec .005 sec
Person Origin (person name)lookup person entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve person's home country 1 SQL Query 1 SQL Query .005 sec .005 secretrieve the organization's relationships to lists containing "high-risk" countries 1 SQL Query 1 SQL Query .005 sec .005 sec
Field Report Search (person name)perform SSE query for field reports that mention this person 1 SSE Request 2 SQL Queries .03 sec 5-30 secretrieve a list of people associated with these field reports 1 SQL Query 1 SQL Query .005 sec .005 secdetermine which people are on watchlists, terrorists, etc… 1 SQL Query 1 SQL Query .005 sec .005 sec
18 requests 39-64 SQL Queries .33 sec 30-80 sec.
Query Comparison:Semagix vs. RDBMS
Performance
> 10,000 entities/relationships per hr.Population/update rate in a Ontology with 1 million entities/relationships
1 minute (near real-time)Incremental Index Update Frequency
65msQuery Response Time (64 concurrent users)
1 - 10 msQuery Response Time (light load)
> 1,980,000Queries per server per hour
More at www.semagix.comand
http://lsdis.cs.uga.edu/lib/presentations.html