Web scale
Named Entity Mining
"There's simply too much information out there"
WI-IAT 2011
in memoriam of
Herbert A. Simon …
stuck
April 2011
Herbert Simon's Brookings Institute Lecture"Designing Organizations for an Information-Rich World"
Johns Hopkins University, September 1, 1969
1.Tales & legends
Find & procure a crystal plastic replacement of a polycarbonate LEXAN 943
Main constraints:
•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stress
and exposure to detergent agents)
•compatible with existing tools - withdrawal must be close to LEXAN 943
•optical characteristic close to LEXAN 943
•weldable by ultrasonic welding
•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94
delay : one week
organization centric search
Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable air
defense missile system ?
location centric search
Recent information (past month)
about call for proposal
"outils Web innovants en entreprise" ?
time centric search
Location
"pro" searches focus on
Orgs People
Time
named entities
2.Introducing
WebNEM
relevant
query ?
query
again ?
where ?
+ browsing/ranking
results
Attention-greedy & burdensome
product
specifications
get
manufacturer
or distributor
find
compliant
products
"SA-24 Grinch
9K338 Igla-S"
Goal : Attention-saver process
exploratory data analysis
of high dimensional data
"In exploratory data analysis of high dimensional data
one of the main tasks is the formation of a
simplified, usually visual, overview of data sets.
....
Clustering and projection
are among the examples of useful methods
to achieve this task."
Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and their
application in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)
Lourenço, Lobo, Bação – JOCLAD 2004
WebNEM
collection of
relevant data,
anywhere in the web
+ projection on
Named Entities space
topical web crawler
named entity recognition
visualization/exploratory analysis tools
"Web scale" collection : brute force
never-ending crawl
fast answer,
"any" topic
a priori
"whole" Web indexing
general index
"everywhere"
huge resources required
(data size based)
user
query
"Web scale" collection : our approach
"close to optimal" resources
(usage based)
user
query
on-demand topical crawl
delayed answer,
but less garbage
tailored index
anywhere
relevant
built on order
Web slices
Projection : when to extract entities ?
Named Entity Recognition is resource intensive
crawl time whole web 1010 asynchronous
query time collection 102 real-time
crawl time web slice 104 asynchronous
process step data size required response time
www.squido.fr
our SaaS Web mining system
large scale
Named Entity extraction (EN/FR)
beta released to customers
June 2011
WebNEM with Squido
index
focused
crawl
search
topicshallow
entity extraction
page
cleaning
user
queries
user
collections
deep
entity extractionvisualization
visualization
Page cleaning
instead
of
this
work
on
this
fast heuristic
DOM processing
Shallow extraction
detectlanguage
tokenizesentence
split
gazetteers grammar
Webdocs
format
parse
index
Deep extraction
POStagger
grammar
orthomatcher index
morphoanalyzer
NP/VPchunker
≅≅≅≅ shallow extraction + elaborate linguistics
3.Annoyances
Linguistic processing throughput
deep extraction
too expensive
when crawling
shallow
extraction
OK
penalty
on
quality
workaround :
�asynch deep extraction
on smaller collections
�query time sanitization
Page cleaning
need evaluation
goal : ↗accuracy ? cost : ↘ recall ?
performance impact ?
↘ +1 processing step
↗ less text in later steps
"Multiple dates" usage ?
<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 10-13, 2008</DATE>
<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 9-13, 2008</DATE>
<DATE TYPE="DateDay" D="12" M="11" Y="2007">November 11-13, 2007</DATE>
<DATE TYPE="DateDay" D="14" M="10" Y="2008">October 12-17, 2008</DATE>
<DATE TYPE="DateDay" D="16" M="2" Y="2009">February 15-18, 2009</DATE>
<DATE TYPE="DateDay" D="17" M="9" Y="2007">September 16-19, 2007</DATE>
<DATE TYPE="DateDay" D="2" M="5" Y="2008">May 2, 2008</DATE>
<DATE TYPE="DateDay" D="26" M="5" Y="2009">May 24-29, 2009</DATE>
<DATE TYPE="DateDay" D="27" M="10" Y="2009">October 25-29, 2009</DATE>
<DATE TYPE="DateDay" D="7" M="10" Y="2008">October 5-9 2008</DATE>
<DATE TYPE="DateDay" D="8" M="2" Y="2009">February 7-10, 2009</DATE>
<DATE TYPE="DateDay" D="8" M="5" Y="2007">May 6-11, 2007</DATE>
<DATE TYPE="DateDay" D="9" M="10" Y="2007">October 7-12, 2007</DATE>
<DATE TYPE="DateMonth" M="11" Y="2009">November, 2009</DATE>
<DATE TYPE="DateMonth" M="2" Y="2009">February, 2009</DATE>
<DATE TYPE="DateMonth" M="8" Y="2008">August 2008</DATE>
retrieve
by date
�
sort
by date
?
Publishing date ?
critical for
time centric
searches
published
05/2011tagged as
7 jul 2011
& many more…
wrong
spelling
Tapei→Taipei
location is also a first name
"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)
compound first names
"Jean-Claude Marin"→Claude Marin
wrong character case (very frequent on titles)
breaks all case-based rules
barrack obama→not extracted
How To Buy Electric Trucks→Buy Electric (organization)
In Virginia Life Is Sweet→Virginia Life (person)
polymorphism
"Nagy Bocsa", "Nagy-Bocsa", "Nagy"sanitize parser output
for tokenization
transliteration, case, punctuation, …
4. Results
Reminder
Next results are obtained
automatically
from unstructured content
picked on the web
by an autonomous system,
without previous knowledge
of the topic or the visited Web sites
Let's try it with a use case
"hydrogen storage for fuel cells"
What's inside a collection
of 66 highly ranked documents ?
run a few cycles
(shallow extraction only)
entity
weight function
(tf-idf, …)
some
104 pages
PeopleOrgs Location Time
Special attention paid
to so-called outliers
Organizations > 900 : overload…
page cleaning + entity sanitization
=> better details & accuracy
↗attention ↘information : top 50
academic
team ?H2 military
usage ?
new questions are instantly popping up
?
People
authors lead to
relevant content
(classic IR method,
even in libraries !)
?
Countries
political threats
on Lithium battery
supplies
argument in favor of
H2 technology
Cities
"Austin is in a unique position
to offer its electric grid as a
real world proving ground"
"Direct Methanol Fuel Cells"
⇒alternative to H2
!
!
!
changeover from nickel to lithium
will be complete by 2016 and 2018
Multiple-dates timeline
outlookhistory
do
ma
ins
time
Honda President Takanobu Ito says
around 10 percent of Honda’s global sales
will be hybrids by 2015
In a few clicks...
DMFC alternative to H2
Austin,
TX
hydrogen storage
for fuel cells ?
changeover from
nickel to lithium
by 2016/2018
5. Perspectives
To clean or not to clean ?
performance impact"attention" impact
run pipeline with/without cleaningcorpus
label examples +/-
clean
set
full
set
time full
pipeline
Publishing date extraction
heuristic
DOM processing
prototype ready
need large scale
evaluation
build gold
standard from
RSS feeds
A zest of Linked Data ?
too slow & fat
for crawling...
use it "offline"
disambiguation, gazetteers, infoboxes, ...
Play with graphs
entity co-occurence, page similarity, ...
UI/user experience
�search facets
�word clouds
�maps
�dashboards
�infoboxes
�highlighting
�graphs
Lexical Taxonomies Induction
22nd International Joint Conference on Artificial Intelligence (IJCAI 2011),
Barcelona, Spain, July 19-22nd, 2011
another kind of projection
a. A real need of Attention-saving…
b. WebNEM results are encouraging
c. Work in progress, lots of paths to explore
6. Digest
"There's simply
too much
information out
there."
"Leaders feel
misled. Stupid.
Trapped."
Final word by Herbert Simon
"Filtering by intelligent programs
is the main part of the answer"
[to information overload]
www.ixxo.frwww.slideshare.net/fpouillouxwww.linkedin.com/pub/st%C3%A9phanie-jacquemont/20/271/767www.linkedin.com/in/fpouilloux
MANY THANKS!joint work of
CREDITSPhotos2. Home page, The 2011 IEEE/WIC/ACM International Conference on Web
Intelligence
4. Designing Organizations for an Information-Rich World, The Herbert A.
Simon Collection
5.Vlad the Impaler, Wikimedia commons
7. Missile 9M342 of the portable anti-aircraft missile system Igla-S,
©vitalykuzmin.net
10. Internet Map 2005, ©www.opte.org
33. The Inspector, ©DePatie-Freleng Enterprises
36. Nanomaterials for Solid State Hydrogen Storage, book cover,
©springer.com
40. EnerDel/Argonne lithium-ion battery, ©Argonne National Laboratory
40. Pennybacker Bridge - Austin, TX, ©Andy Heatwole
41. 20060206211301_132363.jpg, pulpo.org, ©Jumpedforjoy
44. Linking Open Data cloud diagram, ©Richard Cyganiak and Anja
Jentzsch, lod-cloud.net
44. Taji crawl, ©The U.S. Army, www.flickr.com/soldiersmediacenter
48. Views of the solar corona by the Transition Region and Coronal
Explorer, Stanford-Lockheed Institute for Space Research, NASA Small
Explorer program
49. Hyperformance book cover, www.tjwaters.com
50. Dr Simon solving puzzles, The Herbert A. Simon Collection
Websites� wi-iat-2011.org
� The Herbert A. Simon Collection, Carnegie Mellon University Libraries,
diva.library.cmu.edu/webapp/simon/index.html
� www.google.com
� online.barrons.com
� www.me.utexas.edu/~dmfc-muri
� www.alsace-industrie.fr
� www.hybridcars.com
� www.me.utexas.edu/blogs/meyersresearchgroup
Bibliography� Simon, H. A. (1971), "Designing Organizations for an Information-Rich
World", Carnegie Mellon University Libraries,
diva.library.cmu.edu/webapp/simon/item.jsp?q=/box00055/fld04178/bdl
0002/doc0001
� Waters, T. J. (2011), "Hyperformance",
www.tjwaters.com/hyperformance-excerpt.html
� R. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for Inducing
Lexical Taxonomies from Scratch. Proc. of the 22nd International Joint
Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July
19-22nd, 2011, pp. 1872-1877.