Download pdf - Web Scale Named Entity Mining

Web scale

Named Entity Mining

"There's simply too much information out there"

WI-IAT 2011

in memoriam of

Herbert A. Simon …

stuck

April 2011

Herbert Simon's Brookings Institute Lecture"Designing Organizations for an Information-Rich World"

Johns Hopkins University, September 1, 1969

1.Tales & legends

Find & procure a crystal plastic replacement of a polycarbonate LEXAN 943

Main constraints:

•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stress

and exposure to detergent agents)

•compatible with existing tools - withdrawal must be close to LEXAN 943

•optical characteristic close to LEXAN 943

•weldable by ultrasonic welding

•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94

delay : one week

organization centric search

Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable air

defense missile system ?

location centric search

Recent information (past month)

about call for proposal

"outils Web innovants en entreprise" ?

time centric search

Location

"pro" searches focus on

Orgs People

Time

named entities

2.Introducing

WebNEM

relevant

query ?

query

again ?

where ?

+ browsing/ranking

results

Attention-greedy & burdensome

product

specifications

get

manufacturer

or distributor

find

compliant

products

"SA-24 Grinch

9K338 Igla-S"

Goal : Attention-saver process

exploratory data analysis

of high dimensional data

"In exploratory data analysis of high dimensional data

one of the main tasks is the formation of a

simplified, usually visual, overview of data sets.

....

Clustering and projection

are among the examples of useful methods

to achieve this task."

Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and their

application in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)

Lourenço, Lobo, Bação – JOCLAD 2004

WebNEM

collection of

relevant data,

anywhere in the web

+ projection on

Named Entities space

topical web crawler

named entity recognition

visualization/exploratory analysis tools

"Web scale" collection : brute force

never-ending crawl

fast answer,

"any" topic

a priori

"whole" Web indexing

general index

"everywhere"

huge resources required

(data size based)

user

query

"Web scale" collection : our approach

"close to optimal" resources

(usage based)

user

query

on-demand topical crawl

delayed answer,

but less garbage

tailored index

anywhere

relevant

built on order

Web slices

Projection : when to extract entities ?

Named Entity Recognition is resource intensive

crawl time whole web 1010 asynchronous

query time collection 102 real-time

crawl time web slice 104 asynchronous

process step data size required response time

www.squido.fr

our SaaS Web mining system

large scale

Named Entity extraction (EN/FR)

beta released to customers

June 2011

WebNEM with Squido

index

focused

crawl

search

topicshallow

entity extraction

page

cleaning

user

queries

user

collections

deep

entity extractionvisualization

visualization

Page cleaning

instead

of

this

work

on

this

fast heuristic

DOM processing

Shallow extraction

detectlanguage

tokenizesentence

split

gazetteers grammar

Webdocs

format

parse

index

Deep extraction

POStagger

grammar

orthomatcher index

morphoanalyzer

NP/VPchunker

≅≅≅≅ shallow extraction + elaborate linguistics

3.Annoyances

Linguistic processing throughput

deep extraction

too expensive

when crawling

shallow

extraction

OK

penalty

on

quality

workaround :

�asynch deep extraction

on smaller collections

�query time sanitization

Page cleaning

need evaluation

goal : ↗accuracy ? cost : ↘ recall ?

performance impact ?

↘ +1 processing step

↗ less text in later steps

"Multiple dates" usage ?

<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 10-13, 2008</DATE>


<DATE TYPE="DateDay" D="12" M="11" Y="2007">November 11-13, 2007</DATE>

<DATE TYPE="DateDay" D="14" M="10" Y="2008">October 12-17, 2008</DATE>


<DATE TYPE="DateDay" D="17" M="9" Y="2007">September 16-19, 2007</DATE>

<DATE TYPE="DateDay" D="2" M="5" Y="2008">May 2, 2008</DATE>

<DATE TYPE="DateDay" D="26" M="5" Y="2009">May 24-29, 2009</DATE>


<DATE TYPE="DateDay" D="7" M="10" Y="2008">October 5-9 2008</DATE>


<DATE TYPE="DateDay" D="8" M="5" Y="2007">May 6-11, 2007</DATE>


<DATE TYPE="DateMonth" M="11" Y="2009">November, 2009</DATE>

<DATE TYPE="DateMonth" M="2" Y="2009">February, 2009</DATE>

<DATE TYPE="DateMonth" M="8" Y="2008">August 2008</DATE>

retrieve

by date

�

sort

by date

?

Publishing date ?

critical for

time centric

searches

published

05/2011tagged as

7 jul 2011

& many more…

wrong

spelling

Tapei→Taipei

location is also a first name

"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)

compound first names

"Jean-Claude Marin"→Claude Marin

wrong character case (very frequent on titles)

breaks all case-based rules

barrack obama→not extracted

How To Buy Electric Trucks→Buy Electric (organization)

In Virginia Life Is Sweet→Virginia Life (person)

polymorphism

"Nagy Bocsa", "Nagy-Bocsa", "Nagy"sanitize parser output

for tokenization

transliteration, case, punctuation, …

4. Results

Reminder

Next results are obtained

automatically

from unstructured content

picked on the web

by an autonomous system,

without previous knowledge

of the topic or the visited Web sites

Let's try it with a use case

"hydrogen storage for fuel cells"

What's inside a collection

of 66 highly ranked documents ?

run a few cycles

(shallow extraction only)

entity

weight function

(tf-idf, …)

some

104 pages

PeopleOrgs Location Time

Special attention paid

to so-called outliers

Organizations > 900 : overload…

page cleaning + entity sanitization

=> better details & accuracy

↗attention ↘information : top 50

academic

team ?H2 military

usage ?

new questions are instantly popping up

?

People

authors lead to

relevant content

(classic IR method,

even in libraries !)

?

Countries

political threats

on Lithium battery

supplies

argument in favor of

H2 technology

Cities

"Austin is in a unique position

to offer its electric grid as a

real world proving ground"

"Direct Methanol Fuel Cells"

⇒alternative to H2

!

!

!

changeover from nickel to lithium

will be complete by 2016 and 2018

Multiple-dates timeline

outlookhistory

do

ma

ins

time

Honda President Takanobu Ito says

around 10 percent of Honda’s global sales

will be hybrids by 2015

In a few clicks...

DMFC alternative to H2

Austin,

TX

hydrogen storage

for fuel cells ?

changeover from

nickel to lithium

by 2016/2018

5. Perspectives

To clean or not to clean ?

performance impact"attention" impact

run pipeline with/without cleaningcorpus

label examples +/-

clean

set

full

set

time full

pipeline

Publishing date extraction

heuristic

DOM processing

prototype ready

need large scale

evaluation

build gold

standard from

RSS feeds

A zest of Linked Data ?

too slow & fat

for crawling...

use it "offline"

disambiguation, gazetteers, infoboxes, ...

Play with graphs

entity co-occurence, page similarity, ...

UI/user experience

�search facets

�word clouds

�maps

�dashboards

�infoboxes

�highlighting

�graphs

Lexical Taxonomies Induction

22nd International Joint Conference on Artificial Intelligence (IJCAI 2011),

Barcelona, Spain, July 19-22nd, 2011

another kind of projection

a. A real need of Attention-saving…

b. WebNEM results are encouraging

c. Work in progress, lots of paths to explore

6. Digest

"There's simply

too much

information out

there."

"Leaders feel

misled. Stupid.

Trapped."

Final word by Herbert Simon

"Filtering by intelligent programs

is the main part of the answer"

[to information overload]

www.ixxo.frwww.slideshare.net/fpouillouxwww.linkedin.com/pub/st%C3%A9phanie-jacquemont/20/271/767www.linkedin.com/in/fpouilloux

MANY THANKS!joint work of

CREDITSPhotos2. Home page, The 2011 IEEE/WIC/ACM International Conference on Web

Intelligence

4. Designing Organizations for an Information-Rich World, The Herbert A.

Simon Collection

5.Vlad the Impaler, Wikimedia commons

7. Missile 9M342 of the portable anti-aircraft missile system Igla-S,

©vitalykuzmin.net

10. Internet Map 2005, ©www.opte.org

33. The Inspector, ©DePatie-Freleng Enterprises

36. Nanomaterials for Solid State Hydrogen Storage, book cover,

©springer.com

40. EnerDel/Argonne lithium-ion battery, ©Argonne National Laboratory

40. Pennybacker Bridge - Austin, TX, ©Andy Heatwole

41. 20060206211301_132363.jpg, pulpo.org, ©Jumpedforjoy

44. Linking Open Data cloud diagram, ©Richard Cyganiak and Anja

Jentzsch, lod-cloud.net

44. Taji crawl, ©The U.S. Army, www.flickr.com/soldiersmediacenter

48. Views of the solar corona by the Transition Region and Coronal

Explorer, Stanford-Lockheed Institute for Space Research, NASA Small

Explorer program

49. Hyperformance book cover, www.tjwaters.com

50. Dr Simon solving puzzles, The Herbert A. Simon Collection

Websites� wi-iat-2011.org

� The Herbert A. Simon Collection, Carnegie Mellon University Libraries,

diva.library.cmu.edu/webapp/simon/index.html

� www.google.com

� online.barrons.com

� www.me.utexas.edu/~dmfc-muri

� www.alsace-industrie.fr

� www.hybridcars.com

� www.me.utexas.edu/blogs/meyersresearchgroup

Bibliography� Simon, H. A. (1971), "Designing Organizations for an Information-Rich

World", Carnegie Mellon University Libraries,

diva.library.cmu.edu/webapp/simon/item.jsp?q=/box00055/fld04178/bdl

0002/doc0001

� Waters, T. J. (2011), "Hyperformance",

www.tjwaters.com/hyperformance-excerpt.html

� R. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for Inducing

Lexical Taxonomies from Scratch. Proc. of the 22nd International Joint

Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July

19-22nd, 2011, pp. 1872-1877.