17
ailab.ijs.si Approximate subgraph matching for detection of topic variations Mitja Trampuš Dunja Mladenić AI Lab, Jožef Stefan Institute, Slovenia

Approximate subgraph matching for detection of topic variations Mitja Trampuš Dunja Mladenić AI Lab, Jožef Stefan Institute, Slovenia DiversiWeb Workshop,

Embed Size (px)

Citation preview

ailab.ijs.si

Approximate subgraph matching for detection of topic variations

Mitja Trampuš

Dunja MladenićAI Lab, Jožef Stefan Institute, Slovenia

AI Lab, Jozef Stefan Institute ailab.ijs.si

Mining Diversity• Web content varies in many aspects, e.g.

o Topicalo Social (author, target audience, people written about)o Geographical (publisher, places written about)o Sentiment (positive/negative)o Writing style (structure, vocabulary)o Coverage bias

• This work: (micro-)topical diversityo Macroscopic = largely solvedo Microscopic = challenge

AI Lab, Jozef Stefan Institute ailab.ijs.si

Task:Given a collection of texts on a topic,• identify a common template • align texts to the template

AI Lab, Jozef Stefan Institute ailab.ijs.si

Template representation

• Syntactico info1: X people were killed / killed X people / resulted in

X casualtieso info2: blew up Y / destroyed Y / attacked in a Y

• Semantico kill(bomber, people); count(people, X)o destroy(bomber, Y)

people bomber Ykill destroyX count

AI Lab, Jozef Stefan Institute ailab.ijs.si

patients terrorist hospitalkill demolish100 count

treatment attack

receive withstand

execute

policeofficer

bomberpolicestation

slaughter blow up2 count

Pre

req

uis

ite:

Sem

an

tic

Gra

ph

believerssuicidebomber

churchkill destroy12 count

vestexit

cardrive

wearrun

AI Lab, Jozef Stefan Institute ailab.ijs.si

Mining Templates• Template := subgraph with frequent

specializationso Specializations implied by background taxonomy

(WordNet)o Threshold frequency manually defined

believerssuicidebomber

churchkillteardown

12 countpoliceofficer

bomberpolicestation

slaughter blow up2 count

people bomber buildingkill destroyX count

AI Lab, Jozef Stefan Institute ailab.ijs.si

Semantic Graph Construction

1. Data: Google News crawl2. HTML cleanup3. Named entity tagging4. Pronoun resolution (he/she/him/her)5. Named entity consolidation (Barack

Obama vs President Obama)6. Parsing, triple/fact/assertion extraction

(for now: subj-verb-obj only)7. Ontology/taxonomy alignment8. Merging triples into a graph

AI Lab, Jozef Stefan Institute ailab.ijs.si

Approximate subgraph matching

believerssuicidebomber

churchkillteardown

12 count

policeofficer

bomberpolicestation

slaughter blow up2 count

people person locationkill destroynumber

count

people person locationkilldestroynumbercount

people person locationkill destroynumbercount

GENERALIZE

FREQUENT SUBTREE MINING

AI Lab, Jozef Stefan Institute ailab.ijs.si

Approximate subgraph matching

people bomber buildingkill destroynumbercount

people person locationkill destroynumbercount

SPECIALIZE

believerssuicidebomber

churchkillteardown

12 countpoliceofficer

bomberpolicestation

slaughter blow up2 count

AI Lab, Jozef Stefan Institute ailab.ijs.si

Preliminary results• 5 test domains; for each:

o ~10 graphs, ~10000 nodeso 10-60 seconds

• At min. support 30%o 20 maximal patterns, 9 manually judged as interesting

AI Lab, Jozef Stefan Institute ailab.ijs.si

AI Lab, Jozef Stefan Institute ailab.ijs.si

Conclusion• Future work:

o Mapping text -> semantics• Other ontologies?

o Interestingness measure for assertions and patternso Evaluation (precision, recall; multiple domains)o Alternative approaches to generalizing subgraphs

• Template extraction is achievable, but not easy

• Human filtering of results hard to avoid• Current approach reasonably fast

AI Lab, Jozef Stefan Institute ailab.ijs.si

Q?

Thank you.

Can we extract all relations?Kind of …

Thousands of small quakes resumed 18 months ago and continue to rattle Mammoth Lakes, June Lake and other Mono County resort towns. The temblors, most measuring 1 to 3 on the Richter scale, started beneath Mammoth Mountain.

Subject – Verb – Object

Triplets

Semantic Graph

AI Lab, Jozef Stefan Institute ailab.ijs.si

AI Lab, Jozef Stefan Institute ailab.ijs.si

Templates - why• Interpret content

o news archives: structure/annotate old texts, enable semantic search

o wikipedia: suggestions for infobox entries

• Generate contento wikipedia: a starting point for new articles / a checklist of

information to be included

• No normative definition of “good template”

AI Lab, Jozef Stefan Institute ailab.ijs.si

Evaluation

• Qualitativeo Usage-specifico Not useful for tuning algorithms

• Quantitativeo Precisiono Recall