NL search: hype or reality?

NL search: hype or reality?NL search: hype or reality?

Università di Pisa

Giuseppe AttardiGiuseppe AttardiDipartimento di InformaticaDipartimento di Informatica

Università di Pisa Università di Pisa

With H. Zaragoza, J. Atserias, M. Ciaramita of Yahoo! With H. Zaragoza, J. Atserias, M. Ciaramita of Yahoo! Research BarcelonaResearch Barcelona

HakiaHakia

Hakia’s Aims and BenefitsHakia’s Aims and Benefits

Hakia is building the Web’s new Hakia is building the Web’s new ““meaning-basedmeaning-based” ” search enginesearch engine with with the sole purpose of improving the sole purpose of improving search relevancy and interactivity, search relevancy and interactivity, pushing the current boundaries of pushing the current boundaries of Web search.Web search.The The benefitsbenefits to the end user are to the end user are search efficiency, richness of search efficiency, richness of information, and time savings. information, and time savings.

Hakia’s PromiseHakia’s Promise

The basic promise is to bring search The basic promise is to bring search results by results by meaning matchmeaning match - similar to the - similar to the human brain's cognitive skillshuman brain's cognitive skills - rather than - rather than by the mere occurrence (or popularity) of by the mere occurrence (or popularity) of search terms.search terms.Hakia’s new technology is a Hakia’s new technology is a radical radical departure from the conventional indexing departure from the conventional indexing approachapproach, because indexing has severe , because indexing has severe limitations to handle full-scale semantic limitations to handle full-scale semantic search. search.

Hakia’s AppealHakia’s Appeal

Hakia’s capabilities will appeal to all Hakia’s capabilities will appeal to all Web searchers - especially those Web searchers - especially those engaged in engaged in research on knowledge research on knowledge intensive subjectsintensive subjects, such as medicine, , such as medicine, law, finance, science, and literature. law, finance, science, and literature.

Hakia “meaning-based” searchHakia “meaning-based” search

Ontological SemanticsOntological Semantics

A formal and comprehensive linguistic A formal and comprehensive linguistic theory of meaning in natural languagetheory of meaning in natural language

A set of resources, including:A set of resources, including:– a language-independent ontology of 8,000

interrelated concepts– an ontology-based English lexicon of 100,000

word senses– an ontological parser which "translates" every

sentence of the text into its text meaning representation

– acquisition toolbox which ensures the homogeneity of the ontological concepts and lexical entries by different acquirers of limited training

OntoSem Lexicon ExampleOntoSem Lexicon Example

BowBow(bow-n1(bow-n1

(cat n) (cat n) (anno (def "instrument for archery")) (anno (def "instrument for archery")) (syn-struc ((root $var0) (cat n))) (syn-struc ((root $var0) (cat n))) (sem-struc (bow)) (sem-struc (bow))))

(bow-n2(bow-n2 (cat n) (cat n) (anno (def "part of string-instruments")) (anno (def "part of string-instruments")) (syn-struc ((root $var0) (cat n))) (syn-struc ((root $var0) (cat n))) (sem-struc (stringed-instrument-bow)) (sem-struc (stringed-instrument-bow))))

Lexicon (Bow)Lexicon (Bow)

(bow-v1(bow-v1(cat v)(cat v)(anno (def "to give in to someone or something"))(anno (def "to give in to someone or something"))(syn-struc ((subject ((root $var2) (cat np)))(syn-struc ((subject ((root $var2) (cat np)))

(root $var0) (cat v)(root $var0) (cat v)(pp-adjunct ((root to)(pp-adjunct ((root to)

(cat prep)(cat prep)(obj ((root $var3) (cat (obj ((root $var3) (cat

np))))))np))))))))

(sem-struc (yield-to(sem-struc (yield-to(agent (value ^$var2))(agent (value ^$var2))(caused-by (value ^$var3))))(caused-by (value ^$var3))))

))

QDEXQDEX

QDEX extracts all possible queries QDEX extracts all possible queries that can be asked to a Web page, at that can be asked to a Web page, at various lengths and formsvarious lengths and forms

queries (sequences) become queries (sequences) become gateways to the originating gateways to the originating documents, paragraphs and documents, paragraphs and sentences during retrievalsentences during retrieval

QDEX vs Inverted IndexQDEX vs Inverted Index

An inverted index has a huge An inverted index has a huge “active” data set prior to a query “active” data set prior to a query from the user.from the user.

Enriching this data set with semantic Enriching this data set with semantic equivalences (concept relations) will equivalences (concept relations) will further increase the operational further increase the operational burden in an burden in an exponentialexponential manner. manner.

QDEX has a tiny active set for each QDEX has a tiny active set for each query and semantic associations can query and semantic associations can be easily handled on-the-fly. be easily handled on-the-fly.

QDEX combinatoricsQDEX combinatorics

The critical point in QDEX system is to be able to The critical point in QDEX system is to be able to decompose sentences into a handful of decompose sentences into a handful of meaningful sequences without getting lost in the meaningful sequences without getting lost in the combinatory explosion space.combinatory explosion space.

For example, a sentence with 8 significant words For example, a sentence with 8 significant words can generate over a billion sequences (of 1, 2, 3, can generate over a billion sequences (of 1, 2, 3, 4, 5, and 6 words) where only a few dozen makes 4, 5, and 6 words) where only a few dozen makes sense by human inspection.sense by human inspection.

The challenge is how to reduce billion The challenge is how to reduce billion possibilities into a few dozen that make sense. possibilities into a few dozen that make sense. hakia uses OntoSem technology to meet this hakia uses OntoSem technology to meet this challenge. challenge.

Semantic RankSemantic Rank

a pool of relevant paragraphs come from a pool of relevant paragraphs come from the QDEX system for a given query termsthe QDEX system for a given query terms

final relevancy is determined based on an final relevancy is determined based on an advanced sentence analysis and concept advanced sentence analysis and concept match between the query and the best match between the query and the best sentence of each paragraphsentence of each paragraph

morphological and syntactic analyses are morphological and syntactic analyses are also performedalso performed

no keyword matching or Boolean algebra no keyword matching or Boolean algebra is involvedis involved

the credibility and age (of the Web page) the credibility and age (of the Web page) are also taken into accountare also taken into account

PowerSetPowerSet

Powerset DemoPowerset Demo

NL Question on NL Question on WikipediaWikipedia

What companies What companies did IBM acquire?did IBM acquire?

Which company Which company did IBM acquire in did IBM acquire in 1989?1989?

NL Question on NL Question on WikipediaWikipedia

What companies What companies did IBM acquire?did IBM acquire?

Which company Which company did IBM acquire in did IBM acquire in 1989?1989?

Google query on Google query on WikipediaWikipedia

Same queriesSame queries Poorer resultsPoorer results

Google query on Google query on WikipediaWikipedia

Same queriesSame queries Poorer resultsPoorer results

Try yourselfTry yourself

Who acquired IBM?Who acquired IBM? IBM acquisitions 1996IBM acquisitions 1996 IBM acquisitionsIBM acquisitionsWhat do liberal democrats say about What do liberal democrats say about

healthcarehealthcare– 1.4 million matches

ProblemsProblems

Parser from Xerox is a quite Parser from Xerox is a quite sophisticated constituent parser:sophisticated constituent parser:– it produces all possible parser trees– fairly slow

Workaround: index only the highest Workaround: index only the highest relevant portion of the Webrelevant portion of the Web

RealityReality

Semantic Document AnalysisSemantic Document Analysis

Question AnsweringQuestion Answering– Return precise answer to natural

language queriesRelation ExtractionRelation Extraction Intent MiningIntent Mining– assess the attitude of the document

author with respect to a given subject– Opinion mining: attitude is a positive or

negative opinion

Semantic Retrieval ApproachesSemantic Retrieval Approaches

Used in QA, Opinion Retrieval, etc.Used in QA, Opinion Retrieval, etc. Typical 2-stage approach:Typical 2-stage approach:

1. Perform IR and rank by topic relevance

2. Postprocess results with filters and rerank

Generally slow:Generally slow:– Requires several minutes to process

each query

Single stage approachSingle stage approach

Single-stage approach:Single-stage approach:– Enrich the index with opinion tags– Perform normal retrieval with custom

ranking functionProved effective at TREC 2006 Blog Proved effective at TREC 2006 Blog

Opinion Mining TaskOpinion Mining Task

Enriched Index for TREC BlogEnriched Index for TREC Blog

Overlay words with tagsOverlay words with tags

musicmusic isis aa touchtouch lamelame

11 22 33 44 55

NEGATIVENEGATIVE

soundtracksoundtrack littlelittle weakweak

ARTART bitbit plateplate

Enhanced QueriesEnhanced Queries

music NEGATIVE:lamemusic NEGATIVE:lamemusic NEGATIVE:*music NEGATIVE:*

Achieved 3Achieved 3rdrd best P@5 at TREC Blog best P@5 at TREC Blog Track 2006Track 2006

Enriched Inverted IndexEnriched Inverted Index

Inverted IndexInverted Index

Stored compressedStored compressed– ~1 byte per term occurrence

Efficient intersection operationEfficient intersection operation– O(n) where n is the length of shortest

postings list– Using skip lists further reduces cost

Size: ~ 1/8 original textSize: ~ 1/8 original text

Small Adaptive Set IntersectionSmall Adaptive Set Intersection

world wide web

3

9

12

20

40

47

1

8

10

25

40

2

4

6

21

30

35

40

41

IXE Search Engine LibraryIXE Search Engine Library

C++ OO architectureC++ OO architecture Fast indexingFast indexing– Sort-based inversion

Fast searchFast search– Efficient algorithms and data structures– Query Compiler

• Small Adaptive Set Intersection– Suffix array with supra index– Memory mapped index files

Programmable API libraryProgrammable API library Template metaprogrammingTemplate metaprogramming Object Store Data BaseObject Store Data Base

IXE PerformanceIXE Performance

TREC TeraByte 2005:TREC TeraByte 2005:– 2nd fastest– 2nd best P@5

Query ProcessingQuery Processing

Query compilerQuery compiler– One cursor on posting lists for each

node– CursorWord, CursorAnd, CursorOr,

CursorPhraseQueryCursor.next(Result& min)QueryCursor.next(Result& min)– Returns first result r >= min

Single operator for all kind of Single operator for all kind of queries: e.g. proximityqueries: e.g. proximity

IXE ComposabilityIXE Composability

DocInfoDocInfo

PassageDocPassageDoc

Collection<DocInfo>Collection<DocInfo>

Collection<PassageDoc>Collection<PassageDoc>namedatesize

namedatesize

textboundaries

textboundaries QueryCursorQueryCursor

PassageQueryCursorPassageQueryCursor

next()next()

next()next()

CursorCursor

next()next()

Passage RetrievalPassage Retrieval

Documents are split into passagesDocuments are split into passagesMatches are searched in passages Matches are searched in passages ± ±

nn nearby nearbyResults are ranked passagesResults are ranked passagesEfficiency requires special store for Efficiency requires special store for

passage boundariespassage boundaries

QA Using Dependency RelationsQA Using Dependency Relations

Build dependency trees for both Build dependency trees for both question and answerquestion and answer

Determine similarity of Determine similarity of corresponding paths in dependency corresponding paths in dependency trees of question and answertrees of question and answer

PiQASso Answer MatchingPiQASso Answer Matching

Tungsten is a very dense material and has the highest melting point of any metal.

1 Parsing

2 Answer type check 3 Relation extraction

SUBSTANCE<tungsten, material, pred><tungsten, has, subj><point, has, obj>…

4 Matching Distance

Tungsten

6 Popularity Ranking

ANSWER

5 Distance FilteringWhat metal has the highest melting point?

sub mod

obj

mod

QA Using Dependency RelationsQA Using Dependency Relations

Further developed by Cui et al, NUSFurther developed by Cui et al, NUSScore computed by statistical Score computed by statistical

translation modeltranslation modelSecond best at TREC 2004Second best at TREC 2004

Wikipedia ExperimentWikipedia Experiment

Tagged Wikipedia with:Tagged Wikipedia with:– POS– LEMMA– NE (WSJ, IEER)–WN Super Senses– Anaphora– Parsing (head, dependency)

Tools UsedTools Used

SST tagger [Ciaramita & Altun]SST tagger [Ciaramita & Altun]DeSR dependency parser [Attardi & DeSR dependency parser [Attardi &

Ciaramita]Ciaramita]– Fast: 200 sentence/sec– Accurate: 90 % UAS

Dependency ParsingDependency Parsing

Produces dependency treesProduces dependency treesWord-word dependency relationsWord-word dependency relationsFar easier to understand and to Far easier to understand and to

annotateannotate

Rolls-Royce Inc. said it expects its sales to remain steady

SUBJ OBJ

MOD SUBJ

OBJ

SUBJ MODTO

Classifer-based Shift-Reduce ParsingClassifer-based Shift-Reduce ParsingRight He

PPsaw

VVDa

DTgirlNN

withIN

aDT

telescopeNNS

.SENT

nexttop

Shift

Left

CoNLL 2007 Results CoNLL 2007 Results

LanguageLanguage UASUAS LASLAS

CatalanCatalan 92.2092.20 87.6487.64

ChineseChinese 86.7386.73 86.8686.86

EnglishEnglish 86.9986.99 85.8585.85

ItalianItalian 85.5485.54 81.3481.34

CzechCzech 83.4083.40 77.3777.37

TurkishTurkish 83.5683.56 76.8776.87

ArabicArabic 82.5382.53 72.6672.66

HungarianHungarian 81.8181.81 76.8176.81

GreekGreek 80.7580.75 73.9273.92

BasqueBasque 76.8676.86 69.8469.84

EvalIta 2007 Results EvalIta 2007 Results

CollectionCollection UASUAS LASLAS

Cod. CivileCod. Civile 91.3791.37 79.1379.13

NewspaperNewspaper 85.4985.49 76.6276.62

Best statistical parserBest statistical parser

ExperimentExperiment

Experimental data setsExperimental data sets

WikipediaWikipediaYahoo! AnswersYahoo! Answers

English Wikipedia IndexingEnglish Wikipedia Indexing

Original size: 4.4 GBOriginal size: 4.4 GBNumber of articles: 1,400,000Number of articles: 1,400,000Tagging time: ~3 days (6 days with Tagging time: ~3 days (6 days with

previous tools)previous tools)Parsing time: 40 hoursParsing time: 40 hours Indexing time: 9 hours (8 days with Indexing time: 9 hours (8 days with

UIMA + Lucene)UIMA + Lucene) Index size: 3 GBIndex size: 3 GBMetadata: 12 GBMetadata: 12 GB

Scaling IndexingScaling Indexing

Highly parallelizableHighly parallelizableUsing Hadoop in stream modeUsing Hadoop in stream mode

Example (partial)Example (partial)

TERM POS LEMMA WNSS HEAD DEP

The DT the 0 2 NMOD

Tories NNPS tory B-noun.person 3 SUB

won VBD win B-verb.competition 0 VMOD

this DT this 0 5 NMOD

election NN election B-noun.act 3 OBJ

Stacked ViewStacked View

11 22 33 44 55

TERM The Tories won this election

POS DT NNPS VBD DT NN

LEMMA the tory win this election

WNSS 0 B-noun.person B-verb.competition 0 B-noun.act

HEAD 2 3 0 5 3

DEP NMOD SUB VMOD NMOD OBJ

ImplementationImplementation

Special version of Passage RetrievalSpecial version of Passage Retrieval Tags are overlaid to wordsTags are overlaid to words– Dealt as terms in same position as

corresponding word– Not counted to avoid skewing TF/IDF– Given an ID in the lexicon

Retrieval is fast:Retrieval is fast:– A few msec per query on a 10 GB index

Provided as both Linux library and Provided as both Linux library and Windows DLLWindows DLL

Java InterfaceJava Interface

Generated using SWIGGenerated using SWIGResults accessible through a Results accessible through a

ResultIteratorResultIteratorList of terms or tags for a sentence List of terms or tags for a sentence

generated on demandgenerated on demand

Proximity queriesProximity queries Did France win the World Cup?Did France win the World Cup?

proximity 15 [proximity 15 [MORPH/win:*MORPH/win:* DEP/SUB:franceDEP/SUB:france 'world cup']'world cup']

Born in the French territory of New Caledonia, he Born in the French territory of New Caledonia, he was a vital player in the French team that was a vital player in the French team that wonwon the the 1998 1998 World CupWorld Cup and was on the squad, but and was on the squad, but played just one game, as played just one game, as FranceFrance wonwon Euro 2000. Euro 2000.

FranceFrance repeated the feat of Argentina in 1998, by repeated the feat of Argentina in 1998, by taking the title as they taking the title as they wonwon their home 1998 their home 1998 World CupWorld Cup, beating Brazil., beating Brazil.

Both England (1966) and Both England (1966) and FranceFrance (1998) (1998) wonwon their their only only World CupsWorld Cups whilst playing as host nations. whilst playing as host nations.

Proximity queriesProximity queries

Who won the World Cup in 1998?Who won the World Cup in 1998?proximity 13 [proximity 13 [MORPH/win:*MORPH/win:* DEP/SUB:*DEP/SUB:* 'world cup' WSJ/DATE:1998]'world cup' WSJ/DATE:1998]

With the French national team, With the French national team, DugarryDugarry wonwon World Cup World Cup 1998 and Euro 2000. 1998 and Euro 2000.

HeHe captained Arsenal and captained Arsenal and wonwon the the World World CupCup with France in 1998. with France in 1998.

Did France win the World Cup in Did France win the World Cup in 2002?2002?proximity 30 [proximity 30 [MORPH/win:*MORPH/win:* DEP/SUB:franceDEP/SUB:france 'world cup' 'world cup' WSJ/DATE:2002]WSJ/DATE:2002]

No result.No result.

Who won it in 2002?Who won it in 2002?proximity 6 [proximity 6 [MORPH/win:*MORPH/win:* DEP/SUB:*DEP/SUB:* 'world cup' 2002]'world cup' 2002]

He has 105 caps for Brazil, and helped his He has 105 caps for Brazil, and helped his countrycountry winwin the the World CupWorld Cup in in 20022002 after finishing second in 1998. after finishing second in 1998.

20022002 - - BrazilBrazil winswins the Football the Football World CupWorld Cup becoming the first team to becoming the first team to winwin the the trophy 5 times trophy 5 times

Dependency QueriesDependency Queries

deprel [ pattern headPattern ]deprel [ pattern headPattern ]

Semantics: clause matches any document Semantics: clause matches any document that contains a match for pattern whose that contains a match for pattern whose head matches headPatternhead matches headPattern

Implementation:Implementation:

search for search for patternpattern for each match at (doc, pos) for each match at (doc, pos) find h = head(doc, pos) find h = head(doc, pos)

find match for find match for headPatternheadPattern at (doc, h at (doc, h±±2)2)

Finding headsFinding heads

How to find head(doc, pos)?How to find head(doc, pos)?Solution: to store the HEAD Solution: to store the HEAD

positions in a special posting list.positions in a special posting list.A posting list stores the positions A posting list stores the positions

where a term occurs in a document.where a term occurs in a document.The HEADS posting list stores the The HEADS posting list stores the

heads of each term in a document.heads of each term in a document.

Finding HeadsFinding Heads

To retrieve head(doc, pos), one To retrieve head(doc, pos), one accesses the posting list of HEADS accesses the posting list of HEADS for doc and extracts the pos-th item.for doc and extracts the pos-th item.

Posting lists are efficient since they Posting lists are efficient since they are stored compressed on disk and are stored compressed on disk and accessed through memory mapping.accessed through memory mapping.

Dependency PathsDependency Paths

deprel [ patterndeprel [ pattern00 pattern pattern11 … pattern … patternii]]

Note: opposite direction from XPathNote: opposite direction from XPath

Multiple TagsMultiple Tags

DEP/SUB:MORPH/insect:*DEP/SUB:MORPH/insect:*

Dependency queriesDependency queries

Who won the elections?Who won the elections?

deprel [ election won ]deprel [ election won ]

deprel [ DEP/OBJ:election deprel [ DEP/OBJ:election MORPH/win:*MORPH/win:* ] ]

The Scranton/Shafer team The Scranton/Shafer team wonwon the the electionelection over Philadelphia mayor over Philadelphia mayor Richardson Dilworth and Shafer Richardson Dilworth and Shafer became the states lieutenantbecame the states lieutenant

CollectCollect

What are the causes of death?What are the causes of death? deprel [ from deprel [ from MORPH/die:*MORPH/die:* ] ]

She She dieddied from from throat throat cancercancer in in Sherman Oaks, California.Sherman Oaks, California.

Wilson Wilson dieddied from AIDS from AIDS..

DemoDemo

Deep Search on WikipediaDeep Search on Wikipedia–Web interface– Queries with tags and deprel

Browsing on Deep Search resultsBrowsing on Deep Search results– Sentences are collected– Graph of sentences/entities is created• WebGraph [Boldi-Vigna]

– Results clustered through most frequent entities

IssuesIssues

Dependency relations are crude for Dependency relations are crude for English (30 in total)English (30 in total)– SUB, OBJ, NMOD

Better for Catalan (168)Better for Catalan (168)– Distinguish time/location/cause

adverbialsRelation might not be directRelation might not be direct– E.g. “die from cancer”

Queries can’t express SUB/OBJ Queries can’t express SUB/OBJ relationshiprelationship

Semantic Relations?Semantic Relations?

The movie is not a masterpiece

Target-OpinionTarget-Opinion

Or a few general relation types?Or a few general relation types?

Directly/IndirectlyDirectly/IndirectlyAffirmative/Negative/DubitativeAffirmative/Negative/DubitativeActive/PassiveActive/Passive

Translating QueriesTranslating Queries

Compile NL query into query syntaxCompile NL query into query syntaxLearn from examples, e.g. Yahoo! Learn from examples, e.g. Yahoo!

AnswersAnswers

Generic QuadrupleGeneric Quadruple

(Subject, Object, Verb, Mode)(Subject, Object, Verb, Mode)Support searching for quadruplesSupport searching for quadruplesRank based on distanceRank based on distance

S O

V

M

Related WorkRelated Work

ChakrabartiChakrabarti

Proposes to use proximity queriesProposes to use proximity queriesOn a Web index built with Lucene On a Web index built with Lucene

and UIMAand UIMA

We are hiringWe are hiring

Three projects starting:Three projects starting:

1.1. Semantic Search on Italian Semantic Search on Italian WikipediaWikipedia2 assegni ricerca. Fond. Cari Pisa2 assegni ricerca. Fond. Cari Pisa

2.2. Deep SearchDeep Search2 PostDoc. Yahoo! Research2 PostDoc. Yahoo! Research

3.3. Machine TranslationMachine Translation2 PostDoc.2 PostDoc.

QuestionsQuestions

DiscussionDiscussion

Are there other uses than search?Are there other uses than search?– better query refinement– semantic clustering– Vanessa Murdock’s aggregated result

visualization Interested in getting access to the Interested in getting access to the

resource for experimentation?resource for experimentation?Shall relation types be learned?Shall relation types be learned?Will it scale?Will it scale?