Upload
wood
View
24
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Università di Pisa. NL search: hype or reality?. Hakia. Hakia’s Aims and Benefits. Hakia is building the Web’s new “ meaning-based ” search engine with the sole purpose of improving search relevancy and interactivity, pushing the current boundaries of Web search. - PowerPoint PPT Presentation
Citation preview
NL search: hype or reality?NL search: hype or reality?
Università di Pisa
Giuseppe AttardiGiuseppe AttardiDipartimento di InformaticaDipartimento di Informatica
Università di Pisa Università di Pisa
With H. Zaragoza, J. Atserias, M. Ciaramita of Yahoo! With H. Zaragoza, J. Atserias, M. Ciaramita of Yahoo! Research BarcelonaResearch Barcelona
HakiaHakia
Hakia’s Aims and BenefitsHakia’s Aims and Benefits
Hakia is building the Web’s new Hakia is building the Web’s new ““meaning-basedmeaning-based” ” search enginesearch engine with with the sole purpose of improving the sole purpose of improving search relevancy and interactivity, search relevancy and interactivity, pushing the current boundaries of pushing the current boundaries of Web search.Web search.The The benefitsbenefits to the end user are to the end user are search efficiency, richness of search efficiency, richness of information, and time savings. information, and time savings.
Hakia’s PromiseHakia’s Promise
The basic promise is to bring search The basic promise is to bring search results by results by meaning matchmeaning match - similar to the - similar to the human brain's cognitive skillshuman brain's cognitive skills - rather than - rather than by the mere occurrence (or popularity) of by the mere occurrence (or popularity) of search terms.search terms.Hakia’s new technology is a Hakia’s new technology is a radical radical departure from the conventional indexing departure from the conventional indexing approachapproach, because indexing has severe , because indexing has severe limitations to handle full-scale semantic limitations to handle full-scale semantic search. search.
Hakia’s AppealHakia’s Appeal
Hakia’s capabilities will appeal to all Hakia’s capabilities will appeal to all Web searchers - especially those Web searchers - especially those engaged in engaged in research on knowledge research on knowledge intensive subjectsintensive subjects, such as medicine, , such as medicine, law, finance, science, and literature. law, finance, science, and literature.
Hakia “meaning-based” searchHakia “meaning-based” search
Ontological SemanticsOntological Semantics
A formal and comprehensive linguistic A formal and comprehensive linguistic theory of meaning in natural languagetheory of meaning in natural language
A set of resources, including:A set of resources, including:– a language-independent ontology of 8,000
interrelated concepts– an ontology-based English lexicon of 100,000
word senses– an ontological parser which "translates" every
sentence of the text into its text meaning representation
– acquisition toolbox which ensures the homogeneity of the ontological concepts and lexical entries by different acquirers of limited training
OntoSem Lexicon ExampleOntoSem Lexicon Example
BowBow(bow-n1(bow-n1
(cat n) (cat n) (anno (def "instrument for archery")) (anno (def "instrument for archery")) (syn-struc ((root $var0) (cat n))) (syn-struc ((root $var0) (cat n))) (sem-struc (bow)) (sem-struc (bow))))
(bow-n2(bow-n2 (cat n) (cat n) (anno (def "part of string-instruments")) (anno (def "part of string-instruments")) (syn-struc ((root $var0) (cat n))) (syn-struc ((root $var0) (cat n))) (sem-struc (stringed-instrument-bow)) (sem-struc (stringed-instrument-bow))))
Lexicon (Bow)Lexicon (Bow)
(bow-v1(bow-v1(cat v)(cat v)(anno (def "to give in to someone or something"))(anno (def "to give in to someone or something"))(syn-struc ((subject ((root $var2) (cat np)))(syn-struc ((subject ((root $var2) (cat np)))
(root $var0) (cat v)(root $var0) (cat v)(pp-adjunct ((root to)(pp-adjunct ((root to)
(cat prep)(cat prep)(obj ((root $var3) (cat (obj ((root $var3) (cat
np))))))np))))))))
(sem-struc (yield-to(sem-struc (yield-to(agent (value ^$var2))(agent (value ^$var2))(caused-by (value ^$var3))))(caused-by (value ^$var3))))
))
QDEXQDEX
QDEX extracts all possible queries QDEX extracts all possible queries that can be asked to a Web page, at that can be asked to a Web page, at various lengths and formsvarious lengths and forms
queries (sequences) become queries (sequences) become gateways to the originating gateways to the originating documents, paragraphs and documents, paragraphs and sentences during retrievalsentences during retrieval
QDEX vs Inverted IndexQDEX vs Inverted Index
An inverted index has a huge An inverted index has a huge “active” data set prior to a query “active” data set prior to a query from the user.from the user.
Enriching this data set with semantic Enriching this data set with semantic equivalences (concept relations) will equivalences (concept relations) will further increase the operational further increase the operational burden in an burden in an exponentialexponential manner. manner.
QDEX has a tiny active set for each QDEX has a tiny active set for each query and semantic associations can query and semantic associations can be easily handled on-the-fly. be easily handled on-the-fly.
QDEX combinatoricsQDEX combinatorics
The critical point in QDEX system is to be able to The critical point in QDEX system is to be able to decompose sentences into a handful of decompose sentences into a handful of meaningful sequences without getting lost in the meaningful sequences without getting lost in the combinatory explosion space.combinatory explosion space.
For example, a sentence with 8 significant words For example, a sentence with 8 significant words can generate over a billion sequences (of 1, 2, 3, can generate over a billion sequences (of 1, 2, 3, 4, 5, and 6 words) where only a few dozen makes 4, 5, and 6 words) where only a few dozen makes sense by human inspection.sense by human inspection.
The challenge is how to reduce billion The challenge is how to reduce billion possibilities into a few dozen that make sense. possibilities into a few dozen that make sense. hakia uses OntoSem technology to meet this hakia uses OntoSem technology to meet this challenge. challenge.
Semantic RankSemantic Rank
a pool of relevant paragraphs come from a pool of relevant paragraphs come from the QDEX system for a given query termsthe QDEX system for a given query terms
final relevancy is determined based on an final relevancy is determined based on an advanced sentence analysis and concept advanced sentence analysis and concept match between the query and the best match between the query and the best sentence of each paragraphsentence of each paragraph
morphological and syntactic analyses are morphological and syntactic analyses are also performedalso performed
no keyword matching or Boolean algebra no keyword matching or Boolean algebra is involvedis involved
the credibility and age (of the Web page) the credibility and age (of the Web page) are also taken into accountare also taken into account
PowerSetPowerSet
Powerset DemoPowerset Demo
NL Question on NL Question on WikipediaWikipedia
What companies What companies did IBM acquire?did IBM acquire?
Which company Which company did IBM acquire in did IBM acquire in 1989?1989?
NL Question on NL Question on WikipediaWikipedia
What companies What companies did IBM acquire?did IBM acquire?
Which company Which company did IBM acquire in did IBM acquire in 1989?1989?
Google query on Google query on WikipediaWikipedia
Same queriesSame queries Poorer resultsPoorer results
Google query on Google query on WikipediaWikipedia
Same queriesSame queries Poorer resultsPoorer results
Try yourselfTry yourself
Who acquired IBM?Who acquired IBM? IBM acquisitions 1996IBM acquisitions 1996 IBM acquisitionsIBM acquisitionsWhat do liberal democrats say about What do liberal democrats say about
healthcarehealthcare– 1.4 million matches
ProblemsProblems
Parser from Xerox is a quite Parser from Xerox is a quite sophisticated constituent parser:sophisticated constituent parser:– it produces all possible parser trees– fairly slow
Workaround: index only the highest Workaround: index only the highest relevant portion of the Webrelevant portion of the Web
RealityReality
Semantic Document AnalysisSemantic Document Analysis
Question AnsweringQuestion Answering– Return precise answer to natural
language queriesRelation ExtractionRelation Extraction Intent MiningIntent Mining– assess the attitude of the document
author with respect to a given subject– Opinion mining: attitude is a positive or
negative opinion
Semantic Retrieval ApproachesSemantic Retrieval Approaches
Used in QA, Opinion Retrieval, etc.Used in QA, Opinion Retrieval, etc. Typical 2-stage approach:Typical 2-stage approach:
1. Perform IR and rank by topic relevance
2. Postprocess results with filters and rerank
Generally slow:Generally slow:– Requires several minutes to process
each query
Single stage approachSingle stage approach
Single-stage approach:Single-stage approach:– Enrich the index with opinion tags– Perform normal retrieval with custom
ranking functionProved effective at TREC 2006 Blog Proved effective at TREC 2006 Blog
Opinion Mining TaskOpinion Mining Task
Enriched Index for TREC BlogEnriched Index for TREC Blog
Overlay words with tagsOverlay words with tags
musicmusic isis aa touchtouch lamelame
11 22 33 44 55
NEGATIVENEGATIVE
soundtracksoundtrack littlelittle weakweak
ARTART bitbit plateplate
Enhanced QueriesEnhanced Queries
music NEGATIVE:lamemusic NEGATIVE:lamemusic NEGATIVE:*music NEGATIVE:*
Achieved 3Achieved 3rdrd best P@5 at TREC Blog best P@5 at TREC Blog Track 2006Track 2006
Enriched Inverted IndexEnriched Inverted Index
Inverted IndexInverted Index
Stored compressedStored compressed– ~1 byte per term occurrence
Efficient intersection operationEfficient intersection operation– O(n) where n is the length of shortest
postings list– Using skip lists further reduces cost
Size: ~ 1/8 original textSize: ~ 1/8 original text
Small Adaptive Set IntersectionSmall Adaptive Set Intersection
world wide web
3
9
12
20
40
47
1
8
10
25
40
2
4
6
21
30
35
40
41
IXE Search Engine LibraryIXE Search Engine Library
C++ OO architectureC++ OO architecture Fast indexingFast indexing– Sort-based inversion
Fast searchFast search– Efficient algorithms and data structures– Query Compiler
• Small Adaptive Set Intersection– Suffix array with supra index– Memory mapped index files
Programmable API libraryProgrammable API library Template metaprogrammingTemplate metaprogramming Object Store Data BaseObject Store Data Base
IXE PerformanceIXE Performance
TREC TeraByte 2005:TREC TeraByte 2005:– 2nd fastest– 2nd best P@5
Query ProcessingQuery Processing
Query compilerQuery compiler– One cursor on posting lists for each
node– CursorWord, CursorAnd, CursorOr,
CursorPhraseQueryCursor.next(Result& min)QueryCursor.next(Result& min)– Returns first result r >= min
Single operator for all kind of Single operator for all kind of queries: e.g. proximityqueries: e.g. proximity
IXE ComposabilityIXE Composability
DocInfoDocInfo
PassageDocPassageDoc
Collection<DocInfo>Collection<DocInfo>
Collection<PassageDoc>Collection<PassageDoc>namedatesize
namedatesize
textboundaries
textboundaries QueryCursorQueryCursor
PassageQueryCursorPassageQueryCursor
next()next()
next()next()
CursorCursor
next()next()
Passage RetrievalPassage Retrieval
Documents are split into passagesDocuments are split into passagesMatches are searched in passages Matches are searched in passages ± ±
nn nearby nearbyResults are ranked passagesResults are ranked passagesEfficiency requires special store for Efficiency requires special store for
passage boundariespassage boundaries
QA Using Dependency RelationsQA Using Dependency Relations
Build dependency trees for both Build dependency trees for both question and answerquestion and answer
Determine similarity of Determine similarity of corresponding paths in dependency corresponding paths in dependency trees of question and answertrees of question and answer
PiQASso Answer MatchingPiQASso Answer Matching
Tungsten is a very dense material and has the highest melting point of any metal.
1 Parsing
2 Answer type check 3 Relation extraction
SUBSTANCE<tungsten, material, pred><tungsten, has, subj><point, has, obj>…
4 Matching Distance
Tungsten
6 Popularity Ranking
ANSWER
5 Distance FilteringWhat metal has the highest melting point?
sub mod
obj
mod
QA Using Dependency RelationsQA Using Dependency Relations
Further developed by Cui et al, NUSFurther developed by Cui et al, NUSScore computed by statistical Score computed by statistical
translation modeltranslation modelSecond best at TREC 2004Second best at TREC 2004
Wikipedia ExperimentWikipedia Experiment
Tagged Wikipedia with:Tagged Wikipedia with:– POS– LEMMA– NE (WSJ, IEER)–WN Super Senses– Anaphora– Parsing (head, dependency)
Tools UsedTools Used
SST tagger [Ciaramita & Altun]SST tagger [Ciaramita & Altun]DeSR dependency parser [Attardi & DeSR dependency parser [Attardi &
Ciaramita]Ciaramita]– Fast: 200 sentence/sec– Accurate: 90 % UAS
Dependency ParsingDependency Parsing
Produces dependency treesProduces dependency treesWord-word dependency relationsWord-word dependency relationsFar easier to understand and to Far easier to understand and to
annotateannotate
Rolls-Royce Inc. said it expects its sales to remain steady
SUBJ OBJ
MOD SUBJ
OBJ
SUBJ MODTO
Classifer-based Shift-Reduce ParsingClassifer-based Shift-Reduce ParsingRight He
PPsaw
VVDa
DTgirlNN
withIN
aDT
telescopeNNS
.SENT
nexttop
Shift
Left
CoNLL 2007 Results CoNLL 2007 Results
LanguageLanguage UASUAS LASLAS
CatalanCatalan 92.2092.20 87.6487.64
ChineseChinese 86.7386.73 86.8686.86
EnglishEnglish 86.9986.99 85.8585.85
ItalianItalian 85.5485.54 81.3481.34
CzechCzech 83.4083.40 77.3777.37
TurkishTurkish 83.5683.56 76.8776.87
ArabicArabic 82.5382.53 72.6672.66
HungarianHungarian 81.8181.81 76.8176.81
GreekGreek 80.7580.75 73.9273.92
BasqueBasque 76.8676.86 69.8469.84
EvalIta 2007 Results EvalIta 2007 Results
CollectionCollection UASUAS LASLAS
Cod. CivileCod. Civile 91.3791.37 79.1379.13
NewspaperNewspaper 85.4985.49 76.6276.62
Best statistical parserBest statistical parser
ExperimentExperiment
Experimental data setsExperimental data sets
WikipediaWikipediaYahoo! AnswersYahoo! Answers
English Wikipedia IndexingEnglish Wikipedia Indexing
Original size: 4.4 GBOriginal size: 4.4 GBNumber of articles: 1,400,000Number of articles: 1,400,000Tagging time: ~3 days (6 days with Tagging time: ~3 days (6 days with
previous tools)previous tools)Parsing time: 40 hoursParsing time: 40 hours Indexing time: 9 hours (8 days with Indexing time: 9 hours (8 days with
UIMA + Lucene)UIMA + Lucene) Index size: 3 GBIndex size: 3 GBMetadata: 12 GBMetadata: 12 GB
Scaling IndexingScaling Indexing
Highly parallelizableHighly parallelizableUsing Hadoop in stream modeUsing Hadoop in stream mode
Example (partial)Example (partial)
TERM POS LEMMA WNSS HEAD DEP
The DT the 0 2 NMOD
Tories NNPS tory B-noun.person 3 SUB
won VBD win B-verb.competition 0 VMOD
this DT this 0 5 NMOD
election NN election B-noun.act 3 OBJ
Stacked ViewStacked View
11 22 33 44 55
TERM The Tories won this election
POS DT NNPS VBD DT NN
LEMMA the tory win this election
WNSS 0 B-noun.person B-verb.competition 0 B-noun.act
HEAD 2 3 0 5 3
DEP NMOD SUB VMOD NMOD OBJ
ImplementationImplementation
Special version of Passage RetrievalSpecial version of Passage Retrieval Tags are overlaid to wordsTags are overlaid to words– Dealt as terms in same position as
corresponding word– Not counted to avoid skewing TF/IDF– Given an ID in the lexicon
Retrieval is fast:Retrieval is fast:– A few msec per query on a 10 GB index
Provided as both Linux library and Provided as both Linux library and Windows DLLWindows DLL
Java InterfaceJava Interface
Generated using SWIGGenerated using SWIGResults accessible through a Results accessible through a
ResultIteratorResultIteratorList of terms or tags for a sentence List of terms or tags for a sentence
generated on demandgenerated on demand
Proximity queriesProximity queries Did France win the World Cup?Did France win the World Cup?
proximity 15 [proximity 15 [MORPH/win:*MORPH/win:* DEP/SUB:franceDEP/SUB:france 'world cup']'world cup']
Born in the French territory of New Caledonia, he Born in the French territory of New Caledonia, he was a vital player in the French team that was a vital player in the French team that wonwon the the 1998 1998 World CupWorld Cup and was on the squad, but and was on the squad, but played just one game, as played just one game, as FranceFrance wonwon Euro 2000. Euro 2000.
FranceFrance repeated the feat of Argentina in 1998, by repeated the feat of Argentina in 1998, by taking the title as they taking the title as they wonwon their home 1998 their home 1998 World CupWorld Cup, beating Brazil., beating Brazil.
Both England (1966) and Both England (1966) and FranceFrance (1998) (1998) wonwon their their only only World CupsWorld Cups whilst playing as host nations. whilst playing as host nations.
Proximity queriesProximity queries
Who won the World Cup in 1998?Who won the World Cup in 1998?proximity 13 [proximity 13 [MORPH/win:*MORPH/win:* DEP/SUB:*DEP/SUB:* 'world cup' WSJ/DATE:1998]'world cup' WSJ/DATE:1998]
With the French national team, With the French national team, DugarryDugarry wonwon World Cup World Cup 1998 and Euro 2000. 1998 and Euro 2000.
HeHe captained Arsenal and captained Arsenal and wonwon the the World World CupCup with France in 1998. with France in 1998.
Did France win the World Cup in Did France win the World Cup in 2002?2002?proximity 30 [proximity 30 [MORPH/win:*MORPH/win:* DEP/SUB:franceDEP/SUB:france 'world cup' 'world cup' WSJ/DATE:2002]WSJ/DATE:2002]
No result.No result.
Who won it in 2002?Who won it in 2002?proximity 6 [proximity 6 [MORPH/win:*MORPH/win:* DEP/SUB:*DEP/SUB:* 'world cup' 2002]'world cup' 2002]
He has 105 caps for Brazil, and helped his He has 105 caps for Brazil, and helped his countrycountry winwin the the World CupWorld Cup in in 20022002 after finishing second in 1998. after finishing second in 1998.
20022002 - - BrazilBrazil winswins the Football the Football World CupWorld Cup becoming the first team to becoming the first team to winwin the the trophy 5 times trophy 5 times
Dependency QueriesDependency Queries
deprel [ pattern headPattern ]deprel [ pattern headPattern ]
Semantics: clause matches any document Semantics: clause matches any document that contains a match for pattern whose that contains a match for pattern whose head matches headPatternhead matches headPattern
Implementation:Implementation:
search for search for patternpattern for each match at (doc, pos) for each match at (doc, pos) find h = head(doc, pos) find h = head(doc, pos)
find match for find match for headPatternheadPattern at (doc, h at (doc, h±±2)2)
Finding headsFinding heads
How to find head(doc, pos)?How to find head(doc, pos)?Solution: to store the HEAD Solution: to store the HEAD
positions in a special posting list.positions in a special posting list.A posting list stores the positions A posting list stores the positions
where a term occurs in a document.where a term occurs in a document.The HEADS posting list stores the The HEADS posting list stores the
heads of each term in a document.heads of each term in a document.
Finding HeadsFinding Heads
To retrieve head(doc, pos), one To retrieve head(doc, pos), one accesses the posting list of HEADS accesses the posting list of HEADS for doc and extracts the pos-th item.for doc and extracts the pos-th item.
Posting lists are efficient since they Posting lists are efficient since they are stored compressed on disk and are stored compressed on disk and accessed through memory mapping.accessed through memory mapping.
Dependency PathsDependency Paths
deprel [ patterndeprel [ pattern00 pattern pattern11 … pattern … patternii]]
Note: opposite direction from XPathNote: opposite direction from XPath
Multiple TagsMultiple Tags
DEP/SUB:MORPH/insect:*DEP/SUB:MORPH/insect:*
Dependency queriesDependency queries
Who won the elections?Who won the elections?
deprel [ election won ]deprel [ election won ]
deprel [ DEP/OBJ:election deprel [ DEP/OBJ:election MORPH/win:*MORPH/win:* ] ]
The Scranton/Shafer team The Scranton/Shafer team wonwon the the electionelection over Philadelphia mayor over Philadelphia mayor Richardson Dilworth and Shafer Richardson Dilworth and Shafer became the states lieutenantbecame the states lieutenant
CollectCollect
What are the causes of death?What are the causes of death? deprel [ from deprel [ from MORPH/die:*MORPH/die:* ] ]
She She dieddied from from throat throat cancercancer in in Sherman Oaks, California.Sherman Oaks, California.
Wilson Wilson dieddied from AIDS from AIDS..
DemoDemo
Deep Search on WikipediaDeep Search on Wikipedia–Web interface– Queries with tags and deprel
Browsing on Deep Search resultsBrowsing on Deep Search results– Sentences are collected– Graph of sentences/entities is created• WebGraph [Boldi-Vigna]
– Results clustered through most frequent entities
IssuesIssues
Dependency relations are crude for Dependency relations are crude for English (30 in total)English (30 in total)– SUB, OBJ, NMOD
Better for Catalan (168)Better for Catalan (168)– Distinguish time/location/cause
adverbialsRelation might not be directRelation might not be direct– E.g. “die from cancer”
Queries can’t express SUB/OBJ Queries can’t express SUB/OBJ relationshiprelationship
Semantic Relations?Semantic Relations?
The movie is not a masterpiece
Target-OpinionTarget-Opinion
Or a few general relation types?Or a few general relation types?
Directly/IndirectlyDirectly/IndirectlyAffirmative/Negative/DubitativeAffirmative/Negative/DubitativeActive/PassiveActive/Passive
Translating QueriesTranslating Queries
Compile NL query into query syntaxCompile NL query into query syntaxLearn from examples, e.g. Yahoo! Learn from examples, e.g. Yahoo!
AnswersAnswers
Generic QuadrupleGeneric Quadruple
(Subject, Object, Verb, Mode)(Subject, Object, Verb, Mode)Support searching for quadruplesSupport searching for quadruplesRank based on distanceRank based on distance
S O
V
M
Related WorkRelated Work
ChakrabartiChakrabarti
Proposes to use proximity queriesProposes to use proximity queriesOn a Web index built with Lucene On a Web index built with Lucene
and UIMAand UIMA
We are hiringWe are hiring
Three projects starting:Three projects starting:
1.1. Semantic Search on Italian Semantic Search on Italian WikipediaWikipedia2 assegni ricerca. Fond. Cari Pisa2 assegni ricerca. Fond. Cari Pisa
2.2. Deep SearchDeep Search2 PostDoc. Yahoo! Research2 PostDoc. Yahoo! Research
3.3. Machine TranslationMachine Translation2 PostDoc.2 PostDoc.
QuestionsQuestions
DiscussionDiscussion
Are there other uses than search?Are there other uses than search?– better query refinement– semantic clustering– Vanessa Murdock’s aggregated result
visualization Interested in getting access to the Interested in getting access to the
resource for experimentation?resource for experimentation?Shall relation types be learned?Shall relation types be learned?Will it scale?Will it scale?