Semantic Search tutorial at SemTech 2012

1. Semantic Search Tutorial Introduction Peter Mika| Yahoo! Research, Spain [email protected] Thanh Tran | Institute AIFB, KIT, Germany [email protected]

2. About the speakers Peter Mika Senior Research Scientist Head of Semantic Search group at Yahoo! Research inBarcelona Semantic Search, Web Object Retrieval, NaturalLanguage Processing Tran Duc Thanh Assistent Professor at AIFB, Karlsruhe Institute ofTechnology Head of Semantic Search group Semantic Search, Semantic / Linked Data Management 3. Agenda Introduction (10 min) Semantic Web data (50 min) The RDF data model Publishing RDF Crawling and indexing RDF data Query processing (40 min) Ranking (30 min) Result presentation (15 min) Semantic Search evaluation (15 min) Questions (5 min) 4. Why Semantic Search? I. We are at the beginning of search. (Marissa Mayer) Solved large classes of queries, e.g. navigational Heavy investment in computational power Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognition Background knowledge and metadata can help to address poorly solved queries 5. Poorly solved information needs Many of these queries Ambiguous searches would not be asked by paris hilton users, who learned over Long tail queriestime what search technology can and can george bush (and I mean the beer brewernotArizona) in do. Multimedia search paris hilton sexy Imprecise or overly precise searches jim hendler pictures of strong adventures people Precise searches for descriptions countries in africa 32 year old computer scientist living in barcelona reliable digital camera under 300 dollars 6. Example: multiple interpretations 7. Why Semantic Search? II. The Semantic Web is now a reality Large amounts of data published in RDF Heterogeneous data of varying quality Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain Searching data instead or in addition to searching documents Direct answers Novel search tasks 8. Example: direct answers in searchPoints of Facetedinterest inInformationsearch for Information box withVienna,from the Shopping content from andAustriaKnowledgeresultslinks to Yahoo! Graph Travel Since Aug, 2010, regular search results are Powered by Bing 9. Novel search tasks Aggregation of search results e.g. price comparison across websites Analysis and prediction e.g. world temperature by 2020 Semantic profiling recommendations based on particular interests Semantic log analysis understanding user behavior in terms of objects Support for complex tasks e.g. booking a vacation using a combination of services 10. Contextual (pervasive, ambient) search Yahoo! Connected TV: Widget engine embedded into the TVYahoo! IntoNow:recognize audio andshow related content 11. Interactive search and task completion 12. Document retrieval and data retrieval Information Retrieval (IR) support the retrieval ofdocuments (document retrieval) Representation based on lightweight syntax-centric models Work well for topical search Not so well for more complex information needs Web scale Database (DB) and Knowledge-based Systems (KB)deliver more precise answers (data retrieval) More expressive models Allow for complex queries Retrieve concrete answers that precisely match queries Not just matching and filtering, but also joins Limitations in scalability 13. Combination of document and dataretrieval Documents with metadata Metadata may be embedded inside the document Im looking for documents that mention countries in Africa. Data retrieval Structured data, but searchable text fields Im looking for directors, who have directed movies where the synopsis mentions dinosaurs. 14. Semantic Search Target (combination of) document and data retrieval Semantic search is a retrieval paradigm that Exploits the structure/semantics of the data or explicitbackground knowledge to understand user intent and themeaning of content Incorporates the intent of the query and the meaning ofcontent into the search process (semantic models) Wide range of semantic search systems Employ different semantic models, possibly at different steps of the search process and in order to support different tasks 15. Semantic models Semantics is concerned with the meaning of the resources made available for search Various representations of meaning Linguistic models: models of relationships among words Taxonomies, thesauri, dictionaries of entity names Inference along linguistic relations, e.g. broader/narrower terms Conceptual models: models of relationships among objects Ontologies capture entities in the world and theirrelationships Inference along domain-specific relations We will focus on conceptual models in this tutorial In particular, the RDF/OWL conceptual model for 16. Semantic Search a process view Knowledge RepresentationSemantic Models Keywords Resources FormsQuery NLConstruction Formal language IR-style matching & ranking DB-style precise matchingQueryKB-style matching &Processinginferences Query visualization Document and data Documents Result presentationPresentation Summarization Implicit feedback Explicit feedbackQueryIncentivesRefinementDocument Representation 17. Semantic Search systemsFor data / document retrieval, semantic searchsystems might combine a range of techniques, rangingfrom statistics-based IR methods for ranking,database methods for efficient indexing and queryprocessing, up to complex reasoning techniques formaking inferences! 18. Example: Information Workbench Addressing the lifecycle ofinteracting with the Web ofData Integration of data sources Content generation by the enduser Search and Exploration Visualization User- Wikipedia PublishinggeneratedDBpedia, Yago Integrated management ofEarthquakeheterogeneous data(Data.gov)sources Structured and unstructuredStructure Published and user-generatedDynamicd Static and dynamic Open domain 19. Data Sources in the Application Entire English Wikipedia Data from Linked Open Data DBpedia YAGO Data from Data.gov (US Government) E.g. live data about earthquakes Many more 20. Semantic Search Hybrid Search: Structured queries combined withkeywords across structured and unstructured datasources Query interpretation: Translation of keywords intohybrid queries Keyword search/query interpretation combinedwith faceted search: iterative refinement processbased on keywords and operations on facets 21. Search, Refinement and Navigation KeywordsQueryTranslationsTermCompletionsVorlesung Knowledge Discovery - InstitutFacetsAIFB21 22. Result Inspection, Analysis and Browsing 23. Semantic Web data 24. Data on the Web Data on the Web is not directly accessible Most web pages are generated from databases, butformatted for human consumption APIs offer limited views over data Two solutions Extraction using Information Extraction (IE) techniques Out of scope for this tutorial Relying on publishers to expose structured data using standard Semantic Web formats 25. Information extraction Natural Language Processing Named entity recognition and disambiguation, sentimentanalysis etc. Extraction of information about entities Suchanek et al. YAGO: A Core of Semantic KnowledgeUnifying WordNet and Wikipedia, WWW, 2007. Wu and Weld. Autonomously Semantifying Wikipedia, CIKM2007. Extraction from HTML tables Cafarella et al. WebTables: Exploring the Power of Tables onthe Web. VLDB 2008 Wrapper induction Kushmerick et al. Wrapper Induction for InformationExtractionText extraction. IJCAI 2007 Filling web forms automatically (form-filling) Madhavan et al. Googles Deep-Web Crawl. VLDB 2008 26. Semantic Web Sharing data across the Web Standard data model RDF A number of syntaxes (file formats) RDF/XML, RDFa Powerful, logic-based languages for schemas OWL, RIF Query languages and protocols HTTP, SPARQL 27. Resource Description Framework (RDF) Each resource (thing, entity) is identified by a URI Globally unique identifiers URLs a subset of URIs Often abbreviated using namespaces e.g. example:roi = http://example.org/roi RDF represents knowledge as a set of triples Each triple is a single fact about the entity (an attribute or arelationship) A set of triples forms an RDF graphRDF documenttype foaf:Personexample:roinameRoi Blanco 28. Linking resourcesRois homepageFriend-of-a-Friend ontology type example:roifoaf:Person nameRoi BlancoknowssameAsYahoo!s website type worksWith#roi2#peter [email protected] 29. Ontologies Ontologies are the schemas for RDF graphs Define the intended meaning of certain classes and relationships in a domain e.g. the FOAF ontology defines the foaf:Person class and relationships such as foaf:knows Ontologies are published in standard languages such as OWL Language defines the constructs for modeling and their meaning e.g. subClassOf, sameAs Tools for editing ontologies such as Protg, TopBraid Composer Reusing and extending well-known ontologies helps to interpret unknown data e.g. if X is subClassOf foaf:Person, and A is of type X, then we know that A is a person 30. Example: schema.org Agreement on a shared set of schemas for common types of web content Bing, Google, and Yahoo! as initial supporters Similar in intent to sitemaps.org (2006) Use a single format to communicate the same information to allthree search engines Support for microdata schema.org covers areas of interest to all search engines Business listings (local), creative works(video), recipes, reviews User defined extensions Each search engine continues to develop its products 31. Documentation and OWL ontology 32. Publishing RDF Interlinked RDF documents (Linked Data) Each document describes a single resource with URIspointing to related resources Common RDF file formats are RDF/XML and Turtle Mostly implemented as a wrapper around a database orWeb service Embedding RDF inside HTML RDFa, microdata SPARQL endpoints Triple stores are databases for managing RDF data SPARQL is a standard protocol and query language for accessing triple stores using HTTP 33. Linked Data Grass-roots community effort to (re)publish open datasets in RDF Centered around the Dbpedia project Directory of datasets at linkeddata.org, thedatahub.org 34. Metadata in HTML anno 1995What does this termmean?How is this datarelated? 35. Microformats (anno 2003) Mark-up for data in HTML pages Reuse existing HTML elements (class, rel) Microformats exist for a limited set of objects Persons, organizations, reviews, events etc., see microformats.org No formal schemas Limited reuse, extensibility of schemas Unclear which combinations are allowed No identifiers for entities No interlinking between entities

36. RDFa and RDFa Lite (2008-2012) W3C standards for embedding RDF data in HTML documents A set of new HTML attributes to be used in head or body A specification of how to extract the data from these attributes RDFa is just a syntax, you have to choose avocabulary separately RDFa family RDFa 1.0 Recommendation since October, 2008 RDFa 1.1 is a small update on RDFa to make it easier to use RDFa Primer (Working Draft, May 8, 2012) RDFa 1.1 Lite is a subset of RDFa 1.1 with the most 37. Example: Facebooks Open GraphProtocol The Like button provides publishers with a way to promote their content on Facebook and build communities Shows up in profiles and news feed Site owners can later reach users who have liked anobject Facebook Graph API allows 3rd party developers toaccess the data Open Graph Protocol is an RDFa-based format that allows to describe the object that the user Likes 38. Example: Facebooks Open Graph Protocol RDF vocabulary to be used in conjunction with RDFa Simplify the work of developers by restricting the freedom in RDFa Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment Only HTML accepted The Rock (1996) ... 39. Microdata HTML5 Currently under standardization at the W3C Originally part of the HTML5 spec, but now a separatedocument Comparable to RDFa 1.1 Lite Key extensibility features (such as multiple types)missing HTML5 also has a number of semantic elements

My name is , such as , itemprop="name">Neil.

My band is called Four Parts Water.I was born on May 10th 2009.

40. Current state of metadata on the Web 31% of webpages, 5% of domains contain some metadata Analysis of the Bing Crawl (US crawl, January, 2012) RDFa is most common format By URL: 25% RDFa, 7% microdata, 9% microformat By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat Adoption is stronger among large publishers Especially for RDFa and microdata See also P. Mika, T. Potter. Metadata Statistics for a Large WebCorpus, LDOW 2012 H.Mhleisen, C.Bizer.Web Data Commons - ExtractingStructured Data from Two Large Web Corpora, LDOW2012 41. Exponential growth in RDFa dataAnother five-fold increasebetween October 2010 andJanuary, 2012Five-fold increase betweenMarch, 2009 andOctober, 2010Percentage of URLs with embedded metadata in various formats 42. Crawling and indexing 43. Crawling the Semantic Web Linked Data Similar to HTML crawling, but the the crawler needs toparse RDF/XML (and others) to extract URIs to becrawled Semantic Sitemap/VOID descriptions RDFa Same as HTML crawling, but data is extracted aftercrawling Mika et al. Investigating the Semantic Gap through QueryLog Analysis, ISWC 2010. SPARQL endpoints Endpoints are not linked, need to be discovered by othermeans Semantic Sitemap/VOID descriptions 44. Data fusion Ontology matching Widely studied in Semantic Web research, see e.g. list ofpublications at ontologymatching.org Unfortunately, not much of it is applicable in a Web context due to thequality of ontologies Entity resolution Logic-based approaches in the Semantic Web Studied as record linkage in the database literature Machine learning based approaches, focusing on attributes Graph-based approaches, see e.g. the work of Lisa Getoorare applicable to RDF data Improvements over only attribute based matching Blending Merging objects that represent the same real world entity andreconciling information from multiple sources 45. Data quality assessment and curation Heterogeneity, quality of data is an even larger issue Quality ranges from well-curated data sets (e.g. Freebase) tomicroformats In the worst of cases, the data becomes a graph of words Short amounts of text: prone to mistakes in data entry orextraction Example: mistake in a phone number or state code Quality assessment and data curation Quality varies from data created by experts to user-generatedcontent Automated data validation Against known-good data or using triangulation Validation against the ontology or using probabilistic models Data validation by trained professionals or crowdsourcing Sampling data for evaluation 46. Indexing Search requires matching and ranking Matching selects a subset of the elements to be scored The goal of indexing is to speed up matching Retrieval needs to be performed in milliseconds Without an index, retrieval would require streaming through the collection The type of index depends on the query model to support DB-style indexing IR-style indexing 47. IR-style indexing Index data as text Create virtual documents from data One virtual document per subgraph, resource or triple typically: resource Key differences to Text Retrieval RDF data is structured Minimally, queries on property values are required 48. Horizontal index structure Two fields (indices): one for terms, one for properties For each term, store the property on the same position inthe property index Positions are required even without phrase queries Query engine needs to support the alignment operator Dictionary is number of unique terms + number ofproperties Occurrences is number of tokens * 2 49. Vertical index structure One field (index) per property Positions are not required But useful for phrase queries Query engine needs to support fields Dictionary is number of unique terms Occurrences is number of tokens Number of fields is a problem for merging, queryperformance 50. Distributed indexing MapReduce is ideal for building inverted indices Map creates (term, {doc1}) pairs Reduce collects all docs for the same term:(term, {doc1, doc2} Sub-indices are merged separately Term-partitioned indices Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010. 51. Query Processing / Matching 52. Structure Taxonomy of search approaches Query processing / matching techniques for SemanticSearch Types of semantic data Formalisms for querying semantic data Approaches General task: hybrid graph pattern matching Matching keyword query against text Matching structured query against structured data Matching keyword query against structured data Matching structured query against text (a hybrid case) Main tasks, challenges and opportunities 53. Taxonomy of search approaches (1) The search problem A collection of resources, called data Information needs expressed as queries Search is the task of efficiently computing results from data that are relevant to queries Document data retrieval vs. structured data retrieval Differences in query and data representation andmatching Efficiently retrieve structured data that exactly matchformal information needs expressed as structuredqueries Effectively rank textual results that match ambiguous NL /keyword queries to a certain degree (notions ofrelevance) Semantic search: ranked retrieval of document and 54. Taxonomy of search approaches (2)Query engines (ofdatabases) Exact Complete Query Sound Approximate Matching Not complete Not sound RankedData Best effort Top-k Search engines (stand-alone/database extensons)Query processing mainly focuses on efficiency of matchingwhereas ranking deals with degree of matching (relevance)! 55. Query processing for Semantic Search (1) Resources represented by semantic data ranging from Structured data with well defined schemas Semi-structured data with incomplete or no schemas Data that largely comprise text Hybrid / embedded data Information needs of varying complexity, captured usingdifferent formalisms and querying paradigms Natural language texts and keywords Form-based inputs Formal structured queries(Search is end-user oriented paradigm, requiresnatural, intuitive querying interfaces) Semantic search: efficiently computing results (queryprocessing) from data that are relevant to queries(ranking) 56. Query processing for Semantic Search (2)NL Form- / facet- Structured QueriesKeywords Questions based Inputs (SPARQL)Ambiquities Query MatchingData RDF data Semi- OWL ontologies withStructured embedded inStructured rich, formalRDF datatext (RDFa) RDF datasemanticsAmbiquities: confidence degree, truth/trust 57. Query processing for Semantic Search (3)Textual DataStructured query on Keyword query textual data on textual , e.g. querying data, e.g. Web Search target Semanticextension for search systems group of differentsearch users, informationsystems?Unstructuredneeds, and types Structured of data.StructuredQueryQuery Query processing for on Keyword query queryon structured search is hybrid semantic structured datadata, e.g. e.g. standardcombination of techniques!search querying extensions forinterface fordatabasesdatabases / RDF storesStructured Data 58. Types of data models (1) Textual Bag-of-words Represent documents, text in structured data,, real-world objects (captured as structured data) Lacks structure in text, e.g. linguistic structure, hyperlinks, (positionalinformation) Structure in structured data representationterm (statistics) In combination with combination Cloud Computing Cloud technologies, promising Computing solutions for the Technologies management of `big solutions data have emerged. management Existing industry `big data solutions are able to industry support complex solutions queries and analytics support tasks with terabytes of complex data. For example, using a Greenplum. 59. Types of data models (2) Textual Structured Resource Description Framework (RDF) Represent real-world objects, services, applications, .documents Resource attribute values and relationships betweenresources SchemaPicture creatorPersonBob 60. Types of data models (3) Textual Structured Hybrid RDF data embedded in text (RDFa) 61. Types of data models RDFa (1)

adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ 62. Types of semantic data RDFa (2)Bob is a good friend of mine. We contentwent to the same university, andalso shared an apartment in Berlinin 2008. The trouble with Bob isthat he takes much better photosthan I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ 63. Types of semantic data - conclusionSemantic data in general can be conceived as a graph with text and structured data items as nodes, and edges representdifferent types of relationships includingexplicit semantic relationships andvaguely specified ones such as hyperlinks! 64. Formalisms for querying semantic data (1)Example information needInformation about a friend of Alice, who sharedan apartment with her in Berlin and knowssomeone working at KIT. Unstructured queries Fully-structured queries Hybrid queries: unstructured + structured 65. Formalisms for querying semantic data (2)Example information needInformation about a friend of Alice, who sharedan apartment with her in Berlin and knowssomeone working at KIT. Unstructured NL Keywordsshared apartment Berlin Alice 66. Formalisms for querying semantic data (3)Example information needInformation about a friend of Alice, who sharedan apartment with her in Berlin and knowssomeone working at KIT. Unstructured Fully-structured SPARQL: BGP, filter, optional, union, select, construct, ask, describe PREFIX ns: SELECT ?x WHERE { ?x ns:knows ? y. ?y ns:name Alice.?x ns:knows ?z. ?z ns: works ?v. ?v ns:name KIT } 67. Formalisms for querying semantic data(4) Fully-structured Unstructured Hybrid: content and structure constraints shared apartment Berlin Alice?x ns:knows ? y. ?y ns:name Alice.?x ns:knows ?z. ?z ns: works ?v.?v ns:name KIT 68. Formalisms for querying semantic data(5) Fully-structured Unstructured Hybrid: content and structure constraints shared apartment Berlin Alice?x ns:knows ? y. ?y ns:name Alice.?x ns:knows ?z. ?z ns: works ?v.?v ns:name KIT 69. Formalisms for querying semantic data - conclusionSemantic search queries can be conceived as graph patterns with nodes referring totext and structured data items, and edgesreferring to relationships between these items! 70. Processing hybrid graph patterns (1) Example information need Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.apartment shared Berlin Alice?x ns:knows ?z. ?z ns: works ?v. ?v ns:name KIT?y ns:name Alice. ?x ns:knows ? ytrouble with bobFluidOps34Peter sunset.jpgBob is a good friendBeautifulof mine. We went to Sunsetthe sameGermany Semantic AliceSearchuniversity, and alsoshared anapartment in Berlinin 2008. The troubleGermany2009Bobwith Bob is that heThanhtakes much betterphotos than I do:KIT 71. Processing hybrid graph patterns (2) Matching hybrid graph patterns against data 72. Matching keyword query against text Retrieve documents Inverted list (inverted index)keyword {, , ...} AND-semantics: top-k joinsharedBerlin Alice shared Berlin Alice D1 D1D1shared=berlin =aliceshared 73. Matching structured query against structureddata Retrieve data for triple patterns Index on tables Multiple redundant indexes to cover different access patterns Join (conjunction of triples) Blocking, e.g. linear merge join (required sorted input) Non-blocking, e.g. symmetric hash-join Materialized join indexesPer1 ns:works ?v ?v ns:name KIT ?x ns:knows ?y. ?x ns:knows ?z.SP-index PO-index?z ns: works ?v. ?v ns:name KIT = = = Per1 ns:works Ins1 Ins1 ns:name KITPer1 ns:works Ins1 Ins1 ns:name KIT 74. Matching keyword query against structured data Retrieve keyword elements Using inverted indexkeyword {, ,} Exploration / Join Data indexes for triple lookup Materialized index (paths up to graphs) Top-k Steiner tree search, top-k subgraph exploration AliceBobKITAlice Bob KIT = =Alice ns:knows BobInst1 ns:name KITBob ns:works Inst1 75. Matching structured query against text Based on offline IE (offline see Peters slides) Based on online IE, i.e., retrieve is as follows Derive keywords to retrieve relevant documents On-the-fly information extraction, i.e., phrase pattern matching X name Y Retrieve extracted data for structured part Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp. ?x ns:knows ?y. ?x ns:knows ??z ns: works ?v. ?v ns:name K knowsname KIT 76. Matching structured query against text Index Inverted index for document retrieval and pattern matching Join index inverted index for storing materialized joins betweenkeywords Neighborhood indexes for phrase patterns ?x ns:knows ?y. ?x ns:knows ? ?z ns: works ?v. ?v ns:name K KIT nameknows name KIT 77. Query processing main tasks Retrieval Documents , data Query elements, triples, paths, graphs Inverted index,, but also other (B+ tree)Matching Index documents, triples, materialized paths Join Data Different join implementations, efficiency depends on availability of indexes Non-blocking join good for early result reporting and for unpredictable Linked Data / data 78. Query processing more tasks More complex queries: disjunction, aggregation, grouping, an Query alytics Join order optimization Approximate Approximate the search spaceMatching Approximate the results (matching, join) Parallelization Top-k Data Use only some entries in the input streams to produce k results Multiple sources Federation, routing On-the-fly mapping, similarity join Hybrid 79. Query processing on the Web - research challenges and opportunities Large amount ofsemantic data Optimization, parallelizati Data oninconsistent, redunda Approximationnt, and low quality Hybrid querying and data Large amount of data managementembedded in text Federation, routing Large amount of Online schemasourcesmappings Large amount of links Similarity joinbetween sources 80. Ranking 81. Structure Problem definition Types of ambiguities Ranking paradigms Model construction Content-based Structure-based 82. Ranking problem definition Query Ambiguities arise when representation is incomplete / imprecise Matching Ambiguities at the level of elements (content ambiguity) structure between elementsData(structure ambiguity)Due to ambiguities in the representation of theinformation needs and the underlying resources, theresults cannot be guaranteed to exactly match the query.Ranking is the problem of determining the degree ofmatching using some notions of relevance. 83. Content ambiguityapartment shared Berlin Alice?x ns:knows ?z. ?z ns: works ?v. ?v ns:name KIT?y ns:name Alice. ?x ns:knows ? ytrouble with bobFluidOps34Peter sunset.jpgBob is a good friendBeautifulof mine. We went to Sunsetthe sameGermany Semantic AliceSearchuniversity, and alsoshared anapartment in Berlinin 2008. The troubleGermany2009Bobwith Bob is that heThanhtakes much betterphotos than I do:KIT What is meant by Berlin in the query? What is meant by KIT in the query? What is meant by Berlin in the data?What is meant by KIT in the data? A city with the name Berlin? a person?A research group? a university? a location? 84. Structure ambiguityapartment shared Berlin Alice?x ns:knows ?z. ?z ns: works ?v. ?v ns:name KIT?y ns:name Alice. ?x ns:knows ? ytrouble with bobFluidOps34Peter sunset.jpgBob is a good friendBeautifulof mine. We went to Sunsetthe sameGermany Semantic AliceSearchuniversity, and alsoshared anapartment in Berlinin 2008. The troubleGermany2009Bobwith Bob is that heThanhtakes much betterphotos than I do:KIT What is the connection betweenWhat is meant by works? Berlin and Alice? Works at? employed? Friend? Co-worker? 85. Ambiguity Recall: query processing is matching at the level ofsyntax and semantics Ambiguities arise when data or query allow for multipleinterpretations, i.e. multiple matches Syntactic, e.g. works vs. works at Semantic, e.g. works vs. employ Aboutness, i.e., contain some elements whichrepresent the correct interpretation Ambiguities arise when matching elements of differentgranularities Does i contains the interpretation for j, given some part(s) of i(syntactically/semantically) match j E.g. Berlin vs. we went to the same university, and also, weshared an apartment in Berlin in 2008 Strictly speaking, ranking is performed after syntactic /semantic matching is done! 86. Features: What to use to deal with ambiguities?What is meant by Berlin? What is theconnection between Berlin and Alice? Content features Frequencies of terms: d more likely to be about aquery term k when d more often, mentions k(probabilistic IR) Co-occurrences: terms K that often co-occur form acontextual interpretation, i.e., topics (clusterhypothesis) Structure features Consider relevance at level of fields Linked-based popularity 87. Ranking paradigms Explicit relevance model Foundation: probability ranking principle Ranking results by the posterior probability (odds) ofbeing observed in the relevant class: P(w|R) varies in different approaches, e.g., binaryindependence model, 2-poisson model, relevancemodel P(D|R) P( D | R)P(w | R) (1 P(w | N )) P(D/N) w Dw DkP( w | R) P( w | q1 ,..., qk ) P ( m) P ( w | m) P(qk | m) m M i 1 88. Ranking paradigms No explicit notion of relevance: similarity between thequery and the document model Vector space model (cosine similarity) Language models (KL divergence)Sim(q, d )Cos(( w1,d ,..., wt , d ), ( w1,q ,..., wk , q ))P(t | q )Sim(q, d ) KL(q || d ) P(t | q ) log( t VP(t | d ) 89. Model construction How to obtain Relevance models? Weights for query / document terms? Language models for document / queries? 90. Content-based model construction Document statistics, e.g. An object is more likely Term frequencyabout Berlin? Document length When it contains a Collection statistics, e.g.relatively high number of mentions of the term Inverse document Berlinfrequency When the number of Background language mentions of this term inmodels the overall collection is tfrelatively lowwt , d idf|d |tfP(t | d ) (1 ) P(t | C ) |d | 91. Structure-based model construction Consider structure of objects during content-based modeling, i.e., to obtain structured content-based model Content-based model for structured objects, documents and for general tuplesP(t | d )f P(t | f )f Fd An object is more likely about Berlin? When one of its (important) fields contains a relatively high number of mentions of the term Berlin 92. Structure-based model construction PageRank Link analysis algorithm Measuring relative importance of nodes Link counts as a vote of support The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links) ObjectRank Types and semantics of links vary in structured data setting Authority transfer schema graph specifies connection An object about Berlin is more important than one another? strengths When a relatively large number of objects are linked to it Recursively compute authority transfer data graph 93. Taxonomy of ranking approaches Explicitly vs. non-explicitly relevance-based Content-based ranking Structure-based ranking Content- and-structure-based ranking 94. Result Presentation 95. Search interface Input and output functionality helping the user to formulate complex queries presenting the results in an intelligent manner Semantic Search brings improvements in Query formulation Snippet generation Suggesting related entities Adaptive and interactive presentation Presentation adapts to the kind of query and results presented Object results can be actionable, e.g. buy this product Aggregated search Grouping similar items, summarizing results in various ways Filtering (facets), possibly across different dimensions Task completion Help the user to fulfill the task by placing the query in a task context 96. Query formulation Snap-to-grid: suggest the most likely interpretation of thequery Given the ontology or a summary of the data While the user is typing or after issuing the query Example: Freebase suggest, TrueKnowledge 97. Enhanced results/Rich Snippets Use mark-up from the webpage to generate search snippets Originally invented at Yahoo! (SearchMonkey) Google, Yahoo!, Bing, Yandex now consume schema.org markup Validators available from Google and Bing 98. Other result presentation tasks Select the most relevant resources within an RDF document Penin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010 For each resource, rank the properties to be displayed Natural Language Generation (NLG) Verbalize, explain results 99. Aggregated search: facets 100. Aggregated search: Sig.ma 101. Related entities Related actors and movies 102. Adaptive presentation:housing search 103. Evaluation 104. Semantic Search challenge (2010/2011) Two tasks Entity Search Queries where the user is looking for a single real world object Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW2010. List search (new in 2011) Queries where the user is looking for a class of objects Billion Triples Challenge 2009 dataset Evaluated using Amazons Mechanical Turk Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST2010 Blanco et al. Repeatable and Reliable Search SystemEvaluation using Crowd-Sourcing, SIGIR2011 105. Evaluation form 106. Catching the bad guys Payment can be rejected for workers who try to game the system An explanation is commonly expected, though cheaters rarelycomplain We opted to mix control questions into the real results Gold-win cases that are known to be perfect Gold-loose cases that are known to be bad Metrics Worker and std. dev onReal Avg.Known gold-win and gold-loose resultsKnown Good TotalTime to Time to completebadNcomplete N MeanN Mean NMean (sec) badguy20 2.556200 2.73820 2.684240 29.6 goodguy 13 1130 2.03813 3156 95 whoknows 11 211.571232483.5 107. Other evaluations TREC Entity Track Related Entity Finding Entities related to a given entity through a particular relationship Retrieval over documents (ClueWeb 09 collection) Example: (Homepages of) airlines that fly Boeing 747 Entity List Completion Given some elements of a list of entities, complete the list Question Answering over Linked Data Retrieval over specific datasets (Dbpedia andMusicBrainz) Full natural language questions of different forms Correct results defined by an equivalent SPARQL query Example: Give me all actors starring in Batman Begins. 108. Resources 109. Resources Books Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press. 2011 Survey papers Thanh Tran, Peter Mika. Survey of Semantic Search Approaches. Under submission, 2012. Conferences and workshops ISWC, ESWC, WWW, SIGIR, CIKM, SemTech Semantic Search workshop series Exploiting Semantic Annotations in Information Retrieval (ESAIR) Entity-oriented Search (EOS) workshop

Education

Semantic Search tutorial at SemTech 2012