Information Searching

Embed Size (px)

Citation preview

  • 8/3/2019 Information Searching

    1/30

    8/8/2011 1

    Information Searching

  • 8/3/2019 Information Searching

    2/30

    8/8/2011 2

    Information Search

    Traditional Search

    Web Search

    Metadata based Search

    Semantic Search

  • 8/3/2019 Information Searching

    3/30

    8/8/2011 3

    Traditional Search

    A collection of documents is a set of documents related to a specificcontext of interest

    Indexing process is applied to full text of documents

  • 8/3/2019 Information Searching

    4/30

    8/8/2011 4

    Classical Search

  • 8/3/2019 Information Searching

    5/30

    8/8/2011 5

    Web Information SearchingSearch Engine Architecture

  • 8/3/2019 Information Searching

    6/30

    8/8/2011 6

    Web Information Searching

    Web Searching & Information Retrieval, IEEE Web Engineering,2004

    Search engines index each web page by representing it by a set ofweighted keywords

    Using robots or spiders that crawl through the web search engines

    pick up useful pages Indexing of these pages includes:

    Removing all frequent or non-significant words (stop words: and, be)

    Stemming removes all the derivational suffixes (retains root: thinking,thinkers, thinks)

    Pages found are represented by a set of weighted keywords

  • 8/3/2019 Information Searching

    7/30

    8/8/2011 7

    Crawler/ Robot/ Spider

    Crawler is a program controlled by a crawl module that browses theweb

    Collects documents by recursively fetching links from a set of startpages, the received pages are (or parts) are compressed and stored

    in page repository URL and their links form web graph, which can be used by crawler

    control module to decide further crawling

    To save space docID represents pages in the index

    Indexer processes pages collected by crawler.

    It decides which pages to index, duplicate documents are discarded Inverted index is built which contains for each word a sorted list of

    couples (such as docID and position in the document)

  • 8/3/2019 Information Searching

    8/30

    8/8/2011 8

    Query Engine

    Query Engine processes user queries and returns matchinganswers using ranking algorithm

    Algorithm produces numerical score expressing importance of theanswer with respect to query

    Utility data structures contains lists of related pages, which canfacilitate search

    Various query independent as well as dependent data is used todecide ranking (data of modification, site, number of links to otherpages or actual content of documents)

    Query dependent criteria include cosine measure of similarity invector space model

  • 8/3/2019 Information Searching

    9/30

    8/8/2011 9

    TIR(1)

    Classical Models: Boolean, Vector, Probabilistic Vector model is the most popular

    Documents and queries as vectors

  • 8/3/2019 Information Searching

    10/30

    8/8/2011 10

    TIR (2)

    Weighting algorithm : tf-idf (term frequency- inverse document frequency weight wijcorresponding to the i

    thcomponent of the document djvectorrepresentation is given by

    wij= tfij * idfi Where tf

    ij= f

    ij/ max

    l(f

    lj) where maximum is computed over all terms

    mentioned in the document dj. idfi is inverse document frequency for ki is given by idfi= log ( N /ni). The relevance ranking is Sim (d,j) = cos () = (d.q) / |d| |q| Thus very frequent terms receive a low weight, while uncommon terms

    appearing in few documents receive a high weight

    the assumption that the index terms are independent.

  • 8/3/2019 Information Searching

    11/30

    8/8/2011 11

    Web Searching (2)

    Weighing procedure considers:

    If a term appears more frequently than other terms, associated weight canbe increased

    If term appears within many pages then, its weight would be decreased(may not be useful in discriminating items)

    Usually greater weights are assigned to short pages than longer ones

    Inverted file is updates such that for each keyword, the system can find alist of all web pages( with associated weight) inderxed under this term

    Degree of similarity can be calculated using this data

  • 8/3/2019 Information Searching

    12/30

    8/8/2011 12

    Web searching (3)

    To improve search:

    Giving more credit to words appearing in title field

    Considering distance between search keywords appearing within apage

    Using different models for assigning weights: probabilistic orlanguage based

  • 8/3/2019 Information Searching

    13/30

    8/8/2011 13

    TIR (3)

    Problems with TIR

    Keyword based search

    Measure of relevancy of retrieved document

  • 8/3/2019 Information Searching

    14/30

    8/8/2011 14

    Semantic Metadata

    Data which may be associated explicitly or implicitly with a givenpiece of content and whose relevance for that content is determinedby its ontological position( its context) with the domain of knowledge

    Helps in classification, high precision searching

    Named entity recognition involves finding items of potential interestwith a piece of text (person, place, thing, event)

    these are stored in the ontology

    Metadata is a snapshot of the documents relevant information

    Metadata contained within the snapshot references the instances of

    the named entities, which are stored in the ontology

  • 8/3/2019 Information Searching

    15/30

    8/8/2011 15

    Relevant Information

  • 8/3/2019 Information Searching

    16/30

    8/8/2011 16

    Types of Specs and Standards (or MetaModels)

    Domain Independent: (Meta Content Framework), RDF, MOF

    (object facility), DublinCore

    Media Specific: MPEG4, MPEG7, VoiceXML

    Domain/Industry Specific (metamodels): MARC, DCMI, METS(Library), FGDC and UDK (Geographic), NewsML (News), PRISM

    (Publishing Requirements for Industry Standard Metadata)

    Application Specific: ICE Information & Content Exchange (communication

    between sender and receiver)

    Exchange/Sharing: XCM, XMI

    Other Models: RDFS, namespaces, ontologies, (DAML, OIL)

  • 8/3/2019 Information Searching

    17/30

    8/8/2011 17

    Dublin Core Metadata Initiative DCMI1995-96

    Simple element set designed for domain independent resource

    description

    15 elements are defined by this standard

    International, inter-discipline, W3C community consensus Semantic interface among resource description communities (very

    limited form of semantics)

  • 8/3/2019 Information Searching

    18/30

    8/8/2011 18

    DCMI (2002)http://dublincore.org/documents/usageguide/elements.shtml

    Title: name given to the resource

    Contributor: entity responsible for making contributions to thecontent of resource

    Creator:

    Publisher: an entity responsible for making resource available Subject & keywords: topic of content of the resource

    Description: an account of the content of resource

    Format : data representation of the source

    Resource identifier: unambiguous reference

    Language

    Rights : copyright notice/statement

    Date, type, source, Relation, Coverage

    http://dublincore.org/documents/usageguide/elements.shtmlhttp://dublincore.org/documents/usageguide/elements.shtml
  • 8/3/2019 Information Searching

    19/30

    8/8/2011 19

    DDI Data Documentation Initiative

    DDI Data Documentation Initiative:

    Technical documentation of social, behavioural, and economic data

    SDMX Statistical Data and Metadata Exchange: Used bystatisticians for exchange of time series data

  • 8/3/2019 Information Searching

    20/30

    8/8/2011 20

    Creating and Serving Metadata to Power the Life-cycle of Content

    Where isthe

    content?Whose is

    it?

    Produce

    Aggregate

    What is thiscontentabout?

    Catalog/

    Index

    Whatother

    content isit related

    to?

    Integrate

    Syndicate

    What is theright contentfor this user?

    Personalize

    What is thebest way to

    monetize thisinteraction?

    Interactive

    Marketing

    Broadcast,Wireline,Wireless,Interactive TV

    Taalee Semantic MetaBase

    Taalee Content ApplicationsTaalee Infrastructure Services

  • 8/3/2019 Information Searching

    21/30

    8/8/2011 21

    Intelligent Search

  • 8/3/2019 Information Searching

    22/30

    8/8/2011 22

    Intelligent Search using Ontologies

    Query

    Mediator1: Ontology 1

    Mediator2: Ontology 2

    Ontology

    User

    Mediator3: Ontology 3

    Answer

  • 8/3/2019 Information Searching

    23/30

    8/8/2011 23

    SWIR

    Use of Vector Space Model for SW : documents at semantic levelcould be represented as vectors in a hyperspace defined by the setof all ontology concepts

    Weight of concept is relative importance of that concept

    SWIR needs Good domain ontology

    Understanding semantic relationships among ontological concepts

  • 8/3/2019 Information Searching

    24/30

    8/8/2011 24

    SWIR(3)

    Weights are assigned to links based on certain properties of theontology representing the strength of the relation

    Spread activation technique is used to find related concepts in the

    ontology given some initial set of concepts and initial weights

  • 8/3/2019 Information Searching

    25/30

    8/8/2011 25

    SWIR (4) : Weighting Algorithm

    In traditional IR tf-idfstrategy, the first measure gives the degree ofsimilarity between two related concept instances in a relation andthe second measure gives the specificity of the concept relation

    This Cluster measure for concept instances Cjand Ckis given by:W ( Cj, Ck) = { nijk/ nij }

    Where nijrepresents that concepts Cjand Ciare related and nijkrepresentsthat both the concepts Cjand Ckare related to concept Ci. Therefore (Cj, Ck)represents percentage of concepts that Ck is related to that Cjis also related

    This particular measure reflects the fact that concepts sharing common

    relations are semantically similar

  • 8/3/2019 Information Searching

    26/30

    8/8/2011 26

    SWIR (5)

    The Specificity measure is given by:

    W (Cj, Ck) = 1/ n k Where nkis the number of instances of given relation type that have kas its

    destination node

    The actual measure is the product of cluster and specificitymeasures

  • 8/3/2019 Information Searching

    27/30

    8/8/2011 27

    SWIR (6): Spread Activation Algorithm

    Given an initial set of concepts, the algorithm obtains a set of closelyrelated (semantically related) concepts by navigating through thelinked concepts in the graph

    The algorithm has as a starting point, an initial set of instances in theontology with each having an initial activation value

    Constrained Spread Activation applies constraints like maximumpath length, fan-out etc to propagation

  • 8/3/2019 Information Searching

    28/30

    8/8/2011 28

    Conclusion (1)

    Traditional information retrieval : small, static, homogeneous,centrally located, monolingual document collections

    Web information retrieval : huge volumes of data which is volatile,heterogeneous, distributed and multilingual

    Semantic web information retrieval is ontology based intelligentinformation retrieval

    Various semantic search strategies are explored

    Two major differences

    Keyword vs. concept

    Response time a part of relevancy measure Most successful semantic search algorithms are the Vector Space

    Model and the Hybrid approach which uses classical technique withspread activation algorithm

  • 8/3/2019 Information Searching

    29/30

    8/8/2011 29

    Conclusion (2)

    concepts which form the basis of the semantic domain model arenot orthogonal. This issue can be addressed by reassigning theweights to concept links based on the relationship graph of theontology concepts

    The spread activation algorithm has been used to deduce therelationships based on given set of relationships

    The SIR has been visualized as 4 layer process; keywords, indexedkeywords, semantic concepts, relationships

  • 8/3/2019 Information Searching

    30/30

    8/8/2011 30

    References

    Berners-Lee T., Hendler J., Lassila O., The Semantic Web, ScientificAmerican. 2001, 284: 35-43

    R.Baeza-Yates, B.Ribeiro-Neto, Modern Information Retrieval, 1stedition, Addison-Wesley, 1999

    Pokorny J., Web Searching and Information Retrieval, Web Engineering,July/August 2004, 43-48