Click here to load reader

Ir 01

Embed Size (px)

Citation preview

  1. 1. Lecture 01 Information Retrieval
  2. 2. About the Course Book: An Introduction to Information Retrieval, Christopher D. Manning Prabhakar Raghavan Hinrich Schtze, Cambridge University Press, 2009. Other materials may be considered depending on the subject. Principal objective of this course: To introduce students to Information Retrieval concepts, paradigms and techniques, with an emphasis on String and Semantics based IR techniques.
  3. 3. About the Course Grading & Assessment: First Exam .. 20% Second Exam .. 20% Final Exam .. 35% Other Activities . 10% Major Assignment . 15% You are to build a prototype for a search engine that employs both text-based and semantics-based techniques for retrieving the most relevant results to users queries. The search space will be a collection of documents, in addition to a collection of images associated with some textual descriptions.
  4. 4. Course Topics Part 01 Introduction What is IR? Examples of IR Systems. Other topics related to IR. Models of IR Part 02 Boolean Retrieval What is Boolean IR? Term-Document Incidence Matrices Terminology and Notations
  5. 5. Course Topics Part 03 Indexing Building Indexes Semantic Networks Part 04 Retrieval Scoring, Ranking Relevance Feedback Precision/Recall
  6. 6. Course Topics Part 05 Exploiting Ontologies in IR Ontologies Traditional vs. Semantics-based IR techniques
  7. 7. Introduction What is IR Information Retrieval: Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Unstructured Data: refers to data which does not have clear, semantically overt, easy-for-a- computer structure. e.g. Textual information in web pages. Semistructured Data: refers to data which have a partially clear, semantically overt, easy-for-a- computer structure. e.g. finding a document where the title contains Java and the body contains threading.
  8. 8. Introduction What is IR Structured Data: refers to data which have a clear, semantically overt, easy- for-a-computer structure. e.g. Relational Databases.
  9. 9. A look back: 1990s Studies showed that most people preferred getting information from other people rather than from information retrieval systems. Online booking systems? Following to this period and after relentless optimization of IR: The field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most peoples preferred means of information access. Introduction What is IR
  10. 10. Information retrieval did not begin with the Web. The field began with scientific publications and library records, but soon spread to other forms of content, particularly those of information professionals, such as journalists, lawyers, and doctors Introduction What is IR
  11. 11. Introduction Other Topics Related to IR Cross-language IR Multimedia IR Speech retrieval User interfaces for IR Ontology and Semantics-based IR Natural Language Processing (NLP) techniques Dynamic IR Online Advertising !?
  12. 12. Introduction Other Topics Related to IR The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically.
  13. 13. Introduction Classification of IR systems Scale-based Classification of IR systems: Distinguishing between Information retrieval systems according to the scale at which they operate. 1. Web search: The search is conducted over billions of documents stored on millions of computers. Issues to consider: 1. Needing to gather documents for indexing. 2. Being able to build systems that work efficiently at this enormous scale. 3. Handling particular aspects of the web, such as the exploitation of hypertext and page ranking given the commercial importance of the web.
  14. 14. 2. Personal Information Retrieval: Integrating information retrieval into consumer operating systems. Issues to consider: 1. Handling the broad range of document types on a typical personal computer. 2. Making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner. Introduction Classification of IR systems
  15. 15. 3. Enterprise, Institutional, and Domain-specific Search: A corporations documents will typically be stored on centralized file systems and one or a handful of dedicated machines will provide search over the collection. Issues to consider: 1. Handling the broad range of document types on a centralized computer. 2. Scale and Efficiency of the IR system. 3. Maintenance of the search system. Introduction Classification of IR systems
  16. 16. Introduction Classification of IR systems Technique-based Classification of IR systems: Distinguishing between Information retrieval systems according to the search technique that they employ. 1. Keyword-based search: String matching algorithms are employed to find documents relevant to the users query. Issues to consider: 1. Precision and Recall of the search algorithm. 2. Gap between the textual information contained in the document collections and the users information need.
  17. 17. Introduction Classification of IR systems 2. Semantics-based search: Semantic aspects of the users query are derived in an attempt to find documents relevant to the users query. Issues to consider: 1. Precision and Recall of the search algorithm. 2. Lack of Semantic Resources. 3. Incompleteness of Background Knowledge represented in existing Semantic Resources. 4. Semantic Heterogeneity problem between existing Semantic Resources. 5. Lack of Multi-lingual Semantic Resources.
  18. 18. Introduction Classification of IR systems 2. Hybrid Approaches: Keyword-based search is enriched with Semantics-based search to retrieve more relevant results to the users information needs. Issues to consider: 1. Precision and Recall of the search algorithm. 2. Lack of Semantic Resources. 3. Priority of the employed techniques. 4. Incompleteness of Background Knowledge represented in existing Semantic Resources. 5. Types of queries that the system can handle (Single-term vs. Verbose queries). 6. Lack of Multi-lingual Semantic Resources. Research is very active in this area. Example: Dbpedia based search engine (June 2015)