27
Organizational issues Course overview Web Technologies (Technologien f¨ ur das Internet I) Foundations of Information Retrieval http://www2.kbs.uni-hannover.de/internet1.html Introduction Prof. Wolfgang Nejdl, Elena Demidova Institut f¨ ur Verteilte Systeme L3S Researh Center Leibniz Universit¨ at Hannover 13 October 2011 Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Sch¨ utze) 1 / 21

Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Web Technologies(Technologien fur das Internet I)

Foundations of Information Retrievalhttp://www2.kbs.uni-hannover.de/internet1.html

Introduction

Prof. Wolfgang Nejdl, Elena Demidova

Institut fur Verteilte SystemeL3S Researh Center

Leibniz Universitat Hannover

13 October 2011

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 1 / 21

Page 2: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Plan for today

Organizational issues

Course overview

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 2 / 21

Page 3: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Outline

1 Organizational issues

2 Course overview

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 3 / 21

Page 4: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Information for the ITIS students

Transmission will be available to the Internet Technologies andInformation Systems (ITIS) students located outside Hannoverupon request athttps://webconf.vc.dfn.de/foundations_of_ir/. If thisapplies to you, please ask for a password via email from ElenaDemidova, demidova (at) L3S.de

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 4 / 21

Page 5: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Lecture and exam dates

We meet every Thursday 11:30 - (ca.) 14:00. The exercisesessions follow the lectures (do not be late!).

The lectures start next week (October 20).

The exercise sessions start the week after (October 27).

StudIP: please register for the exercise sessions (mailing list).

Exam: 90 minutes written exam on the March 5, 2012.

Prerequisites for the ITIS-students will be clarified later.

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 5 / 21

Page 6: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Outline

1 Organizational issues

2 Course overview

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 6 / 21

Page 7: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Literature

Selected Chapters of the IIR Book: Christopher D. Manning,Prabhakar Raghavan and Hinrich Schtze, Introduction toInformation Retrieval, Cambridge University Press. 2008.http://www-nlp.stanford.edu/IR-book/

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 7 / 21

Page 8: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Outlook for the lectures

We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21

Page 9: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Outlook for the lectures

We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21

Page 10: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Outlook for the lectures

We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21

Page 11: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Outlook for the lectures

We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21

Page 12: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Outlook for the lectures

We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21

Page 13: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 01: Boolean retrieval

Design and data structures of a simple information retrievalsystem

Queries are Boolean expressions, e.g., Caesar and Brutus

The seach engine returns all documents that satisfy theBoolean expression.

Inverted index:

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 . . .︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 9 / 21

Page 14: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 01: Boolean retrieval

Design and data structures of a simple information retrievalsystem

Queries are Boolean expressions, e.g., Caesar and Brutus

The seach engine returns all documents that satisfy theBoolean expression.

Inverted index:

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 . . .︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 9 / 21

Page 15: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 01: Boolean retrieval

Design and data structures of a simple information retrievalsystem

Queries are Boolean expressions, e.g., Caesar and Brutus

The seach engine returns all documents that satisfy theBoolean expression.

Inverted index:

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 . . .︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 9 / 21

Page 16: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 02: The term vocabulary and postings lists

Phrase queries: Stanford University

Proximity: find Gates near Microsoft

We need an index that captures position information forphrase queries and proximity queries.

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 10 / 21

Page 17: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 03: Dictionaries and tolerant retrieval

rd aboard ardent boardroom border

or border lord morbid sordid

bo aboard about boardroom border

- - - -

- - - -

- - - -

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 11 / 21

Page 18: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 06: Scoring, term weighting and the vector spacemodel

Ranking search results

Boolean queries only give inclusion or exclusion of documents.

For ranked retrieval, we measure the proximity from query toeach document.

One formalism for doing this: the vector space model

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 12 / 21

Page 19: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 08: Evaluation and dynamic summaries

Benchmarks (e.g. TREC = Text Retrieval Conference)

Measures (Precision, Recall, etc.)

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 13 / 21

Page 20: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 09: Relevance feedback and query expansion

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 14 / 21

Page 21: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 10: XML retrieval

Semi-structured / structured documents vs. unstructureddocuments

Can we utilize the structure of the data in IR systems?

Databases support search for numerical range and exactmatch, e.g., salary < 60,000 and manager=Smith.

If your data is structured and you only need precise queries likethis (numerical, exact match etc), do not use an IR system.

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 15 / 21

Page 22: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 13: Text classification and Naive Bayes

Text classification = assigning documents automatically topredefined classes

Examples:

a. Language (English vs. French)b. Location

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 16 / 21

Page 23: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 16: Flat clustering

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 17 / 21

Page 24: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 19: Web search

Unusual and diverse documents

Unusual and diverse users and information needs

Beyond terms and text: exploit link analysis, user data

How do web search engines work?

How can we make them better?

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 18 / 21

Page 25: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 20: Crawling

www

fetch

DNS

parse

URL frontier

contentseen?

��

����

docFPs �

�����

robotstemplates �

�����

URLset

URLfilter

dupURLelim-

-

6

�-

?6

- - -

6?

6?

6?

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 19 / 21

Page 26: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

IIR 21: Link analysis / PageRank

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 20 / 21

Page 27: Web Technologies (Technologien für das Internet I) Foundations of Information … · 2015-11-23 · Organizational issuesCourse overview IIR 01: Boolean retrieval Design and data

Organizational issues Course overview

Questions?

Thank you for your attention!

Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 21 / 21