LIS618 lecture 1 Thomas Krichel 2002-09-15. Organization homepage

LIS618 lecture 1

Thomas Krichel

2002-09-15

Organization

• homepage http://wotan.liu.edu/home/krichel/lis618n02a

• Contents to be discussed today.• Send mail to [email protected]

– Your name– Your secret word for grades delivery

• Interrupt me with as many questions as possible!

• Ask for breaks!

http://wotan.liu.edu/home/krichel/lis618n02a

mailto:[email protected]

Proposed Organization

• Normal lecture• Quiz at the beginning of every lecture.• Main quiz next week (25% of grade)• Search exercise 55% • Other quizzes 10%

• Formal syllabus to be made early next week!

Search exercise

• find victim

• conduct interview about an information need experienced by the victim, write down expectations

• search in Dialog and on web

• discuss results with the victim

• write essay, no longer than 7 pages.

Structure of talk

• First talk about me, then about you and the course

• General round trip on theoretical matters.– Context of database searching– Database searching and information retrieval– The retrieval process– Information retrieval models– Retrieval performance evaluation– Query languages

• Logging on to Dialog • Web searching exercise (if time permits)

About me • Born 1965, in Völklingen (Germany)• Studied economics and social sciences at

the Universities of Toulouse, Paris, Exeter and Leiceister.

• PhD in theoretical macroeconomics• Lecturer in Economics at the University of

Surrey 1993 and 2001• Since 2001 assistant professor at the

Palmer School

Why?

• During research assistantship period, (1990 to 1993) I was constantly frustrated with difficult access to scientific literature.

• At the same time, I discovered easy access to freely downloadable software over the Internet.

• I decided to work towards downloadable scientific documents. This lead to my library career (eventually).

Steps taken I

• 1993 founded the NetEc project at http://netec.mcc.ac.uk, later available at http://netec.ier.hit-u.ac.jp as well as at http://netec.wustl.edu.

• These are networking projects targeted to the economics community. The bulk is– Information about working papers– Downloadable working papers– Journal articles were added later

http://netec.mcc.ac.uk/

http://netec.ier.hit-u.ac.jp/

http://netec.wustl.edu/

Steps taken II

• Set up RePEc, a digital library for economics research. Catalogs– Research documents– Collections of research documents– Researchers themselves– Organizations that are important to the

research process

• Decentralized collection, model for the open archives initiative

Steps taken III

• Co-founder of Open Archives Initiative

• Work on the Academic Metadata Format

• Co-founded rclis, a RePEc clone for (Research in Computing, Library and Information Science)

summary

• There are three basic types of models in classic information retrieval.

• Extensions of these types are a matter of research concern and require good mathematical skills.

• All classic models treat document as individual pieces.

Database searching (DS)

• subset of the subject of information retrieval (IR)

• DS mainly thought as applicable to the set of large structured databases as opposed to do web searching

• for those, a general knowledge of what databases are seems useful

• Concentrate on textual databases

traditional social model

• user goes to a library

• describes problem to the librarian

• librarian does the search– without the user present– with the user present

• hands over the result to the user

• user fetches full-text or asks a librarian to fetch the full text.

economic rational for traditional model

• In olden days the cost of telecommunication was high.

• database use costs– cost of communication– cost of access time to the database

• the traditional model controls an upper bound on costs

disintermediation

• with access cost time gone, the traditional model is under threat

• there is disintermediation where the librarian looses her role

• but that may not be good news for information retrieval results– user knows subject matter best– librarian knows searching best

Web searching

• IR has received a lot of impetus through the web, which poses unprecedented search challenges.

• with more and more data appearing on the web DS may be a subject in decline, because it is primarily concerned with non-web databases

Main theory part

• Literature: "Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribiero-Neto

• Don't buy it. It is a not a good book.

before the IR process

• provider– define data that is available

• documents that can be used• document operations• document structure

– index

• user– user need– IR system familiarity

the IR process

• query expresses user need in a query language

• processing of query yields retrieved documents

• calculation of relevance ranking

• examination of retrieved documents

• possible relevance cycle

main problem

• user is not an expert at the formulation of a query

• garbage in garbage out, the retrieval yields poor result

• ways out– design very intuitive interface– give expert guidance

key aid: index

• index term is a part of the document that has a meaning on its own (usually a noun)

• retrieval based on index term raises questions– semantics in query or document is lost– matching done in imprecise space of index terms

• predicting relevance is a central problem• the IR model determines the process of

relevance ranking

taxonomy of classic IR models

• Boolean, or set-theoretic– fuzzy set models– extended Boolean

• vector, or algebraic– generalized vector model– latent semantic indexing– neural network model

• probabilistic– inference network– belief network

basic concepts: index term

• an index term is a word whose semantics help to remember the document's main themes.

• nouns are mainly used

• if all words are index terms, the logical view of the document is full text

basic concept: weight of index term

• given all nouns, not all appear to have the same relevance to the text

• sometimes, we can have a simple measure of the importance of a term, example?

• more generally, for each indexing term and each document we can associate a weight with the term and the document.

• usually, if the document does not contain the term, its weight is zero

basic concept: mutual term independence

• Thinking of the weight of a term as a function of the document and the term only implies that it is independent of other terms.

• This is an important oversimplification.• But it allows for fast computation.• No study has shown that not assuming

independence brings significant performance gain.

Boolean model

• in the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise.

• This allows to combine query terms with Boolean operator AND, OR, and NOT

• thus powerful queries can be written

example: a AND (b OR NOT c)

• 1• 2• 3 • 4• 5• 6• 7

• a b c• a b• a c• c b• c• b• a

advantages of Boolean model

• supposedly easy to grasp by the user

• precise semantics of queries

• implemented in the majority of commercial systems

• why is it set-theoretic ?

problems of Boolean model

• sharp distinction between relevant and irrelevant documents

• no ranking possible

• users find it difficult to formulate Boolean queries

http://openlib.org/home/krichel

Thank you for your attention!

Documents

LIS618 lecture 1 Thomas Krichel 2002-09-15. Organization homepage