Upload
jasmine-jenkins
View
221
Download
2
Tags:
Embed Size (px)
Citation preview
LIS618 lecture 1
Thomas Krichel
2002-09-15
Organization
• homepage http://wotan.liu.edu/home/krichel/lis618n02a
• Contents to be discussed today.• Send mail to [email protected]
– Your name– Your secret word for grades delivery
• Interrupt me with as many questions as possible!
• Ask for breaks!
Proposed Organization
• Normal lecture• Quiz at the beginning of every lecture.• Main quiz next week (25% of grade)• Search exercise 55% • Other quizzes 10%
• Formal syllabus to be made early next week!
Search exercise
• find victim
• conduct interview about an information need experienced by the victim, write down expectations
• search in Dialog and on web
• discuss results with the victim
• write essay, no longer than 7 pages.
Structure of talk
• First talk about me, then about you and the course
• General round trip on theoretical matters.– Context of database searching– Database searching and information retrieval– The retrieval process– Information retrieval models– Retrieval performance evaluation– Query languages
• Logging on to Dialog • Web searching exercise (if time permits)
About me • Born 1965, in Völklingen (Germany)• Studied economics and social sciences at
the Universities of Toulouse, Paris, Exeter and Leiceister.
• PhD in theoretical macroeconomics• Lecturer in Economics at the University of
Surrey 1993 and 2001• Since 2001 assistant professor at the
Palmer School
Why?
• During research assistantship period, (1990 to 1993) I was constantly frustrated with difficult access to scientific literature.
• At the same time, I discovered easy access to freely downloadable software over the Internet.
• I decided to work towards downloadable scientific documents. This lead to my library career (eventually).
Steps taken I
• 1993 founded the NetEc project at http://netec.mcc.ac.uk, later available at http://netec.ier.hit-u.ac.jp as well as at http://netec.wustl.edu.
• These are networking projects targeted to the economics community. The bulk is– Information about working papers– Downloadable working papers– Journal articles were added later
Steps taken II
• Set up RePEc, a digital library for economics research. Catalogs– Research documents– Collections of research documents– Researchers themselves– Organizations that are important to the
research process
• Decentralized collection, model for the open archives initiative
Steps taken III
• Co-founder of Open Archives Initiative
• Work on the Academic Metadata Format
• Co-founded rclis, a RePEc clone for (Research in Computing, Library and Information Science)
summary
• There are three basic types of models in classic information retrieval.
• Extensions of these types are a matter of research concern and require good mathematical skills.
• All classic models treat document as individual pieces.
Database searching (DS)
• subset of the subject of information retrieval (IR)
• DS mainly thought as applicable to the set of large structured databases as opposed to do web searching
• for those, a general knowledge of what databases are seems useful
• Concentrate on textual databases
traditional social model
• user goes to a library
• describes problem to the librarian
• librarian does the search– without the user present– with the user present
• hands over the result to the user
• user fetches full-text or asks a librarian to fetch the full text.
economic rational for traditional model
• In olden days the cost of telecommunication was high.
• database use costs– cost of communication– cost of access time to the database
• the traditional model controls an upper bound on costs
disintermediation
• with access cost time gone, the traditional model is under threat
• there is disintermediation where the librarian looses her role
• but that may not be good news for information retrieval results– user knows subject matter best– librarian knows searching best
Web searching
• IR has received a lot of impetus through the web, which poses unprecedented search challenges.
• with more and more data appearing on the web DS may be a subject in decline, because it is primarily concerned with non-web databases
Main theory part
• Literature: "Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribiero-Neto
• Don't buy it. It is a not a good book.
before the IR process
• provider– define data that is available
• documents that can be used• document operations• document structure
– index
• user– user need– IR system familiarity
the IR process
• query expresses user need in a query language
• processing of query yields retrieved documents
• calculation of relevance ranking
• examination of retrieved documents
• possible relevance cycle
main problem
• user is not an expert at the formulation of a query
• garbage in garbage out, the retrieval yields poor result
• ways out– design very intuitive interface– give expert guidance
key aid: index
• index term is a part of the document that has a meaning on its own (usually a noun)
• retrieval based on index term raises questions– semantics in query or document is lost– matching done in imprecise space of index terms
• predicting relevance is a central problem• the IR model determines the process of
relevance ranking
taxonomy of classic IR models
• Boolean, or set-theoretic– fuzzy set models– extended Boolean
• vector, or algebraic– generalized vector model– latent semantic indexing– neural network model
• probabilistic– inference network– belief network
basic concepts: index term
• an index term is a word whose semantics help to remember the document's main themes.
• nouns are mainly used
• if all words are index terms, the logical view of the document is full text
basic concept: weight of index term
• given all nouns, not all appear to have the same relevance to the text
• sometimes, we can have a simple measure of the importance of a term, example?
• more generally, for each indexing term and each document we can associate a weight with the term and the document.
• usually, if the document does not contain the term, its weight is zero
basic concept: mutual term independence
• Thinking of the weight of a term as a function of the document and the term only implies that it is independent of other terms.
• This is an important oversimplification.• But it allows for fast computation.• No study has shown that not assuming
independence brings significant performance gain.
Boolean model
• in the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise.
• This allows to combine query terms with Boolean operator AND, OR, and NOT
• thus powerful queries can be written
example: a AND (b OR NOT c)
• 1• 2• 3 • 4• 5• 6• 7
• a b c• a b• a c• c b• c• b• a
advantages of Boolean model
• supposedly easy to grasp by the user
• precise semantics of queries
• implemented in the majority of commercial systems
• why is it set-theoretic ?
problems of Boolean model
• sharp distinction between relevant and irrelevant documents
• no ranking possible
• users find it difficult to formulate Boolean queries
http://openlib.org/home/krichel
Thank you for your attention!