41
LIS618 lecture 0 Thomas Krichel 2004-01-24

LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page administrative

Embed Size (px)

Citation preview

Page 1: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

LIS618 lecture 0

Thomas Krichel

2004-01-24

Page 2: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

today's lecture• A look at the course home page

http://wotan.liu.edu/home/krichel/lis618w04s

• administrative stuff

• historical matters about the course

• about me

• business of database searching

• indexes

• the Boolean information retrieval model

• practice example on Dialog

Page 3: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Organization

• homepage http://wotan.liu.edu/home/krichel/lis618w04s

• Contents to be discussed today.• Send mail to [email protected]

– Your name– Your secret word for grades delivery

• Interrupt me with as many questions as possible!

• Ask for breaks!

Page 4: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Proposed Organization

• Normal lecture

• Quiz at the beginning of every lecture– Factually oriented, around 15 minutes– Remove worst performance– Average to form 50%

• Search exercise 50%

• I may make some adjustment to the syllabus this week.

Page 5: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Search exercise

• Find victim of an information need• Best to take someone you know in a

professional capacity• Conduct interview about an information

need experienced by the victim, write down expectations

• Search in formal database and on web• Discuss results with the victim• Write essay, no longer than 5 pages.

Page 6: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

about the course

• This course is new wine in an old bottle

• Officially a merger of – lis566 information resources on the Internet

• mailing lists• usenet news• web searching

– lis618 database searching• access and use of commercial databases

Page 7: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

mix of theory and practice

• I am not a database search practitioner.

• Each database is different, practical skills are not easily transferable.

• Thus my emphasis in the course is more on theory.

• In the past, I theory first, then practice.

• This year I will try to mix. Some theory and some practice in every session.

Page 8: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

What databases?• Dialog has been the traditional database

covered. – They were the market leaders in online

databases in the past.– Nowadays the field is much more open.– They remain a very good teaching tool for

command based database searching.

• Nexis: a news database I have covered every year.

• Google: a well-known search engine that I started to cover two years ago.

Page 9: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

other stuff• Other databases that I have covered in the

past– OCLC FirstSearch– Factiva (briefly)– WestLaw (external speaker)

• Other important developments, not yet covered in previous editions– Peer-to-peer networks– an introduction to reference linking using

OpenURL

Page 10: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

About me • Born 1965, in Völklingen (Germany)• Studied economics and social sciences at

the Universities of Toulouse, Paris, Exeter and Leiceister.

• PhD in theoretical macroeconomics• Lecturer in Economics at the University of

Surrey 1993 and 2001• Since 2001 assistant professor at the

Palmer School

Page 11: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Why?

• During research assistantship period, (1990 to 1993) I was constantly frustrated with difficult access to scientific literature.

• At the same time, I discovered easy access to freely downloadable software over the Internet.

• I decided to work towards downloadable scientific documents. This lead to my library career (eventually).

Page 12: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Steps taken I

• 1993 founded the NetEc project at http://netec.mcc.ac.uk, later available at http://netec.ier.hit-u.ac.jp as well as at http://netec.wustl.edu.

• These are networking projects targeted to the economics community. The bulk is– Information about working papers– Downloadable working papers– Journal articles were added later

Page 13: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Steps taken II

• Set up RePEc, a digital library for economics research. Catalogs– Research documents– Collections of research documents– Researchers themselves– Organizations that are important to the

research process

• Decentralized collection, model for the open archives initiative

Page 14: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Steps taken III

• Co-founder of Open Archives Initiative

• Work on the Academic Metadata Format

• Co-founded rclis, a RePEc clone for (Research in Computing, Library and Information Science)

• Currently working on the Konz project. It uses a database of titles of journal published papers and tries to find them on the Internet.

Page 15: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

my interest in databases

• Main emphasis of course is still on databases.

• From my point of view I have two interests in database searching– As a provider, I must understand how people

search in order to provide some data that they can use and will use.

– As an economist, I have a strong interest in information as a commodity. The database market is an important market place.

Page 16: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

database searching (DS)

• Subset of the subject of information retrieval (IR)

• DS is mainly thought as applicable to the set of large structured databases as opposed to do web searching.

• For those, a general knowledge of what databases are seems useful.

• Concentrate on textual databases

Page 17: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

traditional social model

• User goes to a library

• Describes problem to the librarian

• Librarian does the search– without the user present– with the user present

• Hands over the result to the user

• User fetches full-text or asks a librarian to fetch the full text.

Page 18: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

economic rational for traditional model

• In olden days the cost of telecommunication was high.

• Database use costs– cost of communication– cost of access time to the database

• The traditional model controls an upper limit to the costs.

Page 19: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

disintermediation

• With access cost time gone, the traditional model is under threat

• There is disintermediation where the librarian looses her role of doing the search.

• But that may not be good news for information retrieval results– user knows subject matter best– librarian knows searching best

Page 20: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Web searching

• IR has received a lot of impetus through the web, which poses unprecedented search challenges.

• With more and more data appearing on the web DS may be a subject in decline– It is primarily concerned with non-web

databases– There is more and more web-based methods

of searching

Page 21: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Public access vs quality

• Now the public at large is able to do online searching.

• At the same time need for quality answers has grown.

• Quality-filtered services will become more important.

• In the current databases, there is as lot that would already be available for free mixed with quality-controlled stuff.

• Publishers have direct offerings and intermediated vending is in decline.

Page 22: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Main theory part

• Literature:– "Modern Information Retrieval" by Ricardo

Baeza-Yates and Berthier Ribiero-Neto.– "Information Retrieval in the Digital Age" by

Heting Chu.

• You don't need to buy the books. You better spend practice time on databases rather than reading books

Page 23: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

before the IR process

• provider– define data that is available

• documents that can be used• document operations• document structure

– index

• user– user need– IR system familiarity

Page 24: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

the IR process

• Query expresses user need in a query language

• Processing of query yields retrieved documents

• Calculation of relevance ranking

• Examination of retrieved documents

• Possible return to the start, another query.

Page 25: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

main problem

• User is not an expert at the formulation of a query

• Garbage in garbage out, the retrieval yields poor result

• Ways around that problem– design very intuitive interface for the query– give expert guidance

Page 26: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

taxonomy of classic IR models

• Boolean, or set-theoretic– fuzzy set models– extended Boolean

• Vector, or algebraic– generalized vector model– latent semantic indexing– neural network model

• Probabilistic– inference network– belief network

Page 27: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

summary

• There are three basic types of models in classic information retrieval.

• Extensions of these types are a matter of research concern and require good mathematical skills.

• All classic models treat document as individual pieces.

Page 28: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

key aid: index

• On index is a list of terms, with a list of locations where the term is to be found.

• The way to express locations usually depends on the form that the indexed data takes. – for a book, it is usually the page number, e.g."shmoo 34, 75"– for computer files it is usually the name of the file plus

the number of the byte where the indexed term starts, e.g. "krichel index.html 34, cv.html 890 1209"

• There is usually more than one location of the term.

Page 29: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

key aid: index terms

• The index term is a part of the document that has a meaning on its own.

• It is usually a noun word.• Retrieval based on index term raises questions

– semantics in query or document is lost– matching done in imprecise space of index terms

• Predicting relevance is a central problem• The IR model determines the process of

relevance ranking

Page 30: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

basic concept: weight of index term

• Given all nouns, not all appear to have the same relevance to the text

• Sometimes, we can have a simple measure of the importance of a term, example?

• More generally, for each indexing term and each document we can associate a weight with the term and the document.

• Usually, if the document does not contain the term, its weight is zero

Page 31: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Boolean model

• In the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise.

• This allows to combine query terms with Boolean operator AND, OR, and NOT

• thus powerful queries can be written

Page 32: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Classic implementation: dialog• The documents that I have used

– http://training.dialog.com/sem_info/courses/pdf_sem/dlg1.pdf

– http://training.dialog.com/sem_info/courses/pdf_sem/dlg2.pdf

– http://training.dialog.com/sem_info/courses/pdf_sem/dlg3.pdf

– http://training.dialog.com/sem_info/courses/pdf_sem/dlg4.pdf

• I am also told that there are others at http://gep.dialog.com/instruction/

Page 33: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Dialog is a databank

• over 500 databases• these are also known as files and cover

– references and abstracts for published literature,

– business information and financial data;– complete text of articles and news stories;– statistical tables– Directories

• DIALOG uses the Boolean model

Page 34: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

DIALOG interface

• It is still rooted in "traditional" database systems.

• It has been ismissed as "dial-a-dog".

• It uses a command-driven interface.

• It is very complicated to learn fully.

• It is not suitable for the end-user.

• It therefore offers a valuable skill to the information professional.

Page 35: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Accessing DIALOG

• On the web, go to

• http://www.dialogweb.com/

• Enter username and password

• Forget about subaccount

• Then click on logon

• On the next screen go to command search

• "continue" at the next screen

Page 36: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

two steps in DIALOG

• Step one: select databases (aka files) to look at

• Step two: perform searches on the selected databases

• You may wonder why one does not have one single step like in a search engine. Discuss.

Page 37: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

sample search

• We want to know something about "current awareness in digital libraries"

• From dialogweb command search:– databases– social sciences and humanities– library and information science

• This leads you to http://www.dialogweb.com/cgi/logoff?mode=

guided&url=/cgi/dwframe?href=search.html

Page 38: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

This is database selection…

• At that screen you see a number of "files" with their number.

• You can select those that you want to search

• Then you click "begin datasbase"

• and you get back to the command search

• "b numbers" it will say. That is the command to begin working with files.

Page 39: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

Boolean seach

• Do a number of searches– s current(N)awarness– s digital(N)library– s digital(N)libraries

• Each search retrieves a set of documents

• The sets can be combined– s s1 and (s2 or s3)

Page 40: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

What is the deal?

• There are two stages.

• At stage two we make Boolean queries.

• Each query splits the the records into matching and non-matching records.

• The set of matching records is return.

• It can be further searched or combined with other sets using Boolean operators.

• Try this at home.

Page 41: LIS618 lecture 0 Thomas Krichel 2004-01-24. today's lecture A look at the course home page  administrative

http://openlib.org/home/krichel

Thank you for your attention!