47
8/28/2001 Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs 10:30-12:00 am Fall 2001 Lecture authors: Marti Hearst & Ray Larson

8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

SIMS 202Information Organization

and Retrieval

Prof. Ray Larson & Prof. Warren SackUC Berkeley SIMS

Tues/Thurs 10:30-12:00 amFall 2001

Lecture authors: Marti Hearst & Ray Larson

Page 2: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Today

• Introductions

• Course Overview

• Administrivia

Page 3: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Goals of the Course

• Learn about:– Design, development and use of information

storage and retrieval systems– Practical and theoretical foundations of

information organization and analysis– Evaluation of information access systems– Cognitive and user-centric considerations– Hands-on experience with information systems

Page 4: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Two Main Themes

Information Organization and

Design

Information Retrieval and the Search Process

Page 5: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Web Search Questions

• What do people search for?

• How do people use search engines?– How often do people find what they are looking

for? – How difficult is it for people to find what they

are looking for?

• How can search engines be improved?

Page 6: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

What Do People Search for on the Web?

• Study by Spink et al., Oct 98– www.shef.ac.uk/~is/publications/infres/paper53.html

– Survey on Excite, 13 questions– Data for 316 surveys

Page 7: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

What Do People Search for on the Web?

Topics• Genealogy/Public Figure: 12%• Computer related: 12%• Business: 12%• Entertainment: 8%• Medical: 8%• Politics & Government 7%• News 7%• Hobbies 6%• General info/surfing 6%• Science 6%• Travel 5%• Arts/education/shopping/images 14%

Something is missing…

Page 8: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

What do people search for on the web?

• 4660 sex• 3129 yahoo• 2191 internal site

admin check from kho• 1520 chat• 1498 porn• 1315 horoscopes• 1284 pokemon• 1283 SiteScope test

• 1223 hotmail• 1163 games• 1151 mp3• 1140 weather• 1127 www.yahoo.com• 1110 maps• 1036 yahoo.com• 983 ebay• 980 recipes

50,000 queries from excite 1997 Most frequent terms:

Page 9: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Why do these differ?

• Self-reporting survey

• The nature of language– Only a few ways to say certain things– Many different ways to express most concepts

• UFO, Flying Saucer, Space Ship, Satellite

• How many ways are there to talk about history?

Page 10: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Intranet Queries (Aug 2000)• 3351 bearfacts• 3349 telebears• 1909 extension• 1874 schedule+of+classes• 1780 bearlink• 1737 bear+facts• 1468 decal• 1443 infobears• 1227 calendar• 989 career+center• 974 campus+map• 920 academic+calendar• 840 map

• 773 bookstore• 741 class+pass• 738 housing• 721 tele-bears• 716 directory• 667 schedule• 627 recipes• 602 transcripts• 582 tuition• 577 seti• 563 registrar• 550 info+bears• 543 class+schedule• 470 financial+aid

Page 11: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Intranet Queries• Summary of sample data from 3 weeks of UCB queries

– 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)

– 6.7% Schedule of classes or final exams (6222)

– 5.4% Summer Session (5041)

– 3.2% Extension (2932)

– 3.1% Academic Calendar (2846)

– 2.4% Directories (2202)

– 1.7% Career Center (1588)

– 1.7% Housing (1583)

– 1.5% Map (1393)

• Average query length over last 4 months: 1.8 words

• This suggests what is difficult to find from the home page

Page 12: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

An Example Search System:Cha-Cha

• A system for searching complex intranets

• Places retrieval results in context

Page 13: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

An Example Search System: Cha-Cha

• Important design goals:– Users at any level of computer expertise

– Browsers at any version level

– Computers of any speed

Page 14: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Page 15: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Page 16: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Search: Where to Start?• Guess words?

– Search engine plunges you into the middle of a site/collection

– Too many or too few results– No context

• Use a directory?– If large, may be difficult/frustrating to navigate– Several ways to organize the information– May not reflect users’ needs

• Solution: Integrate Browsing and Search

Page 17: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Page 18: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Page 19: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

How Cha-Cha Works• Crawl entire Intranet• Compute the shortest hyperlink path from a certain root

page to every web page• Index and compute metadata for the pages

– Using Cheshire II – Run a user query.– Gather all the hits– Create a “directory” based on combining the shortest paths– Special graph algorithm removes redundant links and internal

nodes

Page 20: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

crawl theweb

store the

documents

Page 21: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

crawl theweb

store the

documents

create files of

metadata

Cheshire II

Page 22: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha Metadata

• Information about web pages– Title– Length– Inlinks– Outlinks– Shortest Paths from a root home page

• Used to provide innovative search interface

Page 23: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

crawl theweb

store the

documents

create files of

metadata

Cheshire II

Page 24: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

crawl theweb

create a keyword

index

store the

documents

create files of

metadata

Cheshire II

Page 25: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Creating a Keyword Index

• For each document– Tokenize the document

• Break it up into tokens: words, stems, punctuation

• There are many variations on this

– Record which tokens occurred in this document• Called an Inverted Index

• Dictionary: a record of all the tokens in the collection and their overall frequency

• Postings File: a list recording for each token, which document it occurs in and how often it occurs

Page 26: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

Cheshire II

userquery

Page 27: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Responding to the User Query

• User searches on “pam samuelson”

• Search Engine looks up documents indexed with one or both terms in its inverted index

• Search Engine looks up titles and shortest paths in the metadata index

• User Interface combines the information and presents the results as HTML

Page 28: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

Cheshire II

userquery

Page 29: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

Cheshire II

server accesses the

databases

Page 30: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

Cheshire II

results shownto user

Page 31: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Cha-Cha System Architecture

Cheshire II

results shownto user

server accesses the

databases

userquery

Page 32: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

What hasn’t been explained here?

• How documents are ranked

• How queries are formed

• How shortest paths are computed

• How the system is built– … among other things!– This is just an introduction! Much more later.

Page 33: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Two Main Themes

Information Organization and

Design

Information Retrieval and the Search Process

Page 34: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

(Approximate) Course Schedule• Retrieval

– The Search Process– Content Analysis

• Tokenization, Zipf’s Law, Lexical Associations

– IR Implementation– Term weighting and

document ranking• Vector space model• Probabilistic model

– User Interfaces• Overviews, query

specification, providing context, relevance feedback

Page 35: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Overview Example

• Web site design/ Information Architecture– Incorporates many of the organizational issues

we will be covering– Example taken from a study of professional

designers, by Mark Newman

Page 36: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

Information Organization and Retrieval

Information Architecture and Web Site Design

• Information design– structure, categories of

information

• Navigation design– interaction with information

structure

• Graphic design– visual presentation of

information and navigation (color, typography, etc.)

Page 37: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

Information Organization and Retrieval

Design Specialties• Information Architecture

– includes management and more responsibility for content

• User Interface Design– includes testing and

evaluation

Page 38: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

Information Organization and Retrieval

Web Site Design Process

Implementation

Design

Preliminary Design

Conceptualization

Needs Assessment

Page 39: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

Information Organization and Retrieval

Design Process: Preliminary Design

(information/navigation design: schematic)

Page 40: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

Information Organization and Retrieval

Design Process: Preliminary Design

(navigation design: storyboard)

Page 41: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

Web Site Design Process

• Major design activities are:– Deciding on a set of categories that define the information

content– Deciding how to represent these– Deciding on the navigation structure through the

categorized content• Example: a movie listing website

• There are similarities and differences to:– Database design– Thesaurus design

Page 42: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

(Approximate) Course Schedule

• Organization– Overview

– Metadata and Markup

– Controlled Vocabularies, Classification, Thesauri

– Information Design• Thesaurus Design

• Database Design

Page 43: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Assignments and Exams• Approximately 9 short assignments (due within

one week – ten days)– Sometimes “checked”, sometimes graded

• One Midterm exam (take-home open book)• Final exam (during Finals week)• Grading:

– Assignments: 40%• Not evenly weighted

– Final: 25%– Midterm: 25%– Class Participation: 10%

Page 44: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Readings

• Course Reader– Will be available in about a week (will announce)

– Textbooks• Modern Information Retrieval, Baeza-Yates and Ribiero-Neto

(Eds.), Addison Wesley, 1999

• The Organization of Information, Arlene G. Taylor, Libraries Unlimited, 1999,

Page 45: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Homework (!)

• Read the handout (Borges and Dennett)• Write one or two paragraphs on

– What is information, according to your background or area of expertise?

• Due in class this Thursday, Aug 30.

Page 46: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

What is Information?

• There is no “correct” definition• Can involve philosophy, psychology, signal processing,

physics • Cookie Monster’s definition:

– “news or facts about something”

• Oxford English Dictionary– information: informing, telling; thing told, knowledge, items of

knowledge, news– knowledge: knowing familiarity gained by experience; person’s

range of information; a theoretical or practical understanding of; the sum of what is known

Page 47: 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs

8/28/2001 Information Organization and Retrieval

Next Time

More on What is Information?And How much of it is out there?