8/28/2001 Information Organization and Retrieval
SIMS 202Information Organization
and Retrieval
Prof. Ray Larson & Prof. Warren SackUC Berkeley SIMS
Tues/Thurs 10:30-12:00 amFall 2001
Lecture authors: Marti Hearst & Ray Larson
8/28/2001 Information Organization and Retrieval
Today
• Introductions
• Course Overview
• Administrivia
8/28/2001 Information Organization and Retrieval
Goals of the Course
• Learn about:– Design, development and use of information
storage and retrieval systems– Practical and theoretical foundations of
information organization and analysis– Evaluation of information access systems– Cognitive and user-centric considerations– Hands-on experience with information systems
8/28/2001 Information Organization and Retrieval
Two Main Themes
Information Organization and
Design
Information Retrieval and the Search Process
8/28/2001 Information Organization and Retrieval
Web Search Questions
• What do people search for?
• How do people use search engines?– How often do people find what they are looking
for? – How difficult is it for people to find what they
are looking for?
• How can search engines be improved?
8/28/2001 Information Organization and Retrieval
What Do People Search for on the Web?
• Study by Spink et al., Oct 98– www.shef.ac.uk/~is/publications/infres/paper53.html
– Survey on Excite, 13 questions– Data for 316 surveys
8/28/2001 Information Organization and Retrieval
What Do People Search for on the Web?
Topics• Genealogy/Public Figure: 12%• Computer related: 12%• Business: 12%• Entertainment: 8%• Medical: 8%• Politics & Government 7%• News 7%• Hobbies 6%• General info/surfing 6%• Science 6%• Travel 5%• Arts/education/shopping/images 14%
Something is missing…
8/28/2001 Information Organization and Retrieval
What do people search for on the web?
• 4660 sex• 3129 yahoo• 2191 internal site
admin check from kho• 1520 chat• 1498 porn• 1315 horoscopes• 1284 pokemon• 1283 SiteScope test
• 1223 hotmail• 1163 games• 1151 mp3• 1140 weather• 1127 www.yahoo.com• 1110 maps• 1036 yahoo.com• 983 ebay• 980 recipes
50,000 queries from excite 1997 Most frequent terms:
8/28/2001 Information Organization and Retrieval
Why do these differ?
• Self-reporting survey
• The nature of language– Only a few ways to say certain things– Many different ways to express most concepts
• UFO, Flying Saucer, Space Ship, Satellite
• How many ways are there to talk about history?
8/28/2001 Information Organization and Retrieval
Intranet Queries (Aug 2000)• 3351 bearfacts• 3349 telebears• 1909 extension• 1874 schedule+of+classes• 1780 bearlink• 1737 bear+facts• 1468 decal• 1443 infobears• 1227 calendar• 989 career+center• 974 campus+map• 920 academic+calendar• 840 map
• 773 bookstore• 741 class+pass• 738 housing• 721 tele-bears• 716 directory• 667 schedule• 627 recipes• 602 transcripts• 582 tuition• 577 seti• 563 registrar• 550 info+bears• 543 class+schedule• 470 financial+aid
8/28/2001 Information Organization and Retrieval
Intranet Queries• Summary of sample data from 3 weeks of UCB queries
– 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)
– 6.7% Schedule of classes or final exams (6222)
– 5.4% Summer Session (5041)
– 3.2% Extension (2932)
– 3.1% Academic Calendar (2846)
– 2.4% Directories (2202)
– 1.7% Career Center (1588)
– 1.7% Housing (1583)
– 1.5% Map (1393)
• Average query length over last 4 months: 1.8 words
• This suggests what is difficult to find from the home page
8/28/2001 Information Organization and Retrieval
An Example Search System:Cha-Cha
• A system for searching complex intranets
• Places retrieval results in context
8/28/2001 Information Organization and Retrieval
An Example Search System: Cha-Cha
• Important design goals:– Users at any level of computer expertise
– Browsers at any version level
– Computers of any speed
8/28/2001 Information Organization and Retrieval
8/28/2001 Information Organization and Retrieval
8/28/2001 Information Organization and Retrieval
Search: Where to Start?• Guess words?
– Search engine plunges you into the middle of a site/collection
– Too many or too few results– No context
• Use a directory?– If large, may be difficult/frustrating to navigate– Several ways to organize the information– May not reflect users’ needs
• Solution: Integrate Browsing and Search
8/28/2001 Information Organization and Retrieval
8/28/2001 Information Organization and Retrieval
8/28/2001 Information Organization and Retrieval
How Cha-Cha Works• Crawl entire Intranet• Compute the shortest hyperlink path from a certain root
page to every web page• Index and compute metadata for the pages
– Using Cheshire II – Run a user query.– Gather all the hits– Create a “directory” based on combining the shortest paths– Special graph algorithm removes redundant links and internal
nodes
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
crawl theweb
store the
documents
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
crawl theweb
store the
documents
create files of
metadata
Cheshire II
8/28/2001 Information Organization and Retrieval
Cha-Cha Metadata
• Information about web pages– Title– Length– Inlinks– Outlinks– Shortest Paths from a root home page
• Used to provide innovative search interface
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
crawl theweb
store the
documents
create files of
metadata
Cheshire II
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
crawl theweb
create a keyword
index
store the
documents
create files of
metadata
Cheshire II
8/28/2001 Information Organization and Retrieval
Creating a Keyword Index
• For each document– Tokenize the document
• Break it up into tokens: words, stems, punctuation
• There are many variations on this
– Record which tokens occurred in this document• Called an Inverted Index
• Dictionary: a record of all the tokens in the collection and their overall frequency
• Postings File: a list recording for each token, which document it occurs in and how often it occurs
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
Cheshire II
userquery
8/28/2001 Information Organization and Retrieval
Responding to the User Query
• User searches on “pam samuelson”
• Search Engine looks up documents indexed with one or both terms in its inverted index
• Search Engine looks up titles and shortest paths in the metadata index
• User Interface combines the information and presents the results as HTML
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
Cheshire II
userquery
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
Cheshire II
server accesses the
databases
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
Cheshire II
results shownto user
8/28/2001 Information Organization and Retrieval
Cha-Cha System Architecture
Cheshire II
results shownto user
server accesses the
databases
userquery
8/28/2001 Information Organization and Retrieval
What hasn’t been explained here?
• How documents are ranked
• How queries are formed
• How shortest paths are computed
• How the system is built– … among other things!– This is just an introduction! Much more later.
8/28/2001 Information Organization and Retrieval
Two Main Themes
Information Organization and
Design
Information Retrieval and the Search Process
8/28/2001 Information Organization and Retrieval
(Approximate) Course Schedule• Retrieval
– The Search Process– Content Analysis
• Tokenization, Zipf’s Law, Lexical Associations
– IR Implementation– Term weighting and
document ranking• Vector space model• Probabilistic model
– User Interfaces• Overviews, query
specification, providing context, relevance feedback
8/28/2001 Information Organization and Retrieval
Overview Example
• Web site design/ Information Architecture– Incorporates many of the organizational issues
we will be covering– Example taken from a study of professional
designers, by Mark Newman
Information Organization and Retrieval
Information Architecture and Web Site Design
• Information design– structure, categories of
information
• Navigation design– interaction with information
structure
• Graphic design– visual presentation of
information and navigation (color, typography, etc.)
Information Organization and Retrieval
Design Specialties• Information Architecture
– includes management and more responsibility for content
• User Interface Design– includes testing and
evaluation
Information Organization and Retrieval
Web Site Design Process
Implementation
Design
Preliminary Design
Conceptualization
Needs Assessment
Information Organization and Retrieval
Design Process: Preliminary Design
(information/navigation design: schematic)
Information Organization and Retrieval
Design Process: Preliminary Design
(navigation design: storyboard)
Web Site Design Process
• Major design activities are:– Deciding on a set of categories that define the information
content– Deciding how to represent these– Deciding on the navigation structure through the
categorized content• Example: a movie listing website
• There are similarities and differences to:– Database design– Thesaurus design
8/28/2001 Information Organization and Retrieval
(Approximate) Course Schedule
• Organization– Overview
– Metadata and Markup
– Controlled Vocabularies, Classification, Thesauri
– Information Design• Thesaurus Design
• Database Design
8/28/2001 Information Organization and Retrieval
Assignments and Exams• Approximately 9 short assignments (due within
one week – ten days)– Sometimes “checked”, sometimes graded
• One Midterm exam (take-home open book)• Final exam (during Finals week)• Grading:
– Assignments: 40%• Not evenly weighted
– Final: 25%– Midterm: 25%– Class Participation: 10%
8/28/2001 Information Organization and Retrieval
Readings
• Course Reader– Will be available in about a week (will announce)
– Textbooks• Modern Information Retrieval, Baeza-Yates and Ribiero-Neto
(Eds.), Addison Wesley, 1999
• The Organization of Information, Arlene G. Taylor, Libraries Unlimited, 1999,
8/28/2001 Information Organization and Retrieval
Homework (!)
• Read the handout (Borges and Dennett)• Write one or two paragraphs on
– What is information, according to your background or area of expertise?
• Due in class this Thursday, Aug 30.
8/28/2001 Information Organization and Retrieval
What is Information?
• There is no “correct” definition• Can involve philosophy, psychology, signal processing,
physics • Cookie Monster’s definition:
– “news or facts about something”
• Oxford English Dictionary– information: informing, telling; thing told, knowledge, items of
knowledge, news– knowledge: knowing familiarity gained by experience; person’s
range of information; a theoretical or practical understanding of; the sum of what is known
8/28/2001 Information Organization and Retrieval
Next Time
More on What is Information?And How much of it is out there?