112
Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Final Exam Review

SIMS 202Profs. Hearst & Larson

UC Berkeley SIMSFall 2000

Page 2: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Final Exam

Monday Dec 11– 9:30-12:30– Room 202 and 205

Bring– Pens/pencils– Calculator– Notes/Books (optional)

Page 3: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Final Exam

Topics– Comprehensive, but– Emphasis on materials since the

midterm Types of questions

– Similar to those on homeworks and the midterm, but less time-consuming

– Probably a design problem.

Page 4: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Relationships among Language, Concepts, and

Categories

Page 5: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Symbols and Language

Abstract concepts are difficult to express in a computer.

Combinations of abstract concepts are even more difficult to express:– time– shades of meaning– social and psychological concepts– causal relationships

Page 6: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Symbols and Language

As the man walks the cavorting dog, thoughtsarrive unbidden of the previous spring, so unlikethis one, in which walking was marching anddogs were baleful sentinels outside unjust halls.

What is the relation between the symbols and the meaning?

Page 7: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Symbols and Language

Language only hints at meaning. Most meaning of text lies within our

minds and common understanding.– “How much is that doggy in the window?”

» how much: social system of barter and trade (not the size of the dog)

» “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own

» “in the window” implies behind a store window, not really inside a window, requires notion of window shopping

Page 8: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Lexical Relations

Conceptual relations link concepts

Lexical relations link words

How do they differ? How are they similar?

Page 9: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Major Lexical Relations

Synonymy Polysemy Metonymy Hyponymy/Hyperonymy Meronymy Antonymy

Page 10: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Relationships among Meanings

Homonymy: same word, different meanings– bank (river bank) vs bank (financial institution)

Polysemy: same word, different senses of meaning– slightly different concepts expressed similarly– bank (institution vs building)

Synonyms: different words, related senses of meanings– different ways to express similar concepts– jail, prison, penitentiary

Page 11: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Defining Category Membership

Necessary and Sufficient Conditions:– (This used to be a very influential

definition of category membership; it is ok for math and logic but out-of-date for human categories)

– Every condition must be met.– No other conditions can be required.

Page 12: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Can category membership be crisply defined?

What are the necessary and sufficient conditions for something to be a game?

Page 13: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Properties of Categorization Family Resemblance

– Members of a category may be related to one another without all members having any property in common.

» Instead, they may share a large subset of traits.» Some attributes are more likely given that others have

been seen.

– Example: feathers, wings, twittering, ...» Likely to be a bird, but not all features apply to “emu”» Unlikely to see an association with “barks”

Page 14: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Properties of Categorization

Centrality– Some members of a category may be

“better examples” than others.»Example: robins vs. chickens vs. emus»Exampe: soccer vs. gambling vs.

hopscotch

Page 15: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Properties of Categorization

Characteristic Features– Perceived degree of category membership has

to do with which features define the category.– Members usually do not have ALL the

necessary features, but have some subset.– Those members that have more of the central

features are seen as more central members.– People have conceptions of typical members.

Page 16: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Three Psychologically Primary Levels

SUPERORDINATE animal furnitureBASIC LEVEL dog chairSUBORDINATE terrier rocker

Children take longer to learn superordinate

Superordinate not associated with mental images or motor actions

How related to – Hyponymy– Hyperonymy

Page 17: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Characteristics of Basic-level Categories

Language– People name things more readily at basic

level.– Name learned earliest in childhood.– Languages have simpler names at basic level.– Sounds like the “real name”. – Name used more frequently.

» Strange to call a dime a coin, a metal object

– Names used in neutral context.» There’s a dog on the porch.» There’s a terrier on the porch.

Page 18: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Characteristics of Basic-level Categories

Concepts– Things perceived more holistically at the basic

level (rather than by parts).– People interact with basic and more specific

levels similarly.– Things are remembered more readily at basic

level.– Folk biological categories correspond

accurately to scientific biological categories only at the basic level.

Page 19: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Metadata

Page 20: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Metadata Topics What is metadata? Controlled vocabularies / indexing

languages Metadata standards

– Dublin Core– XML– etc

Thesaurus creation and use Classification structure

– Descriptors vs subject headings– Hierarchies vs facets

Page 21: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Metadata

Metadata is:– “data about data” (term usage database

systems)– Information about Information– Structures and Languages for the Description of

Information Resources and their elements (components or features)

– “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)

Page 22: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Type of Metadata systems and standards

Naming and ID systems – URLs, ISBNs Bibliographic description – MARC, Dublin

Core, TEI, etc. Music -- SMDL Images and objects – CIMI, VRA Core

Categories Numeric Data – DDI, SDSM Geospatial Data – FGDC Collections – EAD

Page 23: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Types of Indexing Languages Uncontrolled Keyword Indexing Indexing Languages

– Controlled, but not structured Thesauri

– Controlled and Structured Classification Systems

– Controlled, Structured, and Coded Faceted Classification Systems

Page 24: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Controlled Vocabularies

Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

Page 25: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

What is a “Controlled Vocabulary”

“The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

Similarly, there are too many ways of expressing or explaining the topic of a document.

Controlled vocabularies are sets of Rules for topic identification and indexing, and a THESAURUS, which consists of “lead-in vocabulary” and an limited and selective “Indexing Language” sometimes with special coding or structures.

Page 26: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Uses of Controlled Vocabularies

Library Subject Headings, Classification and Authority Files.

Commercial Journal Indexing Services and databases

Yahoo, and other Web classification schemes

Online and Manual Systems within organizations– SunSolve– MacArthur

Page 27: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Indexing Languages

An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents.

An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.

Page 28: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

The Indexing Process

Concept identification term selection (via thesaurus) term assignment

Page 29: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Application: The Indexing Process (Manual)

IsTerm

suitable

NOSelect Alternativeterm to represent

Concept

WouldConcept be

better representedby one of

these terms

Is There

Another Concept

Consider Preferred

Term

Select Preferred

Term

Establish TermDenoting Concept

Examine Documentand Identify Significant Concepts

Consider First

Concept

PreferredTerm?

StartNO

NO

NO

NO

NO

YES YES YES

YES

YESYES

DoesThesaurus

contain termfor

Concept

Consider anyassociated terms inThesaurus (NT,BT)

Admit New TermInto Thesaurus

Can Conceptbe expressed

combining terms?

Consider Each ofThese Terms

Assign Termsto

Document

Prefer Alternative

Term(s)

End

Adapted from ISO 5963, p.5

Page 30: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Metadata Standards

Page 31: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

The problem

Proliferation of the forms of names– Different names for the same person– Different people with the same names

Page 32: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Bibliographic Description

MARC (Machine Readable Cataloging)

DUBLIN CORE– Warwick Framework for Dublin Core

Metadata GILS (Government Information

Locator Service) RFC 1807 (Format for Bibliographic

Records) RDF (Resource Description Format)

Page 33: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Images and Objects

Categories for the Description of Works of Art (Getty Art Institute)

Consortium for the Computer Interchange of Museum Information (CIMI)

RLG REACH Element Set (for Shared Description of Museum Objects)

VRA Core Categories (Visual Resources Association)

Page 34: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Collection Level Descriptors

EAD (Encoded Archival Description)

Z39.50 Profile for Access to Digital Collections

RSLP Collection Description (Research Support Libraries Programme)

Page 35: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Dublin Core

Simple metadata for describing internet resources.

For “Document-Like Objects” 15 Elements.

Page 36: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Dublin Core Elements

Title Creator Subject Description Publisher Other

Contributors Date Resource Type

Format Resource

Identifier Source Language Relation Coverage Rights

Management

Page 37: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Source

Label: SOURCE The work, either print or

electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.

Page 38: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

The Same Item in Different Metadata Systems

ISBD Dublin Core RFC 1807 TEI Header MARC Record

Page 39: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

ISBD Punctuation

Title Proper (GMD) = Parallel title : other title info / First statement of responsibility ; others. -- Edition information. -- Material. -- Place of Publication : Publisher Name, Date. -- Material designation and extent ; Dimensions of item. -- (Title of Series / Statement of responsibility). -- Notes. -- Standard numbers: terms of availability (qualifications).

Page 40: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Bibliographic Record

Introduction to cataloging and classification / Bohdan S. Wynar. -- 8th ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, 1992. -- (Library science text series).

Page 41: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

MARC Record (display) ID:DCLC9124851-B RTYP:c ST:p FRN: MS:c EL: AD:06-20-91 CC:9110 BLT:am DCF:a CSC: MOD: SNR: ATC: UD:04-11-92 CP:cou L:eng INT: GPC: BIO: FIC:0 CON:b PC:s PD:1992/ REP: CPI:0 FSI:0 ILC:a II:1 MMD: OR: POL: DM: RR: COL: EML: GEN: BSE: 010 9124851 020 0872878112 (cloth) 020 0872879674 (paper) 040 DLC$cDLC$dDLC 050 00 Z693$b.W94 1991 082 00 025.3$220 100 1 Wynar, Bohdan S. 245 10 Introduction to cataloging and classification /$cBohdan S. Wynar. 250 8th ed. /$bArlene G. Taylor. 260 Englewood, Colo. :$bLibraries Unlimited,$c1992. 300 xvii, 633 p. :$bill. ;$c24 cm. 440 0 Library science text series 504 Includes bibliographical references (p. 591-599) and index. 650 0 Cataloging. 650 0 Subject cataloging. 650 0 Classification$xBooks. 630 00 Anglo-American cataloguing rules. 700 10 Taylor, Arlene G.,$d1941-

Page 42: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Conditions of Authorship?

Single person or single corporate entity Unknown or anonymous authors

– Fictitiously ascribed works Shared responsibility Collections or editorially assembled

works Works of mixed responsibility (e.g.

translations) Related Works

Page 43: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Name Authority Files

ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Different names for thesame person

Page 44: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Name Authority Files

ID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

Page 45: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Name authority files

ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)

Different people writing with the same name

Page 46: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Other Types of Controlled Vocabularies

Gazetteers (Geographic Names) Code lists (e.g. LC Language

Codes) Subject Heading Lists Classification Schemes Thesauri

Page 47: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

What is SGML/XML?

A. SGML stands for Standard Generalized Markup Language– XML stands for eXtended Markup

Language B. What it is NOT:

– Not a visual document description– Not an application specific markup– Not proprietary

Page 48: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

What is SGML/XML?

What it is:– An international standard (SGML- ISO

8879:1986)– A generic language for describing the

structure of documents, and markup that can be used for those documents

– Intended for generating markup for content rather than form elements

XML is a simplified subset of SGML (being established by W3C)

Page 49: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

XML

Extensible Markup Language– a simplification of SGML, the Standard

Generalized Markup Language – instead of a fixed set of format-oriented tags

like HTML, XML allows you to create the schema -- whatever set of tags are needed --for your information type or application

– this makes any XML instance “self-describing” and easily understood by computers and people

Version 1.0 ratified by W3C in 2/98; backed by Microsoft, Sun, Netscape, many others

Source Dr. Robert J Glushko

Page 50: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

HTML Airline Schedule Seen “By Computer”

<Title>Airline Schedule</Title><Body><H2>Flight Information</H2><H3>United Airlines #200</H3><UL><LI>San Francisco

<LI>9:30 AM<LI>Honolulu

<LI>12:30 PM <LI>$368.50 </UL></Body>

Source Dr. Robert J Glushko

Page 51: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Airline Schedule in XML

<TransportSchedule Type=“Airline”><Segment Id=“United Airlines #200”> <Origin>San Francisco</Origin><DepartTime>9:30 AM</DepartTime> <Destination>Honolulu</Destination><ArriveTime>12:30 PM</ArriveTime> <Price Currency=“USD”>368.50</Price></Segment></TransportSchedule>

Source Dr. Robert J Glushko

Page 52: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

SGML/XML Structure

An SGML document consists of three parts:– The SGML Declaration– The Document Type Definition (DTD)– The Document Instance

An XML document requires only the document instance, but for effective processing a DTD is important.

Page 53: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Document Type Definitions

The DTD describes the structural elements and "shorthand" markup for a particular document type. It defines:

– Names of "legal" elements– How many times elements can appear– The order of elements in a document– Whether markup can be omitted (SGML only)– Contents of elements (i.e., nested structures)– Attributes associated with elements– Names of "entities"– short-hand conventions for element tags. (SGML only)

Page 54: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

DTD Components

The major components of a DTD are:–Entity Declarations–Element Declarations–Attribute Declarations

Page 55: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Thesauri A Thesaurus is a collection of

selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms

Page 56: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Thesauri (cont.) Examples:

– The ERIC Thesaurus of Descriptors– The Art and Architecture Thesaurus– The Medical Subject Headings (MESH)

of the National Library of Medicine

Page 57: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Why develop a thesaurus?

To provide a conceptual structure or “space” for a body of information– To make it possible to adequately

describe the topical contents of informational objects at an appropriate level of generality or specificity

– To provide enhanced search capabilities and to improve the effectiveness of searching (I.e., to retrieve most of the relevant material without too much irrelevant material).

Page 58: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Why develop a thesaurus?

To provide vocabulary (or terminological) control. – When there are several possible

terms designating a single concept, the thesaurus should lead the indexer or searcher to the appropriate concept, regardless of the terms they start with.

Page 59: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Preliminary considerations

What is used now?– Continue using an existing thesaurus?– Ad hoc modification of existing thesaurus?– Develop a new well-structured thesaurus?

What is the scope and complexity of the subject field?

What kind of retrieval objects or data will be dealt with?

How exhaustive and specific is the desired description of objects?

Page 60: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Preliminary Considerations

The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus.– It is better to plan for a larger and more

comprehensive system than a smaller system that rapidly will become inadequate as the database grows.

Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists.

Page 61: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Development of a Thesaurus Term Selection. Merging and Development of

Concept Classes. Definition of Broad Subject Fields

and Subfields. Development of Classificatory

structure Review, Testing, Application,

Revision.

Page 62: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Flow of Work in Thesaurus Construction

Select Sources

Assign codes

Select Terms

Record Selected Terms

Sort Terms

Merge identical Terms

Define Broad SubjectFields

Merge Terms in SameConcept class

Sort Terms into BroadSubject Fields

Define Subfields withinone Subject Field

Work out detailed structureof the Subject Field

Select Preferred Terms

All Subfields of BroadSubject finished?

All BroadSubjects finished?

Improve Class Structure

Yes

Yes

No

No

Print Classified Indexand review

Discuss with Experts andUsers

Select descriptors andchecklist items

Produce Full Thesaurusand Check references

Assign Notation

Review and Test

Many Modifications?

Based on Soergel, pp 327-333

Yes

No

Revise asneeded

Page 63: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

2. Merging and Development of Concept Classes

Sort Term DB into alphabetical order.

First Round: Merge information for Identical terms -- possibly pulling info from additional sources.

Second Round: Merge synonyms or terms in the same concept class.

Page 64: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

3. Definition of Broad Subject Fields and Subfields

Define Broad Subject fields and sort terms into these broad fields

Define subfields within each broad field and sort terms into these subfields.

Work out the detailed structure– Select Preferred Terms– Merge information for

terms in the same concept class

Repeat these steps– for each subfield within a

broad field– and for each broad field– Until all terms have been

consolidated and preferred terms selected

Page 65: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

4. Development of Classificatory Structure

Produce preliminary version of classified index and update the working database.

Improve classificatory structure

Reality check: produce and distribute a version of the classified index. Distribute to users/experts.

Page 66: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

5. Final Stages

Review Testing Application Revision

Page 67: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Thesaurus Revision and Updates

There will always be new concepts, products, or expressions that need to be added to the thesaurus. – Set a regular schedule of reviews and

revisions.– Collect complaints, problems, etc. and

fold into revision of the thesaurus

Page 68: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Hierarchical vs. Faceted (Subject Heading vs. Descriptor)

Category Systems

Page 69: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

AssigningHeadings vs. Descriptors

Subject headings – assign one (or a

few) complex heading(s) to the document

Descriptors– Mix and match

How would we describe recipes using each technique?

Page 70: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Subject Heading vs. Descriptor WILSONLINE

– Athletes– Athletes--Heath&Hygiene– Athletes--Nutrition– Athletes--Physical Exams– …– Athletics– Athletics -- Administration– Athletics -- Equipment --

Catalogs– …– Sports -- Accidents and

injuries– Sports -- Accidents and

injuries -- prevention

ERIC– Athletes– Athletic Coaches– Athletic Equipment– Athletic Fields– Athletics– …– Sports psychology– Sportsmanship

Page 71: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Subject Headings vs. Descriptors

Describe the contents of an entire document

Designed to be looked up in an alphabetical index– Look up document

under its heading Few (1-5)

headings per document

Describe one concept within a document

Designed to be used in Boolean searching– Combine to describe

the desired document Many (5-25)

descriptors per document

Page 72: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Hierarchical Classification

– Each category is successively broken down into smaller and smaller subdivisions

– No item occurs in more than one subdivision

– Each level divided out by a “character of division”. Also known as a feature.»Example: distinguish Literature based on:

Language Genre Time Period

Page 73: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Hierarchical Classification

Literature

SpanishFrenchEnglish

DramaPoetryProse

18th17th16th

DramaPoetryProse

19th 18th17th16th 19th

...

... ... ...

...

Page 74: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Labeled Categories for Hierarchical Classification

LITERATURE– 100 English Literature

» 110 English Prose English Prose 16th Century English Prose 17th Century English Prose 18th Century ...

» 111 English Poetry 121 English Poetry 16th Century 122 English Poetry 17th Century ...

» 112 English Drama 130 English Drama 16th Century …

– 200 French Literature

Page 75: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Faceted Classification

Create a separate, free-standing list for each characteristic of division (feature).

Combine features to create a classification.

Page 76: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Faceted Classification along with Labeled Categories

A Language– a English– b French– c Spanish

B Genre– a Prose– b Poetry– c Drama

C Period– a 16th Century– b 17th Century– c 18th Century– d 19th Century

Aa English Literature

AaBa English Prose AaBaCa English

Prose 16th Century AbBbCd French

Poetry 19th Century BbCd Drama 19th

Century

Page 77: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Important Question:How to use both types ofclassification structures?

How to look through them? How to use them in search?

Page 78: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Design of Information Architecture

Page 79: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Web Site Design Issues

Page 80: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Design

Prototype

Evaluate

Iteration earlier in the design process is more cost-effective

Iteration is the Key to UI Design

Page 81: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Design Process: Discovery

Implementation

Design

Preliminary Design

Conceptualization

Discovery Assess needs– understand

client’s expectations

– determine scope of project

– characteristics of users

Page 82: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by Mark Newman

Design Process: Conceptualization

Implementation

Design

Preliminary Design

Conceptualization

Discovery Begin defining site– Take results from

discovery and visualize solutions

– Early information design

Page 83: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by Mark Newman

Design Process: Preliminary Design

Implementation

Design

Preliminary Design

Conceptualization

Discovery Generate multiple (3-5) designs

– one will be selected for development

– navigation design– early graphic

design

Page 84: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by Mark Newman

Design Process: Preliminary Design

Activities– Sketching designs– Creating mock-ups– Quick and rough

Deliverables– Schematics (a.k.a. templates)– Site maps– Mock-ups– Presentations

Page 85: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by Mark Newman

Design Process: Design

Implementation

Design

Preliminary Design

Conceptualization

Discovery Iteration

Design

Prototype

Evaluate

• iteration at the level of development process

• And within design stage

Page 86: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by Mark Newman

Design Process: Implementation

Implementation

Design

Preliminary Design

Conceptualization

Discovery Prepare design

for handoff– Create final

deliverable– Specifications and

prototypes– As much detail as

possible

Page 87: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Why Do We Prototype?

Get feedback on our design faster– saves money

Experiment with alternative designs

Fix problems before code is written Keep the design centered on the

user

Page 88: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by James Landay

Fidelity in Prototyping

Fidelity refers to the level of detail High fidelity?

– prototypes look like the final product Low fidelity?

– artists renditions with many details missing

Page 89: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by James Landay

Low-fidelity Sketches

Page 90: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Slide by James Landay

Low-fidelity Sketches

Page 91: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Database Systems

Page 92: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Terms and Concepts

Database: – A collection of similar records with

relationships between the records. (Rowley)

– A Database is a collection of stored operational data used by the application systems of some particular enterprise. (C.J. Date)

Page 93: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

DBMS Benefits

Minimal Data Redundancy Consistency of Data Integration of Data Sharing of Data Ease of Application Development Uniform Security, Privacy, and

Integrity Controls Data Accessibility and

Responsiveness Data Independence Reduced Program Maintenance

Page 94: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Database Components

DBMS===============

Design toolsTable CreationForm CreationQuery CreationReport Creation

Procedural language

compiler (4GL)=============

Run timeForm processorQuery processor

Report WriterLanguage Run time

UserInterface

Applications

ApplicationProgramsDatabase

Database contains:User’s DataMetadataIndexesApplication Metadata

Kroenke, DatabaseProcessing

Page 95: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Terms and Concepts

Records– The set of values for all attributes of a

particular entity– AKA “tuples” or “rows” in relational

DBMS File

– Collection of records – Usually a physical file on OS– May also be a “logical file” like a

“Relation” or “Table” in relational DBMS

Page 96: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Terms and Concepts

Key– an attribute or set of attributes used

to identify or locate records in a file Primary Key

– an attribute or set of attributes that uniquely identifies each record in a file

Page 97: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Terms and Concepts

Data Independence– Physical representation and location of data

and the use of that data are separated» The application doesn’t need to know how

or where the database has stored the data, but just how to ask for it.

» Moving a database from one DBMS to another should not have a material effect on application program

» Recoding, adding fields, etc. in the database should not affect applications

Page 98: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Terms and Concepts

Metadata– Data about data

» In DBMS means all of the characteristics describing the attributes of an entity, E.G.:

name of attribute data type of attribute size of the attribute format or special characteristics

– Characteristics of files or relations»name, content, notes, etc.

Page 99: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Design

Determination of the needs of the organization

Development of the Conceptual Model of the database– Typically using Entity-Relationship

diagramming techniques Construction of a Data Dictionary Development of the Logical Model

Page 100: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Entity

An Entity is an object in the real world (or even imaginary worlds) about which we want or need to maintain information– Persons (e.g.: customers in a

business, employees, authors)– Things (e.g.: purchase orders,

meetings, parts, companies)

Employee

Page 101: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Attributes

Attributes are the significant properties or characteristics of an entity that help identify it and provide the information needed to interact with it or use it. (This is the Metadata for the entities.)

Employee

Last

Middle

First

Name SSN

Age

Birthdate

Projects

Page 102: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Relationships

Relationships are the associations between entities. They can involve one or more entities and belong to particular relationship types

Page 103: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Relationships

ClassAttendsStudent

PartSuppliesproject

partsSupplier

Project

Page 104: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Mapping to a Relational Model

Each entity in the ER Diagram becomes a relation.

A properly normalized ER diagram will indicate where intersection relations for many-to-many mappings are needed.

Relationships are indicated by common columns (or domains) in tables that are related.

We will examine the tables for the Acme Widget Company derived from the ER diagram

Page 105: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Normalization

Normalization theory is based on the observation that relations with certain properties are more effective in inserting, updating and deleting data than other sets of relations containing the same data

Normalization is a multi-step process beginning with an “unnormalized” relation– Hospital example from Atre, S. Data Base: Structured Techniques

for Design, Performance, and Management.

Page 106: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Normalization

Boyce-Codd and

Higher

Functional dependencyof nonkey attributes on the primary key - Atomic values only

Full Functional dependencyof nonkey attributes on the primary key

No transitive dependency between nonkey attributes

All determinants are candidate keys - Single multivalued dependency

Page 107: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Relational Algebra Operations

Select Project Product Union Intersect Difference Join Divide

Page 108: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Effectiveness and Efficiency Issues for DBMS

Focus on the relational model Any column in a relational

database can be searched for values.

To improve efficiency indexes using storage structures such as BTrees and Hashing are used

But many useful functions are not indexable and require complete scans of the the database

Page 109: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Advantages of RDBMS

Possible to design complex data storage and retrieval systems with ease (and without conventional programming).

Support for ACID transactions– Atomic – Consistent– Independent– Durable

Page 110: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Advantages of RDBMS

Support for very large databases Automatic optimization of

searching (when possible) RDBMS have a simple view of the

database that conforms to much of the data used in businesses.

Standard query language (SQL)

Page 111: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Disadvantages of RDBMS

Until recently, no support for complex objects such as documents, video, images, spatial or time-series data. (ORDBMS are adding support these).

Often poor support for storage of complex objects. (Disassembling the car to park it in the garage)

Still no efficient and effective integrated support for things like text searching within fields.

Page 112: Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000

Study hard, and good luck!

Thank you for all the great work!