274
1 Information retrieval [email protected] Vrije Universiteit Brussel Information- and Library Science, University of Antwerp(en), Belgium Lectures presented in universities in China, March 2003. These slides are available from http://www.vub.ac.be/BIBLIO/nieuwenhuysen/presentation s/

1 Information retrieval [email protected] Vrije Universiteit Brussel Information- and Library Science, University of Antwerp(en), Belgium Lectures

Embed Size (px)

Citation preview

1Information retrieval

[email protected]

• Vrije Universiteit Brussel

• Information- and Library Science, University of Antwerp(en),

Belgium

Lectures presented in universities in China, March 2003.

These slides are available fromhttp://www.vub.ac.be/BIBLIO/nieuwenhuysen/presentations/

2

Contents / summary

of this presentation

1. About “information”

2. Databases and computerized information retrieval

3. Classifications, and thesaurus systems

4. Internet

5. World-Wide Web

6. Online access information sources and services!

3

About “information”

Information concepts

****

4

Our world: future trends

Future trends in our world

• Complexity

• Dynamics and evolution Speed and acceleration

• Internationalization Globalization

• Economic products less based on natural resources and more on “knowledge”

Answers / Requirements / Solutions / Reactions

• Knowledge and skills

• Adaptability Flexibility

• Global co-operation Mobility

• Education, research, exploitation of knowledge is important

***-

!? Question !?

Compare “information” for instance with “bananas”.

Compare “information” for instance with “bananas”.

***- 5

6

Information versus other products = bits versus atoms

• The essential difference between information and other economical products or natural products is that information on computers (such as databases) consists of bits (and bytes), while other economic / natural products (such as bananas) consist of atoms.

• This has many interrelated consequences.

***-

01010101101011010010

7

Information: some strange properties (Part 1)

• Information is never consumed and does not deteriorate. However, nevertheless information becomes obsolete; speed of delivery can be crucial. The context is important.

• There is no agreed measure of a unit of information.

• The price of an information item is not well linked to its value in a particular situation. Moreover, one cannot well quantify the benefit/value of information.

***-

8

Information: some strange properties (Part 2)

• One information item can be available to different persons at the same time. Information can be well reproduced, which makes it cheap for wide consumption. However, copyright can keep the price high.

• Most digital information items (documents) can be changed, modified, falsified, manipulated… easier than physical products/items.”Is this document real, authentic, original?”

***-

9***-

Information sources:people and documents

• Information sources come essentially in two formats:

» less formal: people communicating by

—telephone

—electronic mail,…

»more formal: documents such as

—hard copy documents

—electronic, digital documents; computer-based files

• Here we focus mainly on information that is stored in documents.

10

The flow of documentary information with primary and secondary sources

Reader /User /

Receiver

Reader /User /

Receiver

Secondary sources / systems: mainlyReference works (printed, CD-ROM, online)

Library catalogues, including OPACs...

Secondary sources / systems: mainlyReference works (printed, CD-ROM, online)

Library catalogues, including OPACs...

****

Author /Creator / Sender

Author /Creator / Sender Primary sources / systems: mainly

Journal articles / Books / Electronic mail / Online sources /...

Primary sources / systems: mainlyJournal articles / Books /

Electronic mail / Online sources /...

11

The role of secondary information sources

• The secondary information flow is generated on the basis of the primary flow, mainly because the great amounts of primary information lower the chance to retrieve and use the appropriate information item.

• Secondary information tries to bring some order in the great chaos.

****

12

Various categorisations of documentary information sources

Information sources can be categorised in various ways. For instance:

****

•Primary

•Secondary

•Hard copy /not digital

•Digital

•Offline

•Online

•Text•Image•Sound•Animation/video•Software•Data•Interactive

•Books

•Serials

13

Past

Now

Future

Retrospective searching versus current awareness: scheme

****

Retrospective searching

Current awareness

14

Information retrieval: evolution of storage and distribution media

****

• 1450 printing with reusable characters/fonts

• 1975 + online access databasesfrom the 1970s growing Internet

• 1985 + CD-ROM

• 1990 + World-Wide Web

(based on the Internet)

15

Information retrieval: end user or information intermediaries

End-user

Information intermediary(Broker or library or ...)

Information

****

16

End user versus information intermediary

• People can retrieve information themselves, directly as so-called “end-users”.

• However,

»the information landscape is complex,

»it may cost a lot of the time to find the right information,

»it may be costly to search for information

• Therefore it may be wise to obtain the assistance of an expert information intermediary, such a a reference librarian or an information broker.

****

17

About “information”

Evaluating information sources

****

18

Documentary information sources: evaluate their quality

• We should always be critical when using information sources, in view of

»the widely varying degrees of quality of information sources, and of

»the costs associated with searching, finding, using information.

****

19

Documentary information sources: criteria to evaluate their quality (1)

• Is the information valid, reliable, trustworthy, genuine, authentic? Is the author honest? Is the source objective, not subjective, without cultural or political or ideological or commercial bias? Is the origin an individual or a company or an organisation?Is the publication sponsored by some company or organisation?

****

20

Documentary information sources: criteria to evaluate their quality (2)

• Is the information accurate, correct? Who is the author or producer? Has the source an author or a producer with a high expertise, a good reputation, good qualifications?Can the author be contacted for clarification or discussion? Was the information reviewed, edited, improved, corrected, censored, approved, verified, before publication? Do experts agree on the information provided?

****

21

Documentary information sources: criteria to evaluate their quality (3)

• Is the information source unique? Does it offer a great amount of primary information, which is not obtainable from other sources?

• Is the information complete? Is the work available in its entirety?

• Does the source offer a wide coverage? Is the source comprehensive, substantive?

• Is the information current enough, up to date? Is a publication date provided?Is an expiration date provided?

****

22

Documentary information sources: criteria to evaluate their quality (4)

• Does the document provide suitable references, so that you can verify statements and find older suitable information sources?

• Good clear format and lay-out of the information / User-friendly information system / Easy for users to orientate themselves within the resource and to find their way around it?

• Good user support / Good customer support?

• Is the type of distribution medium appropriate? (print, e-mail, online,...)

****

23

Documentary information sources: criteria to evaluate their quality (5)

• Is the information what you want?If not, then reassess your needs and consider other types of information as well.

****

24

Documentary information sources: criteria to evaluate their quality (6)

• Is the information suitable for your level of understanding of the subject? Is the document popular, suitable for the general public, for students, for professionals, for scholarly/academic use…?Doest it report new, primary research (survey, experiment, observation, measurement, invention) or is it a review of sources published earlier?

• Does the information repeat or confirm what you already know, or is it complementary, contradictory, new?

****

25

About “information”

Computer- and network-based information

****

26

Information: from bits to meaningful information

Digitalcomputer data = bits

or01Program code, meaningful for andto be interpreted / executed bya suitable / compatible computer

Information = “documents”, meaningful for andto be interpreted byhuman beings

****

27

Information: digitally stored and managed information

Categories of digital, computer readable information / data, forming electronic “documents”,understandable by human beings.

01textnumbersimagesvideosounds

multimedia

+

****

28

01

Digital information

Multimedia / Hypermedia

Information: types of digital information

Linear textHypertext

Static imagesVideo

Sound

Programs for computers

****

29****

Online / Networked

CD-ROM

Update speed

Volume

Some publication media compared

Printed

30

Publications on CD-ROM or online: advantages compared with hard copy

***-

• Can be cheaper to produce, to transport and to store.

• Can offer better search features.

• Can offer various output formats.

• Can offer fast and efficient “copy and paste” by the reader/user of information to other documents.

Taken together, these features allow more efficient access to large, high volume documents or databases.

31

Scientific publishing in Utopia: an ideal scheme

Many authorsMany authors

Many readers / usersMany readers / users

Many editors / publishersMany editors / publishers

Online remote access multimedia database serverOnline remote access multimedia database server

Many database search clients and user interfaces

Many database search clients and user interfacesone global ,

international computer data communication network

author = reader in science

****

32

!? Question !?

Indicate the differences between reality

and that simplified, ideal schemeof the information flow.

Indicate the differences between reality

and that simplified, ideal schemeof the information flow.

****

33

!? Question !?

Which basic problems/difficulties hinder people

to find / access / use information?

Which basic problems/difficulties hinder people

to find / access / use information?

****

34

Information retrieval: basic difficulties (Part 1)

****

• In many cases it is not completely clear to the user of an information retrieval system which information is in fact needed, required.

• In many cases the need for information cannot be expressed completely in the form of a query.

One of the reasons is that the complete context of the information need should ideally be expressed, including the knowledge and background of the searcher.

35

Information retrieval: basic difficulties (Part 2)

****

• Computer systems are artificial, but nevertheless most use human language in their interface with the human users, for instance in database search systems. This may cause difficulties related to language and vocabulary in particular. Some examples:

• People use different languages and different terms (vocabularies) to describe a similar concept.

• Concepts, vocabularies and meanings of words and terms may change over time.

• Meanings of words / terms may depend on their context.

36

Information retrieval: basic difficulties (Part 3)

****

• Many different and imperfect retrieval systems should or must be used.

»To retrieve and access the information that is in principle available, many different retrieval systems must be available and be mastered.

»Furthermore, a perfect information retrieval software does not (yet) exist; scientific and technological evolution is fast in the domain of information retrieval software since about 1970.

37

Information retrieval: basic difficulties (Part 4)

****

• Information overload

Users are often overwhelmed by the amount of available information and by the large influx of new information.

38

Information retrieval: basic difficulties (Part 5)

****

• The price (or inaccessibility) of particular information

A lot of information cannot be obtained or at least not free of charge.

39

Information retrieval: browsing and searching as methods

• To make information available, the producer of an information system can offer to the user basically two different ways for retrieval of the right information from the system:

»by browsing or

»by searching.

***-

40

• Browsing a logically ordered list of terms

• Logical order /Sorted by subject

• Table of contents

• Classification

• Hypertext-Hypermedia:jump from a page to a linked page

• Searching by submitting a search term to the system

• Alphabetical order / Not sorted by subject

• Alphabetical index

• Thesaurus

• Hypertext-Hypermedia: search built in a page

Information retrieval: browsing versus searching

***-

41

Information retrieval: browsing systems

• In browsing systems, the user can follow some of the paths offered by the system.

• The information is ordered, according to subject for instance.

• The user does not have to use his own words to indicate his needs.

• To support organising and browsing of information items, some type of classification is applied in many cases.

***-

42

Information retrieval: examples of browsing systems

• Examples of browsing systems are

»a table of contents in the front part of a book,

»a set of books placed on shelves according to some classification system,

»a hypertext hierarchical directory on the WWW, or more generally all hypermedia systems.

***-

43

Information retrieval: search systems

• In search systems, the user has to express his need for information by formulating a query that is normally using a natural language or a more formal language.

• In this case the information is normally not ordered according to some logic, but in most cases in the form of a well structured compilation of items of a similar form, in the form of the records of a database when a computer system is applied.

***-

44

Information retrieval: examples of search systems

• Examples of search systems are

»the index (the register) in the back part of a book,

»a library or museum catalogue with a search interface,

»a search form on a web page.

***-

45

Advantages:

»Browsing is relatively easy for the user.

Difficulties for the user:

»Allows the user to explore the information space by roads constructed based on the view of the world of the system designers, and not based on his own view.

Difficulties for the producer:

»It is relatively costly to construct an information system based on browsing.

Information retrieval: pro and contra of browse systems

***-

46

Advantages:

»Creation of keyword indexes for fast searching is relatively simple and cheap and can be automated.

Difficulties for the user:

»Searching is hindered by vocabulary / language problems.

»The users cannot always fully articulate their needs.

Information retrieval: pro and contra of search systems

***-

47

The information industry and the information market

The components of the information industry

****

48

The components of the information industry

• Authors

• Publishers

• Distributors

• Users

• Related organizations

****

49

The information industry and the information market

The information industry and the information market

Overview and evolution

****

50

Increase in the number of scientific and technical serial publications

1

10

100

1000

10000

100000

1000000

1650 1700 1750 1800 1850 1900 1950 2000

****

51

The information market: growth in the database industry

0

2000

4000

6000

8000

10000

1975 1980 1985 1990 1995

Number oflivingdatabases

Number ofdatabaseproducers

Number ofvendors

****

Source: Williams, in: Gale Directory of Databases, 1998.Source: Williams, in: Gale Directory of Databases, 1998.

52

The information industry / market: future trends (Part 1)

• Growth in the production of databases.

• Less analogue / hard-copy production = more digital production, storage, and distribution of information.

• More integration of information types into multimedia and hypermedia.

****

53

The information industry / market: future trends (Part 2)

• Growth in the number of

»producers and distributors,

»end-users searching databases due to easier use and lower costs of information technology

****

54

Databases and computerized information retrieval

Introduction

****

55

What is a database?

A database is a collection of similar data records stored in a common file (or collection of files).

****

56

Types of databases: examples

Examples: The databases that form the basis for

»catalogues of books or other types of documents

»computerized bibliographies

»address directories

»a full text newspaper, newsletter, magazine, journal+ collections of these

»WWW and Internet search engines

» intranet search engines

» ...

****

57

Information management

Information retrieval

Information retrieval and related activities: figure

Image retrievalText retrieval

Presentation of information

***-

58

Information retrieval: via a database to the user

***-

Informationcontent

Informationcontent

Linear file Inverted file

Search engine

Search interface UserUser

Database

59

Comparison

Information retrieval: the basic processes in search systems

Information problem

Representation

Query Indexed documents

Representation

Retrieved, sorted documents

Text documents

Evaluation and

feedback

***-

60

Information retrieval systems: many components make up a system

• Any retrieval system is built up of many more or less independent components.

• These components can be modified to increase the quality of the results more or less independently.

***-

61

Information retrieval systems: important components

***-

the information content

system to describe formal aspects of information items

system to describe the subjects of information items

concrete descriptions of information items = application of the used information description systems

information storage and retrieval computer program(s)

computer system used for retrieval

type of medium or information carrier used for distribution

62

What determines the results of a search in a retrieval system?

• the information retrieval system ( = contents + system)

• the user of the retrieval system and the search strategy applied to the system

***-

Result of a searchResult of a search

63

Databases and computerized information retrieval

Text retrieval and language

***-

64

Text retrieval and language: a word is not a concept (a)

Text retrieval and language: a word is not a concept (a)

Problem: A word or phrase or term is not the same as a concept or

subject or topic.

***-

Word

WordConcept

65

Text retrieval and language: a word is not a concept (a’)

So, to ‘cover’ a concept in a search, to increase the recall of a search, the user of a retrieval system should consider an expansion of the query; that is: the user should also include other words in the query to “cover” the concept

***-

66

Text retrieval and language: a word is not a concept (a’’)

Text retrieval and language: a word is not a concept (a’’)

»synonyms!

»narrower terms, more specific terms (such as particular brand names);including terms with prefixes(for instance: viruses, retroviruses, rotaviruses,...)

»spelling variations (such as UK English versus US English);possible variations after transliteration

***-

!? Question !?

Which problems in text retrieval are illustrated by the following sentences?

Which problems in text retrieval are illustrated by the following sentences?

***- 67

68

Time flies like an arrow.

Fruit flies like a banana.

?

***-Examples

69

Time flies like an arrow.

Fruit flies like a banana.

***-Examples

70

Time flies like an arrow.

Fruit flies like a banana.

OK!

***-Examples

71

Text retrieval and language: ambiguity of meaning (a)

Text retrieval and language: ambiguity of meaning (a)

• Problem: A word or phrase can have more than 1 meaning.Ambiguity of the meaning of a word is a problem for retrieval. This decreases the precision of many searches.The meaning can depend on the context. The meaning may depend on the region where the term is used.

***-

72

Text retrieval and language: ambiguity of meaning (a’)

Text retrieval and language: ambiguity of meaning (a’)

»Example:

—Pascal the philosopher

—Pascal the computer language

***-

73

Text retrieval and language: ambiguity of meaning (a’’)

Problem: Ambiguity of meaning

may be the cause of low precision.

***-

WordConcept

Concept

74

A word is not a conceptA concept is not a word

1 word or term does/can not “cover” a concept = a concept cannot be “covered” by only 1 word or term;

this may be the cause of low recall.

Word

WordConcept

****

75

A word is not a conceptA concept is not a word

Ambiguity of meaning may be the cause of low precision.

****

WordConcept

Concept

76

Text retrieval and language: conclusions

• The use of terms and language to retrieve information from databases/collections/corpora causes many problems.

• These problems are not recognized or underestimated by many users of search/retrieval systems= The power of retrieval systems is overestimated by many users.

• Much research and development is still needed to enhance text retrieval.

***-

77

Databases and computerized information retrieval

Hints on how to use information sources

****

78

Hints on how to use information sources: overview (Part 1)

• Know the purpose and motivation for each search.

• Do not be lazy: search on your own, before bothering experts with requests for advice.

• Plan your search in advance.

• Choose the best source(s) for each search.

• Use the right tools for each job (a suitable communication program for instance, in the case of online searches).

• Do not focus on a single source.

****

79

Hints on how to use information sources: overview (Part 2)

• Consider citation indexes besides subject-oriented databases, as useful secondary information sources.

• Use the available tools for subject searching well.

• Try to cope with the language problems.

• Match your search strategy with the type of source.

• In computer-based retrieval systems, combine search terms when appropriate, using

»Boolean operators

»proximity operators (for instance “near”,...)

****

80

Hints on how to use information sources: overview (Part 3)

• Work cost-effectively.

• Use special care when searching for names.

• Work iteratively.

• Keep a record of your work.

• Be critical: not all information is correct or useful.

• Stop searching when “enough is enough”

• Give up if necessary... (Not all questions have an answer.)

• ...

****

81

Hints on how to use information sources: subject searching

• When you search for information on a particular topic/subject: investigate if the database producer offers

»a subject classification scheme and/or

»a controlled/approved/accepted subject terms, and/or

»a subject thesaurus

• Exploit these, if they are available.

• In most cases you should find and use synonyms and narrower terms

• Use broader and /or related terms, if appropriate.

****

82

Hints on how to use information sources: Boolean combinations (1)

Most text search systems understand the basic Boolean operators:

AND = obtain records that contain both search terms

OR = obtain records that contain one or both search terms

NOT= exclude records that contain a search term

****

83

Hints on how to use information sources: Boolean combinations (2)

Most text search systems understand the basic Boolean operators typed in capital characters:

OR

AND

****

84

Hints on how to use information sources: Boolean combinations (3)

In the case of computer-based information sources, use Boolean combinations of search terms when appropriate and when possible.

****

term x1OR term x2ORterm x3

term x1OR term x2ORterm x3

term y1OR term y2OR term y3

term y1OR term y2OR term y3

term z1OR term z2OR term z3

term z1OR term z2OR term z3

AND AND AND ...

85

!? Question !? Task !? Problem !?

How many (and which) concepts do you see in a search for

“general reviews about

monitoring seawater pollution that is due to effluents”?

How many (and which) concepts do you see in a search for

“general reviews about

monitoring seawater pollution that is due to effluents”?

****

86

!? Exercise !? Task !? Problem !?

Prepare off-line, on paper, a suitable search query in a generic format, to find

“general reviews about

monitoring seawater pollution that is due to effluents” as the basis for later, concrete searches in databases.

(Limit yourself to 1 of the concepts.)

Prepare off-line, on paper, a suitable search query in a generic format, to find

“general reviews about

monitoring seawater pollution that is due to effluents” as the basis for later, concrete searches in databases.

(Limit yourself to 1 of the concepts.)

****

87

Hints on how to use information sources: example of a search query

Example: Searching for the concept “sea” can or should involve the for instance the following words in a Boolean OR combination:baltic OR bay OR bays OR coast OR coastal OR coastline OR coasts OR cove OR coves OR gulf OR mangrove OR mangroves OR marine OR mediterranean OR noordzee OR noordzeekust OR noordzeekusten OR ocean OR oceanic OR oceans OR reef OR reefs OR “saline-freshwater interface” OR sea OR seas OR seashore OR seawater OR seawaters OR shore OR shores

***-Example

88

!? Question !? Task !? Problem !?

What did you learn from the exercise

on the formulation of a query?

What did you learn from the exercise

on the formulation of a query?

****

89

Hints on how to use information sources: work iteratively

Work iteratively = search, investigate your results, refine your search, search again, and so on; do not try to find everything in 1 step, with 1 search.

****

Results

Query Searching

Feedback

90

“The ability to ask the right question is more than half the battle of finding the answer.”

Thomas J. Watson

****

?

91

Hints on how to use information sources: when to stop searching?

Develop a feel for the “curve of diminishing returns”:

If you spend too much time, effort, and/or money with too few benefits, you should stop.

****

time / effort / money

payoffTime to stop?

92

Knowledge organisation: classifications, and thesaurus systems

Knowledge organisation: classifications, and thesaurus systems

Introduction

****

93

• To organise knowledge / documents / books / reports / information / data / records / things / items / materials for more efficient storage and retrieval, some related, similar tools / systems / methods /approaches are used.

• Often but not yet always, this process is assisted by a computer system.

• Good systems are expanded and updated when the need arises.

• The organization system applied should ideally be clearly and immediately visible or even searchable on computer, by the user of the materials.

Knowledge organisation: introduction

Knowledge organisation: introduction

****

94

• Various tools / systems / methods / approaches are available:

»Classification

»Taxonomy

»Thesaurus

»Ontology

»…

Knowledge organisation: some tools

Knowledge organisation: some tools

***-

95

Knowledge organisation: classifications, and thesaurus systems

Knowledge organisation: classifications, and thesaurus systems

Classifications

****

96

Classification systems: introduction

• Classification systems present the subjects in a logical order, usually going from the more general to the more specific.

***-Examples

97

• Universal means here: covering all subjects

• Not just one but several competing systems exist. Examples

»Universal Decimal Classification = UDC

used mainly outside U.S.A.

»Dewey Decimal Classification = DDC

used mainly in U.S.A.

»Library of Congress Classification

used mainly in U.S.A.

» ...

Classification systems: examples of universal systems

Classification systems: examples of universal systems

****Examples

98

Knowledge organisation: classifications, and thesaurus systems

Knowledge organisation: classifications, and thesaurus systems

Thesaurus systems

****

99

Thesaurus: descriptionThesaurus: description

• Thesaurus (contents) =

»system to control a vocabulary (= words and phrases + their relations)

»the contents of this vocabulary

• Thesaurus program =

program to create, manage, modify and/or search a thesaurus using a computer

****

100

Thesaurus relations

Thesaurus relations

Term(s) with broader meaning

BT (= Broader Term)

RT (= Related Term) UF (= Use(d) For)Other term(s) Term Synonym(s)

NT (= Narrower Term)

Term(s) with narrower meaning

****

101

Thesaurus systems that cover all subjects

Thesaurus systems that cover all subjects

• General systems

• Universal systems

• Covering all subjects

• Broad and shallow systems

• Horizontal systems

***-

102

Thesaurus systems that cover all subjects: examples

Thesaurus systems that cover all subjects: examples

• thesaurus system built into word processing software

• Library of Congress Subject Headings (LCSH)

• thesaurus system that runs on a pc; see for instance http://www.wordweb.co.uk/free/

• thesaurus systems that can be used free of charge through the WWW

»http://education.yahoo.com/reference/thesaurus/index.html

»http://thesaurus.plumbdesign.com/

***-Examples

103

!? Exercise !? Task !? Problem !?

Practice using a general thesaurus system that is built in your program for word processing.

Practice using a general thesaurus system that is built in your program for word processing.

***-

!? Exercise !? Task !? Problem !?

Have a look at various global, general, universal thesaurus systems.

Consider which ones may be useful for your future online information searches.

Have a look at various global, general, universal thesaurus systems.

Consider which ones may be useful for your future online information searches.

**-- 104

105

Computer networks, data communication and Internet

Introduction

****

106

Data communication: a definition

• Interpersonal communication

» Telecommunication

—Broadcast

—Telephone

—Data communication

–Remote login

–File transfer

–Hypertext transfer

–Electronic mail

–...

****

107

01

Digital information

Multimedia / Hypermedia

Data communication: which types of ‘data’?

Linear textHypertext

Static imagesVideo

Sound

Programs for computers

****

108

Data communication: which types of ‘data’?

• The same types of data (information) that can be stored and managed on a computer can be transferred over computer networks to one or several other computers.

• So the networks form an important extension of the stand-alone computers.

• “The network is the computer”

****

109

Data communication: applications

• Hard-copy transfer (Fax)

• Online use of the processing power of a remote computer

• Online access to information sources !

»library catalogues,

»bookshop catalogues,

»publisher’s catalogues,

»campus-wide and community information systems,

»(text or multimedia) databases,

»network-based journals, ...

****

110

Data communication: problems, difficulties, limitations

• Low transfer speed

• Technical complexity

***-

111

Computer network protocols: definition

• When 2 computer systems communicate via network, they do that by exchanging messages.

• The structure of network messages varies from network to network.

• Thus the message structure in a particular network is agreed upon a priori and is described in a set of rules, each defined in a protocol.

****

112

Computer networks, data communication and Internet

National Wide Area Networks

****

113

National Wide Area Networks

• Public access national packet switching networks

• Research computer networks

• Public access made available by Internet Service Providers

• ...

****

114

Computer networks, data communication and Internet

International computer networks

****

115

International computer networks: examples

• National public data communication networks linked together

• Internet

• FidoNet

• Bitnet / EARN

• Usenet

• ...

****Examples

116

Computer networks, data communication and Internet

The Internet data communication network

****

117

@

The Internet data communications network (Part 1)

• “Internet” is not well-defined.

• A network of smaller networks:The global collection of interconnected local area, regional and wide-area (national backbone) networks which use the TCP/IP suite of data communication protocols.

****

118****

The Internet data communications network (Part 2)

• Links computers of various types.

• Is constantly growing.

• The analogy of a superhighway has been used to describe the emerging system of networked computers.

• The Internet has no owner, and is not managed by one organization. @

119

The Internet: access from your Local Area Network

Your microcomputer

Local Area Network (LAN)

One of the national networks

The global Internet

****

120

Host computers in the Internet: definition

• A host (computer) is a domain name that has a unique IP address record associated with it.

• Could be any computer connected to the Internet by any means.

• For instance: www.vub.ac.be

****

@

121

Transmission Control Protocol / Internet Protocol (TCP/IP)

• the main suite of transport protocols used on the Internet for connectivity and transmission of data across heterogeneous systems

• “glue that holds the Internet together”

• an open standard

• available on most Unix systems, VMS and other minicomputer systems, many mainframe and supercomputing systems and some microcomputer and PC systems

****

122

Internet: growth in number of hosts worldwide: linear plot

0

5000000

10000000

15000000

20000000

1993 1994 1995 1996 1997 1998

****

January of each year

123

Internet Service Provider= ISP

****

Internet Service Providers provide their clients access to Internet + in many cases

»an email address / server

»space for a web site

»software tools to start

» training

» technical support

»an accessible location for a WWW site of the client

»assistance with WWW site design and promotion

124

World-Wide Web = WWW

Introduction

****

125

The WWW: example of a welcome page

****Example

126

URL = Universal Resource Locator

• = draft standard for specifying an object on the Internet

• the structure is in most casesprotocol://computer_address[/path_name/file_name]

• examples:

» telnet://biblio.vub.ac.be

»ftp://ftp.vub.ac.be/

»gopher://gopher.vub.ac.be/

»http://www.vub.ac.be/BIBLIO/index.html

»news://news.server.edu/comp.infosystems.www

****

127

URLformat / structure

1. The first part of a URL, before the colon “:”, specifies the access method = protocol

2. The second part of the URL, after the colon “:”, is interpreted specific to the access method. In general, two slashes after the colon indicate a machine /computer name.

****

128

!? Question !? Task !? Problem !?

What is the difference between Internet and the World-Wide Web?

What is the difference between Internet and the World-Wide Web?

****

129

The WWW is an application of Internet

****

• The World-Wide Web (WWW) is a service, an application of Internet.

• It is based on the Internet infrastructure.

• So the WWW is newer than the Internet. The concept of the WWW was created at the end of the 1980s when the Internet was already well established.

130

The WWW is an application of Internet: scheme

****

Data communication

Internet

WWW

131

The WWW: the essential elements

• Information delivery and access using hypertext/hypermedia documents/objects

»html documents

»http protocol: http clients http servers

• Integration of protocols in the Internet:

»http servers offering html documents including links to other http servers, telnet servers, ftp servers, nntp servers, gopher servers, ,...

****

132

World-Wide Web = WWW

WWW client programs

****

133

WWW: client / browse programs

• To access the WWW, you run a browser program.

• The browser reads documents, and can fetch documents from other sources. Information providers set up hypermedia servers which browsers can get documents from.

• The browser can display hypertext documents. Hypertext is text with pointers to other text. The browsers let you deal with the pointers in a transparent way: select the pointer, and you are presented with the text that is pointed to.

****

134

WWW: examples of browsers for your own computer

Browsers are available for many computer platforms; in particular: browsers for Windows + Winsock:

»Netscape

»Microsoft Internet Explorer

»...

****

135

!? Question !? Task !? Problem !?

Browse the WWW, using an available

browser client program.

Browse the WWW, using an available

browser client program.

****

136

!? Question !? Task !? Problem !?

What came first: Internet or WWW?Explain.

What came first: Internet or WWW?Explain.

***-

137

World-Wide Web = WWW

Saving information from a web

****

138

WWW: How to save information from a web?

Information displayed by your web browser/client program can be saved,

• by select, copy, paste in another document (and save)

• by saving a complete page to your disk

» in separate files (for instance 1 HTML file + some image files)

» in 1 file, using Microsoft Internet Explorer 5 or a later version

• by copying the information into an e-mail message that you send to your own e-mail account

****

139

!? Exercise !? Task !? Problem !?

Copy some text fragment from WWWand paste it into another document

on your computer.

Copy some text fragment from WWWand paste it into another document

on your computer.

****

140

!? Exercise !? Task !? Problem !?

Save a text from WWW to disk, as HTML,

using a browser program.

Save a text from WWW to disk, as HTML,

using a browser program.

****

141

!? Exercise !? Task !? Problem !?

Display an HTML file that you have saved

from the WWW to your disk,in a program for word processing.

Is the file displayed properly?

Display an HTML file that you have saved

from the WWW to your disk,in a program for word processing.

Is the file displayed properly?

****

142

World-Wide Web = WWW

The success of WWW

****

143

WWW: growing number of WWW servers

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

1993 1994 1995 1996 1997 1998 1999 2000

****

144

WWW as popular method to access information from computers

****

• The WWW has quickly become the most popular medium to access information that resides on various computers that are connected to a computer network.

145

Online access information sources and services

Introduction

****

146

Internet based information sources: problems / difficulties (Part 1)

• Redundancy and overlap:On the one hand, there is too much information on some topics; in other words, the redundancy and overlap are high in many cases. Too few information sources: On the other hand, there are too few information sources on some topics.

****

147

Internet based information sources: problems / difficulties (Part 2)

• No order is imposed on most sources.Quality checks / quality controls are not performed.Related to this: it is not required to register new information offered. Is the information that you find real, honest, authentic?

****

148

Internet based information sources: how many? how much information?

In 2001:

• More than 10 terabyte (= 10 000 gigabyte) of text data

In 2002:

• More than 2000 million (= 2 billion) unique URLs in the total Internet

****

149

Online access information sources and services

Types of online access information systems

***-

150

Types of online access information systems: “free” versus “fee”

• A lot of the information on the Internet is available free of charge, but another part is only accessible when a fee is paid to the producer and / or the distributor.

• Some organisations pay these fees for some sources and then organise access, so that the members of the organisation can retrieve and exploit the information as if it is free of charge.

****

151

Types of online access information systems: “free” versus “fee”

****

Public access information sources free of charge

Fee-based online information services(NOT free of charge)

152

Types of online access information systems: “free” for members only

****

Public access information sources free of charge

Fee-based online information services(NOT free of charge)

Fee-based online information services, made accessible “free of charge”

by an institute to its members

153

Online access information sources and services

Dictionaries and encyclopaedias accessible through the WWW

****

154

Dictionaries and encyclopedias through the WWW: introduction

• Dictionaries and encyclopedias are the first choice among many types of information sources,

»when we do not need detailed information on a common topic

»when we want to prepare a more detailed search on an unfamiliar topic, by searching for the right spelling, synonyms, context,…

• Some dictionaries and encyclopedias are available through the WWW free of charge.

****

155

Dictionaries accessible through Internet and the WWW: example

• The American Heritage® Dictionary of the English Language

»Over 200,000 entries, 70,000 audio word pronunciations, 900 full-page color illustrations

»Available free of charge from http://education.yahoo.com/reference/dictionary/

****Example

156

Dictionaries accessible through Internet and the WWW: compilation

• A compilation/collection of dictionaries can be searched simultaneously and free of charge: http://www.onelook.com/

****Example

157

Encyclopedias accessible through Internet and the WWW: examples

• Encarta Concise Free Encyclopedia 

»http://encarta.msn.com/

»Available in English and in some other languages

****Example

158

Encyclopedias accessible through Internet and the WWW: examples

• Encyclopædia Britannica only a small part is available free of charge + links to selected WWW sites

»http://www.britannica.com/

• Encyclopædia Britannica Concise

»http://education.yahoo.com/reference/encyclopedia/

****Example

159

Encyclopedias accessible through Internet and the WWW: examples

• The Canadian Encyclopedia(in English and in French):

»http://thecanadianencyclopedia.com/

****Example

160

Encyclopedias accessible through Internet and the WWW: examples

• Several encyclopedias and dictionaries have been integrated and are searchable simultaneously and free of charge through http://xrefer.com/

****Example

161

Encyclopedias accessible through Internet and the WWW: overviews

• A list / overview of encyclopedia on the Internet:http://www.internetoracle.com/encyclop.htm

• Other lists of encyclopedia on Internet can be found as a part of more general directories of Internet-based information sources.

****Example

162

Online access information sources and services

Internet directories and indexes

****

163

Internet: meta-information about Internet information sources

• in printed manuals and guides:

- it is not always possible to get a copy fast

- it costs money to get a copy

- they are soon out of date

• offered on the WWW!:

+ directly available when we want to use the Internet

+ many systems are accessible free of charge

+ most systems are regularly updated

• (“intelligent agent” software on client PC)

****

164

Internet: subject-oriented meta-information offered via WWW

Information about information sources: in the form of

»subject guides = texts with references

»subject hypertext directories = subject guides

»key word indexes, generated automatically, for searching

»collections of links or forms to the above

»(multi-threaded search systems)

****

165

Internet global subject directories:introduction

• They are virtual libraries with open shelves, for browsing.

• They are manually generated, man-made by many people.

• They can be browsed following a tree structure or a more complicated variation.

• The most famous of these systems belong to the most popular and most visited sites on the WWW: e.g. Yahoo!

****

166

Internet global subject directories: structure

The structure corresponds to a classification that is in most cases specific for the particular overview. In other words: the well-known and classical universal classification systems are not used in most Internet directories.

****

167

Internet global subject directories: limitations

• They cover only a small number of selected WWW sites, in comparison with the total number of sites that are accessible.

• They are suitable mainly for broad searches that can be difficult to formulate in words, but NOT for more specific searches that require combinations of several concepts.

****

168****

Internet global subject directories:searching directories with a query

• Many of the Internet directories include an index to search their contents with a query.

• However, then the assisting classification structure is not well exploited and the user should be aware of the problems and difficulties of information retrieval with natural language queries.

• Furthermore, the possibility to use the system in this way may be confusing, as these directories are not real full-text Internet indexes, like those provided by other search tools.

169

Internet global subject directories: Yahoo!

• A hypertext global subject directory can be found at http://www.yahoo.com/

and at many other sites, includinghttp://www.yahoo.co.uk/

• Entries are NOT rated.

• Accessible free of charge.

****Example

170

Internet global subject directories: Yahoo! links in pediatrics

• Health > Medicine > Pediatrics:• International Pediatric Chat - for professionals to share information and education

regarding children's health care.

• National Med/Peds Residents' Association - organization for residents, practioners and medical students interested in combined internal medicine and pediatrics.

• Neonatology Network - information and communication platform for neonatologists and pediatricians.

• Pediatria OnLine - qui si parla di bambini, fra pediatri e con le famiglie.

• Pediatric Critical Care

• Pediatric Database (PEDBASE) - containing descriptions of over 500 childhood illnesses.

• Pediatric Endocrinology Conference - LWPES/ESPE joint meeting occuring July 6-10 2001.

• Pediatric Endoscopic Photos - illustrating intestinal problems in children.

***-Example

171

Internet global subject directories: Yahoo! for pediatrics

• Health > Medicine > Pediatrics:link to a digital library (health sciences) for young patients

***-Example

172

Internet global subject directories: Yahoo! to pediatrics organisations

• Health > Medicine > Pediatrics > Organizations:link to the American Academy of Pediatrics

***-Example

173

Internet global subject directories: Yahoo! links to pediatrics schools

• Health > Medicine > Pediatrics >Schools, Departments, and Programs

• University of Rochester - partnership between pediatric residents and community-based agencies that serve children and their families.

• Michigan State University@

• Royal College of Paediatrics and Child Health - responsible for training, examinations, professional standards, and organisation of child health services for the UK.

• Tohoku University

• University of Alabama at Biringham - programs and training opportunities in pediatrics. Also contains faculy information and sub-speciatlty descriptions.

• …

***-Example

174

Internet global subject directories: searching with a query in Yahoo! (1)

• The directory of Yahoo! can not only be browsed, but can also be searched with a query.

• However, in this way the hierarchical structure is not well exploited.

• For the formulation of a search query, Yahoo! can provide automatic assistance related to spelling and word variations. For instance: After searching for “Capetown”, Yahoo! Answers: Other Spellings: Try searching for cape town instead.

***-Example

175

Internet global subject directories: searching with a query in Yahoo! (2)

• When such a query does not provide results, then Yahoo! uses a much larger external Internet index (not produced by Yahoo!) to execute a query based on textual search statements. The chosen Internet index has varied over time.

• This mechanism is not made very clear and may confuse the user.

***-Example

176

Internet global subject directories: BUBL link

• A hypertext global subject directory to more than 10 000 WWW sites for the higher education community can be found athttp://bubl.ac.uk/link/

• Accessible free of charge.

***-Example

177

Internet global subject directories: Google directory

• A hypertext global subject directory can be found athttp://directory.google.com/

• Accessible free of charge.

• Very similar to the Open Directory Project.

***-Example

178

Internet global subject directories: Open Directory Project

• A hypertext global subject directory can be found athttp://www.dmoz.org/

• The contents is also used by in the Google Directory system.

• Accessible free of charge.

***-Example

179

Internet global subject directories: Resource Discovery Network

• A collection of hypertext subject directories that focus on academic information sources can be found athttp://www.rdn.ac.uk/

• Together these lead to more than 30 000 selected WWW sites.

• Accessible free of charge.

***-Example

180

!? Exercise !? Task !? Problem !?

Try to find Internet sourceswhich are relevant for you, by using an Internet-based

global subject directory.

Try to find Internet sourceswhich are relevant for you, by using an Internet-based

global subject directory.

****

181

Internet global subject directories: evaluation criteria (Part 1)

• Is usage free of charge?

• Wide coverage?

• Up to date? Frequent updates? Only few dead / broken links?

• Good coverage of the sources in that part of the world in which you are interested?

• Does the manager of the directory refuse to give priority to sites that want to pay to get a prominent place in the directory?

***-

182

Internet global subject directories: evaluation criteria (Part 2)

• Easy user interface?

• Short response times?

• Are mirror sites available closer to you for faster response?

• Good presentation, description of each site?

• Is a rating, appreciation, review offered for each listed site?

• Is translation of documents offered free of charge?

***-

183

Internet global subject directories: evaluation criteria (Part 3)

• Good documentation and online help?

• Good help desk available?

• High stability and reliability?

***-

184

Internet global subject directories: evaluation criteria (Part 4)

• Are other services offered from the same site or with the same interface? Is the subject directory integrated with other services?Additional services can be

»an Internet index or a WWW index or a gateway to such an index for searching with a query

»travel guides, flight and hotel reservations, maps,...

»WWW-based e-mail and e-mail address directories

»auctions through WWW

***-

185***-

Internet subject directories: non-global, more specific systems

a directory limited to sources in/of a country or region

a directory restricted to a specific subject domain

(“portal”)

a global subject

directory

the complete WWW

can lead to

186

Internet subject directories focusing on a specific subject domain

• Computer science & engineering: http://www.ub.lu.se/eel/

• Marine science and oceanography: http://oceanportal.org/

***-Examples

        

187

Internet indexes:automated search tools

• Several systems allow to search for and to locate many items (addressable resources) in the Internet in a more systematic, direct way than by only browsing/navigating.

• These systems do NOT search the contents of computers through the real Internet in real time and completely when a user makes a query. Searching in that way would be much too slow due to limitations in the technology.

****

188

Internet indexes: scheme of the mechanism

****

User searching for Internet based information

Internet client hardware and software

user interface to a search engine Internet information source

Internet index search engine Internet crawler and indexing system

database of Internet files, including an index

189

Internet indexes:description of the mechanism

Each of these search systems is based on:

• a database of links to pages / URLs that can be retrieved by searching with queries through a big index that is built machine-made on the basis of the contents, the texts, of these pages(to build this database and to keep it up to date, pages are continuously collected from the Internet by a “robot” computer software system)

• a search system with a user interface in a WWW form, to allow the user to search through that database

****

190

Internet indexes:AltaVista

***-Example

The primary search interface can be found in the US:

http://www.altavista.com/

http://www.av.com/

(These addresses all lead to the same information.)

Mirror site in UK:

http://www.altavista.co.uk/

191

Internet indexes:AltaVista: features

• Allows full text searching of the WWW

• Allows advanced Boolean searching (in “Advanced” mode)

• Offers relevance ranking of search results

• Offers a link to an Internet subject directory (Looksmart)

• Offers links to systems to find images, sounds,… (multimedia) in the Internet

***-Example

192

Internet indexes:Fast = All the Web

***-Example

• The search interface can be found at:http://www.alltheweb.com/

• You can search the WWW and ftp servers.

• The database is one of the biggest.

• Not only HTML and plain text files, but also the full text of many Adobe PDF files is indexed.

193

Internet indexes: Google (Part 1)

• http://www.google.com/

• One of the most popular systems in 2001, 2002.

• For retrieval an algorithm is used that takes into account the links between WWW pages.A retrieved page is ranked higher when

»many sites/pages point to it

»“important” sites/pages point to it

****Example

194

Internet indexes: Google (Part 2)

• Full text searching is possible of many files that are available through the WWW.

• Not only HTML and plain text pages are covered, but also the first part is indexed of many files in other formats such as Adobe PDF, Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Rich Text Format,…

****Example

195

!? Question !? Task !? Problem !?

In spite of the popularity of the Google Internet index, there are limitations in the search features.

Which limitations?

In spite of the popularity of the Google Internet index, there are limitations in the search features.

Which limitations?

***-

196

Internet indexes: Google limitations

• Google does NOT offer/allow

»manual or automatic stemming, manual or automatic truncation

»automatic classification of WWW pages

***-Example

197

Internet indexes: Google additional features

• Besides a system to search for WWW pages, Google offers also »a subject directory»searching for images on the WWW

»searching an archive of Usenet messages + posting to Usenet groups

• Thus Google has become a great integrator / aggregator.

****Example

198

!? Exercise !? Task !? Problem !?

Read the manual and

make a search with Google.

Read the manual and

make a search with Google.

***-

199

Internet indexes: MSN Web Search

• Offered free of charge by Microsoft.

• You can search for WWW content.

• Since 1998.

• Famous system, because the search interface can be found with the search functions that have been built into one of the most widespread Internet browser, Microsoft Internet Explorer, and because it is offered by http://search.msn.com/

• Is based on an Internet index created by another company.

***-Example

200

Internet indexes: Scirus

• Allows you to search for manually selected scientific information (only) on the WWW, including access controlled sites, such as the peer-reviewed articles in the journals that are published in ScienceDirect by Elsevier.

• Offered free of charge by Elsevier.

• Is partly based on the Fast WWW search system that is also used by Alltheweb.

• The search interface: http://www.scirus.com

***-Example

201

Internet indexes: Scirus features

• Offers access to information ordered according to some classification system / taxonomy.

• Offers not only access to files in html format, but also to files in PDF, PostScript and other formats.

***-Example

202

Internet indexes: coverage / size of each index

The indexes grow and their “size ranking” is variable.

Biggest systems in 2002:

• Google !

• AltaVista

• (Fast =) All the Web (serving also Lycos)

• Systems based on the INKTOMI database of WWW pages, such as Hotbot, MSN Web search,…

****

203

!? Exercise !? Task !? Problem !?

Try to find Internet sourceswhich are relevant for you, by using an Internet index.

Try to find Internet sourceswhich are relevant for you, by using an Internet index.

****

204

Internet indexes: variations among various systems

• Besides their common aims and characteristics, we can nevertheless see differences, variations among the searchable Internet index systems.

• To illustrate these variations and to assist Internet users to make a decision on which search system to use, the following list of some features and evaluation criteria can be useful.

***-

205

Internet indexes: evaluation criteria (Part 1)

• Is usage free of charge?

• How complete is the coverage?

• Is the coverage good (or poor) for a particular geographic region?

• Is the coverage good (or poor) for a particular type of documents?

• Is the searchable database up to date? Is the database updated frequently? Do the search results contain only few dead (broken) links?

***-

206

Internet indexes: evaluation criteria (Part 2)

• Is spamming filtered out, to give other pages a better chance of turning up in the result set?Can the system cluster presumed duplicate documents in the results? Or does the system simply eliminate presumed duplicate documents from its database?

• Does the database system work with a full text indexing of each ASCII and HTML document that has a place in the database, so that full text searching is possible?

***-

207

Internet indexes: evaluation criteria (Part 3)

• Are the contents of meta-fields also indexed to make them searchable?

• Does the system index also the text in files on the web that consist of non-ASCII codes to make these also searchable and retrievable? For instance files in the format of the various versions of

»Microsoft Word

»Microsoft PowerPoint

»Adobe Acrobat (Portable Document Format)

***-

208

Internet indexes: evaluation criteria (Part 4)

• Field indexing, so that searching for the contents of a particular field is possible? for instance:

the HTML title, HTML keywords,

URL, date,

link, Java applet,

text, image file,

sound file, video file,...

***-

209

Internet indexes: evaluation criteria (Part 5)

• Does the system offer powerful search options like

»truncation?

»word stemming?

»Boolean search combinations?

»proximity searching?

»automatic translation of your search terms in several other languages?

»spelling check of your search terms?

***-

210

Internet indexes: evaluation criteria (Part 6)

• Can the results be limited to a certain time period? For instance based on the date

»of the file as noted by the server computer, or

»of the most recent indexing of the file

• Is the user interface easy to understand and efficient to use?

• Is a user interface offered in your own language?

• Does the system rank the items in the result set according to their presumed relevance?

***-

211

Internet indexes: evaluation criteria (Part 7)

• Possibility to combine Boolean retrieval with relevance ranking of results?

• Can the results be ordered according to date

»of the file as noted by the server computer, or

»of the most recent indexing of the file

• Can the results be ordered according to size?

• Can all the results (documents) from the same site be grouped together (clustered)?

***-

212

Internet indexes: evaluation criteria (Part 8)

• Can the system rank the results (documents) on the basis of the number of WWW hyperlinks to that document?

• The system does not place/rank some results (documents) higher in the results list, on the basis of payments by the producer of those documents to the search system company.

• Are advertisements / sponsored links / sponsored results clearly distinguished from normal (not sponsored) search results?

***-

213

Internet indexes: evaluation criteria (Part 9)

• Short response times?

• Are mirror sites available closer to you for faster response?

• Does the system offer a good presentation format of each result (document/page/item)?For instance: are search terms indicated / highlighted in the results?

• Good and detailed summary of each result available?

• Offers an analysis of words occurring in the results, which can help you to refine a search?

***-

214

Internet indexes: evaluation criteria (Part 10)

• Is translation of documents offered free of charge?

• High stability and reliability? No large variations/fluctuations in the results from identical searches at different times.

• Good documentation and online help?

• Good help desk available?

• Can the search system provide updated results through electronic mail, as a current awareness tool?

***-

215

Internet indexes: evaluation criteria (Part 11)

• Other services available besides the normal WWW index:

» index to news resources, that is more frequently updated?

»anonymous ftp file index?

»gopher index?

»searchable Usenet newsgroups archive?

»Internet subject directory?

»White pages = people finder = addresses = ...

»WWW-based e-mail and e-mail address directories

»auctions through WWW

***-

216

Internet indexes: evaluation criteria (Part 12)

• Is the search/query also submitted to another database to obtain more results? for instance: to a book database to obtain book descriptions besides WWW documents

***-

217

Internet indexes: evaluation criteria (Part 13)

• Are results (retrieved documents) grouped / classified / clustered by the search system, on the basis of the subjects of the documents and are these presented as groups / clusters / classes to the user of the search system, to assist the user in coping with the problems that can be caused for instance by multiple meanings of words used in a search query.

***-

218

!? Question !? Task !? Problem !?

Why do different Internet search engines (in most cases)

give different results for an identical search?

Why do different Internet search engines (in most cases)

give different results for an identical search?

***-

219

Internet information sources

Coverage of Internet directories and Internet indexes

****

A global Internet index

A global Internet directory

220****

Global Internet search tools: a comparison

Global Internet directories

• Only a limited selection of Internet sources

• Browsing information sources is easy

• Good for broad searches

Global Internet indexes

• About 1/3 of the Internet is covered by an index

• Searching requires some skills and knowledge

• Good for specific, narrow searches

Multi-threaded search systems

• These get information from directories and indexes

• Searching requires some skills and knowledge

• Good when even 1 index does not yield information

221

!? Question !? Task !? Problem !?

Which information on the Internet is not covered

by many searchable Internet indexes?

Which information on the Internet is not covered

by many searchable Internet indexes?

***-

222

Internet indexes cover only a part of the Internet: introduction (1)

***-

The “visible” part of Internet

The “hidden, invisible” part of Internet and the WWW, (that is not searchable using a global index

like, AltaVista, Google...)

223

Internet indexes cover only a part of the Internet: introduction (2)

***-

Why can Internet indexes find only a part of what is in fact available through the Internet?

1. Quantitative technical limitations: Each Internet search system has indexed only a part of the static WWW pages that are available for indexing.

2. Qualitative technical limitations: Besides the static WWW pages that Internet search engines try to cover, many other, quite different sources exist, that are also available through the Internet, but that are not incorporated in those search engines.

224

Internet

Internet indexes cover only a part of the Internet: scheme

***-

WWW

Databases and

file archives accessible through

the Internet

telnetftp...

telnetftp...

CGI, ASP,...CGI, ASP,...

Rapidly changing information, such as news

Information accessible only when passwords are used

Static indexable texts in the WWW( = on HTTP server computers)

covered partly by Internet indexes

Wordfiles

PDFfiles

225

Internet indexes cover only a part of the Internet: conclusion for users

When you want to retrieve information about a particular subject from the Internet, use not only WWW indexes, but use also other sources accessible through the Internet

»databases! (book and journal bibliographies, library catalogues, archives of group messages, directories, atlases,…)

»rapidly changing information, such as news

» information accessible only when passwords are used

»anonymous ftp file archives

»e-mail based interest groups; Usenet newsgroups

***-

226****

Finding multimedia files on the Internet

Several public access search systems are available free of charge to search the Internet for multimedia files:

»images / pictures (either artwork, either photos, or both)

»sound / audio files (music, speeches,...)

»video

227****

Finding images on the Internet:introduction

• Several public access search systems are available free of charge to search for images / pictures (either artwork, either photos, or both) on the Internet.

• When searching for images, the search results from such a system offer not only links to the image files on the Internet, but also directly small versions of the images (so-called “thumbnails”).

228****Examples

Finding images on the Internet:examples of search engines

• http://alltheweb.com !!!

• http://gallery.yahoo.com/ !

• http://images.google.com/ !!!! or through http://www.google.com/

• http://multimedia.lycos.com/

• http://www.altavista.com/ !!(also audio and video, choose not the normal text search, but IMAGES in the user interface.)

• http://www.ditto.com/ !

229**** Examples

Finding images on the Internet:screen shot of a Google image search

230

!? Exercise !? Task !? Problem !?

Use a specialised search engineto find images

about a particular subject on the Internet.

Use a specialised search engineto find images

about a particular subject on the Internet.

***-

231

Online access information sources and services

Online access information sources and services

Public access book databases

****

232

Public access book databases: introduction

Public access book databases: introduction

• Even in this age of Internet-based information sources, a lot of information is still distributed in the form of printed books.

• The contents of most books is (still) not available on the Internet.

• Most Internet search tools do NOT allow you to find out about the existence of books that may be interesting for you.

• So, specific search tools to find books can be useful.

****

233

Public access book databases: an overview

Public access book databases: an overview

• (Databases by publishers.)

• Databases by book distributors / bookshops!

• Online public access library catalogues

• (Databases of computer-based versions of books.)

****

234

Public access book databases provided by bookshops

Public access book databases provided by bookshops

• To find currently available books, the bibliographic databases assembled by big bookshops are interesting.

• Several offer a good coverage and are accessible free of charge.

****

235

Book databases accessible free of charge: examples in U.S.A.

Book databases accessible free of charge: examples in U.S.A.

• Amazon.com (US):http://www.amazon.com/ http://www.amazon.co.uk/ note: amazon, NOT amazoneSubject description is poor.

• Barnes and Noble (US):http://www.bn.com/

****Examples

236

Book databases accessible free of charge: examples in Europe

Book databases accessible free of charge: examples in Europe

• Blackwell’s on the Internet (International, academic books):http://www.blackwell.co.uk/

• VLB for books in Germanhttp://www.buchhandel.de/

• For books in Frenchhttp://www.chapitre.com

• Boeknet - De Nederlandse Internet Boekhandel (Dutch)http://www.boeknet.nl/

***-Examples

237

Book databases accessible free of charge: for old books

Book databases accessible free of charge: for old books

To find used, secondhand, rare, hard-to-find and out-of-print books around the world:

• abebooks http://www.abebooks.com/

• Virtual Book Shophttp://www.bookshop.com/

***-Examples

238

Free public access bibliographic book database + price comparisons

Free public access bibliographic book database + price comparisons

• Even comparisons of the catalogues of shops of books (as well as of music, movies and many other goods) are available free of charge.

• See for instance

»http://www.bookfinder.com/

»http://www.dealtime.com/

****

239

Example of an international public access dissertation database

Example of an international public access dissertation database

• The dissertation database of UMI is available from: http://wwwlib.umi.com/dissertations/

• The most current two years are available without charge.

***-Examples

240

!? Question !? Task !? Problem !?!? Question !? Task !? Problem !?

Search for titles of bookswhich are relevant for you,

using an online database provided by a book publisher or bookshop.

Search for titles of bookswhich are relevant for you,

using an online database provided by a book publisher or bookshop.

****

241

Public access book databases: evaluation criteria (Part 1)Public access book databases: evaluation criteria (Part 1)

• Is usage free of charge?

• Wide coverage? Also for books in your preferred language?

• Specialized coverage for particular subjects?

• Up to date? Frequent updates?

• Abstracts, summaries, descriptions, tables of contents included?

• Full text indexing of each item in the database, so that full text searching is possible?

***-

242

Public access book databases: evaluation criteria (Part 2)Public access book databases: evaluation criteria (Part 2)

• Field indexing, so that searching for the contents of a particular field is possible? for instance

»the title

»the date of publication

»the author

»the publisher

»the language

***-

243

Public access book databases: evaluation criteria (Part 3)Public access book databases: evaluation criteria (Part 3)

• Does the database producer improve retrieval by

»adding subject terms, or

»by classifying the books in categories

• Powerful search options:

» truncation? stemming?

»Boolean search combinations? proximity searching,…?

»spelling check of your search terms?

» translation of your search terms in several other languages?

***-

244

Public access book databases: evaluation criteria (Part 4)Public access book databases: evaluation criteria (Part 4)

• Easy user interface?

• Is a user interface offered in your own language?

• Relevance ranking of results?

• Possibility to combine Boolean retrieval with relevance ranking of results?

• Can results be limited to a certain time period?

• Can the results be ordered according to date, size, origin,...?

***-

245

Public access book databases: evaluation criteria (Part 5)Public access book databases: evaluation criteria (Part 5)

• Good presentation of each result?

• Does the system offer a current awareness service, sending information on new titles that may be of interest to you?

• Short response times?

***-

246

Public access book databases: evaluation criteria (Part 6)Public access book databases: evaluation criteria (Part 6)

• Are other services offered from the same site or with the same interface? Is the system integrated with other services?Additional services can be

»searchable databases of videos, of music CD’s, CD-ROMs, DVDs, all for sale also

»a subject directory for browsing, besides the database with index for searching

»WWW-based e-mail and e-mail address directories

»auctions through WWW

***-

247

Online access information sources and services

Online access information sources and services

Library Online Public Access Catalogues

= OPACs

****

248

Online Public Access Catalogues of libraries

Online Public Access Catalogues of libraries

****

• Mainly to find older books, the catalogues of libraries can be useful.

• Most are accessible online and free of charge.

249

Online Public Access Catalogues = OPACs: definition

Online Public Access Catalogues = OPACs: definition

***-

Online Public Access Catalogue:

a term used to describe any type of computerized library catalog offered to the public by online login

250

Online access library catalogues:The British Library

Online access library catalogues:The British Library

• Accessible online via WWW: Since 2000: http://blpc.bl.uk/

• Access free of charge

***-Example

251

Online access library catalogues:The British Library: screenshotOnline access library catalogues:The British Library: screenshot

***-Example

252

Online access information sources and services

Online access information sources and services

Fee-based online public access information services

****

253

Types of online access information systems: “free” versus “fee”

• A lot of the information on the Internet is available free of charge, but another part is only accessible when a fee is paid to the producer and / or the distributor.

• Some organisations pay these fees for some sources and then organise access, so that the members of the organisation can retrieve and exploit the information as if it is free of charge.

• The first commercial computer systems that make information available online were born around 1975.

• Most of them are now also available through the Internet.

****

254

Fee-based online access services: examples (Part 1)

Fee-based online access services: examples (Part 1)

Location of the computer(s)

U.S.A.U.S.A.U.S.A.U.S.A.U.S.A., Taiwan, UKSwitzerlandU.S.A.U.S.A.

Name

America On LineOCLCOvid TechnologiesCompuServeCambridgeData-StarDialogEBSCO

***-Examples

255

Fee-based online access services: examples (Part 2)

Fee-based online access services: examples (Part 2)

Location of the computer(s)

U.S.A.

U.S.A.U.S.A.U.S.A., The Netherlands,...Germany - U.S.A. - JapanThe Netherlands...

Name

Elsevier ScienceDirect FactivaISI (Web of Science, JCR,…)LexisNexisMSN (Microsoft)ProdigySilver PlatterSTN Swets (e-journals)...

***-Examples

256

Online information services: various names for similar systems

Online information services: various names for similar systems

• (fee-based) online (access) information service

• (fee-based) online (access) computer service

• databank

• database vendor

• host computer

• aggregator

• ...

***-

257

Online information services:total size of their databases

Online information services:total size of their databases

In 1999:

The big host systems and the public access WWW pages offer a comparable quantity of information:

• WWW offered about 8 terabytes (= 8 000 gigabytes) of text data

(according to Lawrence and Lee Giles, Nature, 1999, Vol. 400, pp. 107-109.)

• Dialog offered about 9 terabytes (= 9 000 gigabytes) (in 1998)

»6 billion pages of text

»3 million images

****

258

Database hosts / distributors:evaluation criteria (Part 1)Database hosts / distributors:evaluation criteria (Part 1)

• Contract required?

• A priori payment required?

• Stability / history / evolution / future of host?

• Low costs of data communication?

• Many databases available?

• Whole records available (or only parts)?

• Frequent updates?

• Whole database available? As one file or fragmented?

***-

259

Database hosts / distributors:evaluation criteria (Part 2)Database hosts / distributors:evaluation criteria (Part 2)

• Price of access? Price of information?

• Powerful search options: truncation, Boolean combinations, proximity searching,…?

• Can the indexes of more than one database be searched simultaneously?

• Speed of retrieval?

• Relevance ranking of results?

• Fast response? Accuracy of data communication?

• Clear output format?

***-

260

Database hosts / distributors:evaluation criteria (Part 3)Database hosts / distributors:evaluation criteria (Part 3)

• Online indication of costs?

• Easy user interface?

• Practice free of charge?

• Good manuals, documentation and online help?

• Training courses available? Quality?

• Good help desk available?

• Gateway service offered?

• ...

***-

261

Databases of online public access databases

Databases of online public access databases

• Example

»Gale directory of databases !

• Their coverage:

»online access databases

»(databases accessible on CD-ROM)

»...

***-

262

Databases of databases: Gale

Databases of databases: Gale

• Produced in U.S.A.

• Not free of charge

• Available in various formats:

»printed

»on CD-ROM

»online via the host systems Data-Star, Dialog, with a payment required for each use

»online through the Internet through various hosts,for a fixed price per year to be paid in advance

***-

263

Online access information sources and services

Online access information sources and services

Online access databases about journal articles

****

264

Online access databases about journal articles: overview

Online access databases about journal articles: overview

• Thousands of fee-based online access databases offer bibliographies or full-texts of journal articles in particular subject domains and published by many publishers.

• Many publishers offer searchable bibliographies, but only of their own publications. (for instance Emerald, Elsevier)

• Only few large databases offer access to bibliographies of articles published in journals from many publishers, free of charge.

****

265

Online access databases about journal articles: Ingenta (1)

Online access databases about journal articles: Ingenta (1)

• Ingenta Journals allows you to search a bibliographic database of millions of journal articles, including titles, authors, in many cases abstracts.

• Searching is free of charge.

***-Example

266

Online access databases about journal articles: Ingenta (2)

Online access databases about journal articles: Ingenta (2)

• Payment is required to receive the full text of an article.

• Ingenta has acquired Uncover in 2000.

• Available from

»http://www.ingenta.co.uk/

»http://www.ingenta.com/

***-Example

267

Online access databases about journal articles: Article@INIST

Online access databases about journal articles: Article@INIST

• Article@INIST allows you to search in a bibliographic database, NOT full-text (Journal articles, Journal issues, Books, Reports or Conferences, doctoral dissertations) at the Institut de l'Information Scientifique et Technique, France.

• Searching is free of charge.

• Available fromhttp://form.inist.fr/public/eng/conslt.htm

• Payment is required to receive the full text of an article.

****Example

268

Online access databases about journal articles: Infotrieve

Online access databases about journal articles: Infotrieve

• Infotrieve allows you to search free of charge in a bibliographic database of the articles of more than 20 000 journal titles and conference proceedings, NOT full-text.

• Available from http://www3.infotrieve.com/

• Payment is required to receive the full text of a document.

• Current awareness services are also offered free of charge: the table of contents of new issues of the journals that you have selected are sent to you by email.

***-Example

269

!? Question !? Task !? Problem !?!? Question !? Task !? Problem !?

Search for titles of journal articleswhich are relevant for you,

in a database provided free of charge.

Search for titles of journal articleswhich are relevant for you,

in a database provided free of charge.

***-

270

Online access information sources and services

Online access information sources and services

Electronic newsletters and journals

***-

271

Electronic newsletters and journals: introduction

Electronic newsletters and journals: introduction

***-

• Since the end of the 1990s, electronic journals have become a new communication medium that cannot be neglected.

Author / Sender Editor Reader / Receiver

272

Online access information sources and services

Online access information sources and services

Conclusion

***-

273

Online access information: future trends

• An increasing amount of information becomes available online.

• A growing amount of this online information becomes available free of charge.

• The quality of server and client software is growing.

A consequence is:

• An increasing number of end-users searching for information online.

****

274

Online access information: conclusion

• In the case of simple information needs, the WWW and the search tools can work like “magic”.

• However, in the case of more complicated information needs, there is still is no “magic button” that brings you immediately to all the required information.

****