42
© Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

Embed Size (px)

Citation preview

Page 1: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 1

Vocabulary & languages in searching

Connection:indexing

searching

Page 2: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 2

Basic assertion

Indexing and searching:inexorably

connected– you cannot search that that was not first indexed in some manner or other

– indexing of documents or objects is done in order to be searchable • there are many ways to do indexing

– to index one needs an indexing language• there are many indexing languages

– even taking every word in a document is an indexing language

Knowing searching is knowing indexing

Page 3: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 3

General definitionsVocabulary [Encarta Dictionary]

“1. words knownLANGUAGE - all the words used by or known to a particular person or group, or contained in a language as a whole”

Language“1. speech of group

the speech of a country, region, or group of people, including its diction, syntax, and grammar

2. system of communicationa system of communication with its own set of conventions or special words”

Page 4: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 4

From general to specific

• These general definitions are valid for application in indexing & searching to define– index terms– indexing vocabulary– indexing language– search terms– search vocabulary– query (request, search) language

Page 5: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 5

Specific

Index terma word or phrase that denotes (describes) a concept & connotes (implies) a class

index term “table” describes a

and implies many kinds of tables:

for which, if desired, we may have more specific index terms

Page 6: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 6

Specific ...

Indexing vocabularya set of index terms used in a domain or for a set of documents or objects• it could be even a single document or object e.g. a book

Indexing languagean indexing vocabulary together with rules – syntax, grammar – for their application and use

Page 7: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 7

Specific ...

Search termsa counterpart to index terms, also denoting a concept and connoting a class for a search

Search vocabularya set of search terms in a domain or available in a systems

Query languagea search vocabulary together with rules for their use in searching

Page 8: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 8

More

“An index language is the language used to describe documents and requests.

The elements of the index language are index terms, which may be derived from the text of the document to be described, or may be arrived at independently.

The vocabulary of an index language may be controlled or uncontrolled.”

(van Rijsbergen, 1979)

Page 9: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 9

Controlled vocabulary

• Predetermined – indicating what terms to be used in indexing– may show definition of and relations between terms• examples: thesaurus, subject heading list, classification

• Also indicates terms that may be selected for searching

• An indexing AND a searching tool• Human constructed

– and costly to construct and use

Page 10: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 10

Uncontrolled vocabulary

• Derived from documents– nowadays automatically

• using various ways or algorithms

– constant issue: which way is “better”

• Used to construct inverted indexes • a concordance, such as of the Bible, indicating place and position of each word mentioned in the text is an inverted index

• monks used to do it in 12th century, computers do it today

• Inverted indexes are used for free text searching

Page 11: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 11

Controlled vs. free text searching

• Endless source of debate & controversy

• But, each has its place for given circumstance & retrieval goal

• Each has strengths & weaknesses

• can you list or find a list comparing them?

• Users mostly use free text searching

• Professional searchers use both as warranted

• As option:KNOW THY CONTROLLED VOCABULARY

Page 12: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 12

Inverted indexes

Useful to know how they function to understand search & retrieval. Steps:

1. Each document is indexed– every word in a document is taken as

index term with exception of stop words– position in text is noted

2. Indexes for all documents are merged• index terms are arranged alphabetically

in the bowel of the system• under each index term are document numbers in

which it appears & position in text for that document

Page 13: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 13

So, when you search

for digital AND libraries:1. computer takes all documents under

digital2. and all documents under libraries3. compares to “see” which documents have

both terms and then4. provides you the list of those documents

in a default format or you may choose a format

• This is also called “coordinate indexing”– coordination is done at time of searching

Page 14: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 14

Variation: when you search

for digital (WITH) libraries or“digital libraries” i.e as a phrase1. computer goes through the same steps as

before but then also

2. “looks” for documents where digital is positioned right before libraries • remember: computer “knows” position of

each term in each document, each sentence

• So searching for a phrase is a form of searching of terms connected with AND but in a given sequence

Page 15: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 15

Example of inverted fileDoc # Text

1 Slow brown truck arrived

2 Shipment of brownies damaged in a fire

3 Delivery of brownies arrived in a slow truck

4 Shipment of brownies arrived in a truck

Term Position in doc number

arrived (1:4), (3:4), (4:4)

brown (1:2)

brownies (2:3), (3:3), (4:3)

damaged (2:4)

delivery (3:1)

fire (2:7)

shipment (2:1), (4:1)

slow (1:1), (3:7)

truck (1:3), (3:8), (4:7)

For simplicity documents have one sentence.Stop words: “a,” “of,” “in.”

Inverted index

Search for slow AND truck gets as results documents 1 and 3 since

both contain slow and truck

Search for slow (w) truck retrieves only document 3 in which slow is 7th and

truck is 8th, they are right next to each other. Doc 1 has both words, but

not next to each other thus not retrieved

Page 16: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 16

Thesaurus

• Good old Peter Mark Roget had a most useful idea & did a great job

• Following this idea thesaurus became THE major tool for controlled vocabulary in information retrieval (IR)– starting in 1950’s & to this day many IR thesauri have been developed

– all have a similar structure & function– but they are difficult & costly to construct

Page 17: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 17

What is a thesaurus?

“For writers, it is a tool like Roget’s one with words grouped and classified to help select the best word to convey a specific nuance of meaning.

For indexers and searchers, it is an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus.”

(Milstead, 2000)

Page 18: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 18

more…

“A thesaurus to an information scientist is a controlled set of the terms used to index information in a database, and therefore also to search for information in that database so the same concepts are represented by the same term.”

(Batty, 1998)

Page 19: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 19

Basic thesaurus components• For each entry thesaurus has a classification grid:– Descriptor (DE) – an index term that has

• Scope note (SN) – context in which used• Broader terms (BT) – higher in a hierarchy• Narrower terms (NT) – lower in a hierarchy• Related terms (RT) – other connected descriptors• Used for (UF) – synonyms that are not descriptors

– Note: not all of these may be present for every descriptor

• A searcher or indexer can use these as a guide for selection/rejection & for browsing to get ideas

Page 20: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 20

Examples of thesauri

• Thesauri have been constructed for great many domains, from A to Z – here are some lists

• international & multilingual thesauri• online thesauri• among them ERIC Thesaurus (we use it for example)

– BUT: different thesauri may and do treat the same descriptor (index term) differently • having different, more or fewer narrower, broader, related terms

• thus it is dangerous to use them interchangeably

Page 21: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 21

Standard structureWith variations on the theme, thesauri have similar conceptual structure to guide searcher or indexer:

Note: Every descriptor doesn't have to have all of these

Descriptor - DE

Broader terms - BT

Narrower terms - NT

Related terms - RTUsed for - UF

Synonyms

Scope note - SN

Page 22: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 22

Same thesaurus but …

• Examples of ERIC (Educational Resources Information Center) thesaurus as used differently in different systems:1. ERIC own system2. ERIC file on DIALOG (begin 1)3. ERIC file on OVID (accessible through RUL)

• Notice how each uses thesaurus displays & search in its own way, but principles still the same

• Oh well…

Page 23: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 23

ERIC online thesaurus on ERIC

• Allows for – searching for words that are included in descriptors by category or all categories

– browsing alphabetically– browsing in one of about 40 categories

• Search for library in all categories found 76 descriptors that have “library” included

• Out of these selected library education

Page 24: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 24

ERIC online thesaurus on ERICdescriptor library education

Page 25: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 25

ERIC thesaurus on DIALOG

• In a convoluted way ERIC thesaurus (and other ones) can be displayed on DIALOG (and other vendors, such as OVID)

• How?– begin in file 1 – ERIC– then expand a desired term – here we used term library

– you will see under R that certain terms have related terms – meaning that these are thesaurus entries

– then expand on one of those to see related terms

– then you can browse & choose which ones to use in search

• And here are Print Screens of the process

Page 26: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 26

going …

Expandlibrary

Page 27: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 27

going …

45237 items have

library

RT indicates related terms

This one has 14 related

terms

Page 28: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 28

going …

We now chose descriptor LIBRARY ADMINISTRATION and expand on that one

Neat trick:

You can expand on expand & get related terms

Page 29: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 29

going …

14 related terms for this one are listed

These are now R terms of

various type

Can expand on this one to see

other RT

You can also select any of

these to search

Page 30: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 30

going …

We have now selected r10 – library expenditures

Page 31: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 31

going …

Now we can view some items in a chosen format

or we can further modify this search - add refine, …

And this is what we got

Page 32: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 32

gone

Additional

index terms

Descriptors used for this item

Descriptors with * are

major

This is one of the items

we got

Page 33: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 33

ERIC thesaurus on OVID(accessed through RUL)

For library ask to map

as thesaurus

term

Page 34: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 34

going …

There are more down there but we choose this

one to expand

Page 35: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 35

going …

Entries for descriptor Electronic

Libraries

Continue to search for

AND

Page 36: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 36

going …

Retrieved & ready to display

Page 37: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 37

goneChoose format you want for this item

Page 38: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 38

Relevance feedback

• Method for using information in items judged relevant to further refine or change the search– e.g. in relevant items we can browse titles, descriptors, identifiers, abstracts … to get leads for further search terms & tactics

• in some advanced systems this may be done automatically

Page 39: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 39

Query expansion

• Method for adding, modifying, changing search terms in query– to broaden, narrow, focus, change … terms

• Many sources can be used– relevance feedback, thesauri, dictionaries, textbooks, documents, catalogs, & people: users, colleagues, your own mind & experience

• Some systems suggest terms for query expansion

Page 40: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 40

Conclusion

• At the base of all searching are– terms– vocabularies– languages– but a variety exists

• In reality in searching there is no completely controlled or uncontrolled vocabulary– matter of degree– & most importantly, matter of mastery

Page 41: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 41

symbolically;controlled & free vocabulary

Page 42: © Tefko Saracevic 1 Vocabulary & languages in searching Connection: indexing searching

© Tefko Saracevic 42

thank you!