33
LIS618 lecture 4 Thomas Krichel 2003-10-19

LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Embed Size (px)

Citation preview

Page 1: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

LIS618 lecture 4

Thomas Krichel

2003-10-19

Page 2: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Structure

• Document preprocessing

• Practice: Nexis– document preprocessing– segment theory and practice

• Practice: Factiva

Page 3: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

document preprocessing

• There are some operations that may be done to the documents before indexing– lexical analysis– stemming of words– elimination of stop words– selection of index terms– construction of term categorization structures

we will look at those in turn• in many cases, document preprocessing is not

well documented by the provider.• but searchers need to be aware of them…

Page 4: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

lexical analysis

• divides a stream of characters into a stream of words

• seems easy enough but….

• should we keep numbers?

• hyphens. compare "state-of-the-art" with "b-52"

• removal of punctuation, but "333B.C."

• casing. compare "bank" and "Bank"

Page 5: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

stemming

• in general, users search for the occurrence of a term irrespective of grammar

• plural, gerund forms, past tense can be subject to stemming

• important algorithm by Porter• evidence about the effect of stemming on

information retrieval is mixed• stemming is relatively rare these days.

Page 6: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

elimination of stop words

• some words carry no meaning and should be eliminated

• in fact any word that appears in 80% of all documents is pretty much useless, but

• consider a searcher for "to be or not to be".

• It is better to reduce the index weight of terms that appear very frequently

Page 7: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

index term selection

• In printed indexes, we use nouns only

• some nouns that appear heavily together can be considered to be one index term, such as "computer science"

• Dialog deals with this through phrase indexing.

• Most web engines, however, index all words, and all of the individually

Page 8: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

thesauri

• a list of words and for each word, a list of related words – synonyms– broader terms– narrower terms

• used– to provide a consistent vocabulary for indexing and

searching– to assist users with locating terms for query

formulation– allow users to broaden or narrow query

Page 9: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

use of thesauri

• Thesauri are limited to experimental systems, or some high-quality systems, see http://www.sosig.ac.uk/roads/cgi-bin/thesaurus.pl for an example, or look at Nexis

• It can be confusing to users. • Frequently the relationship between terms in the

query is badly served by the relationships in the thesaurus. Thus thesaurus expansion of an initial query (if performed automatically) can lead to bad results.

Page 10: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Back to Nexis: word limits

• The following are always considered word limits– hyphens– slashes– parentheses– spaces

Page 11: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

plurals

• Nexis indexes plural and possive as the singular.

• But in power search, you can use the following– PLURAL (term) only the plural of term– SINGULAR (term) only the singular of term– ALLCAPS (term) only capitals of term– NOCAPS (term) no capitals of term– CAPS (term) capitalized term only

Page 12: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Document preprocessing in Nexis

• ampersand: if it is surrounded by blanks, it treats it as "and". If it is not, it treats it as a normal character company(at&t).

• apostrophe: works if not followed by "s", in which case it is a possessive

• at-sign: used for sections in case law, ignored otherwise, e.g. in email addresses: presidentwhitehouse.com

Page 13: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Document preprocessing in nexis

• colon and comma are read as a space unless adjacent characters are numbers.

• hyphen / and \ is read as a space

• percent and pound sign mean themselves and are not equivalent to anything.

• " ? $ ; are all ignored

• ® is replaced by the word "R", ™ is replaced by the word "TM".

Page 14: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

equivalents

• Nexis has a number of "equivalents" where, depending on sources, it replaces one with the other. Contrary to their claims they also work in quick search

• First (second, third, etc.)is 1st (2nd, 3rd, etc.) Monday (All days ex. Sunday) Mon (Tues, Weds, etc.)

• January (Abbreviations work) Jan (Feb, Mar, etc.)• One (all numbers < 20) 1 (2, 3, etc.) • and & • company co• corporation corp • incorporated inc

Page 15: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

noise and reserved words

• Noise words are common words– in power search, noise words are ignored, replace by

space– in quick search, you can use phrases– no list of noise words

• Reserved words are – and– or– not

used in Boolean expressions. They are not indexed.

Page 16: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Nexis segments

• Nexis does some document preprocessing for characters, discussed in a later slide.

• The processed document has a number field/value pairs that are called segments

• Not every source has every segment.• I make a distinction between

– native – smart-indexed

segments.

Page 17: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

some segments in legal docs

• CITE • CLASS• DATE common search for any date field• FIRST-ACTION date• HISTORY • ISSUED-BY• LAST-ACTION date• NAME • REFERENCES• TEXT full text• TITLE same as name • TYPE

Page 18: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

typical segments in news

• BYLINE • CORRRECTION• CORRECTION-DATE• DATE• DATELINE (not a date) • GRAPHIC• HEADLINE • HIGHLIGHT • LEAD• HLEAD is HEADLINE, HIGHLIGHT, & LEAD

Page 19: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

typical segments in news

• PUBLICATION name and copyright

• SECTION

• SERIES

• SOURCE

• TICKER  

• TYPE

Page 20: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

typical smart-indexed segments

• CITY• COMPANY • COUNTRY • GEOGRAPHIC • INDUSTRY• KEYWORD • ORGANIZATION• PERSON • PRODUCT• SUBJECT • TICKER• TYPE• TERMS includes all these

Page 21: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

segment search

• You can place query terms and connectors in a segment and then search for it. Example:

hlead((drug or substance) w/10 abuse)

Page 22: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

using segments for news

• uses power search expressions, plus• hlead (expression) ?• headline (expression)• company (expression) for a company• byline (expression) for the author• show (expression) for a television show

transcript

expression is a Boolean expression or simple keyword.

Page 23: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

power search for legal data

• uses power search expressions, plus

• name (expression) for the name of a party

• cite (expression) for a citation expression for case law

• title (expression) for the title of a law article

expression is a Boolean expression or simple keyword

Page 24: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Search forms

• There are special forms for – News– Company reports– Market indicators– Portfolio– News and quotes about companies

Page 25: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Personal news alert

• do a search

• then click on “track in personal news” to get to a screen where you can enter – periodicity – what documents to be sent– subject

• This works for real estate for me.

Page 26: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Real time news

• This uses a different query language– terms are implicitly ANDed– explicit AND and OR allowed– phrases have to be put in quotes– * starts for any number of characters, not just

one as in power search– parenthesis can be used

• I have poor experience with this.

Page 27: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Summary on Nexis

• Nexis has a rich set of resources.

• It can be searched by inexperienced, but likely to get poor result.

• Clever learning about its features can get you quite far, however, the features are not well documented online. There is not enough detail.

Page 28: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Factiva

• Nexis is news with legal "stuff".

• Factiva is news with business "stuff".

• It will only work with Microsoft Internet Explorer! This violates the most important rule of web site design.

• It is because the use asp technology. A bad choice!

Page 29: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Login to factiva

• We have a public account that will serve up to 30 users concurrently up to 2003-12-31– user id: mls003– password: transcripts– name space: 16

• https://global.factiva.com/factivalogin/login.asp has the login

• Sessions time out after 30 minutes.

Page 30: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

More on Factiva

• http://www.factiva.com/factiva has– downloadable brochures– case studies– white papers– product tour

• I looked the broshure "Inside-Out". Well written, ordered copies.

Page 31: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

Free text search

• similar to nexis power search• operators "and" "or" "not" "w/i", "near/n"

where n is a number.• /f/n requires the preceding expression to

be in the first n words in the full text. • "same" stands for same paragraph• "atleastn" requires at least n occurences.• "wc" is a word count, use <, > and then a

number, e.g. wc<1000.

Page 32: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

but as well

• You can add codes from indexing terms.

• Note that the + shows that there is more. When you press the triangle the code is dropped into the text box.

Page 33: LIS618 lecture 4 Thomas Krichel 2003-10-19. Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:

http://openlib.org/home/krichel

Thank you for your attention!