Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59

Chapter 2. Extracting Lexical Features

2007 년 1 월 23 일인공지능연구실 조선호

Text : FINDING OUT ABOUT

Page. 39 ~ 59

Preview

2.1 Building Useful Tools 2.2 Inter-document Parsing 2.3 Intra-document Parsing

2.3.1 Stemming and Other Morphological Processing 2.3.2 Noise Words 2.3.3 Summary

2.4 Example corpora 2.5 Implementation

2.5.1 Basic Algorithm 2.5.2 Fine Points 2.5.3 Software Libraries

2.1 Building Useful Tools Introduce the example of IR system.

Search engine 개발의 주된 three phases1. First Phase - textual objects 의 arbitrary pile 을 잘 정의된 ( 각 포함하고

있는 terms 의 string 이 index 되어진 ) document 의 corpus 로 convert2. Second Phase - Index relation 을 invert 하기 위해 효율적인 data structure

로 만드는 것이 필요☞ 특정 Keywords 가 포함된 모든 문서를 찾을 수 있다 . ( 모든 keyword 가

포함된 특정 문서를 찾는 것보다 더 유리 )

3. Third Phase – query 에 가장 유사한 것들을 검색하기 위해 인덱스들에 대한 query 를 match

Extracting Lexical features – First and second phase 에서 주로 사용: 그 이후의 분석에서 의미를 가진 features 의 집합의 추출이 목표

이 작업을 통해 얻게 되는 이러한 단위적인 feature set 의 specification 이 중요 .

Level of analysis – documents, words, roots, characters, ...

2.2 Inter-document Parsing

Corpus (an arbitrary “pile of text”) 를 개별적인 검색 가능한 document 로 만드는 단계

AI theses (AIT) and email 의 사례 Multiple text fields

- concatenation( 연결 ) 로써의 implement 주석을 사용

- hitlist 에 proxy 들로 사용- 특별한 강조로 사용

특별한 document class 들을 위한 Pre-filters - deTeX - HTML, XML parsers (SAX, DOX)

Email 은 문장 구성에 따른 구조적인 정보의 사례 Ex) mark-up languages – TEX, XML, HTML

☞ filter 같은 것이 존재하여 의미 있는 text 를 추출해낸다 .

2.3 Intra-document Parsing File 은 간단히 character 들의 stream 으로 구분된다고 가정할 수 있다 .

Process a string of characters assemble characters into tokens

(tokenizer) choose tokens to index

Lexical Analyzer generator

Ex) Lex / yacc Basic idea is a finite state machine Triples of input state, transition token,

output state

Lexical Analyzer Output of lexical analyzer is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more

efficient, but limits somewhat the power of our search

Same lexical analysis for both documents and queries!

Stemming and Other Morphological Processing Conflation Stemming

Rewrite rules Porter stemmer

Other approaches Phrases

Stemming

Additional processing at the token level We covered earlier this semester

Turn words into a canonical form: “cars” into “car” “children” into “child” “walked” into “walk”

Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall

Conflation

Stemming Stemming 에서는 suffix 들은 제거된다 .

다음은 복수형의 단수형화이다 .

- WOMAN / WOMEN

- LEAF / LEAVES

- FERRY / FERRIES

- ALUMNUS / ALUMNI

- DATUM / DATA

Rewrite rules

Porter stemmer Rules

Rule matching

Other approaches

Phrases

Noise Words

a.k.a. Stop Words, negative dictionaries

Function words that contribute little or nothing to meaning

Very frequent words If a word occurs in every document, it is not useful in

choosing among documents However, need to be careful, because this is corpus-

dependent Often implemented as a discrete list

Summary

Text document is represented by the words it contains (and their occurrences) e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications

Stemming Reduce dimensionality Identifies a word by its root e.g., flying, flew fly

Stop words Identifies the most common words that are unlikely to help with text

mining e.g., “the”, “a”, “an”, “you”

2.4 Example Corpora We are assuming a fixed corpus. Some sample corpora:

AIT Email. Anyone’s email. Reuters corpus Brown corpus

Will contain textual fields, maybe structured attributes Textual: free, unformatted, no meta-information. NLP

mostly needed here Structured: additional information beyond the content

AI Theses (AIT)

AIT year Distribution

Structured Fields for Email An Email Message

Header – From, To, Cc, Subject, Date

Text fields for Email

Subject Format is structured, content is arbitrary. Captures most critical part of content. Proxy for content -- but may be inaccurate.

Body of email Highly irregular, informal English. Entire document, not summary. Spelling and grammar irregularities. Structure and length vary.

2.5 Implementation

Indexing

We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index

terms Assume that each token is a word and we don’t

want to recognize any more complex structures than single words.

When all documents are processed, create index

Basic algorithm

Figure 2.4 Basic Posting Data Structure

Basic Indexing Algorithm For each document in the corpus

Get the next token Create or update an entry in a list

- doc ID, frequency. For each token found in the corpus

calculate #docs, total frequency sort by frequency Often called a “reverse index”, because it reverses the

“words in a document” index to be a “documents containing words” index.

May be built on the fly or created after indexing.

Refined Posting Data Structures

Minimizing OS dependencies

Fine Points Dynamic Corpora (e.g., the web): requires incremental algorit

hms Higher-resolution data (eg, char position).

Supports highlighting Supports phrase searching Useful in relevance ranking

Giving extra weight to proxy text (typically by doubling or tripling frequency count)

Document-type-specific processing In HTML, want to ignore tags In email, maybe want to ignore quoted material

Basic Measures for Text Retrieval

|}{|

|}{}{|

retrieved

RetrievedRelevantprecision

Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)

Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

|}{|

|}{}{|

relevant

RetrievedRelevantrecall

Relevant Relevant & Retrieved Retrieved

All Documents