Upload
charlene-singleton
View
230
Download
5
Embed Size (px)
Citation preview
Chapter 2. Extracting Lexical Features
2007 년 1 월 23 일인공지능연구실 조선호
Text : FINDING OUT ABOUT
Page. 39 ~ 59
Preview
2.1 Building Useful Tools 2.2 Inter-document Parsing 2.3 Intra-document Parsing
2.3.1 Stemming and Other Morphological Processing 2.3.2 Noise Words 2.3.3 Summary
2.4 Example corpora 2.5 Implementation
2.5.1 Basic Algorithm 2.5.2 Fine Points 2.5.3 Software Libraries
2.1 Building Useful Tools Introduce the example of IR system.
Search engine 개발의 주된 three phases1. First Phase - textual objects 의 arbitrary pile 을 잘 정의된 ( 각 포함하고
있는 terms 의 string 이 index 되어진 ) document 의 corpus 로 convert2. Second Phase - Index relation 을 invert 하기 위해 효율적인 data structure
로 만드는 것이 필요☞ 특정 Keywords 가 포함된 모든 문서를 찾을 수 있다 . ( 모든 keyword 가
포함된 특정 문서를 찾는 것보다 더 유리 )
3. Third Phase – query 에 가장 유사한 것들을 검색하기 위해 인덱스들에 대한 query 를 match
Extracting Lexical features – First and second phase 에서 주로 사용: 그 이후의 분석에서 의미를 가진 features 의 집합의 추출이 목표
이 작업을 통해 얻게 되는 이러한 단위적인 feature set 의 specification 이 중요 .
Level of analysis – documents, words, roots, characters, ...
2.2 Inter-document Parsing
Corpus (an arbitrary “pile of text”) 를 개별적인 검색 가능한 document 로 만드는 단계
AI theses (AIT) and email 의 사례 Multiple text fields
- concatenation( 연결 ) 로써의 implement 주석을 사용
- hitlist 에 proxy 들로 사용- 특별한 강조로 사용
특별한 document class 들을 위한 Pre-filters - deTeX - HTML, XML parsers (SAX, DOX)
Email 은 문장 구성에 따른 구조적인 정보의 사례 Ex) mark-up languages – TEX, XML, HTML
☞ filter 같은 것이 존재하여 의미 있는 text 를 추출해낸다 .
2.3 Intra-document Parsing File 은 간단히 character 들의 stream 으로 구분된다고 가정할 수 있다 .
Process a string of characters assemble characters into tokens
(tokenizer) choose tokens to index
Lexical Analyzer generator
Ex) Lex / yacc Basic idea is a finite state machine Triples of input state, transition token,
output state
Lexical Analyzer Output of lexical analyzer is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more
efficient, but limits somewhat the power of our search
Same lexical analysis for both documents and queries!
Stemming and Other Morphological Processing Conflation Stemming
Rewrite rules Porter stemmer
Other approaches Phrases
Stemming
Additional processing at the token level We covered earlier this semester
Turn words into a canonical form: “cars” into “car” “children” into “child” “walked” into “walk”
Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall
Conflation
Stemming Stemming 에서는 suffix 들은 제거된다 .
다음은 복수형의 단수형화이다 .
- WOMAN / WOMEN
- LEAF / LEAVES
- FERRY / FERRIES
- ALUMNUS / ALUMNI
- DATUM / DATA
Rewrite rules
Porter stemmer Rules
Rule matching
Other approaches
Phrases
Noise Words
a.k.a. Stop Words, negative dictionaries
Function words that contribute little or nothing to meaning
Very frequent words If a word occurs in every document, it is not useful in
choosing among documents However, need to be careful, because this is corpus-
dependent Often implemented as a discrete list
Summary
Text document is represented by the words it contains (and their occurrences) e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications
Stemming Reduce dimensionality Identifies a word by its root e.g., flying, flew fly
Stop words Identifies the most common words that are unlikely to help with text
mining e.g., “the”, “a”, “an”, “you”
2.4 Example Corpora We are assuming a fixed corpus. Some sample corpora:
AIT Email. Anyone’s email. Reuters corpus Brown corpus
Will contain textual fields, maybe structured attributes Textual: free, unformatted, no meta-information. NLP
mostly needed here Structured: additional information beyond the content
AI Theses (AIT)
AIT year Distribution
Structured Fields for Email An Email Message
Header – From, To, Cc, Subject, Date
Text fields for Email
Subject Format is structured, content is arbitrary. Captures most critical part of content. Proxy for content -- but may be inaccurate.
Body of email Highly irregular, informal English. Entire document, not summary. Spelling and grammar irregularities. Structure and length vary.
2.5 Implementation
Indexing
We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index
terms Assume that each token is a word and we don’t
want to recognize any more complex structures than single words.
When all documents are processed, create index
Basic algorithm
Figure 2.4 Basic Posting Data Structure
Basic Indexing Algorithm For each document in the corpus
Get the next token Create or update an entry in a list
- doc ID, frequency. For each token found in the corpus
calculate #docs, total frequency sort by frequency Often called a “reverse index”, because it reverses the
“words in a document” index to be a “documents containing words” index.
May be built on the fly or created after indexing.
Refined Posting Data Structures
Minimizing OS dependencies
Fine Points Dynamic Corpora (e.g., the web): requires incremental algorit
hms Higher-resolution data (eg, char position).
Supports highlighting Supports phrase searching Useful in relevance ranking
Giving extra weight to proxy text (typically by doubling or tripling frequency count)
Document-type-specific processing In HTML, want to ignore tags In email, maybe want to ignore quoted material
Basic Measures for Text Retrieval
|}{|
|}{}{|
retrieved
RetrievedRelevantprecision
Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
|}{|
|}{}{|
relevant
RetrievedRelevantrecall
Relevant Relevant & Retrieved Retrieved
All Documents