Upload
georgiana-mckinney
View
221
Download
7
Embed Size (px)
Citation preview
October 2005 CSA3180: Text Processing I 1
CSA3180: Natural Language Processing
Text Processing 1• Language Encoding Issues• Common Corpora• Handling Large Document Collections• Applications: Anatomy of a Search Engine• NLTK
October 2005 CSA3180: Text Processing I 2
Language Encoding Issues
• Different encoding methods
• Different languages
• Unicode Standard
• Further information:– Unicode Consortium– Jukka Korpela Tutorial
http://www.cs.tut.fi/~jkorpela/chars.html
October 2005 CSA3180: Text Processing I 3
Language Encoding Issues
• Character Repertoire – set of distinct characters
• Character Code – mapping between characters and positive integers
• Character Encoding – algorithm for presenting characters using particular code
October 2005 CSA3180: Text Processing I 4
Language Encoding Issues
• Encoding using octets
• Common Encodings:– ASCII– ISO Latin I (ISO 8859-1)– ISO Latin II + III Extensions (for Maltese)– Unicode & UTF-8– ANSI– Cyrillic and Chinese Encodings
October 2005 CSA3180: Text Processing I 5
Language Encoding Issues
• Text encoding on the Web
• MIME Standard– Content-Type: text/html; charset=iso-8859-1– Used in Email and Web Servers– Problems in implementation: few encodings
properly supported– UTF-8 recommended
October 2005 CSA3180: Text Processing I 6
Common Corpora
• WordNet
• TREC/ACE/TIDES Corpora
• Linguistic Data Consortium (LDC)– GigaWord (News)– Tree Banks– MUC (Message Understanding Conference)– TIPSTER (Information Retrieval)
October 2005 CSA3180: Text Processing I 7
Handling Large Document Collections
• Special issues involved in processing
• Hierarchical directory structures
• File indexes
• Batch processing – start, resume, pause, end
• Job scheduling
October 2005 CSA3180: Text Processing I 8
Applications
• Anatomy of a Search Engine (Larry Page and Sergey Brin)
• Describes the internals of Google
• NLP in everyday life!
October 2005 CSA3180: Text Processing I 9
Next Sessions…
• Natural Language Toolkit (NLTK)
• http://nltk.sourceforge.net/
• Please download and install!