Upload
kelley-cain
View
216
Download
1
Embed Size (px)
Citation preview
Text statistics 7Day 30 - 11/05/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
03-Nov-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/
CompCultEN/ Chapter numbering
3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode
characters 6. Control
Final project
03-Nov-2014NLP, Prof. Howard, Tulane University
3
Open Spyder
03-Nov-2014
4
NLP, Prof. Howard, Tulane University
Review
03-Nov-2014
5
NLP, Prof. Howard, Tulane University
ConditionalFreqDist
1. >>> from nltk.corpus import brown
2. >>> from nltk.probability import ConditionalFreqDist
3. >>> cat = ['news', 'romance']
4. >>> catWord = [(c,w)
5. for c in cat
6. for w in brown.words(categories=c)]
7. >>> cfd=ConditionalFreqDist(catWord)
03-Nov-2014NLP, Prof. Howard, Tulane University
6
Conditional frequency distribution
03-Nov-2014
7
NLP, Prof. Howard, Tulane University
03-Nov-2014NLP, Prof. Howard, Tulane University
8
A more interesting example
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
sci fi 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
Conditions = categories, sample = modal verbs
1. # from nltk.corpus import brown2. # from nltk.probability import
ConditionalFreqDist3. >>> cat = ['news', 'religion', 'hobbies',
'science_fiction', 'romance', 'humor']4. >>> mod = ['can', 'could', 'may', 'might',
'must', 'will']5. >>> catWord = [(c,w)6. for c in cat7. for w in brown.words(categories=c)8. if w in mod]9. >>> cfd = ConditionalFreqDist(catWord)10. >>> cfd.tabulate()11. >>> cfd.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
9
cfd.tabulate()
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
science_fiction 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
03-Nov-2014NLP, Prof. Howard, Tulane University
10
cfd.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
11
03-Nov-2014NLP, Prof. Howard, Tulane University
12
Another example
The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses:1. >>> from nltk.corpus import inaugural2. >>> inaugural.fileids()
3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt']
03-Nov-2014NLP, Prof. Howard, Tulane University
13
cfd2.plot()
First try
1. from nltk.corpus import inaugural
2. from nltk.probability import ConditionalFreqDist
3. keys = ['america', 'citizen']
4. keyYear = [(w, title[:4])
5. for title in inaugural.fileids()
6. for w in inaugural.words(title)
7. if w.lower() in keys]
8. cfd2 = ConditionalFreqDist(keyYear)
9. cfd2.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
14
03-Nov-2014NLP, Prof. Howard, Tulane University
15
cfd2.plot()
Second try
1. from nltk.corpus import inaugural2. from nltk.probability import
ConditionalFreqDist3. keys = ['america', 'citizen']4. keyYear = [(key, title[:4])5. for title in inaugural.fileids()6. for w in inaugural.words(title)7. for k in keys8. if w.lower().startswith(k)]9. cfd3 = ConditionalFreqDist(keyYear)10. cfd3.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
16
dfc3.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
17
Stemming
03-Nov-2014NLP, Prof. Howard, Tulane University
18
Third try
1. from nltk.stem.snowball import EnglishStemmer
2. stemmer = EnglishStemmer()
3. from nltk.corpus import inaugural
4. from nltk.probability import ConditionalFreqDist
5. keys = ['america', 'citizen']
6. keyYear = [(w, title[:4])
7. for title in inaugural.fileids()
8. for w in inaugural.words(title)
9. if stemmer.stem(w) in keys]
10. cfd4 = ConditionalFreqDist(keyYear)
11. cfd4.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
19
cfd4.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
20
Next time
03-Nov-2014NLP, Prof. Howard, Tulane University
21