Text Corpora and Lexical Resources

Text Corpora and Lexical Resources

Chapter 2 of Natural Language Processing with Python

So far --• We have learned the basics of Python– Reading and writing – interactive and files– Control structures• if, while, for, function and class definitions

– Important data structures:• lists, tuples, numeric (int and float)

– Basic natural language processing techniques

Tonight• Expanding the scope of textual

information we can access• Additional language constructions for

working with text• Reintroduce some Python structures for

organizing programs

Text corpora• A collection of text entities– Usually there is some unifying

characteristic, but not always– Typical examples• All issues of a newspaper for a period of time• A collection of reports from a particular industry

or standards body– More recent• The whole collection of posts to twitter• All the entries in a blog or set of blogs

Check it out• Go to http://www.gutenberg.org/• Take a few minutes to explore the site.– Look at the top 100 downloads of yesterday– Can you characterize them? What do you

think of this list?

Corpora in nltk• The nltk includes part of the Gutenberg

collection• Find out which ones by

>>>nltk.corpus.gutenberg.fileids()• These are the texts of the Gutenberg

collection that are downloaded with the nltk package.

Accessing other texts• We will explore the files loaded with nltk• You may want to explore other texts also. • From the help(nltk.corpus):– If C{item} is one of the unique identifiers listed

in the corpus module's C{items} variable, then the corresponding document will be loaded from the NLTK corpus package.

– If C{item} is a filename, then that file will be read.

For now – just a note that we can use these tools on other texts that we download or acquire from any source.

Using the tools we saw before• The particular texts we saw in chapter 1

were accessed through aliases that simplified the interaction.

• Now, more general case, we have to do more.

• To get the list of words in a text:>>>emma = nltk.corpus.gutenberg.words('austen-emma.txt')• Now we have the form we had for the texts of Chapter 1

and can use the tools found there. Try:>>> len(emma)

Note the frequency of use of Jane Austen books ???

Shortened reference• Global context– Instead of citing the gutenberg corpus for each resource,

>>> from nltk.corpus import gutenberg>>> gutenberg.fileids()['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',

...]>>> emma = gutenberg.words('austen-emma.txt')• So, nltk.corpus.gutenberg.words('austen-emma.txt')becomes just gutenberg.words('austen-emma.txt')

Other access options• gutenberg.words('austen-emma.txt')– the words of the text

• gutenberg.raw('austen-emma.txt')– the original text, no separation into tokens

(words). One long string.• gutenberg.sents('austen-emma.txt')– the text divided into sentences

Some code to run• Enter and run the code for counting

characters, words, sentences and finding the lexical diversity score of each text in the corpus.

import nltkfrom nltk.corpus import gutenbergfor fileid in gutenberg.fileids(): num_chars = len(gutenberg.raw(fileid)) num_words = len(gutenberg.words(fileid)) num_sents = len(gutenberg.sents(fileid)) num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) print int(num_chars/num_words), int(num_words/num_sents), \int(num_words/num_vocab), fileid

Short, simple code. Already seeing some noticeable time to execute

Modify the code• Simple change – print out the total

number of characters, words, sentences for each text.

The text corpus• Take a look at your directory of nltk_data to

see the variety of text materials accessible to you. – Some are not plain text and we cannot use them

yet – but will– Of the plain text, note the diversity• Classic published materials• News feeds, movie reviews• Overheard conversations, internet chat

– All categories of language are needed to understand the language as it is defined and as it is used.

The Brown Corpus• First 1 million word corpus• Explore –– what are the categories?– Access words or sentences from one or more

categories or fileids

>>> from nltk.corpus import brown>>> brown.categories()>>> brown.fileids(categories=”<choose>")

Sylistics

• Enter that code and run it.• What does it give you?• What does it mean?

>>> from nltk.corpus import brown>>> news_text = brown.words(categories='news')>>> fdist = nltk.FreqDist([w.lower() for w in news_text])>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']>>> for m in modals:... print m + ':', fdist[m],

Spot check• Repeat the previous code, but look for

the use of those same words in the categories for religion, government

• Now analyze the use of the “wh” words in the news category and one other of your choice. (Who, What, Where, When, Why)

One step comparison• Consider the following code:import nltkfrom nltk.corpus import browncfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre))genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']modals = ['can', 'could', 'may', 'might', 'must', 'will']cfd.tabulate(conditions=genres, samples=modals)

Enter and run it. What does it do?

Other corpora• There is some information about the

Reuters and Inaugural address corpora also. Take a look at them with the online site. (5 minutes or so)

Spot Check• Take a look at Table 2-2 for a list of some of

the material available from the nltk project. (I cannot fit it on a slide in any meaningful way)

• Confirm that you have downloaded all of these (when you did the nltk.download, if you selected all)

• Find them in your directory and explore.– How many languages are represented?– How would you describe the variety of content?

What do you find most interesting/unusual/strange/fun?

Languages• The Universal Declaration of Human Rights

is available in 300 languages.>>>udhr.fileids()

Organization of Corpora• The organization will vary according to

the type of corpus. Knowing the organization may be important for using the corpus.

Example Descriptionfileids() the files of the corpusfileids([categories]) the files of the corpus corresponding to these categoriescategories() the categories of the corpuscategories([fileids]) the categories of the corpus corresponding to these filesraw() the raw content of the corpusraw(fileids=[f1,f2,f3]) the raw content of the specified filesraw(categories=[c1,c2]) the raw content of the specified categorieswords() the words of the whole corpuswords(fileids=[f1,f2,f3]) the words of the specified fileidswords(categories=[c1,c2]) the words of the specified categoriessents() the sentences of the whole corpussents(fileids=[f1,f2,f3]) the sentences of the specified fileidssents(categories=[c1,c2]) the sentences of the specified categoriesabspath(fileid) the location of the given file on diskencoding(fileid) the encoding of the file (if known)open(fileid) open a stream for reading the given corpus fileroot() the path to the root of locally installed corpusreadme() the contents of the README file of the corpus

Table 2.3 – Basic Corpus Functionality in NLTK

from help(nltk.corpus.reader)Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are: - I{corpus}.words(): list of str - I{corpus}.sents(): list of (list of str) - I{corpus}.paras(): list of (list of (list of str)) - I{corpus}.tagged_words(): list of (str,str) tuple - I{corpus}.tagged_sents(): list of (list of (str,str)) - I{corpus}.tagged_paras(): list of (list of (list of (str,str))) - I{corpus}.chunked_sents(): list of (Tree w/ (str,str) leaves) - I{corpus}.parsed_sents(): list of (Tree with str leaves) - I{corpus}.parsed_paras(): list of (list of (Tree with str leaves)) - I{corpus}.xml(): A single xml ElementTree - I{corpus}.raw(): unprocessed corpus contents For example, to read a list of the words in the Brown Corpus, use C{nltk.corpus.brown.words()}: >>> from nltk.corpus import brown >>> print brown.words()

Types of information returned from typical functions

Spot check• Choose a corpus and exercise some of

the functions– Look at raw, words, sents, categories,

fileids, encoding• Repeat for a source in a different

language.• Work in pairs and talk about what you

find, what you might want to look for.– Report out briefly

Working with your own sources

• NLTK provides a great bunch of resources, but you will certainly want to access your own collections – other books you download, or files you create, etc.

from nltk.corpus import PlaintextCorpusReader>>> corpus_root = '/usr/share/dict' >>> wordlists = PlaintextCorpusReader(corpus_root, '.*') >>> wordlists.fileids()['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']>>> wordlists.words('connectives')['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

You could get the list of files in any directory

Other Corpus readers• There are a number of different readers

for different types of corpora. • Many files in corpora are “marked up” in

various ways and the reader needs to understand the markings to return meaningful results.

• We will stick to the PlaintextCorpusReader for now

Conditional Frequency Distribution

• When texts in a corpus are divided into categories, we may want to look at the characteristics by category – word use by author or over time, for example

Figure 2.4: Counting Words Appearing in a Text Collection (a conditional frequency distribution)

Frequency Distributions• A frequency distribution counts some

occurrence, such as the use of a word or phrase.

• A conditional frequency distribution, counts some occurrence separately for each of some number of conditions (Author, date, genre, etc.)

• For example:>>> genre_word = [(genre, word) ... for genre in ['news', 'romance'] ... for word in brown.words(categories=genre)] >>> len(genre_word)170576

Think about this. What exactly is happening?

What are those 170,576 things?, Run the code, then enter just >>> genre_word

• For each genre (‘news’, ‘romance’)• loop over every word in that genre• produce the pairs showing the genre and

the word• What type of data is genre_word?

>>> genre_word = [(genre, word) ... for genre in ['news', 'romance'] ... for word in brown.words(categories=genre)] >>> len(genre_word)170576

Spot check• Refining the result– When you displayed genre_word, you may

have noticed that some of the words are not words at all. They are punctuation marks.

– Refine this code to eliminate the entries in genre_word in which the word is not all alphabetic.

– Remove duplicate words that differ only in capitalization.

Work together. Talk about what you are doing. Share your ideas and insights

Conditional Frequency Distribution

• From the list of pairs we created, we can generate a conditional frequency distribution of words by genre

>>> cfd = nltk.ConditionalFreqDist(genre_word)>>> cfd

>>> cfd.conditions()Run these. Look at the results

Look at the conditional distributions >>> cfd['news']<FreqDist with 100554 outcomes>>>> cfd['romance']<FreqDist with 70022 outcomes>>>> list(cfd['romance'])[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]>>> cfd['romance']['could']193

Presenting the results• Plotting and tabulating – concise representations of the frequency

distributions• Tabulate• With no parameters, simply tabulates all

the conditions against all the values

cfd.tabulate()

Look closely >>> from nltk.corpus import inaugural>>> cfd = nltk.ConditionalFreqDist(... (target, fileid[:4])

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

... for target in ['america', 'citizen']

... if w.lower().startswith(target))

Get the text

The two axes

Narrow the word choice

All the words in each file

Remember List Comprehension?

Three elements• For a conditional frequency distribution:– Two axes

• condition or event, something of interest• some connected characteristic – a year, a place, an

author, anything that is related in some way to the event– Something to count

• For the condition and the characteristic, what are we counting? Words? actions? what?

– From the previous example• inaugural addresses• specific words• count the number of times that a form of either of those

words occurred in that address

Spot check• Run the code on the previous example.• How many times was some version of

“citizen” used in the 1909 inaugural address?

• How many times was “america” mentioned in 2009?

• Play with the code. What can you leave off and still get some meaningful output?

Another case• Somewhat simpler specification• Distribution of length of word in

languages, with restriction on languages

>>> from nltk.corpus import udhr>>> languages = ['Chickasaw', 'English', 'German_Deutsch',... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']>>> cfd = nltk.ConditionalFreqDist(... (lang, len(word)) ... for lang in languages... for word in udhr.words(lang + '-Latin1'))

Now tabulate

• Only choose to tabulate some of the results.

>>> cfd.tabulate(conditions=['English', 'German_Deutsch'],... samples=range(10), cumulative=True) 0 1 2 3 4 5 6 7 8 9 English 0 185 525 883 997 1166 1283 1440 1558 1638German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275

Note – so far, I cannot do plots. I hope to get that fixed. If you can do plots, do try some of the examples.

Common methods for Conditional Frequency Distributions

• cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs

• cfdist.conditions() alphabetically sorted list of conditions• cfdist[condition] the frequency distribution for this condition• cfdist[condition][sample] frequency for the given sample for this

condition• cfdist.tabulate() tabulate the conditional frequency distribution• cfdist.tabulate(samples, conditions) tabulation limited to the specified

samples and conditions• cfdist.plot() graphical plot of the conditional frequency distribution• cfdist.plot(samples, conditions) graphical plot limited to the

specified samples and conditions• cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in

cfdist2

References• This set of slides comes very directly

from the book, Natural Language Processing with Python. www.nltk.org

http://www.nltk.org/

Documents

Text Corpora and Lexical Resources