Creating and Exploring Frequency Lists -...

Preview:

Citation preview

Creating and Exploring Frequency Lists

Marco Baroni

Computational skills for text analysis

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Hello Larry!

Why frequency?

I Occurrence and co-occurrence counts (“frequency”) atcore of any statistical method to analyze/extractinformation from text

I Frequency analysis to study not only what is possible, butalso what is common, “natural”

Collecting frequency lists

I To count words, start from tokenized text (so you knowwhat a word is)

I We need a new type of variable to store a table in Perl

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Tables in perl

I Hash tables, or hashes: complex variables that contain atable where each key X (variable, number, string) isassociated with a value Y (variable, number, string)

I A hash: %tableI A key: $iI The corresponding value: $table{$i}

Hash tableContents of a hash table named %hash

$x $hash{$x}cat 12dog 23mouse 7

Populate a table

I Create/update a table row with the key dog and the value23:$hash{"dog"} = 23;

I The same, but now dog is in $word and 23 is in $freq:$hash{$word} = $freq;

I Incrementing the value corresponding to the $word key:$hash{$word} += 1;

I More compactly:$hash{$word}++;

Printing all rows of a table

foreach $key (keys %hash) {print "$key $hash{$key}\n";

}

# NB: keys return the key list (an array!)

# NB2: a new kind of loop:# foreach item (array) {}

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Collecting frequency listsFrom tokenized input

while (<>) {$input = $_;$input =~ s/\n//; # remove the newline

# increment count for word in $input:$freqtable{$input}++;

}# we traversed the whole input, the list is ready

# we print itforeach $key (keys %freqtable) {print "$key $freqtable{$key}\n";

}

Sorting in alphabetical (ASCII) order

while (<>) {$input = $_;$input =~ s/\n//; # remove the newline

# increment count for word in $input:$freqtable{$input}++;

}# we traversed the whole input, the list is ready

# we sort it and print itforeach $key (sort (keys %freqtable)) {print "$key $freqtable{$key}\n";

}

Sorting by decreasing frequencyWeird syntax, just learn it as an idiom

while (<>) {$input = $_;$input =~ s/\n//; # remove the newline

# increment count for word in $input:$freqtable{$input}++;

}# we traversed the whole input, the list is ready

# we sort it and print it (code on multiple lines# just so it fits on the slide)foreach $key

(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

print "$key $freqtable{$key}\n";}

Frequency of 2-word sequences

$key $freqtable{$key}the cat 10cat meows 2a cat 7

Frequency of 2-word sequences

while (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input; # add to sequence

if (!defined($sequence[1])) { next;} # not enough items yet

$curr_string = join " ", @sequence; # concatenate$freqtable{$curr_string}++; # increment sequence count

shift @sequence; # remove first item in sequence}

foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

print "$key $freqtable{$key}\n";}

Longer sequences$last_index = 2; # to control sequence sizewhile (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input;

if (!defined($sequence[$last_index])) { next;}

$curr_string = join " ", @sequence;$freqtable{$curr_string}++;

shift @sequence;}

foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

print "$key $freqtable{$key}\n";}

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Exploring frequency lists

I Filtering frequency lists with:I Regexps on the words they containI Numerical conditions (such as: > “greater than”) on the

frequency countsI Store the Brown word and bigram frequency lists in two

files, so that we can play with them

The general frame

while (<>) {$input = $_;$input =~ s/\n//;

($word1,$freq) = split " ",$input;# split is the inverse of join# and with (), perl treats $word1 and $freq as the# first and second elements of the array implicitly# created by split

# how would you extend to bigrams, trigrams?

if (CONDITIONS) {print "$input\n";}

}

Example: love forms

while (<>) {$input = $_;$input =~ s/\n//;

($word1,$freq) = split " ",$input;

if ($word1 =~ /^lov(e[sd]?|ing)$/) {print "$input\n";}

}

Example: love forms that occur at least 50 times

while (<>) {$input = $_;$input =~ s/\n//;

($word1,$freq) = split " ",$input;

if (($word1 =~ /^lov(e[sd]?|ing)/) &&($freq > 49)) {

print "$input\n";}

}

Practice time

I Extract the frequency of all words that contain at least 5characters; do not print words that occur less than 5 times

I Extract the frequency of words that contain at least 6characters and end in -ment or -ion

I Extract the bigrams that contain a form of love as first orsecond element

I You will need to formulate an or condition (with ||) since thelove form could be the first or the second word of thebigram

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

The need for more cleaningI When looking at keywords in context qualitatively, it made

sense to preserve the original context in which keywordsappeared

I However, once we start extracting frequency lists,numbers, punctuation marks and function words (such asof and the) are annoying, and clutter the resulting lists

I You can weed out numbers, punctuation marks and othernon-(fully)-alphabetical stuff from your tokenized corpuswith one or more regular expressions

I A rather draconian approach might involve something like:if ($input =~ /[^a-zA-Z’\-]/) { next; }

I (This assumes that the newline character was stripped offfrom $input)

I You can also clean up the frequency lists, instead of thetokenized corpus: what’s the difference?

I However, function words do not contain easy-to-spotspecial characters

Function words

I Function words (prepositions, articles, auxiliary verbs. . . )such as the, of, is are very very frequent, and (for mostpurposes) not that interesting

I They also “block” the harvesting of more interestingco-occurrences

I In state of the nation, the interesting bigram is state nation,not state of, of the, the nation

I Solution: collect list of most frequent words from corpus (orother source), and write program to filter words from thisstop list out of the corpus

Creating a stop-word list

I From the Brown single word frequency list, create astop.txt file with all the words that occur N times ormore in the Brown

I One word per line (no frequencies!)I By looking at the sorted frequency list, I think that 500

occurrences might be a good threshold, but you can tryother Ns, if you so wish

Removing stop-words$stop_file = shift;$tok_corpus = shift;

open STOP, $stop_file;while (<STOP>) {$input = $_;$input =~ s/\n//;$stop_list{$input} = 1; # value is arbitrary, we will just

} # check if token is in %stop_list hashclose STOP;

open CORPUS, $tok_corpus;while (<CORPUS>) {$input = $_;$input =~ s/\n//;if (!defined($stop_list{$input})) {print "$input\n";

}}close CORPUS;

Removing stop-wordsOpening, reading and closing files explicitly

I We need to follow explicit procedure to get a file name fromcommand line argument, opening file, reading one line at atime, closing file

I We cannot use the while <> { ... } shortcut,because this time we need to jostle two input files (the stopword list, and the tokenized corpus)

I We encountered the full procedure in the Basics slides, butwe forgot about it

I Note the funny filehandle variables, without $ and (byconvention) UPPERCASED, that Perl uses to connect toexternal files

Removing stop-wordsChecking set membership in Perl

I We just store the set members (here, the stop words) askeys of a hash table with arbitrary values (here, 1s)

I We can then efficiently check if an item is in the set simplyby checking whether we have a value corresponding tothat item in the hash table: defined($set{$item})

I Note that instead of checking if word is not in stop list inorder to print, we could have checked if word is in stop list,and in that case issue a next command to move to nextline in tokenized file

Filtering the Brown

I Run the program to remove stop words on the tokenizedBrown

I It will need two input arguments: the name of the stop list,and the name of the tokenized Brown file

I Extract a bigram frequency list from the filtered tokenizedfile you created, and compare to the frequency list youobtained before stop-list filtering

I NB: it is also sometimes useful to perform keep-wordfiltering, i.e., to filter the corpus (or a frequency list) topreserve only words that are found in a certain list (mightbe relevant to your project)

Recommended