Creating and Exploring Frequency Lists

Marco Baroni

Computational skills for text analysis

Outline

Introduction

Perl tables

Collecting frequency lists

Exploring frequency lists

Stop(-word) lists

Hello Larry!

Why frequency?

I Occurrence and co-occurrence counts (“frequency”) atcore of any statistical method to analyze/extractinformation from text

I Frequency analysis to study not only what is possible, butalso what is common, “natural”

I To count words, start from tokenized text (so you knowwhat a word is)

I We need a new type of variable to store a table in Perl

Outline

Introduction

Perl tables

Stop(-word) lists

Tables in perl

I Hash tables, or hashes: complex variables that contain atable where each key X (variable, number, string) isassociated with a value Y (variable, number, string)

I A hash: %tableI A key: $iI The corresponding value: $table{$i}

Hash tableContents of a hash table named %hash

$x $hash{$x}cat 12dog 23mouse 7

Populate a table

I Create/update a table row with the key dog and the value23:$hash{"dog"} = 23;

I The same, but now dog is in $word and 23 is in $freq:$hash{$word} = $freq;

I Incrementing the value corresponding to the $word key:$hash{$word} += 1;

I More compactly:$hash{$word}++;

Printing all rows of a table

foreach $key (keys %hash) {print "$key $hash{$key}\n";

# NB: keys return the key list (an array!)

# NB2: a new kind of loop:# foreach item (array) {}

Outline

Introduction

Perl tables

Stop(-word) lists

Collecting frequency listsFrom tokenized input

while (<>) {$input = $_;$input =~ s/\n//; # remove the newline

# increment count for word in $input:$freqtable{$input}++;

}# we traversed the whole input, the list is ready

# we print itforeach $key (keys %freqtable) {print "$key $freqtable{$key}\n";

Sorting in alphabetical (ASCII) order

# we sort it and print itforeach $key (sort (keys %freqtable)) {print "$key $freqtable{$key}\n";

Sorting by decreasing frequencyWeird syntax, just learn it as an idiom

# we sort it and print it (code on multiple lines# just so it fits on the slide)foreach $key

(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

print "$key $freqtable{$key}\n";}

Frequency of 2-word sequences

$key $freqtable{$key}the cat 10cat meows 2a cat 7

Frequency of 2-word sequences

while (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input; # add to sequence

if (!defined($sequence[1])) { next;} # not enough items yet

$curr_string = join " ", @sequence; # concatenate$freqtable{$curr_string}++; # increment sequence count

shift @sequence; # remove first item in sequence}

foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

Longer sequences$last_index = 2; # to control sequence sizewhile (<>) {$input = $_;$input =~ s/\n//;push @sequence,$input;

if (!defined($sequence[$last_index])) { next;}

$curr_string = join " ", @sequence;$freqtable{$curr_string}++;

shift @sequence;}

foreach $key(sort {$freqtable{$b} <=> $freqtable{$a}}(keys %freqtable)) {

Outline

Introduction

Perl tables

Stop(-word) lists

I Filtering frequency lists with:I Regexps on the words they containI Numerical conditions (such as: > “greater than”) on the

frequency countsI Store the Brown word and bigram frequency lists in two

files, so that we can play with them

The general frame

while (<>) {$input = $_;$input =~ s/\n//;

($word1,$freq) = split " ",$input;# split is the inverse of join# and with (), perl treats $word1 and $freq as the# first and second elements of the array implicitly# created by split

# how would you extend to bigrams, trigrams?

if (CONDITIONS) {print "$input\n";}

Example: love forms

($word1,$freq) = split " ",$input;

if ($word1 =~ /^lov(e[sd]?|ing)$/) {print "$input\n";}

Example: love forms that occur at least 50 times

($word1,$freq) = split " ",$input;

if (($word1 =~ /^lov(e[sd]?|ing)/) &&($freq > 49)) {

print "$input\n";}

Practice time

I Extract the frequency of all words that contain at least 5characters; do not print words that occur less than 5 times

I Extract the frequency of words that contain at least 6characters and end in -ment or -ion

I Extract the bigrams that contain a form of love as first orsecond element

I You will need to formulate an or condition (with ||) since thelove form could be the first or the second word of thebigram

Outline

Introduction

Perl tables

Stop(-word) lists

The need for more cleaningI When looking at keywords in context qualitatively, it made

sense to preserve the original context in which keywordsappeared

I However, once we start extracting frequency lists,numbers, punctuation marks and function words (such asof and the) are annoying, and clutter the resulting lists

I You can weed out numbers, punctuation marks and othernon-(fully)-alphabetical stuff from your tokenized corpuswith one or more regular expressions

I A rather draconian approach might involve something like:if ($input =~ /[^a-zA-Z’\-]/) { next; }

I (This assumes that the newline character was stripped offfrom $input)

I You can also clean up the frequency lists, instead of thetokenized corpus: what’s the difference?

I However, function words do not contain easy-to-spotspecial characters

Function words

I Function words (prepositions, articles, auxiliary verbs. . . )such as the, of, is are very very frequent, and (for mostpurposes) not that interesting

I They also “block” the harvesting of more interestingco-occurrences

I In state of the nation, the interesting bigram is state nation,not state of, of the, the nation

I Solution: collect list of most frequent words from corpus (orother source), and write program to filter words from thisstop list out of the corpus

Creating a stop-word list

I From the Brown single word frequency list, create astop.txt file with all the words that occur N times ormore in the Brown

I One word per line (no frequencies!)I By looking at the sorted frequency list, I think that 500

occurrences might be a good threshold, but you can tryother Ns, if you so wish

Removing stop-words$stop_file = shift;$tok_corpus = shift;

open STOP, $stop_file;while (<STOP>) {$input = $_;$input =~ s/\n//;$stop_list{$input} = 1; # value is arbitrary, we will just

} # check if token is in %stop_list hashclose STOP;

open CORPUS, $tok_corpus;while (<CORPUS>) {$input = $_;$input =~ s/\n//;if (!defined($stop_list{$input})) {print "$input\n";

}}close CORPUS;

Removing stop-wordsOpening, reading and closing files explicitly

I We need to follow explicit procedure to get a file name fromcommand line argument, opening file, reading one line at atime, closing file

I We cannot use the while <> { ... } shortcut,because this time we need to jostle two input files (the stopword list, and the tokenized corpus)

I We encountered the full procedure in the Basics slides, butwe forgot about it

I Note the funny filehandle variables, without $ and (byconvention) UPPERCASED, that Perl uses to connect toexternal files

Removing stop-wordsChecking set membership in Perl

I We just store the set members (here, the stop words) askeys of a hash table with arbitrary values (here, 1s)

I We can then efficiently check if an item is in the set simplyby checking whether we have a value corresponding tothat item in the hash table: defined($set{$item})

I Note that instead of checking if word is not in stop list inorder to print, we could have checked if word is in stop list,and in that case issue a next command to move to nextline in tokenized file

Filtering the Brown

I Run the program to remove stop words on the tokenizedBrown

I It will need two input arguments: the name of the stop list,and the name of the tokenized Brown file

I Extract a bigram frequency list from the filtered tokenizedfile you created, and compare to the frequency list youobtained before stop-list filtering

I NB: it is also sometimes useful to perform keep-wordfiltering, i.e., to filter the corpus (or a frequency list) topreserve only words that are found in a certain list (mightbe relevant to your project)

Creating and Exploring Frequency Lists -...

Documents

Exploring Kahuwai Exploring Kahuwai

Of corpora and brains - CLIC-CIMECclic.cimec.unitn.it/marco/publications/lectures/eegoxel-upc-2009.pdf · Of corpora and brains ... chimpanzee, deer, elephant, fox, giraffe, gorilla,

A PRACTICAL TUTOR FOR THE DEVELOPING … PRACTICAL TUTOR FOR THE DEVELOPING TRUMPET PLAYER: ... Representative Lists of Solo Literature ... exploring the world of solo literature is

CSCE 3110 Data Structures & Algorithm Analysis More on lists. Circular lists. Doubly linked lists

1 Chapter 2. Linked Lists - Singly linked lists - Doubly linked lists aka lecture 3

U.S. 301 Lists - all lists combined 301 All 3 Lists... · U.S. 301 Lists - all lists combined Additional Duty Rates & Effective Dates for Lists 1, 2, 3 & 4 There are several 10 digit

Linked Lists Linked Lists

ML Lists.1 Standard ML Lists. ML Lists.2 Lists A list is a finite sequence of elements. [3,5,9] ["a", "list" ] [] ML lists are immutable. Elements

APPROVED LISTS Pre-Qualified Product Lists · 2015-01-15 · APPROVED LISTS (Pre-Qualified Product Lists) Lists of products pre-qualified by the Department for use on Wisconsin DOT

7.5 Skip Lists - file7.5 Skip Lists -

Master Minimum Equipment Lists/Minimum Equipment Lists

Cecilia Poletto - CLIC-CIMECclic.cimec.unitn.it/roberto/IGG40/3.docx · Web viewOld Italian does not need to be explained assuming that speakers had two grammars to account for

CS 1031 Linked Lists Definition of Linked Lists Examples of Linked Lists Operations on Linked Lists Linked List as a Class Linked Lists as Implementations

APPROVED LISTS Pre-Qualified Product Lists...APPROVED LISTS (Pre-Qualified Product Lists)Lists of products pre-qualified by the Department for use on Wisconsin DOT projects are compiled

Python Lists - Dr. Chuck · Python Lists Chapter 8 Python for Informatics: Exploring Information

Linked Lists II Doubly Linked Lists

Double-Linked Lists and Circular Lists

Lists Lists Lists Lists Lists Lists Lists 7 Ways Your Lists Are Looking & Acting Old

IBM Poetry: Exploring Restriction in Computer Poems'funkhous/2008/machine/IBM_Poetry… · Web view“Stochastic Text” merges sparse, pre-set word lists in controlled and random

91.102 - Computing II Lists(more complex than before…) List Representations Generalized Lists (and Lists of Lists...) Strings C strings Pascal strings