23
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Embed Size (px)

Citation preview

Page 1: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Word-counts, visualizations and N-grams

Eric Atwell, Language Research Group

(with thanks to Katja Markert, Marti Hearst, and other contributors)

Page 2: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Reminder

Tokenization - by whitespace, regular expressions

Problems: It’s data-base New York …

Jabberwocky shows we can break words into morphemes

Morpheme types: root/stem, affix, clitic

Derivational vs. Inflectional

Regular vs. Irregular

Concatinative vs. Templatic (root-and-pattern)

Morphological analysers: Porter stemmer, Morphy, PC-Kimmo

Morphology by lookup: CatVar, CELEX, OALD++

MorphoChallenge: Unsupervised Machine Learning of morphology

Page 3: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Counting Token Distributions

Useful for lots of things

One cute application: see who talks where in a novel

• Idea comes from Eick et al. who did it with The Jungle Book by Kipling

Page 4: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

SeeSoft Vizualization of Jungle Book Characters, From Eick, Steffen, and Sumner ‘92

Page 5: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

The FreqDist Data StructurePurpose: collect counts and frequencies for some phenomenon• Initialize a new FreqDist:

>>> import nltk

>>> from nltk.probability import FreqDist

>>> fd = FreqDist()

• When in a counting loop:

fd.inc(‘item of interest’)• After done counting:

fd.N() # total number of tokens counted (N = number)

fd.B() # number of unique tokens (types; B = buckets)

fd.samples() # list of all the tokens seen (there are N)

fd.Nr(10) # number of samples that occurred 10 times

fd.count(‘red’) # number of times the token ‘red’ was seen

fd.freq(‘red’) # relative frequency of ‘red’; that is fd.count(‘red’)/fd.N()

fd.max() # which token had the highest count

fd.sorted_samples() # show the samples in decreasing order of frequency

Page 6: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

FreqDist() in action

Page 7: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Word Lengths by Language

Page 8: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Word Lengths by Language

Page 9: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

How to determine the characters?

Who are the main characters in a story?

Simple solution: look for words that begin with capital letters; count how often each occurs.

Then show the most frequent.

Page 10: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Who are the main characters? And where in the story?

Page 11: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Language Modeling

N-gram modelling: a fundamental concept in NLP

Main idea:

• For a given language, some words are more likely than others to follow each other; and

• You can predict (with some degree of accuracy) the probability that a given word will follow another word.

• This works for words; also for Parts-of-Speech, prosodic features, dialogue acts, …

Page 12: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Next Word Prediction

From a NY Times story...

• Stocks ...

• Stocks plunged this ….

• Stocks plunged this morning, despite a cut in interest rates

• Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ...

• Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

Page 13: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Human Word Prediction

Clearly, at least some of us have the ability to predict future words in an utterance.

How?

• Domain knowledge

• Syntactic knowledge

• Lexical knowledge

Page 14: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Simple Statistics Does a Lot

A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques

In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)

Page 15: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

N-Gram Models of Language

Use the previous N-1 words in a sequence to predict the next word

How do we train these models?

• Very large corpora

Page 16: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Simple N-Grams

Assume a language has V word types in its lexicon, how likely is word x to follow word y?

• Simplest model of word probability: 1/V

• Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability)

popcorn is more likely to occur than unicorn

• Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…)

mythical unicorn is more likely than mythical popcorn

Page 17: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Computing Next Words

Page 18: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Auto-generate a Story

If it simply chooses the most probable next word given the current word, the generator loops – can you see why?

This is a bigram model

?better to take longer history into account: trigram, 4-gram, …

(but will this guarantee no loops?)

Page 19: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Applications

Why do we want to predict a word, given some preceding words?

• Rank the likelihood of sequences containing various alternative hypotheses, e.g. for automatic speech recognition (ASR)

Theatre owners say popcorn/unicorn sales have doubled...

See for yourself: EBL has Dragon Naturally Speaking ASR

• Assess the likelihood/goodness of a sentence

• for text generation or machine translation.

The doctor recommended a cat scan.

El doctor recommendó una exploración del gato.

Page 20: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

“can” and “will” more frequent in skills and hobbies

(Bob the Builder: “Yes we can!”)

How to implement this?

Comparing Modal Verb Counts

Page 21: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Comparing Modals

Page 22: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Comparing Modals

Page 23: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research

Reminder

FreqDist counts of tokens and their distribution can be useful

Eg find main characters in Gutenberg texts

Eg compare word-lengths in different languages

Human can predict the next word …

N-gram models are based on counts in a large corpus

Auto-generate a story ... (but gets stuck in local maximum)

Grammatical trends: modal verb distribution predicts genre