CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

CIS192 Python ProgrammingIntro to Natural Language Processing

Raymond Yin

University of Pennsylvania

November 2, 2016

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 1 / 27

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps


python.png

Final Project Proposal

Can work individually or with a partner~10 hours of work per personEmail me and the TAs a 150-400 word description and teammembers by this SundayDemos during CIS Project Fair (Reading Days)


python.png

Outline




python.png

Natural Language Processing


python.png

Natural Language Processing

source: researchperspectives.org


python.png

Language is Hard

How can a computer:Recognize parts of speech of sentences?Tell whether a sentence is positive or negative?Figure out what words are most commonly used?Summarize text?


python.png

Some Applications of NLP

Sentiment analysisSpam filteringPlagiarism detectionDocument categorizationSummarizationText searchMuch more...


python.png

Natural Language Tool Kit (NLTK)

NLP toolkit for English in PythonDeveloped at Penn in 2001!nltk.org


http://www.nltk.org/

python.png

Terminology

Corpus: a body of textToken: Each meaningful "entity" in a string

Depending on context, tokens can be words, sentences,paragraphs

Part of Speech: categories that words are assigned tonoun, verb, adjective, ...

Stopwords: most common words in a language, filtered out beforeNLP tasks

the, is, at, which, on, ...


python.png

Outline




python.png

Word Tokenization

>>> nltk.word_tokenize(’The mitochondria is thepowerhouse of the cell.’)[’The’, ’mitochondria’, ’is’, ’the’, ’powerhouse’, ’of’, ’the’, ’cell’, ’.’]


python.png

Sentence Tokenization

>>> sentences = "Prof. Sanjeev Khanna taught CIS 320last spring. It was a great class...and I wasn’t

able to get off the waitlist for CIS 677.">>> nltk.sent_tokenize(sentences)[’Prof. Sanjeev Khanna taught CIS 320 last spring.’,"It was a great class...and I wasn’t able to getoff the waitlist for CIS 677."]


python.png

Outline




python.png

Counting Words in a Corpus

Before today:

>>> counts = defaultdict(int)>>> for word in words:

counts[word] += 1

Better:

>>> counts = FreqDist(words)>>> counts.most_common(10) #=> [(’the’, 49), ...]

Neat!


python.png

Outline




python.png

Creating "random" sentences from a corpus

In probability theory, Markov Chains are "memoryless""Future state depends on current state only"To create a "random" sentence:

Take your current wordAdd a new word that typically appears after your current wordRepeat!


python.png

Outline




python.png

Part of Speech Tagging

Use nltk.pos_tag(list_of_tokens) to identify part ofspeech tagsnltk.help.upenn_tagset shows what each tag code means


python.png

Outline




python.png

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...How would we do this?


python.png

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...Simple way:

For each token in our corpus, count the occurrences of surroundingtokens


python.png

Outline




python.png

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?


python.png

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?

Machine learning (last two lectures)Try to "learn" the sentiment-relevant features of textNeed lots of training dataData driven approach

Rule-based methods"Rule of thumb": uses heuristics to determine sentimentsNeeds little training dataGood for production: fast, but harder to initially createVADER: popular rule based model aimed for social media


http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

python.png

Outline




python.png

Next Steps

CIS 526: Machine TranslationCIS 530: Computational LinguisticsKaggle for large datasets, competitionsawesome-nlp: curated list of NLP resources on GitHub


http://mt-class.org/penn/

https://www.kaggle.com/datasets

https://github.com/keonkim/awesome-nlp

Documents

CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated