Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
python.png
CIS192 Python ProgrammingIntro to Natural Language Processing
Raymond Yin
University of Pennsylvania
November 2, 2016
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 1 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 2 / 27
python.png
Final Project Proposal
Can work individually or with a partner~10 hours of work per personEmail me and the TAs a 150-400 word description and teammembers by this SundayDemos during CIS Project Fair (Reading Days)
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 3 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 4 / 27
python.png
Natural Language Processing
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 5 / 27
python.png
Natural Language Processing
source: researchperspectives.org
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 6 / 27
python.png
Language is Hard
How can a computer:Recognize parts of speech of sentences?Tell whether a sentence is positive or negative?Figure out what words are most commonly used?Summarize text?
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 7 / 27
python.png
Some Applications of NLP
Sentiment analysisSpam filteringPlagiarism detectionDocument categorizationSummarizationText searchMuch more...
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 8 / 27
python.png
Natural Language Tool Kit (NLTK)
NLP toolkit for English in PythonDeveloped at Penn in 2001!nltk.org
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 9 / 27
python.png
Terminology
Corpus: a body of textToken: Each meaningful "entity" in a string
Depending on context, tokens can be words, sentences,paragraphs
Part of Speech: categories that words are assigned tonoun, verb, adjective, ...
Stopwords: most common words in a language, filtered out beforeNLP tasks
the, is, at, which, on, ...
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 10 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 11 / 27
python.png
Word Tokenization
>>> nltk.word_tokenize(’The mitochondria is thepowerhouse of the cell.’)[’The’, ’mitochondria’, ’is’, ’the’, ’powerhouse’, ’of’, ’the’, ’cell’, ’.’]
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 12 / 27
python.png
Sentence Tokenization
>>> sentences = "Prof. Sanjeev Khanna taught CIS 320last spring. It was a great class...and I wasn’t
able to get off the waitlist for CIS 677.">>> nltk.sent_tokenize(sentences)[’Prof. Sanjeev Khanna taught CIS 320 last spring.’,"It was a great class...and I wasn’t able to getoff the waitlist for CIS 677."]
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 13 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 14 / 27
python.png
Counting Words in a Corpus
Before today:
>>> counts = defaultdict(int)>>> for word in words:
counts[word] += 1
Better:
>>> counts = FreqDist(words)>>> counts.most_common(10) #=> [(’the’, 49), ...]
Neat!
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 15 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 16 / 27
python.png
Creating "random" sentences from a corpus
In probability theory, Markov Chains are "memoryless""Future state depends on current state only"To create a "random" sentence:
Take your current wordAdd a new word that typically appears after your current wordRepeat!
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 17 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 18 / 27
python.png
Part of Speech Tagging
Use nltk.pos_tag(list_of_tokens) to identify part ofspeech tagsnltk.help.upenn_tagset shows what each tag code means
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 19 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 20 / 27
python.png
Free Word Association
After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...How would we do this?
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 21 / 27
python.png
Free Word Association
After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...Simple way:
For each token in our corpus, count the occurrences of surroundingtokens
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 22 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 23 / 27
python.png
Sentiment Analysis
Is some particular text is positive or negative (and to whatdegree?)How might we do this?
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 24 / 27
python.png
Sentiment Analysis
Is some particular text is positive or negative (and to whatdegree?)How might we do this?
Machine learning (last two lectures)Try to "learn" the sentiment-relevant features of textNeed lots of training dataData driven approach
Rule-based methods"Rule of thumb": uses heuristics to determine sentimentsNeeds little training dataGood for production: fast, but harder to initially createVADER: popular rule based model aimed for social media
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 25 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 26 / 27
python.png
Next Steps
CIS 526: Machine TranslationCIS 530: Computational LinguisticsKaggle for large datasets, competitionsawesome-nlp: curated list of NLP resources on GitHub
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 27 / 27