Upload
alyona-medelyan
View
721
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Introduction into Natural Language Processing: - Fiction vs Reality - Complexities of NLP - NLP with Python: NLTK, Gensim, TextBlob (stopwords removal, part of speech tagging, tfidf, text categorization, sentiment analysis - What's next
Citation preview
Understanding human language with Python
Alyona Medelyan
Who am I?
Alyona Medelyan
▪ In Natural Language Processing since 2000
▪ PhD in NLP & Machine Learning from Waikato
▪ Author of the open source keyword extraction algorithm Maui
▪ Author of the most-cited 2009 journal survey “Mining Meaning with Wikipedia”
▪ Past: Chief Research Officer at Pingar
▪ Now: Founder of Entopix, NLP consultancy & software development
aka @zelandiya
Agenda
State of NLPRecap on fiction vs reality: Are we there yet?
NLP ComplexitiesWhy is understanding language so complex?
NLP using PythonNLTK, Gensim, TextBlob & Co
Building NLP applicationsA little bit of data science
Other NLP areasAnd what’s coming next
State of NLP
Fiction versus Reality
He (KITT) “always had an ego that was easy to bruise and displayed a very sensitive, but kind and dryly
humorous personality.” - Wikipedia
Android Auto: “hands-free operation through voice commands
will be emphasized to ensure safe driving”
“by putting this into one's ear one can instantly understand anything said in any language” (Hitchhiker
Wiki)
WordLense:“augmented
reality translation”
Two girls use Google Translate to call a real Indian restaurant and order in Hindi…How did it go? www.youtube.com/watch?v=wxDRburxwz8
The LCARS (or simply library computer) … used sophisticated artificial intelligence routines to
understand and execute vocal natural language commands (From Memory Alpha Wiki)
Let’s try out Google
“Samantha [the OS]proves to be constantly available, always curious and interested, supportive and undemanding”
Siri doesn’t seem to be as “available”
NLP Complexities
Why is understanding language so complex?
Word segmentation complexities
▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ The first hot dogs were sold by Charles Feltman on
Coney Island in 1870.
▪ The first hot dogs were sold by Charles Feltman on Coney Island in 1870.
Disambiguation complexities
Flying planes can be dangerous
NLP using Python
NLTK, Gensim, TextBlob & Co
text text text
text text text text text text text text text text text texttext text text
sentiment
keywords tags
genre
categoriestaxonomy terms
entities
namespatternsbiochemical
entities… text text text
text text text text text text text text text text text texttext text text
What can we do with text?
NLTKPython platform for NLP
How to get to the core words?Remove Stopwords with NLTK
even the acting in transcendence is solid , with the dreamy depp turning in a typically strong performancei think that transcendence has a pretty solid acting, with the dreamy depp turning in a strong performance as he usually does
>>> from nltk.corpus import stopwords>>> stop = stopwords.words('english')>>> words = ['the', 'acting', 'in', 'transcendence', 'is', 'solid', 'with', 'the', 'dreamy', 'depp']>>> print [word for word in words if word not in stop]['acting', 'transcendence', 'solid’, 'dreamy', 'depp']
Getting closer to the meaning:Part of Speech tagging with NLTK
Flying planes can be dangerous
>>> import nltk>>> from nltk.tokenize import word_tokenize>>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous")) [('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('dangerous', 'JJ')]
✓
Keyword scoring: TFxIDF
Relative frequency of a
term t in a document d
The inverse proportion of documents d in collection D mentioning term t
from nltk.corpus import movie_reviewsfrom gensim import corpora, models
texts = []for fileid in movie_reviews.fileids(): words = texts.append(movie_reviews.words(fileid))
dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]tfidf = models.TfidfModel(corpus)
TFxIDF with Gensim
TFxIDF with Gensim (Results)
for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: my_id = dictionary.token2id.get(word)
print word, '\t', tfidf.idfs[my_id]
film 0.190174003903movie 0.364013496254comedy 1.98564470702violence 3.2108967825jolie 6.96578428466
Where does this text belong?Text Categorization with NLTK
Entertainment
Politics
TVNZ: “Obama and Hangover star trade insults in interview”
>>> train_set = [(document_features(d), c) for (d,c) in categorized_documents]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> doc_features = document_features(new_document)>>> category = classifier.classify(features)
Sentiment analysis with TextBlob
>>> from textblob import TextBlob>>> blob = TextBlob("I love this library")>>> blob.sentimentSentiment(polarity=0.5, subjectivity=0.6)
for review in transcendence: blob = TextBlob(open(review).read()) print review, blob.sentiment.polarity ../data/transcendence_1star.txt 0.0170799124247../data/transcendence_5star.txt 0.0874591503268../data/transcendence_8star.txt 0.256845238095../data/transcendence_10star.txt 0.304310344828
Building NLP applications
A little bit of data science
Keywords extracton in 3h:Understanding a movie review
bellboyjennifer bealsfour roomsbealsroomstarantinomadonnaantonio banderasvaleria golino
…four of the biggest directors in hollywood : quentin tarantino , robert rodriguez , … were all directing one big film with a big and popular cast ...the second room ( jennifer beals ) was better , but lacking in plot ... the bumbling and mumbling bellboy … ruins every joke in the film …
github.com/zelandiya/KiwiPyCon-NLP-tutorial
Keyword extraction on 2000 movie reviews:What makes a successful movie?
van dammezeta – jonessmith batman de palma eddie murphy killer tommy lee jones wild west mars murphy ship space brothers de bont ...
star wars disney war de niro jackie alien jackie chan private ryan truman show ben stiller cameron science fiction cameron diaz fiction jack ...
Negative Positive
How NLP can help a beer drinker?
Sweaty Horse Blanket: Processing the Natural Language of Beerby Ben Fieldsvimeo.com/96809735
Other NLP areas
What’s coming next?
Filling the gaps in machine understanding
/m/0d3k14
/m/044sb
/m/0d3k14
… Jack Ruby, who killed J.F.Kennedy's assassin Lee Harvey Oswald. …
Freebase
What’s next?
Vs.
Conclusions:Understanding human language with Python
deeplearning.net/software/theanoscikit-learn.org/stable
NLTKnltk.org
Are we there yet?
@zelandiya #nlproc
radimrehurek.com/gensim
textblob.readthedocs.org