36
Understanding human language with Python Alyona Medelyan

KiwiPyCon 2014 talk - Understanding human language with Python

Embed Size (px)

DESCRIPTION

Introduction into Natural Language Processing: - Fiction vs Reality - Complexities of NLP - NLP with Python: NLTK, Gensim, TextBlob (stopwords removal, part of speech tagging, tfidf, text categorization, sentiment analysis - What's next

Citation preview

Page 1: KiwiPyCon 2014 talk - Understanding human language with Python

Understanding human language with Python

Alyona Medelyan

Page 2: KiwiPyCon 2014 talk - Understanding human language with Python

Who am I?

Alyona Medelyan

▪ In Natural Language Processing since 2000

▪ PhD in NLP & Machine Learning from Waikato

▪ Author of the open source keyword extraction algorithm Maui

▪ Author of the most-cited 2009 journal survey “Mining Meaning with Wikipedia”

▪ Past: Chief Research Officer at Pingar

▪ Now: Founder of Entopix, NLP consultancy & software development

aka @zelandiya

Page 3: KiwiPyCon 2014 talk - Understanding human language with Python

Agenda

State of NLPRecap on fiction vs reality: Are we there yet?

NLP ComplexitiesWhy is understanding language so complex?

NLP using PythonNLTK, Gensim, TextBlob & Co

Building NLP applicationsA little bit of data science

Other NLP areasAnd what’s coming next

Page 4: KiwiPyCon 2014 talk - Understanding human language with Python

State of NLP

Fiction versus Reality

Page 5: KiwiPyCon 2014 talk - Understanding human language with Python

He (KITT) “always had an ego that was easy to bruise and displayed a very sensitive, but kind and dryly

humorous personality.” - Wikipedia

Page 6: KiwiPyCon 2014 talk - Understanding human language with Python

Android Auto: “hands-free operation through voice commands

will be emphasized to ensure safe driving”

Page 7: KiwiPyCon 2014 talk - Understanding human language with Python

“by putting this into one's ear one can instantly understand anything said in any language” (Hitchhiker

Wiki)

Page 8: KiwiPyCon 2014 talk - Understanding human language with Python

WordLense:“augmented

reality translation”

Page 9: KiwiPyCon 2014 talk - Understanding human language with Python

Two girls use Google Translate to call a real Indian restaurant and order in Hindi…How did it go? www.youtube.com/watch?v=wxDRburxwz8

Page 10: KiwiPyCon 2014 talk - Understanding human language with Python

The LCARS (or simply library computer) … used sophisticated artificial intelligence routines to

understand and execute vocal natural language commands (From Memory Alpha Wiki)

Page 11: KiwiPyCon 2014 talk - Understanding human language with Python

Let’s try out Google

Page 12: KiwiPyCon 2014 talk - Understanding human language with Python

“Samantha [the OS]proves to be constantly available, always curious and interested, supportive and undemanding”

Page 13: KiwiPyCon 2014 talk - Understanding human language with Python

Siri doesn’t seem to be as “available”

Page 14: KiwiPyCon 2014 talk - Understanding human language with Python

NLP Complexities

Why is understanding language so complex?

Page 15: KiwiPyCon 2014 talk - Understanding human language with Python
Page 16: KiwiPyCon 2014 talk - Understanding human language with Python

Word segmentation complexities

▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ The first hot dogs were sold by Charles Feltman on

Coney Island in 1870.

▪ The first hot dogs were sold by Charles Feltman on Coney Island in 1870.

Page 17: KiwiPyCon 2014 talk - Understanding human language with Python

Disambiguation complexities

Flying planes can be dangerous

Page 18: KiwiPyCon 2014 talk - Understanding human language with Python

NLP using Python

NLTK, Gensim, TextBlob & Co

Page 19: KiwiPyCon 2014 talk - Understanding human language with Python

text text text

text text text text text text text text text text text texttext text text

sentiment

keywords tags

genre

categoriestaxonomy terms

entities

namespatternsbiochemical

entities… text text text

text text text text text text text text text text text texttext text text

What can we do with text?

Page 20: KiwiPyCon 2014 talk - Understanding human language with Python

NLTKPython platform for NLP

Page 21: KiwiPyCon 2014 talk - Understanding human language with Python

How to get to the core words?Remove Stopwords with NLTK

even the acting in transcendence is solid , with the dreamy depp turning in a typically strong performancei think that transcendence has a pretty solid acting, with the dreamy depp turning in a strong performance as he usually does

>>> from nltk.corpus import stopwords>>> stop = stopwords.words('english')>>> words = ['the', 'acting', 'in', 'transcendence', 'is', 'solid', 'with', 'the', 'dreamy', 'depp']>>> print [word for word in words if word not in stop]['acting', 'transcendence', 'solid’, 'dreamy', 'depp']

Page 22: KiwiPyCon 2014 talk - Understanding human language with Python

Getting closer to the meaning:Part of Speech tagging with NLTK

Flying planes can be dangerous

>>> import nltk>>> from nltk.tokenize import word_tokenize>>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous")) [('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('dangerous', 'JJ')]

Page 23: KiwiPyCon 2014 talk - Understanding human language with Python

Keyword scoring: TFxIDF

Relative frequency of a

term t in a document d

The inverse proportion of documents d in collection D mentioning term t

Page 24: KiwiPyCon 2014 talk - Understanding human language with Python

from nltk.corpus import movie_reviewsfrom gensim import corpora, models

texts = []for fileid in movie_reviews.fileids(): words = texts.append(movie_reviews.words(fileid))

dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]tfidf = models.TfidfModel(corpus)

TFxIDF with Gensim

Page 25: KiwiPyCon 2014 talk - Understanding human language with Python

TFxIDF with Gensim (Results)

for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: my_id = dictionary.token2id.get(word)

print word, '\t', tfidf.idfs[my_id]

film 0.190174003903movie 0.364013496254comedy 1.98564470702violence 3.2108967825jolie 6.96578428466

Page 26: KiwiPyCon 2014 talk - Understanding human language with Python

Where does this text belong?Text Categorization with NLTK

Entertainment

Politics

TVNZ: “Obama and Hangover star trade insults in interview”

>>> train_set = [(document_features(d), c) for (d,c) in categorized_documents]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> doc_features = document_features(new_document)>>> category = classifier.classify(features)

Page 27: KiwiPyCon 2014 talk - Understanding human language with Python

Sentiment analysis with TextBlob

>>> from textblob import TextBlob>>> blob = TextBlob("I love this library")>>> blob.sentimentSentiment(polarity=0.5, subjectivity=0.6)

for review in transcendence: blob = TextBlob(open(review).read()) print review, blob.sentiment.polarity ../data/transcendence_1star.txt 0.0170799124247../data/transcendence_5star.txt 0.0874591503268../data/transcendence_8star.txt 0.256845238095../data/transcendence_10star.txt 0.304310344828

Page 28: KiwiPyCon 2014 talk - Understanding human language with Python

Building NLP applications

A little bit of data science

Page 29: KiwiPyCon 2014 talk - Understanding human language with Python

Keywords extracton in 3h:Understanding a movie review

bellboyjennifer bealsfour roomsbealsroomstarantinomadonnaantonio banderasvaleria golino

…four of the biggest directors in hollywood : quentin tarantino , robert rodriguez , … were all directing one big film with a big and popular cast ...the second room ( jennifer beals ) was better , but lacking in plot ... the bumbling and mumbling bellboy … ruins every joke in the film …

github.com/zelandiya/KiwiPyCon-NLP-tutorial

Page 30: KiwiPyCon 2014 talk - Understanding human language with Python

Keyword extraction on 2000 movie reviews:What makes a successful movie?

van dammezeta – jonessmith batman de palma eddie murphy killer tommy lee jones wild west mars murphy ship space brothers de bont ...

star wars disney war de niro jackie alien jackie chan private ryan truman show ben stiller cameron science fiction cameron diaz fiction jack ...

Negative Positive

Page 31: KiwiPyCon 2014 talk - Understanding human language with Python

How NLP can help a beer drinker?

Sweaty Horse Blanket: Processing the Natural Language of Beerby Ben Fieldsvimeo.com/96809735

Page 32: KiwiPyCon 2014 talk - Understanding human language with Python
Page 33: KiwiPyCon 2014 talk - Understanding human language with Python

Other NLP areas

What’s coming next?

Page 34: KiwiPyCon 2014 talk - Understanding human language with Python

Filling the gaps in machine understanding

/m/0d3k14

/m/044sb

/m/0d3k14

… Jack Ruby, who killed J.F.Kennedy's assassin Lee Harvey Oswald. …

Freebase

Page 35: KiwiPyCon 2014 talk - Understanding human language with Python

What’s next?

Vs.

Page 36: KiwiPyCon 2014 talk - Understanding human language with Python

Conclusions:Understanding human language with Python

deeplearning.net/software/theanoscikit-learn.org/stable

NLTKnltk.org

Are we there yet?

@zelandiya #nlproc

radimrehurek.com/gensim

textblob.readthedocs.org