25
Natural Language Processing in practice

Всеволод Демкин "Natural language processing на практике"

Embed Size (px)

DESCRIPTION

Конференция "AI&BigData Lab", 12 апреля 2014

Citation preview

Page 1: Всеволод Демкин "Natural language processing на практике"

Natural Language Processingin practice

Page 2: Всеволод Демкин "Natural language processing на практике"

Topics

* Overview of NLP* Getting Data* Models & Algorithms* Building an NLP system* A practical example

Page 3: Всеволод Демкин "Natural language processing на практике"

A bit about me* Lisp programmer* Architect and research lead at Grammarly (3+ years of NLP work)* Teacher at KPI: Operating Systems

* Links:http://lisp-univ-etc.blogspot.comhttp://github.com/vselovedhttp://twitter.com/vseloved

Page 4: Всеволод Демкин "Natural language processing на практике"

A bit about Grammarly

(c) xkcd

The best English language writing enhancement app:Spellcheck - Grammar check - Style improvement - Synonyms and word choice - Plagiarism check

Page 5: Всеволод Демкин "Natural language processing на практике"

What is NLP?Transforming free-form text into structured data and back

Intersection of Comp Sci & Linguistics & Software Eng

Based on Algorithms, Machine Learning, and Statistics

Page 6: Всеволод Демкин "Natural language processing на практике"

Popular NLP problems* Spam Filtering* Spelling Correction* Sentiment Analysis* Question Answering* Machine Translation* Text Summarization* Search (also IR)

http://www.paulgraham.com/spam.htmlhttp://norvig.com/spell-correct.html

(c) gettyimages

Page 7: Всеволод Демкин "Natural language processing на практике"

Levels of NLP* data & tools* models* production-ready systems

Page 8: Всеволод Демкин "Natural language processing на практике"

Role of Linguistics

Page 9: Всеволод Демкин "Natural language processing на практике"

NLP Datastructured semi-structured – unstructured–

“Data is ten times more

powerful than algorithms.”

-- Peter NorvigThe UnreasonableEffectiveness of Data.http://youtu.be/yvDCzhbjYWs

Page 10: Всеволод Демкин "Natural language processing на практике"

Kinds of data* Dictionaries* Corpora* User Data

Page 11: Всеволод Демкин "Natural language processing на практике"

Where to get data?* Linguistic Data Consortium http://www.ldc.upenn.edu/ * Google ngrams, book ngrams, syntactic ngrams* Wikimedia* Wordnet* APIs: Twitter, Wordnik, ...* University sites: Stanford, Oxford, CMU, ...

Page 12: Всеволод Демкин "Natural language processing на практике"

Create your own!* Linguists* Crowdsourcing* By-product

-- Johnatahn Zittrain http://goo.gl/hs4qB

Page 13: Всеволод Демкин "Natural language processing на практике"

Tools* analysis tools* processing tools

* Unix command line* XML processing* Map-reduce systems* R, Python, Lisp

(c) O'Reilly Media

Page 14: Всеволод Демкин "Natural language processing на практике"

Algorithms

* Dynamic Programming* Search Algorithms* Tree Algorithms

Page 15: Всеволод Демкин "Natural language processing на практике"

Beyond Algorithms

* CKY constituency parsing* Noisy channel spelling correction* TF-IDF document classification* Bayesian filtering

Page 16: Всеволод Демкин "Natural language processing на практике"

Models

* generative vs discriminative* statistical vs rule-based

Page 17: Всеволод Демкин "Natural language processing на практике"

Language ModelsNgrams

Generative ML models:* Bayesian inference (bag-of-words model)* Hidden Markov model (sequence model)* Neural networks (holistic model)

LM + Domain Model

Page 18: Всеволод Демкин "Natural language processing на практике"

Discriminative Models

* Heuristic* Maximum Entropy* “Advanced” LM Models

Page 19: Всеволод Демкин "Natural language processing на практике"

Going Into Prod

* Translate real-world requirements into a measurable goal * Pre- and post- processing * Don't trust research results * Gather user feedback

Page 20: Всеволод Демкин "Natural language processing на практике"

Practical Example:Language Detection

Page 21: Всеволод Демкин "Natural language processing на практике"

IdeaStandard approach:character LM

Let's try an alternative:word LM

Data – from WiktionaryTest data from Wikipedia–

Page 22: Всеволод Демкин "Natural language processing на практике"

Practical ML System

* Training

Page 23: Всеволод Демкин "Natural language processing на практике"

ML System

* Training* Evaluation

Page 24: Всеволод Демкин "Natural language processing на практике"

ML System

* Training* Evaluation* Production

Page 25: Всеволод Демкин "Natural language processing на практике"

Thanks!

Questions?

Vsevolod Dyomkin@vseloved