Upload
hendrik
View
216
Download
2
Embed Size (px)
Citation preview
– Hacker Ethics
“Access to computers —
and anything which might !teach you something about !
the way the world works!—
should be unlimited and total.
Always yield to !the Hands-On Imperative!”
Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.). New York: Penguin Books. ISBN 0141000511. OCLC 47216793.
Agenda• Computational Social Science
• Natural Language Processing
• Word Vector Representations
• Comparing different Wikipedia revisions
• Random Indexing
• word2vec patent
Computational Social Science Digital Humanities
• combines computer science & social sciences
• makes new research possible, e.g. the analysis of massive social networks and content of millions of books
immersion.media.mit.edu
D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:
10.1145/2184319.2184336
Massive-scale automated !analysis of news-content• 2.5 million articles from 498 different
English-language news outlets (Reuters & New York Times Corpus)
• automatically annotated into 15 topic areas
• the topics were compared in regards to readability, linguistic subjectivity and gender imbalances
I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale
automated analysis of news-content: topics, style and gender’, Digital Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928
“Low level of political interest and engagement could be connected to the !
lack of subjectivity (adjectival excess)”
Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet
Male-to-Female Ratio!Named Entity Recognition
“Gender bias in sports coverage (...) females only account for between
only 7 and 25 per cent of coverage”
scikit-learn
gensimNatural Language ToolkitspaCyword2vec
Machine Learning
Text ProcessingTopic Modeling
Visualizationd3.js
Google Chart APIHighcharts
Part-of-Speech Tagging!Identifying nouns, verbs, adjectives…
>>> import nltk >>> text = "In the middle ages Sweden had the same king as Denmark and Norway." >>> words = nltk.word_tokenize( text ) !>>> nltk.pos_tag( words ) [('In', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'), ('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as', 'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway', 'NNP'), ('.', '.')]
NN* Noun VB* Verb JJ* Adjective RB* Adverb DT Determiner IN Preposition
Named Entity Recognition!Identifying people, organizations, locations…
>>> import nltk >>> text = "New York City is the largest city in the United States." >>> words = nltk.word_tokenize( text ) !>>> nltk.ne_chunk( nltk.pos_tag( words ) ) Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('United', 'NNP'), ('States', 'NNPS')]), ('.', '.')])
ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian (geo-political entity)
–J. R. Firth 1957
“You shall know a word by the company it keeps”
Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
MAN
WOMANAUNT
UNCLEQUEEN
KING
word2vecVectors can encode relationships
T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
word2vecVectors can encode relationships
KINGS
KING
QUEEN
QUEENS
T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
word2vecVectors can encode relationships
England is to Cameron as Germany is to ?
England is to London as Germany is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
598.7ms [["Berlin",0.563393235206604],["Dusseldorf",0.5625754594802856],["Munich",0.5460122227668762],["Budapest",0.5285829901695251],
["Düsseldorf",0.5266501903533936]]
556.8ms [["Merkel",0.5016422867774963],["Schroeder",0.49941977858543396],["Klaus",0.4981233477592468],["Schröder",
0.4947296977043152],["Peer_Nils",0.492642343044281]]
word2vecAnalogy puzzles
wake is to woken as be is to ?
fast is to fastest as slow is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
806.2ms [["slowest",0.7025301456451416],["slower",0.6236234307289124],["slowed",0.5842559337615967],["slowing",0.5462259650230408],["quickest",
0.5290436744689941]]
929.9ms [["been",0.41698968410491943],["tobe",0.40402814745903015],["are",0.3866569399833679],["being",0.3746173679828644],["notbe",
0.36837878823280334]]
word2vecAnalogy puzzles
Scotland is to haggis as Germany is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
793.5ms [["Currywurst",0.5284685492515564],["schnitzel",0.5208959579467773],["wursts",0.5166285037994385],["sauerkraut",
0.512742817401886],["stollen",0.5095855593681335]]
word2vecAnalogy puzzles
communism is to Karl_Marx as capitalism is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
544.7ms [["Capitalism",0.5884973406791687],["capitalist",0.5700926184654236],["Friedrich_Hayek",0.5352163314819336],
["Milton_Friedman",0.5348755121231079],["John_Maynard_Keynes",0.5335651636123657]]
word2vecAnalogy puzzles
Link: https://radimrehurek.com/gensim/models/word2vec.html
Link: https://radimrehurek.com/gensim/models/word2vec.html
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
2 words context window
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
5 words context window
2 words context window
Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
Machine Translation
T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:
http://arxiv.org/abs/1309.4168
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Visualisation
JSON
Link gensim: https://radimrehurek.com/gensim/!Link word2vec: https://code.google.com/p/word2vec/
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Visualisation
JSON
linguistics
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Visualisation
JSON
Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Visualisation
JSON
Link: https://github.com/mbostock/d3/wiki/Gallery
• “Pretty much every time Google has engaged in patent infringement litigation, it has been against someone who has brought an infringement suit against them first. (…) it keeps inventions they are using out of the hands of patent trolls”
• “Idiotic. I'm surprised they didn't just patent matrix algebra”
• “fuck software patents”
https://www.reddit.com/r/MachineLearning/comments/37b1bl/word2vec_has_been_patented_what_does_it_change/
Google’s word2vec patentReactions from the community
• Omer Levy: • The novelty claim in this
patent is somewhat bogus • word2vec is doing more or
less what the NLP research community has been doing for the past 25 years
• much of the improvement in performance stems from preprocessing "hacks" and hyperparameter settings
• word2vec is a brilliantly efficient implementation of decade-old ideas
https://www.reddit.com/r/MachineLearning/comments/37b1bl/word2vec_has_been_patented_what_does_it_change/
Google’s word2vec patentReactions from the community
Google’s word2vec patentWhat does it change for NLP practitioners?
• “Likely nothing. It's probably one of the thousands of overly broad "defensive" patents held by companies”
• “Didn't it have an Apache open source license before-hand?”
Google’s word2vec patentWhat does it change for NLP practitioners?
• “Likely nothing. It's probably one of the thousands of overly broad "defensive" patents held by companies.”
• “Didn't it have an Apache open source license before-hand?”
• all information in vectors • each word has a hash key!
• n-dimensional vector • most dimensions are 0 • for a small number k, randomly
distributed -1 or +1 values • the dimension of the vectors is
much smaller than the number of contexts
Random Indexing!Incremental word space model
hash !key
• Every time you see a word wi, add the hash key of the words in the context window vi-3, …, vi+3 to the word’s context vector vi
• After a number of occurrences, the context vector holds information about a word’s distribution
• dimensionality reduction, computationally less costly than methods like PCA
Random Indexing!Incremental word space model
hash !key
context!vector
Image Captioning
Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html
Image Captioning
Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html
T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
word2vecVectors can encode relationships
KINGS
KING
QUEEN
QUEENS
Image Captioning
Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, ‘Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering’,
CoRR, vol. abs/1505.05612, 2015 [Online]. Available: http://arxiv.org/abs/1505.05612
Image Question Answering
Hacking!Human!Language!Hendrik Heuer
London
[email protected]!http://hen-drik.de!@hen_drik
Thanks to Andrii, Jussi & Roelof
Slides: https://tinyurl.com/pydata-language