Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
© d-fine — All rights reserved © d-fine — All rights reserved | 0
Machine learning and text analytics
XXXVII Heidelberg Physics Graduate Days
Heidelberg, October 13th, 2016
© d-fine — All rights reserved © d-fine — All rights reserved | 1
Introduction
Michael Hecht
» Senior Consultant (since 2011 working for d-fine)
» PhD in theoretical physics (Topological String Theory) in
Munich (@LMU) and Boston (@Harvard U.)
» Machine learning expert
» Hobby data scientist and Kaggle competitor
(Highscore: 18th in Kaggle’s world ranking)
Todor Dobrikov
» Senior Manager (since 2007 working for d-fine)
» Diploma in mathematics with computer science (@TUD)
and MSc in mathematical finance (@Oxford)
» Expert in rating model development and credit portfolio
modelling
» Establishing machine learning in d-fine’s projects
2016-10-13 | Machine learning and text analytics
© d-fine — All rights reserved © d-fine — All rights reserved | 2
Agenda
2016-10-13 | Machine learning and text analytics
» 3 Textanalytics and NLP
» 16 Machine Learning
» 34 Deep Learning for NLP
› 35 Overview
› 55 Word embeddings
› 65 NLP (almost) from scratch
› 70 Outlook
» 73 ChatAnalytics
› 78 Dashboard for surveillance of trader communication
© d-fine — All rights reserved © d-fine — All rights reserved | 3
Textanalytics and NLP
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP
© d-fine — All rights reserved © d-fine — All rights reserved | 4
The rapidly increasing amount of information is a challenge in banking and
forces financial institutions to redefine processes and concepts (1/2)
» We get more and more information from various sources, e.g. news provider, homepages,
chats, tweeds and many more.
» All these contributions contain some information, and taken together, are valuable in banking,
because
› it is almost orthogonal to traditional information,
› it is up-to-date,
› it is hard to manipulate all sources together.
» However, it is impossible respectively inefficient to monitor manually all available contributions
real-time.
» Moreover, text is unstructured. This means there are countless ways to transport the identical
information by words from the sender to the receiver. This makes the quantitative analysis of
text challenging.
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (1/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 5
The rapidly increasing amount of information is a challenge in banking and
forces financial institutions to redefine processes and concepts (2/2)
Total and daily number of news for some news provider Articles, words, bytes and entities
(from 2007)
(from 2004)
(from 2010)
(from 2012)
(from 2000)
(from 2007)
(from 2007)
» 13,000,000 news articles
times about 400 words each
» 5,200,000,000 words total
times with ca. 8 characters per word
» 20,000,000,000 characters
times 8 Bit per character
» Gives 19 Gigabyte unstructured data of
raw business news articles, which have
to be analysed for e.g.
› stock listed companies (S&P500 or
STOXX Europe 800)
› local small and medium sized
enterprises
› sovereign
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (2/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 6
For financial institutions, two conclusions can be drawn from the rapidly
increasing amount of available information
Banks will have to integrate text-based information in their processes and models.
Decision aids
in manual
processes
Support manual processes with dashboards and reports for text-based information
such that the analysts’ attention is guided to important events, documents, or
passages within documents
» Develop a model to analyse text
» Visualize and highlight features from text and other information (e.g. market ,
balance-sheet and behavioural data)
» Decisions are made by the analyst, who evaluates the text-based information
manually
1
Integration
into internal
models
Integrate text-based information in internal models, so that manual effort is only
caused in model development and validation
» Develop a model to analyse text
» Combine the text-features and other information (e.g. market, balance-sheet and
behavioural data)
» Decisions are based on the model result without analysing the text manually
2
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (3/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 7
The most simple method to analyse text is to compare the single words with
word lists for some categories
Boeing bests EADS in surprise U.S. aerial
tanker win
Thu Feb 24, 2011 6:52pm EST
Boeing Co was the "clear winner" in a U.S.
Air Force tanker competition, the Pentagon
said on Thursday, surprising analysts who
had expected Europe's EADS to win the
deal. […]
Example for a news article Challenges in the field of text analytics
» Define lists for “positive” and “negative” words:
› Negativ: BANKRUPTCY, DEFAULT,
DANGER, DEVALUE,
DOWNGRADE, FRAUD, etc.
› Positiv: BENEFIT, BOOST, GAIN, BEST
IMPROVE, OPPORTUNITY,
PROGRESS, WIN, etc.
» Dictionaries for different context (general,
financial, political, …) are online available
» Dictionaries imply high maintenance costs, e.g. if
one word is added, all tenses and all cases must
be added, too see also Stemming
» Many words are ambiguous, and occur on more
than one list (e.g. fine, company, sound)
» Calculate the ratios: Positive =𝑝
𝑛+𝑝= 100% and
Negative =𝑛
𝑛+𝑝= 0%
Simple methods fail to identify the correct sentiment of a news article conditional on a company, so that
more sophisticated methods are needed.
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (4/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 8
There are many dictionaries online available, each with a different focus and
built with different methods
General Inquirer
General purpose, 182 categories (e.g. Positive, Negative, Hostile, Strong, Power, Weak, Active, Passive), the dictionary also
contains part-of-speech tags for each word (e.g. Noun, CONJ, DET, PREP), available via http://www.wjh.harvard.edu/~inquirer/
1
Loughran and McDonald Sentiment Word Lists
Financial / economic background (constructed in 2009 with 10-K fillings), 6 categories (Litigious, Negative, Positive, Strong,
Uncertainty and Weak), available via http://www3.nd.edu/~mcdonald/Word_Lists.html
2
Subjectivity Lexicon
General purpose, contains 3 categories (positive, neutral and negative),
available via http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
3
Diction 5 / 7
Includes 33 word-categories (e.g. Accomplishment, Aggression, Centrality) and 6 variables, which are based on count ratios in
the word categories, the software is proprietary (see http://www.dictionsoftware.com/)
4
Linguistic Inquiry and Word Counts
Social and psychological background, 64 hierarchical word lists and summary statistics, the software is proprietary see
http://liwc.wpengine.com/
5
Build your own
Based on expert knowledge, based on a trainings set, find the words with the strongest discriminant power -> machine learning
6
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (5/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 9
Incorporating interaction between words and the position of a word relative
to the company name improves the text analysis
Boeing bests EADS in surprise U.S. aerial
tanker win
Thu Feb 24, 2011 6:52pm EST
Boeing Co was the "clear winner" in a U.S.
Air Force tanker competition, the Pentagon
said on Thursday, surprising analysts who
had expected Europe's EADS to win the
deal. […]
Example for a news article Challenges in the field of text analytics
» Define a weight for each word that depends on
the word-distance to the company name (and on
the distance to the first word of the article).
» Do not consider words in isolation, but
› consider n-grams (sequences of words, works
also for negations) and
› search for strong and weak words via
appropriate dictionaries and close to the
company name
› consider the function of a word within the
sentence -> see also Part-of-Speech tagging
Combining different word lists improves the text analysis.
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (6/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 10
The idea of Part-of-Speech tagging is to find the most likely sequence of
tags and analyse the words tag-specific
Boeing bests EADS in surprise U.S. aerial
tanker win
Thu Feb 24, 2011 6:52pm EST
Boeing Co was the "clear winner" in a U.S.
Air Force tanker competition, the Pentagon
said on Thursday, surprising analysts who
had expected Europe's EADS to win the
deal. […]
Example for a news article Challenges in the field of text analytics
» The word ‘deal’ has different functions in English
› it can be a verb,
› it can be a noun or
› an interjection
» The unconditional probability that it serves as a
noun is 65%, according to the General Inquirer.
» Given that the word ‘the’, that serves a an article
for sure, precedes ‘deal’ indicates that the noun is
meant here
» Given a word sequence, we use the sequence of
tags that is most likely and then interpret
the word
» Hidden Markov Models or Maximum
Entropy Models might be applied for POS
tagging
Positive NOUN, 65% idiom-noun: 'a great deal,' 'a good deal,' etc.--an
indefinite but large quantity
Active SUPV, 34% verb: To take action with respect to something or
someone, to handle
Negative INTJ, 1% idiom-noun: 'big deal'--sarcastic admiration
Part-of-speech tagging improves the text analysis.
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (7/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 11
Other news articles provide important context and allow to differentiate
between sentiment and information
Boeing bests EADS in surprise U.S. aerial
tanker win
Thu Feb 24, 2011 6:52pm EST
Boeing Co was the "clear winner" in a U.S.
Air Force tanker competition, the Pentagon
said on Thursday, surprising analysts who
had expected Europe's EADS to win the
deal. […]
------------------------------------------------------------
Boeing wins U.S. tanker competition:
Pentagon
Thu Feb 24, 2011 5:09pm EST
Boeing Co has won a contract to build new
refuelling planes for the U.S. Air Force, […]
Example for a news article Challenges in the field of text analytics
» Define a similarity-measure to compare two news
articles with each other.
» Define
› a time-interval (e.g. 24 hours) or
› a number of news articles (e.g. 40)
and compare all news items within this interval
resp. all n preceding news articles with the actual
article
» Articles, that are very similar to articles published
before should receive a small weight
» The similarity-measure can be defined by making
use of the vector-space representation of text as
the angle between two news articles resp. the
corresponding vectors.
» The vector-space representation and a similarity
measure offer the opportunity to apply a kNN
approach to classify news articles.
Sentiment is not identical to information.
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (8/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 12
Other news articles provide important context and allow to differentiate
between sentiment and information
Example for a news article Challenges in the field of text analytics
» Weight word according to their position in
the news article to make
› the comparison sensitive to the news
structure and
› give more attention to the beginning of a
news
» Problem: Since there are about 13.000
English words, the dimension of the vector
/ document matrix becomes easily high and
they are very sparse.
» Dismiss stop-words (a, for, there, this, …)
» Identify word-stems (standing and stands)
» Identify synonyms and n-grams (article and
news)
News 1:
This news stands for
(1.00) (0.99) (0.97) (0.95)
a simple wildcard for
(0.92) (0.90) (0.87) (0.83)
more meaningful news
(0.79) (0.74) (0.70)
News 2:
Here is another article
(1.00) (0.99) (0.97) (0.95)
without meaningful content
(0.92) (0.90) (0.87)
standing for some news
(0.83) (0.79) (0.74) (0.70)
cos x, y =𝑥, 𝑦
𝑥 𝑦=
3.2552
9.66 = 0.3370
1
3
Word a
anoth
er
art
icle
conte
nt
for
here
is
meanin
gfu
l
more
new
s
sim
ple
som
e
sta
ndin
g
sta
nds
this
wild
card
without
News1 .92 0 0 0 1.78 0 0 .74 .79 1.69 0.9 0 0 .97 1 .87 0
News2 0 .97 .95 .87 .79 1 .99 .9 0 .7 0 .74 .83 0 0 0 .92
2
A simple norm can be defined for text and allows to compare documents.
kNN might be an alternative classification algorithm given labelled trainings data.
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (9/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 13
The Porter Stemmer consists of simple rules that cut a word down to its stem
Step 1a IES -> I ponies -> poni
SSES -> SS caresses -> caress
SS -> SS caress -> caress
S -> ‘’ cats -> cat
Step 1b (m>0) EED -> EE feed -> feed
agreed -> agree
(*v*) ED -> ‘’ plastered -> plaster
bled -> bled
(*v*) ING -> ‘’ motoring -> motor
sing -> sing
Step 1c (*v*) Y -> I happy -> happi
sky -> sky
Step 2 (m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition
rational -> rational
Step 3 (m>0) ICATE -> IC triplicate -> triplic
(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
Step 4 (m>1) AL -> ‘’ revival -> reviv
(m>1) ANCE -> ‘’ allowance -> allow
(m>1) ENCE -> ‘’ inference -> infer
Step 5a (m>1) E -> ‘’ probate -> probat
rate -> rate
Step 5b (m > 1 and *d and *L)
-> single letter controll -> control
» Every word can be represented as
C?(VC){m}V? where
› C is a sequence of consonant
› V is a sequence of vowels
› (.){m} denotes an m-times
repetition of the expression in the
brackets
› ? denotes optionality for the
preceding expression
› * denotes wildcard
» With this notation, there are five steps
to take to cut a word to its stem (see
right for an extraction of the rules)
» However, word stems are not always
real words and stemming rules might
fail for some words (e.g. European /
Europe or matrices / matrix)
» Clever stemming reduces the
dimension by factor 10!
Examples for stemming rules Remarks on the Porter Stemmer
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (10/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 14
Based on a trainings set, algorithm may identify n-grams and synonyms to
reduce the dimension of word and to improve the accuracy
» Synonyms in a given context posses similar
meanings and might be seen as substitutes
» Hence it is likely that the word around the words,
that are synonyms, are the same
» Consider two words as synonyms, if they have a
sufficiently large intersection on words used with
them
» Depending on the context, ‘apple’ and ‘fruit’ or
‘Linde’ and ‘Baum’ are synonyms, but not always!
Identify synonyms Identifying n-grams and composed words
» We can use data itself to decide whether words
stick together with a PCA
» It performs an eigenvalue decomposition of the
data covariance matrix
» We can use the eigenvalues and vectors to
reduce the dimension of our problem
» Let x and y be the number of two similar words
e.g. ’social’ & ‘media’ or ‘machine’ & ‘learning’
» Every data point represents one message
article story write
news
read
journal
another word
line
produce
price
book
We can apply those methods to languages we know nothing about yet – (e.g. Chinese, Arabic, Klingon…).
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (11/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 15
Combining the previous concepts, we consider five independent indicators
and aggregate them geometrically to one overall signal
Concept Dimensions
News 1 (negative)
News 2 (positive)
» Information: Comparison of the message with
messages published before or on the same day in order
to recognize recurring news stories. If the Information
was already known we adjust the message by assigning
it a lower weight.
» Relevance: Measure how much the message focuses on
the assessed company.
» Sentiment: Assigns a positive (green) or negative (red)
implication of the news story for the company under
consideration.
» Certainty: Measure if the article contains final and
certain information or if the information presented are
speculations about expected future developments.
» Readability: Measures the complexity of the language.
Complex expressions and complicated words are more
prone to be misinterpreted by the algorithm. To a certain
extend this also applies to humans…
The Model is easier to test and calibrate with the dimensions separated first.
2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (12/12)
© d-fine — All rights reserved © d-fine — All rights reserved | 16
Machine Learning
2016-10-13 | Machine learning and text analytics | Machine Learning
© d-fine — All rights reserved © d-fine — All rights reserved | 17
Machine Learning is not a new hype
» Machine learning is not new. Early inventions were driven by the military.
» The Internet age: IBM, Google, Amazon and Facebook are leading to a renaissance of
machine learning Google searches on big data, machine
learning and data science
“Artificial
Intelligence
Winter”
US Army: General
purpose computer
ENICA
1946 1958 1985 1997 2004 2011 2016
US office for Naval
Research:
Perceptron artificial
neural network
Learning machine by
Arthur Samuel plays
checkers
Rediscovering of
Backpropagation
algorithm for neural
networks
Statistical
approaches: Support
vector machine
IBM’s Deep Blue
beats Garry
Kasparov
Random forests
Google’s Alpha
Go
Large scale
perceptron
IBM’s Watson
wins Jeopardy!
2016-10-13 | Machine learning and text analytics | Machine Learning (1/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 18
ML is no longer limited to artificial-intelligence researchers and born-digital
companies like Amazon, Google, and Facebook
Marketing &
Logistics Finance
Machine
Learning
Applications
Product
Placement &
Pricing
Science
Modelling
Prediction Text Analytics
Fraud
detection
Robo-
Advisors
Personal
Assistance
Google Now /
Siri
Social
Marketing
Filtering
Image detection
Search
Engine
Autonomous
Systems
Automated
Trading
2016-10-13 | Machine learning and text analytics | Machine Learning (2/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 19
Machine learning applied by d-fine
» Big data analytics and Machine Learning are global trends relevant not only for marketing by
the big internet technology companies such as Amazon, Google and Facebook
» Automated data analysis and predictions are also relevant in sales, logistics, risk
management, customer support, human resources, operation, health care, insurance, life
sciences, electric grid distribution, manufacturing, …
Customer support
Automated Trading
Cost Prediction Fraud detection
Treatment Optimization Quality Management Risk Scoring
Computer Assisted Diagnosis Product Pricing
Loss Given Default
Claims Prediction Product Engineering
Why is this also relevant for d-fine, a consultancy with clients mainly in banking, insurance and energy?
Banking Insurance Energy Health Care
2016-10-13 | Machine learning and text analytics | Machine Learning (3/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 20
How to get started with machine learning
» Humans are unable to explain their expertise
» The solution changes in time
» The solution needs to be adapted to particular cases
» Human expertise does not exist
» Speech recognition, text recognition, handwriting detection, fraud detection (too complex to formulate)
» Financial modelling, credit scoring, routing on a computer network, horse betting (time varying solution
needed)
» Filtering, user biometrics recognition, personal advertising, market basket analysis, web search (strong
adaption needed)
» Autonomous systems, navigating on Mars (no expertise available)
» …
Machine Learning is useful when…
Where Machine Learning is applied for…
Machine Learning is especially useful when functional dependencies of input data and desired output data
are too complex, changing with time or are unknown.
2016-10-13 | Machine learning and text analytics | Machine Learning (4/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 21
What exactly is machine learning?
Machine Learning takes away the task to define exact functional dependencies from the act of
programming
Computer
Input data
Program
Input data
Output data
Program
Output data
Tra
ditio
nal
pro
gra
mm
ing
Ma
ch
ine
Le
arn
ing
Computer
By Machine Learning the software is programming its functionality by itself.
2016-10-13 | Machine learning and text analytics | Machine Learning (5/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 22
predicted 𝑦
Machine learning basics – supervised learning
𝑥
» One form of Machine Learning is supervised
learning, where a training algorithm is fed by
exemplary pairs of input and output variables to
determine the predictor.
» Example: Supervised learning
› Input feature 𝑥
› Target variable 𝑦
› Training set: A list of pairs of feature and
variable 𝑥, 𝑦 𝑖
» When the target variable is continuous the learning
problem is called a regression problem, when the
target variable is discrete it is called a
classification problem.
Explanation Visualization
Training set (x,y)i
Learning
algorithm
Predictor f(x)
The goal of supervised learning is to find an optimal predictor based on labelled training data.
2016-10-13 | Machine learning and text analytics | Machine Learning (6/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 23
Classification of machine learning concepts
Machine Learning concepts can be classified by the grade of supervision and learning depth.
» Supervised concepts: labelling of training data, error estimation possible
» Unsupervised concepts: no labelling, structures determined by the algorithm
» Deep learning: higher level abstractions by multiple processing layers, (Pseudo-) Artificial
Intelligence.
Deep learning
» Artificial Neuronal Network
› Backpropagation
› Recurrent Neuronal Network
› Convolutional Neuronal Network
» Survey Propagation
» Deep Belief Network
» Sparse Autoencoder
Shallow learning
» Generalized linear regression models
» Perceptron
» Support Vector Machine
» Boosting
» Gaussian Mixture Model
» Restricted Boltzmann Machine
» Sparse Coding
» Autoencoder Neuronal Network
Su
pe
rvis
ed
lea
rnin
g
Unsu
pe
r-
vis
ed
lea
rnin
g
ML concepts can be classified by the learning depth and grade of supervision. However the transition
between regimes is flowing.
2016-10-13 | Machine learning and text analytics | Machine Learning (7/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 24
Example: linear regression
Feature
variables A vector 𝒙 = 𝑥1, … , 𝑥𝑖 , … build up of properties
like living area, number of bedrooms…
Target
variable
The price of the real estate
Training
set
Collected
data
Predictor
ansatz
Learning
algorithm 𝒃 = 𝑏0, … , 𝑏𝑖, . . can be determined with the Least
Mean Squares concept and gradient descent
𝑝 𝒙 = 𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2 + ⋯,
where 𝑏0 is the intercept and 𝑏𝑖 (for 𝑖 > 0) are the
weights.
The problem is to determinate 𝒃 in a way to make the best possible prediction.
Object bedrooms living area Price p
A 3 120m² 1.100€
B 1 40m² 330€
C 2 80m² 600€
Let us assume we want to predict the price of a real estate from its living area or other properties.
2016-10-13 | Machine learning and text analytics | Machine Learning (8/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 25
Least mean squares
Stochastic gradient descent
» Error estimation for training set 𝒙𝑗, 𝑦𝑗 by least squares
𝐽 𝒃 =1
2 𝑝𝒃 𝒙𝑗 − 𝑦𝑗 2𝑚
𝑗=1
» Start with initial 𝒃 and apply gradient descent rule
Update 𝑏𝑖 ≔ 𝑏𝑖 − α𝜕
𝜕𝑏𝑖𝐽(𝒃) for each 𝑖 simultaneously
» We can find the derivate 𝜕
𝜕𝑏𝑖𝐽(𝒃) = 𝑝𝒃 𝒙 − 𝑦 𝑥𝑖
» Algorithm for the whole training set: stochastic gradient
descent
Loop until converge {
Loop for 𝑗 = 1 to m {
𝑏𝑖 ≔ 𝑏𝑖 − α 𝑝𝒃 𝒙𝑗 − 𝑦𝑗 𝑥𝑖 for every 𝑖
}
}
Example
» The local minimum found by
gradient descent is dependent of
the initial choice of 𝒃 and the
learning rate α.
Gradient descent leads to a local minimum with learning rate α.
2016-10-13 | Machine learning and text analytics | Machine Learning (9/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 26
Example: logistic regression
Feature
variables
A vector 𝒙 build up of personal data, past credit
history or current payment behaviour of the debtor.
Target
variable
The binary default variable 𝑌 (𝑌 = 1 : default,
𝑌 = 0 : no default)
Training
set
Historic data
Predictor
ansatz
Learning
algorithm
e.g. Maximum Likelihood Estimation (MLE) and
gradient ascent
Name Age Income Default
A 32 46.000€ 0
B 26 31.000€ 1
C 54 60.000€ 0
𝑃 𝑌 = 1 𝒙 ) = 𝑓𝐿 𝑏0 + 𝑏1𝑥1 + ⋯ = 𝑓𝐿 𝒃𝒙 ,
where 𝑓𝐿(𝑠) = 1
1+exp(−𝑠). This ansatz ensures
that the probability is confined in the interval [0,1].
The logistic predictor has to be trained individually for each banks portfolio and has to be recalibrated
yearly.
A classification problem in finance is the prediction of the default probability of a credit receiver.
2016-10-13 | Machine learning and text analytics | Machine Learning (10/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 27
Maximum likelihood estimation
Assuming we have a training set with n independent observations with an identical underlying distribution
dependent with a fit vector 𝒃. Then the joined density function is the product of the individual density functions
𝑓 𝒙1, … , 𝒙𝑛; 𝒃 = 𝑓𝑋𝑖(𝒙𝑖; 𝒃) ≝ 𝐿(𝒃; 𝒙1, … , 𝒙𝑛)
𝑛
𝑖=1
The trick is to view this a function on 𝒃 and consider the 𝒙𝑖 to be constant parameters. This is called the
likelihood function 𝐿(𝒃; 𝒙1, … , 𝒙𝑛). The best fit to the data is reached when this likelihood is maximized. However
it is often useful to maximize the logarithmic likelihood
𝑙 𝒃 = log 𝐿(𝒃; 𝒙1, … , 𝒙𝑛)
To illustrate that let us at first assume that feature and target variables are connected by
𝑦𝑖 = 𝒃𝒙𝑖 + ε𝑖
where ε is a random noise distributed according to a Gaussian distribution
𝑓(ε𝑖)=1
2πσ2 exp (−(ε𝑖)
2
2σ2 ) which implies 𝑓(𝒙𝑖; 𝒃)=1
2πσ2 exp (−(𝑦𝑖−𝒃𝒙𝑖)
2
2σ2 )
2016-10-13 | Machine learning and text analytics | Machine Learning (11/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 28
Maximum likelihood estimation
We can now evaluate the likelihood function to
𝑙 𝒃 = log 1
2πσ2exp −
𝑦𝑖 − 𝒃𝒙𝑖2
2σ2= log
1
2πσ2exp −
𝑦𝑖 − 𝒃𝒙𝑖2
2σ2
𝑛
𝑖=1
𝑛
𝑖=1
= 𝑛 log1
2πσ2−
1
2σ2 𝑦𝑖 − 𝒃𝒙𝑖
2
𝑛
𝑖=1
Maximizing this expression leads again to the least mean squares 1
2 𝑦𝑖 − 𝒃𝒙𝑖
2𝑛𝑖=1
we used to optimize the
linear regression problem.
Coming back to our logistic regression problem, knowing that
𝑃 𝑌 = 1 𝒙) = 𝑓𝐿 𝒃𝒙 and 𝑃 𝑌 = 0 𝒙) = 1 − 𝑓𝐿 𝒃𝒙
we can construct an individual density function as
𝑃 𝑦 𝒙) = (𝑓𝐿 𝒃𝒙 )𝑦(1 − 𝑓𝐿 𝒃𝒙 )1 − 𝑦
The log likelihood is then
𝑙 𝒃 = log 𝐿 𝒃 = 𝑦𝑖𝑛𝑖=1 log𝑓𝐿 𝒃𝒙𝑖 + (1 − 𝑦𝑖) log(1 − 𝑓𝐿 𝒃𝒙𝑖 )
To maximize the log likelihood we now analogously use the gradient ascent rule
𝑏𝑗 ≔ 𝑏𝑗 + α𝜕
𝜕𝑏𝑗
𝑙 𝒃
Using the differentiation rule of the logistic function 𝑓𝐿′ 𝑠 = 𝑓𝐿 𝑠 (1 − 𝑓𝐿(𝑠)) the contained derivative can be
evaluated as 𝜕
𝜕𝑏𝑗
𝑙 𝒃 = 𝑦 − 𝑓𝐿 𝒃𝒙 𝑥𝑗
2016-10-13 | Machine learning and text analytics | Machine Learning (12/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 29
The perceptron
We end again with a stochastic gradient ascent rule
𝑏𝑗 ≔ 𝑏𝑗 + α 𝑦 − 𝑓𝐿 𝒃𝒙 𝑥𝑗
where 𝑓𝐿(𝑠) is a non-linear function.
Interchanging the logistic function 𝑓𝐿(𝑠) in this algorithm with a step
function H 𝑠 = 1, 𝑠 ≥ 00, 𝑠 < 0
, brings us to the so-called perceptron and the
perceptron learning rule
𝑏𝑗 ≔ 𝑏𝑗 + α 𝑦 − 𝐻 𝒃𝒙 𝑥𝑗.
Historically it was thought that a perceptron resembles the way a human
neuron works, as it transfers a signal (feature variable) to a non-zero
output 𝑦 = 1 only when it overcomes a certain threshold defined by the
weights.
A single perceptron connected to two input variables can realize the logic OR-function.
𝑥1
𝑥2
𝑏1
𝑏2 𝑦 = 𝐻(𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2)
𝑦
b0
2016-10-13 | Machine learning and text analytics | Machine Learning (13/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 30
Deep learning networks
» Artificial neurons can have an arbitrary transfer function φ(𝝎𝑖𝑗𝒙𝑗) and unlimited inputs.
» A layered network of connected neurons is a powerful tool to
predict complex non-linear dependencies in data
But how can a neural network be trained?
» One-layer network training: perceptron learning rule
» A Training method for multi-layer networks is backpropagation:
1. Propagation of input features to output variables 𝑜
2. Estimation of the mean square error 𝐸 =1
2𝑡 − 𝑜 2 , where 𝑡 is
the training target variable
3. Back-propagation of the error through the network with weight
adjustment corresponding to gradient descent
In detail: The weight 𝜔𝑖𝑗 between the 𝑖-th and j-th neuron has to be
updated via: 𝜔𝑖𝑗 :=𝜔𝑖𝑗 − α𝜕𝐸
𝜕ω𝑖𝑗
= 𝜔𝑖𝑗 +α δj o𝑖, where
δj =
φ′ 𝝎𝑖𝑗𝒙𝑗 𝑡𝑗 − 𝑜𝑗 , if 𝑗 is output neuron
φ′ 𝝎𝑖𝑗𝒙𝑗 𝛿𝑘𝜔𝑗𝑘𝑘
, if 𝑗 is hidden/input neuron
2016-10-13 | Machine learning and text analytics | Machine Learning (14/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 31
Bias–variance tradeoff
» Fitting with high order polynomials 𝑏0 + 𝑏1𝑥 + 𝑏2𝑥2+…+𝑏5𝑥
5 leads to a lower total error compared to simple
linear model.
» The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to
miss the relevant relations between features and target outputs → underfitting
» The variance is error from sensitivity to small fluctuations in the training set. High variance can cause
overfitting: modelling the random noise in the training data, rather than the intended outputs.
» With more and more complex models and more parameters we tend to over-fit the Noise and mask the true
Signal
» Model Selection: We want to choose the best trade between bias and variance.
2016-10-13 | Machine learning and text analytics | Machine Learning (15/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 32
Model selection – basics
Our Example Goal: Pick the “best” Model
» First Idea: Find the Model with smallest
training error.
» Does not work: Will always prefer high
variance models with maximum
parameter number!
» New Idea: simple cross validation
» Split the data into a training set with ca.
70% of the data and a validation set with
30% of data.
» Fit models to training set and measure
error on validation set.
» Problem: We loose 30% of our data
» Problem 2: Depending on the Split our
MSE can differ in level
» For our polynomial problem we plot the validation MSE
» In accordance with our intuition we see that 2 seems to
be a good choice with not much more to gain with higher
orders.
We have found a way to quantify the Bias-Variance Tradeoff.
2016-10-13 | Machine learning and text analytics | Machine Learning (16/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 33
Model selection – k-fold cross validation
Our Example Goal: Pick the “best” Model
» How can we recycle more of our data?
1. Split the data into k-subsets
2. Fit the models using k-1 subsets and
measure the MSE on the remaining
subset
3. Average the results over all possible
choices for the k-1 subsets
» We typically See a U-shape
» Shown here: Three shapes for different problems.
» The brown line is the problem we presented before.
Taking the bottom of each U-curve leads to the best Model in terms of Bias-Variance Tradeoff.
2016-10-13 | Machine learning and text analytics | Machine Learning (17/17)
© d-fine — All rights reserved © d-fine — All rights reserved | 34
Deep Learning for NLP
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP
© d-fine — All rights reserved © d-fine — All rights reserved | 35
Overview
“Any sufficiently advanced technology is indistinguishable from magic” Arthur C. Clarke’s 3rd law (“Hazards of Prophecy: The Failure of Imagination”, 1973)
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview
© d-fine — All rights reserved © d-fine — All rights reserved | 36
Deep learning fuels lots of recent technological advancements and is applied
in the real world
Automated logistics
Product recommendations and
Tailored advertising
Face
Recognition
Community
-detection
Siri (Cortana, Alexis, …)
Pattern recognition
Optimization of
Computing centers
Self-driving cars
Machine learning is being applied in almost all of the major tech companies.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (1/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 37
Example 1 – Image recognition (1/2)
1http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
» Deep Learning algorithms are capable of learning to recognize
objects on pictures
» Standardized test for this kind of problem:
ImageNet (http://image-net.org/)
› Training set: 1.2 million images in 1000 categories
› Test set: 150k images
› Each year: Large Scale Visual Recognition Challenge (ILSVRC)
› Winner 2016 in category “Object localization” had an error of 2.9%
» Humans achieve an error rate of
4%-15% on this task1
Deep Learning techniques outperform humans at an image recognition task.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (2/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 38
Example 1 – Image recognition (2/2)
Performance improved by more than 87% over the last 6 years.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (3/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 39
Example 2 – Google Deepmind’s AlphaGo1
1https://deepmind.com/research/alphago/
» Combining techniques from Deep Learning and traditional
search theory (MC tree search) Deepmind was able to build an
algorithm which is capable of superhuman performance at the
game of Go
» In March 2016 AlphaGo defeated one of the top human Go
players Lee Sedol in a five match competition with 4:1 and the
reigning 3-times European champion Fan Hui with 5:0
» This is remarkable as the branching ratio of Go is about 250!
(compare Chess with a BR of ~35)
» The algorithm was in a first step trained on games of human
experts by supervised learning techniques
» In a second step the algorithm improved itself by lots of games
of self-play by using techniques from (deep) reinforcement
learning
» Used a new kind of hardware, so-called Tensor Processing
Units (TPUs)
Deep Learning systems outperform humans in games of perfect information
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (4/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 40
Example 3 – Google’s “show and tell” algorithm1 (2016)
1Vinyals, Oriol, et al. "Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge." (2016) 2https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html
» Deep Learning algorithms are currently not only able to recognise objects
» They can even automatically generate captions for images and even learn to combine objects
» Recently Google open-sourced its model “Show and Tell”2 for image captioning
» Examples:
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (5/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 41
Example 4 – Neural style transfer
1Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arXiv preprint arXiv:1508.06576 (2015)
» Deep Learning algorithms are capable of creating works of art
» Taking a “style” and a “target” image there are algorithms which transfer the selected style to the target
image1
» This is even possible for videos! (Example: Youtube)
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (6/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 42
Example 5 – mortgage risk1
1Giesecke, Kay, J. Sirignano, and A. Sadhwani. Deep Learning for Mortgage Risk. Working Paper, Stanford University, 2016.
out-of-sample
» Analysis of mortgage risk data fir 120 Mio. loans originated in the US between 1995 and 2014
» This amounts to ~ 350B monthly observations with 300 feature variables (e.g. FICO score, Interest rate,
income, …) and performance status (30, 60, 90+ days late, foreclosure, REO, paid off)
» Monthly loan performance for the retail loans was provided by CoreLogic
» Additionally local (zip code, …) and national economics factors were collected from various sources
» Modelled transitions within loan performance states:
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (7/19)
Out-of-sample AUCs for month-ahead prediction using ensemble
Out-of-sample AUCs for month-ahead prediction
© d-fine — All rights reserved © d-fine — All rights reserved | 43
Example 6 – parsing of natural language: Google’s Parsey McParseface1
1Andor, Daniel, et al. "Globally normalized transition-based neural networks."arXiv preprint arXiv:1603.06042 (2016), https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html
Model News Web Questions
Martins et al. (2013) 93.10 88.23 94.21
Zhang & McDonald (2014) 93.32 88.65 93.37
Weiss et al. (2015) 93.91 89.29 94.17
Andor et al. (2016)* 94.44 90.17 95.40
Parsey McParseface 94.15 89.08 94.77
Model News Web Questions
Ling et al. (2015) 97.78 94.03 96.18
Andor et al. (2016)* 97.77 94.80 96.86
Parsey McParseface 97.52 94.24 96.45
% correct head assignments in the tree per-token accuracy in % (POS) tagging
» Natural language is hard to parse for machines because of the prepositional phrase attachment ambiguity
» Professional human linguists trained on this kind of task agree in 96%-97% on a parsing task
» Parsey McParseface achieves an accuracy of about 94%!
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (8/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 44
Example 7 – predicting banking distress1
» Gathered 6.6M articles (3.4B words) from Reuters online in the period 2007 – 2014 (Q3)
» Monitored news items on 101 different large European banks
» Model was first pre-trained unsupervised with raw news items
» In a second step training proceeded on 243 distress events like government interventions, state aids, direct
failures and distressed mergers
» Model yields an AUC of ~71%
» From the trained network a country-specific stress-index was extracted:
1Rönnqvist, Samuel, and Peter Sarlin. "Identifying bank stress by deep learning of news." Workshop New Challenges in Neural Computation 2015. 2015, Rönnqvist, Samuel, and Peter Sarlin. "Detect & Describe: Deep learning of bank stress in the news." Computational Intelligence, 2015 IEEE Symposium Series on. IEEE, 2015, Rönnqvist, Samuel, and Peter Sarlin. "Bank distress in the news: Describing events through deep learning." arXiv preprint arXiv:1603.05670 (2016).
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (9/19)
Stress-index for Germany over time Posterior of stress-index over time
© d-fine — All rights reserved © d-fine — All rights reserved | 45
(Deep) neural networks
1See e.g. Bengio, Yoshua, et al. "Towards biologically plausible deep learning." arXiv preprint arXiv:1502.04156 (2015 for a discussion.
» Neural networks can be understood as a mathematical model of neurons in brains1 (all we
know this model is to simplistic to account for the processes in a real brain)
» Neurons “fire” electrical impulses along so-called axons to other neurons, thereby producing a
dense and complicated web of interacting units
» In the mathematical model of a neuron the signal 𝑥 from another neuron undergoes first an
affine transformation (the parameters of those are called weights), which is the input to a non-
linear function called an activation function
» By building a network of such mathematical neurons one obtains a so-called neural network
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (10/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 46
General overview of neural networks
1There are also other types like Hopfian networks or (Deep) Boltzmann Machines which are not discussed here 2http://playground.tensorflow.org
» Typical1 neural networks models come with
various so-called layers
» There are various activation functions used in
the literature and in applications
Nowadays neural networks are build with 100s of layers leading to a high capacity2.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (11/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 47
Learning in neural networks
1See e.g. http://sebastianruder.com/optimizing-gradient-descent/ for a good overview.
› Therefore the Learning problem is mapped to an optimization problem of an in general non-
convex objective function
› The optimization problem for neural networks is in almost all cases tried to be solved by the
17th century technique of gradient descent with some modern twists1
» A neural network is parametrized by its topology, its
activation functions and its weights
» In a real world application one chooses a topology and
types of activation functions
» The weights 𝑤 are then derived from training data by
optimizing an objective function 𝐽(𝑤)
» Example: Supervised learning task:
Given a set of 𝑁 observations 𝑥𝑖 with labels 𝑦𝑖 the weights 𝑤
are fixed by:
𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 𝑤, 𝑥𝑖 , 𝑦𝑖
𝑖
= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽(𝑁𝑁(𝑤,
𝑖
𝑥𝑖); 𝑦𝑖)
where 𝑁𝑁 denotes the output of the Neural Network (or almost
any other ML algorithm)
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (12/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 48
Types of neural networks
» Sequential (statefull models)
› These types of neural networks can be used to model
sequential data like with varying length like natural
language
› Often in these types of networks a “cell” is repeated
over and over
› By unrolling (possibly to infinity) these types of
networks can be formally transformed into static ones
» Static (stateless models)
› These types of networks come with a fixed shape and number of layers and neurons
› They are often used for supervised learning tasks like image recognition
› Example: Google LeNet (09/2014) which won the ILSVRC 2014 with almost 5 Mio. parameters
Neural networks can lead to “differentiable” versions of static and sequential data.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (13/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 49
General properties of neural networks
1Can be weakened to measurable
» Static Neural Networks with at least one hidden layer are universal function approximators:
» In other words, neural networks can in principle approximate any continuous function1
» Special types of sequential Neural Networks called “Recurrent Neural Networks” are Turing
complete, which means that they can simulate arbitrary programs
These general results should be taken with a grain of salt as these are only guaranteed with infinite
resources and nothing is said about learnability from data.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (14/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 50
Convolutional (CNN)
There are several deep neural architectures for (un-) supervised learning
tasks
Recursive
» The static length input
features are “convoluted”
» Each layer learns a more
abstract representation of
the data
» Equivalent to
renormalization group flow
in physics Mainly used for static data
such as images
» The “input features” come
in a natural hierarchy or
tree-like structure
» The neural net is applied
all along the hierarchy
Mainly used for
hierarchical and tree-like
data such as language
Boltzmann machines (and variants)
» The neural net consists of
a visible and hidden part
» The hidden part can learn
arbitrary complex
representations of the
inputs
» Comparable to HMMs Mainly used for static data
Recurrent (RNN)
» The “input features” are
sequential and of different
length
» The neural net is recurring
» The neural net can learn
long term dependencies
(c.f. LSTM, GRUs)
Mainly used for sequential
data such as language
and time series
Architectures can
be 100s of layers
deep!
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (15/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 51
Deep neural networks in their naïve form have various problems
1Hochreiter, Sepp. "Untersuchungen zu dynamischen neuronalen Netzen."Diploma, Technische Universität München (1991): 91.
» Overfitting
› Deep neural networks typically have millions of free parameters
› Without care this can typically lead to the overfitting phenomenon
» Lots of (labelled) data is needed for training
› Without expert knowledge, which could either be built into the topology of the net or into constraints on the weights
and/or the objective function, lots of labelled data is needed to bring deep neural nets into a regime of good behaviour
with respect to generalization
» Vanishing gradient problem1
› Neural networks are usually trained by various incarnations of gradient descent, e.g.
𝑤𝑖+1 = 𝑤𝑖 − 𝛾 ⋅𝜕𝐽 𝑤
𝜕𝑤 𝑤𝑖
› By the chain rule this leads to products of activation functions
› As most activation functions take values in [−1, 1] these products
become very small for deep networks
» Slow training
› To optimize a non-convex function with millions of terms and millions of variables is computationally very expensive
› Without special hardware the training of deep neural nets is not feasible
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (16/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 52
What rescues deep learning?
1Nonetheless remarkable progress was made during the 90s by people like Jürgen Schmidthuber (ETH), Geoffrey Hinton (Google), Yann LeCun (Facebook), Yosuah Bengio (U. Montreal), Andrew Ng (Baidu) and others
» Deep neural nets are hard to train due to what is known the “vanishing/exploding gradient problem“
» In the 90s this (among other things) led to a period called the AI-winter and almost to an abandonment of
the idea of neural nets1. Progress during the last 10 years has made it possible to train very deep nets
with hundreds of layers
» Responsible for this progress are mainly:
» With these techniques Deep Neural Nets have reached super-human abilities in many areas, including
image recognition, geolocating by images, game playing, sentiment analysis, …
» There are several software frameworks available for the training of deep neural nets
Growth of available
computing power:
Clusters of (C,G,T)PUs
Availability of large
amounts of (labelled)
data
Methodological
breakthroughs (pre-training,
dropout, LSTMs/GRUs,
ReLUs, stochastic depth
training, convnets, …)
Nervana
Python
Framework
Developer Google U. Montreal Collobert, et
al. Microsoft U. Berkeley DMLC H2O.ai Skymind
Language(s) C++, Python Python C, Lua C++ C++, Python C++, Python,
R, Julia, …
Java, Scala,
Python, R
Java, Scala,
C
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (17/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 53
Why should neural networks be deep after all?
Example: natural language (1/2)
Lin, Henry, and Max Tegmark. "Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language." arXiv preprint arXiv:1606.06737(2016), Lin, Henry W., and Max Tegmark. "Why does deep and cheap learning work so well?." arXiv preprint arXiv:1608.08225 (2016).
» Studying the empirical mutual information (kind of a two-point function) between symbols in
natural written language unveils a power-law behaviour!
» These long range interactions can even theoretically not be modelled by simple shallow
models like Hidden Markov Models (HMMs)
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (18/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 54
Why should neural networks be deep after all?
Example: natural language (2/2)
1Mehta, Pankaj, and David J. Schwab. "An exact mapping between the variational renormalization group and deep learning." arXiv preprint arXiv:1410.3831 (2014), https://charlesmartin14.wordpress.com/2015/04/01/why-deep-learning-works-ii-the-renormalization-group/
» This would explain two empirically observed properties of Deep Neural Networks:
› These types of models are able to extract high level features from microscopic data (e.g. raw pixels to
categories of objects) as they flow to fixed points under the RG-flow (universality)
› The “two-point functions” of Deep Neural Networks in general exhibt a power law decay near their
critical points
» Nonetheless can deep models sometimes be approximated by simpler shallow models!
» Hallucinating Wikipedia entries (more on
this in a moment) with a deep recurrent
neural architecture captures the long
range interactions present in natural
language
» This is not an accident, as it was argued
that Deep Neural architectures are
related to a well-known set of ideas in
physics, namely the Renormalization
Group (RG)
Deep Learning can be mapped onto the Renormalization Group known from physics.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (19/19)
© d-fine — All rights reserved © d-fine — All rights reserved | 55
Word embeddings
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings
© d-fine — All rights reserved © d-fine — All rights reserved | 56
Problems with discrete word representations
Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of Machine Learning Research 12.Aug (2011): 2493-2537.
saying that Europe needs unified banking regulation to replace the hodgepodge
government debt problems turning into banking crises as has happened in
These words will represent banking
» Great as resource but missing nuances, e.g. synonyms:
adept, expert, good, practiced, proficient, skillful?
» Missing new words (impossible to keep up to date):
wicked, badass, nifty, crack, ace, wizard, genius, ninjia
» Subjective and requires human labor to create and adapt
» Hard to compute accurate word similarity
» Instead: Use the Distributional hypothesis:
You can get a lot of value by representing a word by means of its neighbours
“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
Distributional similarity is one of the most successful ideas of modern statistical NLP
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (1/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 57
Word embeddings (1/4)
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
Counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
Like 2 0 0 1 0 1 0 0
Enjoy 1 0 0 0 0 0 1 0
Deep 0 1 0 0 1 0 0 0
Learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
» Idea:
Instead of capturing co/occurrence counts
directly predict surrounding words of every
word
» This idea was used by Mikolov to embed a
corpus of words into a high-dimensional
vector space
» This algorithm is called word2vec
» Window based cooccurence matrix
› Example:
I like deep learning.
I like NLP.
I enjoy flying.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (2/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 58
Word embeddings (2/4): word2vec
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
Continuous bag of words (CBOW) model Skip-gram model
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (3/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 59
Word embeddings (3/4): word2vec
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013. Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of Machine Learning Research 12.Aug (2011): 2493-2537.
» Idea: Predict surrounding words in a window of length m of every word
» Objective function: Maximize the log probability of any context word given the current center
word:
» The simplest model for the probabilities 𝑃(𝑤𝑡+𝑗|ℎ) is given by “dynamic” logistic regression
» The optimization of the above objective function is for normal corpora not feasible
» Word2vec uses a clever Monte-Carlo algorithm called noise-contrastive training to
circumvent this problem (shown for the CBOW model):
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (4/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 60
Word embeddings (4/4): GloVe1
1Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014. Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of Machine Learning Research 12.Aug (2011): 2493-2537.
» GloVe is generally faster to train than
word2vec
» Works well even on small corpuses
» Can be scaled to large corpuses
» Idea:
Improve word2vec by co/occurence statistics => new
objective function
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (5/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 61
Examples of GloVe embeddings (1/3)
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014.
𝑣 𝑘𝑖𝑛𝑔 − 𝑣 𝑚𝑎𝑛 + 𝑣 𝑤𝑜𝑚𝑎𝑛 ≃ 𝑣(𝑞𝑢𝑒𝑒𝑛)
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (6/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 62
Examples of GloVe embeddings (2/3)
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014.
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (7/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 63
GloVe as a preprocessing step on one of our projects
word similarity
traders 1.00
investors 0.87
dealers 0.84
stocks 0.82
markets 0.81
prices 0.80
word similarity
fraud 1.00
charges 0.85
bribery 0.85
alleged 0.84
corruption 0.82
embezzlement 0.81
» Project: Detection of insider trading in chats of traders
» Training data:
› 5 years of trader’s chat history from 2011-2016 (unstructured)
› Amounts to around 1.4 Mio. Chats
› Around 86GB of raw xml-files
» Data was not cleaned (apart from cutting persistent chat rooms)
» Training on Microsoft Azure N12-instance with two Nvidia K80 (~4992 CUDA cores / card and
up to 2.91 Tflops [double precision])
» Transfer to Microsoft Azure in a hashed format (quite a challenge …)
» Training was terminated after 20h of training
» Examples of learned vectors:
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (8/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 64
The new star? fasttext by Facebook Research1
1Joulin, Armand, et al. "Bag of Tricks for Efficient Text Classification." arXiv preprint arXiv:1607.01759 (2016).
» Recently Facebook published a new algorithm for efficient word embeddings: fasttext
» Amazingly no deep architecture is used there
» Instead this model is a linear softmax classifier on n-gram features 𝑥𝑖
» The objective function is simply given by a weighted log-likelihood function
» Normally this problem has a computational complexity of 𝑂(𝑘 ⋅ ℎ) where 𝑘 is the number of
classes and ℎ the embedding dimension
» By utilising the hierarchical softmax algorithm together with an efficient mapping of the –grams
via the “hashing trick” the complexity can be reduced to 𝑂(ℎ ⋅ log2 𝑘)
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (9/9)
© d-fine — All rights reserved © d-fine — All rights reserved | 65
NLP (almost) from scratch
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - NLP (almost) from scratch
© d-fine — All rights reserved © d-fine — All rights reserved | 66
Recurrent neural networks (RNNs)
1Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
» Recurrent neural networks
» Problem: long-term dependencies
» Long short-term memory (LSTM) cells1
to the rescue
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - NLP (almost) from scratch (1/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 67
Training a language model on character level
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
» With RNNs it is possible to train a language model
character by character!
» These models automatically learn punctuation or
other characteristics of the underlying text
» The models are language agnostic and can be
used for various tasks
» For example it can be used to generate text in a
certain style, as a spelling corrector or to read and
fantasize house numbers, …
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - NLP (almost) from scratch (2/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 68
Examples of hallucinated text
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Linux source code Algebraic geometry (from latex sources)
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - NLP (almost) from scratch (3/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 69
How does the language model learn
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
~ 100
~ 300
~ 500
~ 700
~ 2000
~ 1200
» LSTM trained on Leo Tolstoy’s War and Peace after various episodes of training
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - NLP (almost) from scratch (4/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 70
Outlook
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Outlook
© d-fine — All rights reserved © d-fine — All rights reserved | 71
Outlook
Kumar, Ankit, et al. "Ask me anything: Dynamic memory networks for natural language processing." arXiv preprint arXiv:1506.07285 (2015).
» Attention and memory mechanisms (“Neural
Turing Machines” and “Dynamic memory
networks”)
» Sequence-to-sequence models
» Can one also learn the topology of a neural
networks from data (e.g. Bayesian Neural
Networks)?
» What about ensembles of neural networks (as
with e.g. random forests)? (Dropout, …)
» What is the relation between Deep Learning
and StrongAI? (OpenAI, DeepMind)
» What is the relation between Deep Learning
and biological brains?
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Outlook (1/2)
© d-fine — All rights reserved © d-fine — All rights reserved | 72
Outlook: Attention and dynamic memory networks
Kumar, Ankit, et al. "Ask me anything: Dynamic memory networks for natural language processing." arXiv preprint arXiv:1506.07285 (2015) Connecting Images and Natural Language, Andrej Karpathy, PhD Thesis, 2016
2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Outlook (2/2)
© d-fine — All rights reserved © d-fine — All rights reserved | 73
ChatAnalytics
An overview of a d-fine project
2016-10-13 | Machine learning and text analytics | ChatAnalytics
© d-fine — All rights reserved © d-fine — All rights reserved | 74
The penalties for misconduct by traders can meet banks sensitive, especially
if internal control mechanisms fail
» Libor scandal:
› Deutsche Bank – 2,5 B$
› UBS – 1,5 B$
» Manipulations on the forex market:
› JPMorgan, Citigroup, Barclays, RBS, UBS
total 5,6 B$
Penalties from the latest scandals Fines over the last years
» 325.000 Bloomberg Professionals send 200B Emails und
15-20B Instant Messages a day!
» If a single function can be said to justify the $20,000
annual cost of a Bloomberg terminal, it is probably Instant
Bloomberg, which the company accurately describes as
“the dominant chat tool used by the global financial
community” (FT)
Communications channels of traders
So much pay the banks
Penalties and legal costs in billions of dollars
The banking authorities and compliance departments advocate stricter and - due to the high volume -
automated control mechanisms to identify misbehavior of traders.
2016-10-13 | Machine learning and text analytics | ChatAnalytics (1/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 75
Traders are communicating mainly over Instant Bloomberg, where they are
talking sometimes openly about market manipulations and more
» UBS-trader: "i was frontrunning EVERY single offer in
usdjpy and eurjpy." (Tagesschau.de)
» UBS-trader: "(...) i was frontrunning EVERY SINGLE
ODA and i mean EVERY haha” (Tagesschau.de)
» UBS-trader: A zu UBS-Händler B: "das ding ist wir dürfen
nicht mehr front runnen, compliance sitzt uns am arsch".
(Tagesschau.de)
» According to Barclays: “Dude. I owe you big time! Come
over one day after work and I’m opening a bottle of
Bollinger” (Bloomberg)
»
BankWTrader 15:46:53 i’d prefer we join forces
BankYTrader 15:46:56 perfick
BankYTrader 15:46:59 lets do this…
BankYTrader 15:47:11 lets double team them
BankWTrader 15:47:12 YESssssssssssss
BankWTrader 16:03:25 sml rumour we haven’t lost it
BankYTrader 16:03:45 we do dollarrr
Examples of chats from the latest scandals Challenges for ChatAnalytics
Different languages and styles
Trader A: wären wir weit weg mit dem Preis?
Trader B: quite a bit away
Trader A: dann kommen wir nicht hin
Deception
Trader A: Selling Wolfsburg for 134.2, do you care?
Trader B: Nope, what about Stuttgart?
Blue horseshoe loves anacott steel
Humor and sarcams
Trader A: Should I say frontrun to greet our complience?
Trader B: If u don’t say frontrunner
Abbreviations and errors
Trader A: sry i dont kn w bro
Trader B: v nice mate
The surveillance of traders is an application of TextAnalytics
2016-10-13 | Machine learning and text analytics | ChatAnalytics (2/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 76
Chat Feature 1 Feature 2 …
1 5 1.2
2 4 3.5
3 10 7.8
… … … ...
Chat Suspicious?
1 No
2 Yes
3 No
… …
Supervised learning
Unsupervised
learning
Model predictions
NLP processing “raw” chat logs
» Naive Bayes
» k-NN
» SVMs
» Gradient boosted trees
» Random Forests
» Deep-NN (CNNs)
» Embeddings: GloVe,
word2vec, LDA2vec
» Topological data analysis
» Deep-RNNs
» Deep Auto-Encoder
The structure of chats is richer than the structure of news articles and their
analysis
» Language detection
» POS-tagging
» Annotations
» Feature e.g. # of verbs, #
Emoticons, …
Review of the
predictions by a human
analyst
Combination of machine learning and the experience of human analysts allows for the evaluation of
complex, unstructured and nearly unlabelled data
2016-10-13 | Machine learning and text analytics | ChatAnalytics (3/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 77
The architecture of ChatAnalytics:
We love open-source software
Used
te
ch
no
logie
s
» Analysis of all chat
communication
» Development of statistical
models
» Scoring for suspicious chats
Description
» Storage of results from the
kernel
» Gathering of feedback from
analysts
» Parametrization of productive
models
» Configuration of kernel,
database and frontend
» Presentation of results from the
DB-Layer
» Gahtering of feedback from
analysts regarding the results
of the kernel
» The kernel is implemented in R
» Neural nets were trained with
Tensorflow
» As the database the client
demanded Oracle
» Presentation of the results is
via a Web-Application written
with R/Shiny
Kernel Database Presentation
Architecture
1 2 3
2016-10-13 | Machine learning and text analytics | ChatAnalytics (4/4)
© d-fine — All rights reserved © d-fine — All rights reserved | 78
Dashboard for surveillance of trader communication
Click here
2016-10-13 | Machine learning and text analytics | Dashboard for the supervision of communication
© d-fine — All rights reserved © d-fine — All rights reserved | 79
d-fine
Frankfurt
Munich
London
Vienna
Zurich
Headquarters
d-fine GmbH
An der Hauptwache 7
60313 Frankfurt/Main
Germany
Tel +49 69 90737-0
Fax +49 69 90737-200
www.d-fine.com
Contact
Todor Dobrikov
Senior Manager
Tel +49 69 90737-447
Mobile +49 162 2631320
E-Mail [email protected]
Michael Hecht
Senior Consultant
Tel +49 89 7908617-0
Mobile +49 162 2631431
E-Mail [email protected]
d-fine
(textbox is required to avoid an issue
where this page gets rotated by 90°
if printing (both “physical” and pdf))