Machine learning and text analytics - Heidelberg University · » Moreover, text is unstructured. This means there are countless ways to transport the identical information by words

© d-fine — All rights reserved © d-fine — All rights reserved | 0

Machine learning and text analytics

XXXVII Heidelberg Physics Graduate Days

Heidelberg, October 13th, 2016


Introduction

Michael Hecht

» Senior Consultant (since 2011 working for d-fine)

» PhD in theoretical physics (Topological String Theory) in

Munich (@LMU) and Boston (@Harvard U.)

» Machine learning expert

» Hobby data scientist and Kaggle competitor

(Highscore: 18th in Kaggle’s world ranking)

Todor Dobrikov

» Senior Manager (since 2007 working for d-fine)

» Diploma in mathematics with computer science (@TUD)

and MSc in mathematical finance (@Oxford)

» Expert in rating model development and credit portfolio

modelling

» Establishing machine learning in d-fine’s projects

2016-10-13 | Machine learning and text analytics


Agenda

2016-10-13 | Machine learning and text analytics

» 3 Textanalytics and NLP

» 16 Machine Learning

» 34 Deep Learning for NLP

› 35 Overview

› 55 Word embeddings

› 65 NLP (almost) from scratch

› 70 Outlook

» 73 ChatAnalytics

› 78 Dashboard for surveillance of trader communication


Textanalytics and NLP

2016-10-13 | Machine learning and text analytics | Textanalytics and NLP


The rapidly increasing amount of information is a challenge in banking and

forces financial institutions to redefine processes and concepts (1/2)

» We get more and more information from various sources, e.g. news provider, homepages,

chats, tweeds and many more.

» All these contributions contain some information, and taken together, are valuable in banking,

because

› it is almost orthogonal to traditional information,

› it is up-to-date,

› it is hard to manipulate all sources together.

» However, it is impossible respectively inefficient to monitor manually all available contributions

real-time.

» Moreover, text is unstructured. This means there are countless ways to transport the identical

information by words from the sender to the receiver. This makes the quantitative analysis of

text challenging.

2016-10-13 | Machine learning and text analytics | Textanalytics and NLP (1/12)


The rapidly increasing amount of information is a challenge in banking and

forces financial institutions to redefine processes and concepts (2/2)

Total and daily number of news for some news provider Articles, words, bytes and entities

(from 2007)

(from 2004)

(from 2010)

(from 2012)

(from 2000)

(from 2007)

(from 2007)

» 13,000,000 news articles

times about 400 words each

» 5,200,000,000 words total

times with ca. 8 characters per word

» 20,000,000,000 characters

times 8 Bit per character

» Gives 19 Gigabyte unstructured data of

raw business news articles, which have

to be analysed for e.g.

› stock listed companies (S&P500 or

STOXX Europe 800)

› local small and medium sized

enterprises

› sovereign



For financial institutions, two conclusions can be drawn from the rapidly

increasing amount of available information

Banks will have to integrate text-based information in their processes and models.

Decision aids

in manual

processes

Support manual processes with dashboards and reports for text-based information

such that the analysts’ attention is guided to important events, documents, or

passages within documents

» Develop a model to analyse text

» Visualize and highlight features from text and other information (e.g. market ,

balance-sheet and behavioural data)

» Decisions are made by the analyst, who evaluates the text-based information

manually

1

Integration

into internal

models

Integrate text-based information in internal models, so that manual effort is only

caused in model development and validation

» Develop a model to analyse text

» Combine the text-features and other information (e.g. market, balance-sheet and

behavioural data)

» Decisions are based on the model result without analysing the text manually

2



The most simple method to analyse text is to compare the single words with

word lists for some categories

Boeing bests EADS in surprise U.S. aerial

tanker win

Thu Feb 24, 2011 6:52pm EST

Boeing Co was the "clear winner" in a U.S.

Air Force tanker competition, the Pentagon

said on Thursday, surprising analysts who

had expected Europe's EADS to win the

deal. […]

Example for a news article Challenges in the field of text analytics

» Define lists for “positive” and “negative” words:

› Negativ: BANKRUPTCY, DEFAULT,

DANGER, DEVALUE,

DOWNGRADE, FRAUD, etc.

› Positiv: BENEFIT, BOOST, GAIN, BEST

IMPROVE, OPPORTUNITY,

PROGRESS, WIN, etc.

» Dictionaries for different context (general,

financial, political, …) are online available

» Dictionaries imply high maintenance costs, e.g. if

one word is added, all tenses and all cases must

be added, too see also Stemming

» Many words are ambiguous, and occur on more

than one list (e.g. fine, company, sound)

» Calculate the ratios: Positive =𝑝

𝑛+𝑝= 100% and

Negative =𝑛

𝑛+𝑝= 0%

Simple methods fail to identify the correct sentiment of a news article conditional on a company, so that

more sophisticated methods are needed.



There are many dictionaries online available, each with a different focus and

built with different methods

General Inquirer

General purpose, 182 categories (e.g. Positive, Negative, Hostile, Strong, Power, Weak, Active, Passive), the dictionary also

contains part-of-speech tags for each word (e.g. Noun, CONJ, DET, PREP), available via http://www.wjh.harvard.edu/~inquirer/

1

Loughran and McDonald Sentiment Word Lists

Financial / economic background (constructed in 2009 with 10-K fillings), 6 categories (Litigious, Negative, Positive, Strong,

Uncertainty and Weak), available via http://www3.nd.edu/~mcdonald/Word_Lists.html

2

Subjectivity Lexicon

General purpose, contains 3 categories (positive, neutral and negative),

available via http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

3

Diction 5 / 7

Includes 33 word-categories (e.g. Accomplishment, Aggression, Centrality) and 6 variables, which are based on count ratios in

the word categories, the software is proprietary (see http://www.dictionsoftware.com/)

4

Linguistic Inquiry and Word Counts

Social and psychological background, 64 hierarchical word lists and summary statistics, the software is proprietary see

http://liwc.wpengine.com/

5

Build your own

Based on expert knowledge, based on a trainings set, find the words with the strongest discriminant power -> machine learning

6


http://www.wjh.harvard.edu/~inquirer/

http://www3.nd.edu/~mcdonald/Word_Lists.html

http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

http://www.dictionsoftware.com/

http://liwc.wpengine.com/


Incorporating interaction between words and the position of a word relative

to the company name improves the text analysis


tanker win






deal. […]


» Define a weight for each word that depends on

the word-distance to the company name (and on

the distance to the first word of the article).

» Do not consider words in isolation, but

› consider n-grams (sequences of words, works

also for negations) and

› search for strong and weak words via

appropriate dictionaries and close to the

company name

› consider the function of a word within the

sentence -> see also Part-of-Speech tagging

Combining different word lists improves the text analysis.



The idea of Part-of-Speech tagging is to find the most likely sequence of

tags and analyse the words tag-specific


tanker win






deal. […]


» The word ‘deal’ has different functions in English

› it can be a verb,

› it can be a noun or

› an interjection

» The unconditional probability that it serves as a

noun is 65%, according to the General Inquirer.

» Given that the word ‘the’, that serves a an article

for sure, precedes ‘deal’ indicates that the noun is

meant here

» Given a word sequence, we use the sequence of

tags that is most likely and then interpret

the word

» Hidden Markov Models or Maximum

Entropy Models might be applied for POS

tagging

Positive NOUN, 65% idiom-noun: 'a great deal,' 'a good deal,' etc.--an

indefinite but large quantity

Active SUPV, 34% verb: To take action with respect to something or

someone, to handle

Negative INTJ, 1% idiom-noun: 'big deal'--sarcastic admiration

Part-of-speech tagging improves the text analysis.



Other news articles provide important context and allow to differentiate

between sentiment and information


tanker win






deal. […]

------------------------------------------------------------

Boeing wins U.S. tanker competition:

Pentagon


Boeing Co has won a contract to build new

refuelling planes for the U.S. Air Force, […]


» Define a similarity-measure to compare two news

articles with each other.

» Define

› a time-interval (e.g. 24 hours) or

› a number of news articles (e.g. 40)

and compare all news items within this interval

resp. all n preceding news articles with the actual

article

» Articles, that are very similar to articles published

before should receive a small weight

» The similarity-measure can be defined by making

use of the vector-space representation of text as

the angle between two news articles resp. the

corresponding vectors.

» The vector-space representation and a similarity

measure offer the opportunity to apply a kNN

approach to classify news articles.

Sentiment is not identical to information.



Other news articles provide important context and allow to differentiate

between sentiment and information


» Weight word according to their position in

the news article to make

› the comparison sensitive to the news

structure and

› give more attention to the beginning of a

news

» Problem: Since there are about 13.000

English words, the dimension of the vector

/ document matrix becomes easily high and

they are very sparse.

» Dismiss stop-words (a, for, there, this, …)

» Identify word-stems (standing and stands)

» Identify synonyms and n-grams (article and

news)

News 1:

This news stands for

(1.00) (0.99) (0.97) (0.95)

a simple wildcard for

(0.92) (0.90) (0.87) (0.83)

more meaningful news

(0.79) (0.74) (0.70)

News 2:

Here is another article

(1.00) (0.99) (0.97) (0.95)

without meaningful content

(0.92) (0.90) (0.87)

standing for some news

(0.83) (0.79) (0.74) (0.70)

cos x, y =𝑥, 𝑦

𝑥 𝑦=

3.2552

9.66 = 0.3370

1

3

Word a

anoth

er

art

icle

conte

nt

for

here

is

meanin

gfu

l

more

new

s

sim

ple

som

e

sta

ndin

g

sta

nds

this

wild

card

without

News1 .92 0 0 0 1.78 0 0 .74 .79 1.69 0.9 0 0 .97 1 .87 0

News2 0 .97 .95 .87 .79 1 .99 .9 0 .7 0 .74 .83 0 0 0 .92

2

A simple norm can be defined for text and allows to compare documents.

kNN might be an alternative classification algorithm given labelled trainings data.



The Porter Stemmer consists of simple rules that cut a word down to its stem

Step 1a IES -> I ponies -> poni

SSES -> SS caresses -> caress

SS -> SS caress -> caress

S -> ‘’ cats -> cat

Step 1b (m>0) EED -> EE feed -> feed

agreed -> agree

(*v*) ED -> ‘’ plastered -> plaster

bled -> bled

(*v*) ING -> ‘’ motoring -> motor

sing -> sing

Step 1c (*v*) Y -> I happy -> happi

sky -> sky

Step 2 (m>0) ATIONAL -> ATE relational -> relate

(m>0) TIONAL -> TION conditional -> condition

rational -> rational

Step 3 (m>0) ICATE -> IC triplicate -> triplic

(m>0) ATIVE -> formative -> form

(m>0) ALIZE -> AL formalize -> formal

Step 4 (m>1) AL -> ‘’ revival -> reviv

(m>1) ANCE -> ‘’ allowance -> allow

(m>1) ENCE -> ‘’ inference -> infer

Step 5a (m>1) E -> ‘’ probate -> probat

rate -> rate

Step 5b (m > 1 and *d and *L)

-> single letter controll -> control

» Every word can be represented as

C?(VC){m}V? where

› C is a sequence of consonant

› V is a sequence of vowels

› (.){m} denotes an m-times

repetition of the expression in the

brackets

› ? denotes optionality for the

preceding expression

› * denotes wildcard

» With this notation, there are five steps

to take to cut a word to its stem (see

right for an extraction of the rules)

» However, word stems are not always

real words and stemming rules might

fail for some words (e.g. European /

Europe or matrices / matrix)

» Clever stemming reduces the

dimension by factor 10!

Examples for stemming rules Remarks on the Porter Stemmer



Based on a trainings set, algorithm may identify n-grams and synonyms to

reduce the dimension of word and to improve the accuracy

» Synonyms in a given context posses similar

meanings and might be seen as substitutes

» Hence it is likely that the word around the words,

that are synonyms, are the same

» Consider two words as synonyms, if they have a

sufficiently large intersection on words used with

them

» Depending on the context, ‘apple’ and ‘fruit’ or

‘Linde’ and ‘Baum’ are synonyms, but not always!

Identify synonyms Identifying n-grams and composed words

» We can use data itself to decide whether words

stick together with a PCA

» It performs an eigenvalue decomposition of the

data covariance matrix

» We can use the eigenvalues and vectors to

reduce the dimension of our problem

» Let x and y be the number of two similar words

e.g. ’social’ & ‘media’ or ‘machine’ & ‘learning’

» Every data point represents one message

article story write

news

read

journal

another word

line

produce

price

book

We can apply those methods to languages we know nothing about yet – (e.g. Chinese, Arabic, Klingon…).



Combining the previous concepts, we consider five independent indicators

and aggregate them geometrically to one overall signal

Concept Dimensions

News 1 (negative)

News 2 (positive)

» Information: Comparison of the message with

messages published before or on the same day in order

to recognize recurring news stories. If the Information

was already known we adjust the message by assigning

it a lower weight.

» Relevance: Measure how much the message focuses on

the assessed company.

» Sentiment: Assigns a positive (green) or negative (red)

implication of the news story for the company under

consideration.

» Certainty: Measure if the article contains final and

certain information or if the information presented are

speculations about expected future developments.

» Readability: Measures the complexity of the language.

Complex expressions and complicated words are more

prone to be misinterpreted by the algorithm. To a certain

extend this also applies to humans…

The Model is easier to test and calibrate with the dimensions separated first.



Machine Learning

2016-10-13 | Machine learning and text analytics | Machine Learning


Machine Learning is not a new hype

» Machine learning is not new. Early inventions were driven by the military.

» The Internet age: IBM, Google, Amazon and Facebook are leading to a renaissance of

machine learning Google searches on big data, machine

learning and data science

“Artificial

Intelligence

Winter”

US Army: General

purpose computer

ENICA

1946 1958 1985 1997 2004 2011 2016

US office for Naval

Research:

Perceptron artificial

neural network

Learning machine by

Arthur Samuel plays

checkers

Rediscovering of

Backpropagation

algorithm for neural

networks

Statistical

approaches: Support

vector machine

IBM’s Deep Blue

beats Garry

Kasparov

Random forests

Google’s Alpha

Go

Large scale

perceptron

IBM’s Watson

wins Jeopardy!

2016-10-13 | Machine learning and text analytics | Machine Learning (1/17)


ML is no longer limited to artificial-intelligence researchers and born-digital

companies like Amazon, Google, and Facebook

Marketing &

Logistics Finance

Machine

Learning

Applications

Product

Placement &

Pricing

Science

Modelling

Prediction Text Analytics

Fraud

detection

Robo-

Advisors

Personal

Assistance

Google Now /

Siri

Social

Marketing

Filtering

Image detection

Search

Engine

Autonomous

Systems

Automated

Trading



Machine learning applied by d-fine

» Big data analytics and Machine Learning are global trends relevant not only for marketing by

the big internet technology companies such as Amazon, Google and Facebook

» Automated data analysis and predictions are also relevant in sales, logistics, risk

management, customer support, human resources, operation, health care, insurance, life

sciences, electric grid distribution, manufacturing, …

Customer support

Automated Trading

Cost Prediction Fraud detection

Treatment Optimization Quality Management Risk Scoring

Computer Assisted Diagnosis Product Pricing

Loss Given Default

Claims Prediction Product Engineering

Why is this also relevant for d-fine, a consultancy with clients mainly in banking, insurance and energy?

Banking Insurance Energy Health Care



How to get started with machine learning

» Humans are unable to explain their expertise

» The solution changes in time

» The solution needs to be adapted to particular cases

» Human expertise does not exist

» Speech recognition, text recognition, handwriting detection, fraud detection (too complex to formulate)

» Financial modelling, credit scoring, routing on a computer network, horse betting (time varying solution

needed)

» Filtering, user biometrics recognition, personal advertising, market basket analysis, web search (strong

adaption needed)

» Autonomous systems, navigating on Mars (no expertise available)

» …

Machine Learning is useful when…

Where Machine Learning is applied for…

Machine Learning is especially useful when functional dependencies of input data and desired output data

are too complex, changing with time or are unknown.



What exactly is machine learning?

Machine Learning takes away the task to define exact functional dependencies from the act of

programming

Computer

Input data

Program

Input data

Output data

Program

Output data

Tra

ditio

nal

pro

gra

mm

ing

Ma

ch

ine

Le

arn

ing

Computer

By Machine Learning the software is programming its functionality by itself.



predicted 𝑦

Machine learning basics – supervised learning

𝑥

» One form of Machine Learning is supervised

learning, where a training algorithm is fed by

exemplary pairs of input and output variables to

determine the predictor.

» Example: Supervised learning

› Input feature 𝑥

› Target variable 𝑦

› Training set: A list of pairs of feature and

variable 𝑥, 𝑦 𝑖

» When the target variable is continuous the learning

problem is called a regression problem, when the

target variable is discrete it is called a

classification problem.

Explanation Visualization

Training set (x,y)i

Learning

algorithm

Predictor f(x)

The goal of supervised learning is to find an optimal predictor based on labelled training data.



Classification of machine learning concepts

Machine Learning concepts can be classified by the grade of supervision and learning depth.

» Supervised concepts: labelling of training data, error estimation possible

» Unsupervised concepts: no labelling, structures determined by the algorithm

» Deep learning: higher level abstractions by multiple processing layers, (Pseudo-) Artificial

Intelligence.

Deep learning

» Artificial Neuronal Network

› Backpropagation

› Recurrent Neuronal Network

› Convolutional Neuronal Network

» Survey Propagation

» Deep Belief Network

» Sparse Autoencoder

Shallow learning

» Generalized linear regression models

» Perceptron

» Support Vector Machine

» Boosting

» Gaussian Mixture Model

» Restricted Boltzmann Machine

» Sparse Coding

» Autoencoder Neuronal Network

Su

pe

rvis

ed

lea

rnin

g

Unsu

pe

r-

vis

ed

lea

rnin

g

ML concepts can be classified by the learning depth and grade of supervision. However the transition

between regimes is flowing.



Example: linear regression

Feature

variables A vector 𝒙 = 𝑥1, … , 𝑥𝑖 , … build up of properties

like living area, number of bedrooms…

Target

variable

The price of the real estate

Training

set

Collected

data

Predictor

ansatz

Learning

algorithm 𝒃 = 𝑏0, … , 𝑏𝑖, . . can be determined with the Least

Mean Squares concept and gradient descent

𝑝 𝒙 = 𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2 + ⋯,

where 𝑏0 is the intercept and 𝑏𝑖 (for 𝑖 > 0) are the

weights.

The problem is to determinate 𝒃 in a way to make the best possible prediction.

Object bedrooms living area Price p

A 3 120m² 1.100€

B 1 40m² 330€

C 2 80m² 600€

Let us assume we want to predict the price of a real estate from its living area or other properties.



Least mean squares

Stochastic gradient descent

» Error estimation for training set 𝒙𝑗, 𝑦𝑗 by least squares

𝐽 𝒃 =1

2 𝑝𝒃 𝒙𝑗 − 𝑦𝑗 2𝑚

𝑗=1

» Start with initial 𝒃 and apply gradient descent rule

Update 𝑏𝑖 ≔ 𝑏𝑖 − α𝜕

𝜕𝑏𝑖𝐽(𝒃) for each 𝑖 simultaneously

» We can find the derivate 𝜕

𝜕𝑏𝑖𝐽(𝒃) = 𝑝𝒃 𝒙 − 𝑦 𝑥𝑖

» Algorithm for the whole training set: stochastic gradient

descent

Loop until converge {

Loop for 𝑗 = 1 to m {

𝑏𝑖 ≔ 𝑏𝑖 − α 𝑝𝒃 𝒙𝑗 − 𝑦𝑗 𝑥𝑖 for every 𝑖

}

}

Example

» The local minimum found by

gradient descent is dependent of

the initial choice of 𝒃 and the

learning rate α.

Gradient descent leads to a local minimum with learning rate α.



Example: logistic regression

Feature

variables

A vector 𝒙 build up of personal data, past credit

history or current payment behaviour of the debtor.

Target

variable

The binary default variable 𝑌 (𝑌 = 1 : default,

𝑌 = 0 : no default)

Training

set

Historic data

Predictor

ansatz

Learning

algorithm

e.g. Maximum Likelihood Estimation (MLE) and

gradient ascent

Name Age Income Default

A 32 46.000€ 0

B 26 31.000€ 1

C 54 60.000€ 0

𝑃 𝑌 = 1 𝒙 ) = 𝑓𝐿 𝑏0 + 𝑏1𝑥1 + ⋯ = 𝑓𝐿 𝒃𝒙 ,

where 𝑓𝐿(𝑠) = 1

1+exp(−𝑠). This ansatz ensures

that the probability is confined in the interval [0,1].

The logistic predictor has to be trained individually for each banks portfolio and has to be recalibrated

yearly.

A classification problem in finance is the prediction of the default probability of a credit receiver.



Maximum likelihood estimation

Assuming we have a training set with n independent observations with an identical underlying distribution

dependent with a fit vector 𝒃. Then the joined density function is the product of the individual density functions

𝑓 𝒙1, … , 𝒙𝑛; 𝒃 = 𝑓𝑋𝑖(𝒙𝑖; 𝒃) ≝ 𝐿(𝒃; 𝒙1, … , 𝒙𝑛)

𝑛

𝑖=1

The trick is to view this a function on 𝒃 and consider the 𝒙𝑖 to be constant parameters. This is called the

likelihood function 𝐿(𝒃; 𝒙1, … , 𝒙𝑛). The best fit to the data is reached when this likelihood is maximized. However

it is often useful to maximize the logarithmic likelihood

𝑙 𝒃 = log 𝐿(𝒃; 𝒙1, … , 𝒙𝑛)

To illustrate that let us at first assume that feature and target variables are connected by

𝑦𝑖 = 𝒃𝒙𝑖 + ε𝑖

where ε is a random noise distributed according to a Gaussian distribution

𝑓(ε𝑖)=1

2πσ2 exp (−(ε𝑖)

2

2σ2 ) which implies 𝑓(𝒙𝑖; 𝒃)=1

2πσ2 exp (−(𝑦𝑖−𝒃𝒙𝑖)

2

2σ2 )



Maximum likelihood estimation

We can now evaluate the likelihood function to

𝑙 𝒃 = log 1

2πσ2exp −

𝑦𝑖 − 𝒃𝒙𝑖2

2σ2= log

1

2πσ2exp −

𝑦𝑖 − 𝒃𝒙𝑖2

2σ2

𝑛

𝑖=1

𝑛

𝑖=1

= 𝑛 log1

2πσ2−

1

2σ2 𝑦𝑖 − 𝒃𝒙𝑖

2

𝑛

𝑖=1

Maximizing this expression leads again to the least mean squares 1

2 𝑦𝑖 − 𝒃𝒙𝑖

2𝑛𝑖=1

we used to optimize the

linear regression problem.

Coming back to our logistic regression problem, knowing that

𝑃 𝑌 = 1 𝒙) = 𝑓𝐿 𝒃𝒙 and 𝑃 𝑌 = 0 𝒙) = 1 − 𝑓𝐿 𝒃𝒙

we can construct an individual density function as

𝑃 𝑦 𝒙) = (𝑓𝐿 𝒃𝒙 )𝑦(1 − 𝑓𝐿 𝒃𝒙 )1 − 𝑦

The log likelihood is then

𝑙 𝒃 = log 𝐿 𝒃 = 𝑦𝑖𝑛𝑖=1 log𝑓𝐿 𝒃𝒙𝑖 + (1 − 𝑦𝑖) log(1 − 𝑓𝐿 𝒃𝒙𝑖 )

To maximize the log likelihood we now analogously use the gradient ascent rule

𝑏𝑗 ≔ 𝑏𝑗 + α𝜕

𝜕𝑏𝑗

𝑙 𝒃

Using the differentiation rule of the logistic function 𝑓𝐿′ 𝑠 = 𝑓𝐿 𝑠 (1 − 𝑓𝐿(𝑠)) the contained derivative can be

evaluated as 𝜕

𝜕𝑏𝑗

𝑙 𝒃 = 𝑦 − 𝑓𝐿 𝒃𝒙 𝑥𝑗



The perceptron

We end again with a stochastic gradient ascent rule

𝑏𝑗 ≔ 𝑏𝑗 + α 𝑦 − 𝑓𝐿 𝒃𝒙 𝑥𝑗

where 𝑓𝐿(𝑠) is a non-linear function.

Interchanging the logistic function 𝑓𝐿(𝑠) in this algorithm with a step

function H 𝑠 = 1, 𝑠 ≥ 00, 𝑠 < 0

, brings us to the so-called perceptron and the

perceptron learning rule

𝑏𝑗 ≔ 𝑏𝑗 + α 𝑦 − 𝐻 𝒃𝒙 𝑥𝑗.

Historically it was thought that a perceptron resembles the way a human

neuron works, as it transfers a signal (feature variable) to a non-zero

output 𝑦 = 1 only when it overcomes a certain threshold defined by the

weights.

A single perceptron connected to two input variables can realize the logic OR-function.

𝑥1

𝑥2

𝑏1

𝑏2 𝑦 = 𝐻(𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2)

𝑦

b0



Deep learning networks

» Artificial neurons can have an arbitrary transfer function φ(𝝎𝑖𝑗𝒙𝑗) and unlimited inputs.

» A layered network of connected neurons is a powerful tool to

predict complex non-linear dependencies in data

But how can a neural network be trained?

» One-layer network training: perceptron learning rule

» A Training method for multi-layer networks is backpropagation:

1. Propagation of input features to output variables 𝑜

2. Estimation of the mean square error 𝐸 =1

2𝑡 − 𝑜 2 , where 𝑡 is

the training target variable

3. Back-propagation of the error through the network with weight

adjustment corresponding to gradient descent

In detail: The weight 𝜔𝑖𝑗 between the 𝑖-th and j-th neuron has to be

updated via: 𝜔𝑖𝑗 :=𝜔𝑖𝑗 − α𝜕𝐸

𝜕ω𝑖𝑗

= 𝜔𝑖𝑗 +α δj o𝑖, where

δj =

φ′ 𝝎𝑖𝑗𝒙𝑗 𝑡𝑗 − 𝑜𝑗 , if 𝑗 is output neuron

φ′ 𝝎𝑖𝑗𝒙𝑗 𝛿𝑘𝜔𝑗𝑘𝑘

, if 𝑗 is hidden/input neuron



Bias–variance tradeoff

» Fitting with high order polynomials 𝑏0 + 𝑏1𝑥 + 𝑏2𝑥2+…+𝑏5𝑥

5 leads to a lower total error compared to simple

linear model.

» The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to

miss the relevant relations between features and target outputs → underfitting

» The variance is error from sensitivity to small fluctuations in the training set. High variance can cause

overfitting: modelling the random noise in the training data, rather than the intended outputs.

» With more and more complex models and more parameters we tend to over-fit the Noise and mask the true

Signal

» Model Selection: We want to choose the best trade between bias and variance.



Model selection – basics

Our Example Goal: Pick the “best” Model

» First Idea: Find the Model with smallest

training error.

» Does not work: Will always prefer high

variance models with maximum

parameter number!

» New Idea: simple cross validation

» Split the data into a training set with ca.

70% of the data and a validation set with

30% of data.

» Fit models to training set and measure

error on validation set.

» Problem: We loose 30% of our data

» Problem 2: Depending on the Split our

MSE can differ in level

» For our polynomial problem we plot the validation MSE

» In accordance with our intuition we see that 2 seems to

be a good choice with not much more to gain with higher

orders.

We have found a way to quantify the Bias-Variance Tradeoff.



Model selection – k-fold cross validation

Our Example Goal: Pick the “best” Model

» How can we recycle more of our data?

1. Split the data into k-subsets

2. Fit the models using k-1 subsets and

measure the MSE on the remaining

subset

3. Average the results over all possible

choices for the k-1 subsets

» We typically See a U-shape

» Shown here: Three shapes for different problems.

» The brown line is the problem we presented before.

Taking the bottom of each U-curve leads to the best Model in terms of Bias-Variance Tradeoff.



Deep Learning for NLP

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP


Overview

“Any sufficiently advanced technology is indistinguishable from magic” Arthur C. Clarke’s 3rd law (“Hazards of Prophecy: The Failure of Imagination”, 1973)

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview


Deep learning fuels lots of recent technological advancements and is applied

in the real world

Automated logistics

Product recommendations and

Tailored advertising

Face

Recognition

Community

-detection

Siri (Cortana, Alexis, …)

Pattern recognition

Optimization of

Computing centers

Self-driving cars

Machine learning is being applied in almost all of the major tech companies.

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Overview (1/19)


Example 1 – Image recognition (1/2)

1http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

» Deep Learning algorithms are capable of learning to recognize

objects on pictures

» Standardized test for this kind of problem:

ImageNet (http://image-net.org/)

› Training set: 1.2 million images in 1000 categories

› Test set: 150k images

› Each year: Large Scale Visual Recognition Challenge (ILSVRC)

› Winner 2016 in category “Object localization” had an error of 2.9%

» Humans achieve an error rate of

4%-15% on this task1

Deep Learning techniques outperform humans at an image recognition task.



Example 1 – Image recognition (2/2)

Performance improved by more than 87% over the last 6 years.



Example 2 – Google Deepmind’s AlphaGo1

1https://deepmind.com/research/alphago/

» Combining techniques from Deep Learning and traditional

search theory (MC tree search) Deepmind was able to build an

algorithm which is capable of superhuman performance at the

game of Go

» In March 2016 AlphaGo defeated one of the top human Go

players Lee Sedol in a five match competition with 4:1 and the

reigning 3-times European champion Fan Hui with 5:0

» This is remarkable as the branching ratio of Go is about 250!

(compare Chess with a BR of ~35)

» The algorithm was in a first step trained on games of human

experts by supervised learning techniques

» In a second step the algorithm improved itself by lots of games

of self-play by using techniques from (deep) reinforcement

learning

» Used a new kind of hardware, so-called Tensor Processing

Units (TPUs)

Deep Learning systems outperform humans in games of perfect information



Example 3 – Google’s “show and tell” algorithm1 (2016)

1Vinyals, Oriol, et al. "Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge." (2016) 2https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html

» Deep Learning algorithms are currently not only able to recognise objects

» They can even automatically generate captions for images and even learn to combine objects

» Recently Google open-sourced its model “Show and Tell”2 for image captioning

» Examples:



Example 4 – Neural style transfer

1Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arXiv preprint arXiv:1508.06576 (2015)

» Deep Learning algorithms are capable of creating works of art

» Taking a “style” and a “target” image there are algorithms which transfer the selected style to the target

image1

» This is even possible for videos! (Example: Youtube)


https://www.youtube.com/watch?v=YRm_kqClsFY


Example 5 – mortgage risk1

1Giesecke, Kay, J. Sirignano, and A. Sadhwani. Deep Learning for Mortgage Risk. Working Paper, Stanford University, 2016.

out-of-sample

» Analysis of mortgage risk data fir 120 Mio. loans originated in the US between 1995 and 2014

» This amounts to ~ 350B monthly observations with 300 feature variables (e.g. FICO score, Interest rate,

income, …) and performance status (30, 60, 90+ days late, foreclosure, REO, paid off)

» Monthly loan performance for the retail loans was provided by CoreLogic

» Additionally local (zip code, …) and national economics factors were collected from various sources

» Modelled transitions within loan performance states:


Out-of-sample AUCs for month-ahead prediction using ensemble

Out-of-sample AUCs for month-ahead prediction


Example 6 – parsing of natural language: Google’s Parsey McParseface1

1Andor, Daniel, et al. "Globally normalized transition-based neural networks."arXiv preprint arXiv:1603.06042 (2016), https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

Model News Web Questions

Martins et al. (2013) 93.10 88.23 94.21

Zhang & McDonald (2014) 93.32 88.65 93.37

Weiss et al. (2015) 93.91 89.29 94.17

Andor et al. (2016)* 94.44 90.17 95.40

Parsey McParseface 94.15 89.08 94.77

Model News Web Questions

Ling et al. (2015) 97.78 94.03 96.18

Andor et al. (2016)* 97.77 94.80 96.86

Parsey McParseface 97.52 94.24 96.45

% correct head assignments in the tree per-token accuracy in % (POS) tagging

» Natural language is hard to parse for machines because of the prepositional phrase attachment ambiguity

» Professional human linguists trained on this kind of task agree in 96%-97% on a parsing task

» Parsey McParseface achieves an accuracy of about 94%!


http://www.cs.cmu.edu/~ark/TurboParser/

http://research.google.com/pubs/archive/38148.pdf

http://static.googleusercontent.com/media/research.google.com/en/pubs/archive/43800.pdf

http://arxiv.org/abs/1603.06042



http://www.cs.cmu.edu/~lingwang/papers/emnlp2015.pdf





Example 7 – predicting banking distress1

» Gathered 6.6M articles (3.4B words) from Reuters online in the period 2007 – 2014 (Q3)

» Monitored news items on 101 different large European banks

» Model was first pre-trained unsupervised with raw news items

» In a second step training proceeded on 243 distress events like government interventions, state aids, direct

failures and distressed mergers

» Model yields an AUC of ~71%

» From the trained network a country-specific stress-index was extracted:

1Rönnqvist, Samuel, and Peter Sarlin. "Identifying bank stress by deep learning of news." Workshop New Challenges in Neural Computation 2015. 2015, Rönnqvist, Samuel, and Peter Sarlin. "Detect & Describe: Deep learning of bank stress in the news." Computational Intelligence, 2015 IEEE Symposium Series on. IEEE, 2015, Rönnqvist, Samuel, and Peter Sarlin. "Bank distress in the news: Describing events through deep learning." arXiv preprint arXiv:1603.05670 (2016).


Stress-index for Germany over time Posterior of stress-index over time


(Deep) neural networks

1See e.g. Bengio, Yoshua, et al. "Towards biologically plausible deep learning." arXiv preprint arXiv:1502.04156 (2015 for a discussion.

» Neural networks can be understood as a mathematical model of neurons in brains1 (all we

know this model is to simplistic to account for the processes in a real brain)

» Neurons “fire” electrical impulses along so-called axons to other neurons, thereby producing a

dense and complicated web of interacting units

» In the mathematical model of a neuron the signal 𝑥 from another neuron undergoes first an

affine transformation (the parameters of those are called weights), which is the input to a non-

linear function called an activation function

» By building a network of such mathematical neurons one obtains a so-called neural network



General overview of neural networks

1There are also other types like Hopfian networks or (Deep) Boltzmann Machines which are not discussed here 2http://playground.tensorflow.org

» Typical1 neural networks models come with

various so-called layers

» There are various activation functions used in

the literature and in applications

Nowadays neural networks are build with 100s of layers leading to a high capacity2.



Learning in neural networks

1See e.g. http://sebastianruder.com/optimizing-gradient-descent/ for a good overview.

› Therefore the Learning problem is mapped to an optimization problem of an in general non-

convex objective function

› The optimization problem for neural networks is in almost all cases tried to be solved by the

17th century technique of gradient descent with some modern twists1

» A neural network is parametrized by its topology, its

activation functions and its weights

» In a real world application one chooses a topology and

types of activation functions

» The weights 𝑤 are then derived from training data by

optimizing an objective function 𝐽(𝑤)

» Example: Supervised learning task:

Given a set of 𝑁 observations 𝑥𝑖 with labels 𝑦𝑖 the weights 𝑤

are fixed by:

𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 𝑤, 𝑥𝑖 , 𝑦𝑖

𝑖

= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽(𝑁𝑁(𝑤,

𝑖

𝑥𝑖); 𝑦𝑖)

where 𝑁𝑁 denotes the output of the Neural Network (or almost

any other ML algorithm)



Types of neural networks

» Sequential (statefull models)

› These types of neural networks can be used to model

sequential data like with varying length like natural

language

› Often in these types of networks a “cell” is repeated

over and over

› By unrolling (possibly to infinity) these types of

networks can be formally transformed into static ones

» Static (stateless models)

› These types of networks come with a fixed shape and number of layers and neurons

› They are often used for supervised learning tasks like image recognition

› Example: Google LeNet (09/2014) which won the ILSVRC 2014 with almost 5 Mio. parameters

Neural networks can lead to “differentiable” versions of static and sequential data.



General properties of neural networks

1Can be weakened to measurable

» Static Neural Networks with at least one hidden layer are universal function approximators:

» In other words, neural networks can in principle approximate any continuous function1

» Special types of sequential Neural Networks called “Recurrent Neural Networks” are Turing

complete, which means that they can simulate arbitrary programs

These general results should be taken with a grain of salt as these are only guaranteed with infinite

resources and nothing is said about learnability from data.



Convolutional (CNN)

There are several deep neural architectures for (un-) supervised learning

tasks

Recursive

» The static length input

features are “convoluted”

» Each layer learns a more

abstract representation of

the data

» Equivalent to

renormalization group flow

in physics Mainly used for static data

such as images

» The “input features” come

in a natural hierarchy or

tree-like structure

» The neural net is applied

all along the hierarchy

Mainly used for

hierarchical and tree-like

data such as language

Boltzmann machines (and variants)

» The neural net consists of

a visible and hidden part

» The hidden part can learn

arbitrary complex

representations of the

inputs

» Comparable to HMMs Mainly used for static data

Recurrent (RNN)

» The “input features” are

sequential and of different

length

» The neural net is recurring

» The neural net can learn

long term dependencies

(c.f. LSTM, GRUs)

Mainly used for sequential

data such as language

and time series

Architectures can

be 100s of layers

deep!



Deep neural networks in their naïve form have various problems

1Hochreiter, Sepp. "Untersuchungen zu dynamischen neuronalen Netzen."Diploma, Technische Universität München (1991): 91.

» Overfitting

› Deep neural networks typically have millions of free parameters

› Without care this can typically lead to the overfitting phenomenon

» Lots of (labelled) data is needed for training

› Without expert knowledge, which could either be built into the topology of the net or into constraints on the weights

and/or the objective function, lots of labelled data is needed to bring deep neural nets into a regime of good behaviour

with respect to generalization

» Vanishing gradient problem1

› Neural networks are usually trained by various incarnations of gradient descent, e.g.

𝑤𝑖+1 = 𝑤𝑖 − 𝛾 ⋅𝜕𝐽 𝑤

𝜕𝑤 𝑤𝑖

› By the chain rule this leads to products of activation functions

› As most activation functions take values in [−1, 1] these products

become very small for deep networks

» Slow training

› To optimize a non-convex function with millions of terms and millions of variables is computationally very expensive

› Without special hardware the training of deep neural nets is not feasible



What rescues deep learning?

1Nonetheless remarkable progress was made during the 90s by people like Jürgen Schmidthuber (ETH), Geoffrey Hinton (Google), Yann LeCun (Facebook), Yosuah Bengio (U. Montreal), Andrew Ng (Baidu) and others

» Deep neural nets are hard to train due to what is known the “vanishing/exploding gradient problem“

» In the 90s this (among other things) led to a period called the AI-winter and almost to an abandonment of

the idea of neural nets1. Progress during the last 10 years has made it possible to train very deep nets

with hundreds of layers

» Responsible for this progress are mainly:

» With these techniques Deep Neural Nets have reached super-human abilities in many areas, including

image recognition, geolocating by images, game playing, sentiment analysis, …

» There are several software frameworks available for the training of deep neural nets

Growth of available

computing power:

Clusters of (C,G,T)PUs

Availability of large

amounts of (labelled)

data

Methodological

breakthroughs (pre-training,

dropout, LSTMs/GRUs,

ReLUs, stochastic depth

training, convnets, …)

Nervana

Python

Framework

Developer Google U. Montreal Collobert, et

al. Microsoft U. Berkeley DMLC H2O.ai Skymind

Language(s) C++, Python Python C, Lua C++ C++, Python C++, Python,

R, Julia, …

Java, Scala,

Python, R

Java, Scala,

C



Why should neural networks be deep after all?

Example: natural language (1/2)

Lin, Henry, and Max Tegmark. "Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language." arXiv preprint arXiv:1606.06737(2016), Lin, Henry W., and Max Tegmark. "Why does deep and cheap learning work so well?." arXiv preprint arXiv:1608.08225 (2016).

» Studying the empirical mutual information (kind of a two-point function) between symbols in

natural written language unveils a power-law behaviour!

» These long range interactions can even theoretically not be modelled by simple shallow

models like Hidden Markov Models (HMMs)



Why should neural networks be deep after all?

Example: natural language (2/2)

1Mehta, Pankaj, and David J. Schwab. "An exact mapping between the variational renormalization group and deep learning." arXiv preprint arXiv:1410.3831 (2014), https://charlesmartin14.wordpress.com/2015/04/01/why-deep-learning-works-ii-the-renormalization-group/

» This would explain two empirically observed properties of Deep Neural Networks:

› These types of models are able to extract high level features from microscopic data (e.g. raw pixels to

categories of objects) as they flow to fixed points under the RG-flow (universality)

› The “two-point functions” of Deep Neural Networks in general exhibt a power law decay near their

critical points

» Nonetheless can deep models sometimes be approximated by simpler shallow models!

» Hallucinating Wikipedia entries (more on

this in a moment) with a deep recurrent

neural architecture captures the long

range interactions present in natural

language

» This is not an accident, as it was argued

that Deep Neural architectures are

related to a well-known set of ideas in

physics, namely the Renormalization

Group (RG)

Deep Learning can be mapped onto the Renormalization Group known from physics.



Word embeddings

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings


Problems with discrete word representations

Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of Machine Learning Research 12.Aug (2011): 2493-2537.

saying that Europe needs unified banking regulation to replace the hodgepodge

government debt problems turning into banking crises as has happened in

These words will represent banking

» Great as resource but missing nuances, e.g. synonyms:

adept, expert, good, practiced, proficient, skillful?

» Missing new words (impossible to keep up to date):

wicked, badass, nifty, crack, ace, wizard, genius, ninjia

» Subjective and requires human labor to create and adapt

» Hard to compute accurate word similarity

» Instead: Use the Distributional hypothesis:

You can get a lot of value by representing a word by means of its neighbours

“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)

Distributional similarity is one of the most successful ideas of modern statistical NLP

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Word embeddings (1/9)


Word embeddings (1/4)

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Counts I like enjoy deep learning NLP flying .

I 0 2 1 0 0 0 0 0

Like 2 0 0 1 0 1 0 0

Enjoy 1 0 0 0 0 0 1 0

Deep 0 1 0 0 1 0 0 0

Learning 0 0 0 1 0 0 0 1

NLP 0 1 0 0 0 0 0 1

flying 0 0 1 0 0 0 0 1

. 0 0 0 0 1 1 1 0

» Idea:

Instead of capturing co/occurrence counts

directly predict surrounding words of every

word

» This idea was used by Mikolov to embed a

corpus of words into a high-dimensional

vector space

» This algorithm is called word2vec

» Window based cooccurence matrix

› Example:

I like deep learning.

I like NLP.

I enjoy flying.



Word embeddings (2/4): word2vec

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Continuous bag of words (CBOW) model Skip-gram model



Word embeddings (3/4): word2vec

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013. Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of Machine Learning Research 12.Aug (2011): 2493-2537.

» Idea: Predict surrounding words in a window of length m of every word

» Objective function: Maximize the log probability of any context word given the current center

word:

» The simplest model for the probabilities 𝑃(𝑤𝑡+𝑗|ℎ) is given by “dynamic” logistic regression

» The optimization of the above objective function is for normal corpora not feasible

» Word2vec uses a clever Monte-Carlo algorithm called noise-contrastive training to

circumvent this problem (shown for the CBOW model):



Word embeddings (4/4): GloVe1

1Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014. Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of Machine Learning Research 12.Aug (2011): 2493-2537.

» GloVe is generally faster to train than

word2vec

» Works well even on small corpuses

» Can be scaled to large corpuses

» Idea:

Improve word2vec by co/occurence statistics => new

objective function



Examples of GloVe embeddings (1/3)

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014.

𝑣 𝑘𝑖𝑛𝑔 − 𝑣 𝑚𝑎𝑛 + 𝑣 𝑤𝑜𝑚𝑎𝑛 ≃ 𝑣(𝑞𝑢𝑒𝑒𝑛)



Examples of GloVe embeddings (2/3)

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014.



GloVe as a preprocessing step on one of our projects

word similarity

traders 1.00

investors 0.87

dealers 0.84

stocks 0.82

markets 0.81

prices 0.80

word similarity

fraud 1.00

charges 0.85

bribery 0.85

alleged 0.84

corruption 0.82

embezzlement 0.81

» Project: Detection of insider trading in chats of traders

» Training data:

› 5 years of trader’s chat history from 2011-2016 (unstructured)

› Amounts to around 1.4 Mio. Chats

› Around 86GB of raw xml-files

» Data was not cleaned (apart from cutting persistent chat rooms)

» Training on Microsoft Azure N12-instance with two Nvidia K80 (~4992 CUDA cores / card and

up to 2.91 Tflops [double precision])

» Transfer to Microsoft Azure in a hashed format (quite a challenge …)

» Training was terminated after 20h of training

» Examples of learned vectors:



The new star? fasttext by Facebook Research1

1Joulin, Armand, et al. "Bag of Tricks for Efficient Text Classification." arXiv preprint arXiv:1607.01759 (2016).

» Recently Facebook published a new algorithm for efficient word embeddings: fasttext

» Amazingly no deep architecture is used there

» Instead this model is a linear softmax classifier on n-gram features 𝑥𝑖

» The objective function is simply given by a weighted log-likelihood function

» Normally this problem has a computational complexity of 𝑂(𝑘 ⋅ ℎ) where 𝑘 is the number of

classes and ℎ the embedding dimension

» By utilising the hierarchical softmax algorithm together with an efficient mapping of the –grams

via the “hashing trick” the complexity can be reduced to 𝑂(ℎ ⋅ log2 𝑘)



NLP (almost) from scratch

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - NLP (almost) from scratch


Recurrent neural networks (RNNs)

1Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

» Recurrent neural networks

» Problem: long-term dependencies

» Long short-term memory (LSTM) cells1

to the rescue

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - NLP (almost) from scratch (1/4)


Training a language model on character level

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

» With RNNs it is possible to train a language model

character by character!

» These models automatically learn punctuation or

other characteristics of the underlying text

» The models are language agnostic and can be

used for various tasks

» For example it can be used to generate text in a

certain style, as a spelling corrector or to read and

fantasize house numbers, …



Examples of hallucinated text


Linux source code Algebraic geometry (from latex sources)



How does the language model learn


~ 100

~ 300

~ 500

~ 700

~ 2000

~ 1200

» LSTM trained on Leo Tolstoy’s War and Peace after various episodes of training



Outlook

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Outlook


Outlook

Kumar, Ankit, et al. "Ask me anything: Dynamic memory networks for natural language processing." arXiv preprint arXiv:1506.07285 (2015).

» Attention and memory mechanisms (“Neural

Turing Machines” and “Dynamic memory

networks”)

» Sequence-to-sequence models

» Can one also learn the topology of a neural

networks from data (e.g. Bayesian Neural

Networks)?

» What about ensembles of neural networks (as

with e.g. random forests)? (Dropout, …)

» What is the relation between Deep Learning

and StrongAI? (OpenAI, DeepMind)

» What is the relation between Deep Learning

and biological brains?

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Outlook (1/2)


Outlook: Attention and dynamic memory networks

Kumar, Ankit, et al. "Ask me anything: Dynamic memory networks for natural language processing." arXiv preprint arXiv:1506.07285 (2015) Connecting Images and Natural Language, Andrej Karpathy, PhD Thesis, 2016

2016-10-13 | Machine learning and text analytics | Deep Learning for NLP - Outlook (2/2)


ChatAnalytics

An overview of a d-fine project

2016-10-13 | Machine learning and text analytics | ChatAnalytics


The penalties for misconduct by traders can meet banks sensitive, especially

if internal control mechanisms fail

» Libor scandal:

› Deutsche Bank – 2,5 B$

› UBS – 1,5 B$

» Manipulations on the forex market:

› JPMorgan, Citigroup, Barclays, RBS, UBS

total 5,6 B$

Penalties from the latest scandals Fines over the last years

» 325.000 Bloomberg Professionals send 200B Emails und

15-20B Instant Messages a day!

» If a single function can be said to justify the $20,000

annual cost of a Bloomberg terminal, it is probably Instant

Bloomberg, which the company accurately describes as

“the dominant chat tool used by the global financial

community” (FT)

Communications channels of traders

So much pay the banks

Penalties and legal costs in billions of dollars

The banking authorities and compliance departments advocate stricter and - due to the high volume -

automated control mechanisms to identify misbehavior of traders.

2016-10-13 | Machine learning and text analytics | ChatAnalytics (1/4)


Traders are communicating mainly over Instant Bloomberg, where they are

talking sometimes openly about market manipulations and more

» UBS-trader: "i was frontrunning EVERY single offer in

usdjpy and eurjpy." (Tagesschau.de)

» UBS-trader: "(...) i was frontrunning EVERY SINGLE

ODA and i mean EVERY haha” (Tagesschau.de)

» UBS-trader: A zu UBS-Händler B: "das ding ist wir dürfen

nicht mehr front runnen, compliance sitzt uns am arsch".

(Tagesschau.de)

» According to Barclays: “Dude. I owe you big time! Come

over one day after work and I’m opening a bottle of

Bollinger” (Bloomberg)

»

BankWTrader 15:46:53 i’d prefer we join forces

BankYTrader 15:46:56 perfick

BankYTrader 15:46:59 lets do this…

BankYTrader 15:47:11 lets double team them

BankWTrader 15:47:12 YESssssssssssss

BankWTrader 16:03:25 sml rumour we haven’t lost it

BankYTrader 16:03:45 we do dollarrr

Examples of chats from the latest scandals Challenges for ChatAnalytics

Different languages and styles

Trader A: wären wir weit weg mit dem Preis?

Trader B: quite a bit away

Trader A: dann kommen wir nicht hin

Deception

Trader A: Selling Wolfsburg for 134.2, do you care?

Trader B: Nope, what about Stuttgart?

Blue horseshoe loves anacott steel

Humor and sarcams

Trader A: Should I say frontrun to greet our complience?

Trader B: If u don’t say frontrunner

Abbreviations and errors

Trader A: sry i dont kn w bro

Trader B: v nice mate

The surveillance of traders is an application of TextAnalytics



Chat Feature 1 Feature 2 …

1 5 1.2

2 4 3.5

3 10 7.8

… … … ...

Chat Suspicious?

1 No

2 Yes

3 No

… …

Supervised learning

Unsupervised

learning

Model predictions

NLP processing “raw” chat logs

» Naive Bayes

» k-NN

» SVMs

» Gradient boosted trees

» Random Forests

» Deep-NN (CNNs)

» Embeddings: GloVe,

word2vec, LDA2vec

» Topological data analysis

» Deep-RNNs

» Deep Auto-Encoder

The structure of chats is richer than the structure of news articles and their

analysis

» Language detection

» POS-tagging

» Annotations

» Feature e.g. # of verbs, #

Emoticons, …

Review of the

predictions by a human

analyst

Combination of machine learning and the experience of human analysts allows for the evaluation of

complex, unstructured and nearly unlabelled data



The architecture of ChatAnalytics:

We love open-source software

Used

te

ch

no

logie

s

» Analysis of all chat

communication

» Development of statistical

models

» Scoring for suspicious chats

Description

» Storage of results from the

kernel

» Gathering of feedback from

analysts

» Parametrization of productive

models

» Configuration of kernel,

database and frontend

» Presentation of results from the

DB-Layer

» Gahtering of feedback from

analysts regarding the results

of the kernel

» The kernel is implemented in R

» Neural nets were trained with

Tensorflow

» As the database the client

demanded Oracle

» Presentation of the results is

via a Web-Application written

with R/Shiny

Kernel Database Presentation

Architecture

1 2 3



Dashboard for surveillance of trader communication

Click here

2016-10-13 | Machine learning and text analytics | Dashboard for the supervision of communication

http://df-analytics02/shiny/zkb_dashboard_chats/


d-fine

Frankfurt

Munich

London

Vienna

Zurich

Headquarters

d-fine GmbH

An der Hauptwache 7

60313 Frankfurt/Main

Germany

Tel +49 69 90737-0

Fax +49 69 90737-200

www.d-fine.com

Contact

Todor Dobrikov

Senior Manager

Tel +49 69 90737-447

Mobile +49 162 2631320

E-Mail [email protected]

Michael Hecht

Senior Consultant

Tel +49 89 7908617-0

Mobile +49 162 2631431

E-Mail [email protected]

d-fine

(textbox is required to avoid an issue

where this page gets rotated by 90°

if printing (both “physical” and pdf))

Documents

Machine learning and text analytics - Heidelberg University · » Moreover, text is unstructured. This means there are countless ways to transport the identical information by words