38
The Power of Words Comparing different approaches to text analysis Christian Winkler

The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

The Power of WordsComparing different approaches to text analysis

Christian Winkler

Page 2: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Who I am

Dr. Christian Winkler

Founder, Machine Learning Expert

Photo credit: Urs Wehrli

Page 3: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

The challenge:

Information overload with text

12 million contracts

30,000customer emails

1 million archived documents

How can Machine Learning help?

100,000 documented

change requests…

Page 4: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Collect contentautomatically

1

Cleaning& linguistics

2

Statistics & QA

3

Data-driveninsights

4

Visualization & reporting

5

Pipeline for text analytics

Spidering

Extraction of text and metadata

Normalization

Determine language

SynonymsOutlier detection

Feature extraction

Regression

Clustering

Overall structure

Word combinations

Categories

Timelines

Semantics

Page 5: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Preparation of textCleaning & Linguistics

Tokenization

Stopword removal

Normalization

Lemmatization

Named Entity Recognition

Part of speech tagging

Page 6: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Only consider word frequencies

• Term frequency (TF, TF-IDF)

• Simple, but robust

• Basis for many algorithms(Retrieval, classification, topic modeling)

Disadvantages

• Oversimplified language model

• Syntactical and relational information is lost

Improvements e.g. via n-grams

Bag-of-Words vectorizationDokumente

D1: „Pete likes London. Pete likes Paris."

D2: „Pete does not like London."

D3: „Pete likes London, but not Paris."

D1 2 2 1 1

D2 1 1 1 1 1

D3 1 1 1 1 1 1

Page 7: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Topic Modeling

Page 8: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Look for hidden/latent structure

1) What are candidates for topics?

2) How are they distributed in document space?

Basic idea

Topic 1

Topic 2

Topic 3

TopicsDocuments

...

Topic k

doc 1 doc 2 doc n...

Page 9: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

How Topic Modeling works

Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html

Topic modelling transforms the matrix• Re-arrange features (words) and

documents• Find blocks

• Word in blocks constitute topics• Documents in blocks belong to topic

Page 10: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Example topic modeling: Reuters News

Page 11: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Summary topic modeling

Fast summary for digital marketing, product design, personalization

Detect what people are talking (writing) about

Find hidden (niche) structure

Result: Latent structure of dataset

Page 12: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Word Embeddings -Vectorizing words

Page 13: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

"You shall know a word by the company it keeps."

Example: What is "tezgüino"? What is similar to "tezgüino"?

Distributional Hypthesis (Firth, 1957)

A bottle of ____ is on the table.Everybody likes ____.Don’t have ____ before you drive.We make ____ out of corn.

EISENSTEIN, JACOB: Natural Language Processing. Georgia Tech, 2018. Ch. 14

Page 14: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Semantic similarity

My cat eats fish on Saturday

His cat eats turkey on Tuesday

My dog eats meat on Sunday

His dog eats turkey on Monday

Syntagmatic axis

Par

adig

mat

icax

is

Similar contexts(paradigmatic): cat ≈ dog

Co-occurence(syntagmatic): cat ≈ eats

Page 15: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Basic idea of word embeddings

Learn semantics via

context

Princess 0,93 0,01 0,93 0,12 ....

Woman 0,08 0,03 0,98 0,51 ....

Queen 0,99 0,05 0,97 0,72 ....

King 0,96 0,99 0,03 0,64 ....

Page 16: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

”The man who passes the sentence should swing the sword.” – Ned Stark

Predtict word (red)via context (green)

Order of context is ignored (bag of words)

Rows of weigh matrix W

yield n-dimensional word vectors

word2Vec training (Continuous Bag of Words)

sentenceone-hot-encoded

shouldone-hot-encoded

theone-hot-encoded

swordone-hot-encoded

N-dimhiddenlayer

swingone-hot-encoded

W

neural network

Page 17: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Training with 68 subreddits eachcontainting 1,000 posts about tv shows

25 epochs, takes a few minutes

Example

Page 18: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

"Simpsons"Without direct relation to tv shows

Page 19: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Similarities of relationships

Page 20: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

x = Kirk + Marge - Homer

Similarities of relationships

Homer

Marge

x

Kirk

Page 21: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

t-SNE oder PCA

Interesting, but only clusters arerelevant, differencevectors incorrectlymapped

Visualization

Page 22: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

fastText (Open Source software from Facebook)

• Uses character n-grams

• Useful for spell checking

• Pre-trained models for detecting language

gloVe

• Uses global co-occurence matrix

• Focus on non-local semantic similarities

• Sometimes better similarity compared to word2vec

• Less popular

fastText & GloVe

Page 23: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

word2Vec fastText gloVe

• Facebook Research• Character n-grams• Classification, fault-tolerant

• Google Research • Stanford University• Different method

(co-occurence matrix)

Page 24: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

fastText: Language detection

Loga

rith

mic

scal

e

Page 25: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Summary word embeddings

• Find new trends• Build semantic search engines

Detect changes over time

Vectors have a similarity matrix (inner product)

Summary: Understand semantic context of words

Page 26: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

ElMo: Contextualizedembeddings

Page 27: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Word have different meanings depending on context

Context helps to improve word representation in vector space

Basic idea

He talked to the pole

He climbed the pole.

The pole wore a blue jacket.

The pole consisted of metal.

The pole was shaking.

She turned to the pole.

Page 28: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

ELMo

startWV

LSTMLayer 1

LSTMLayer 1

Partly con-textualized

WV

Fully con-textualized

WV

He climbed the pole

Page 29: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Context-aware training

• Use bi-directional LSTMs

• Layer represent different layers of abstraction

• Character n-grams with CNN as starting point for (uncontextualized) word vectors

• Well-suited also for short texts

New featuresDisadvantages compared to word2vec

• Words cannot be mapped to a unique vector

• Much, much, much slower in training and evaluation

Semantical sentence representation

• E.g. via sum of contextualized word vectors

• Improvements via TF/IDF

• Often used for sentiment analysis

Page 30: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

ELMo well suited for short sentences

Training with headlines from Reuters World News

Calculate ELMo sentence embeddings

Find semantically similar news

Example

Page 31: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

BERT: Transfer Learning

Page 32: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Transfer Learning

Data Set 1 Model 1 Task 1

Data Set 2 Model 2 Task 2

Classical ML

A model is trained for exactly one task starting withoutprior knowledgeA lot of training data is needed to get good results.

BaseModel

Base Task

ImprovedModel

Task

Transfer Learning

A base model (trained with a very large dataset) is retrainedto a more specific task with a comparatively small dataset.

Base Data Set

Data

Fine-tuning

Page 33: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Transfer Learning

PretrainedBase

Model

ClassificationModel

WikipediaBooks

ClassificationData

QAModel

Question-Answer Data

NERModel

Named Entity Data

BERTTask 1: Masked LMTask 2: Next Sentence

QAModel

Question-Answer Data

BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 110 Mio parameters4days on 4-16 TPUs, 2GB size

SQuAD: 150.000 QA-pairsTraining 1h CLOUD TPU

Prediction

Billions of words

Page 34: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Could use the selfposts as a data basis

However we don‘t know what is actuallyinside

Therefore use Wikipedia articlehttps://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series

Text-only version 57 kB

Question answering

Page 35: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Question Answering

Page 36: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

State-of-the-Art Text Mining

Classical Text Mining

Document VectorsBag-of-words /

TF-IDF

Classification

Topic

Modeling

(Pre-Trained) Embeddings und Deep Learning for complex tasks

Document VectorsFixed-length Seq.

EmbeddingDeep

Neural Net

Sentiment

Analysis

Knowledge

Extraction

Question

Answering

Embeddings

Word or

Document Vectors

Semantic

Similarity

Page 37: The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

Commercial use of these methods

Technicaldocumentation

Data-driven approach to collect knowledge fromdifferent sources

EnterpriseWikis

Change RequestsScientific

publications…

Cost driverKnowledge

silos

Detect game-changing

technologies early

Technicaldebt