42
Visualizing Data using t-SNE An Intuitive Introduction Simon Carbonnelle Universit´ e Catholique de Louvain, ICTEAM 12th of May, 2016

Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Visualizing Data using t-SNEAn Intuitive Introduction

Simon Carbonnelle

Universite Catholique de Louvain, ICTEAM12th of May, 2016

Page 2: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Visualization and Dimensionality Reduction

Intuition behind t-SNE

Visualizing representations

Page 3: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Visualization and Dimensionality Reduction

Intuition behind t-SNE

Visualizing representations

Page 4: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Visualization is key to understand data easily.

Data of house areas in m2 and price in 1000s of euros.

Area price

43.69 298.7128.82 308.

102.22 426.6836.32 307.5348.35 315.4

Area price

59.04 324.4890.13 373.859.24 325.7194.89 396.6927.72 313.53

Area price

65.2 323.4392.38 379.5677.86 337.7773.48 349.1552.19 311.86

Question

Is the relation linear?1

Page 5: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Visualization is key to understand data easily.

Question

Is the relation linear?1

Page 6: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Dimensionality Reduction is a helpful tool forvisualization.

I Dimensionality reduction algorithmsI Map high-dimensional data to a

lower dimensionI While preserving structure

I They are used forI VisualizationI PerformanceI Curse of dimensionality

I A ton of algorithms exist

I t-SNE is specialised for visualization

I ... and has gained a lot of popularity

2

Page 7: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Dimensionality Reduction is a helpful tool forvisualization.

I Dimensionality reduction algorithmsI Map high-dimensional data to a

lower dimensionI While preserving structure

I They are used forI VisualizationI PerformanceI Curse of dimensionality

I A ton of algorithms exist

I t-SNE is specialised for visualization

I ... and has gained a lot of popularity

3

Page 8: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Visualization and Dimensionality Reduction

Intuition behind t-SNE

Visualizing representations

Page 9: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Dimensionality Reduction techniques solveoptimization problems.

X = {x1, x2, ..., xn ∈ Rh} → Y = {y1, y2, ..., yn ∈ Rl}

minY

C (X ,Y)

Three approaches for Dimensionality Reduction:

I Distance preservation

I Topology preservation

I Information preservation

t-SNE is distance-based but tends to preserve topology

4

Page 10: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Dimensionality Reduction techniques solveoptimization problems.

X = {x1, x2, ..., xn ∈ Rh} → Y = {y1, y2, ..., yn ∈ Rl}

minY

C (X ,Y)

Three approaches for Dimensionality Reduction:

I Distance preservation

I Topology preservation

I Information preservation

t-SNE is distance-based but tends to preserve topology

4

Page 11: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

SNE computes pair-wise similarities.

SNE converts euclidean distances to similarities, that can beinterpreted as probabilities.

pj |i =exp(− ‖ xi − xj ‖2 /2σ2

i )∑k 6=i exp(− ‖ xi − xk ‖2 /2σ2

i )

qj |i =exp(− ‖ yi − yj ‖2)∑k 6=i exp(− ‖ yi − yk ‖2)

pi |i = 0, qi |i = 0

Hence the name Stochastic Neighbor Embedding...

5

Page 12: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Pair-wise similarities should stay the same.

6

Page 13: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Pair-wise similarities should stay the same.

pj |i⇔

qj |i⇔

6

Page 14: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Pair-wise similarities should stay the same.

6

Page 15: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Pair-wise similarities should stay the same.

6

Page 16: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Pair-wise similarities should stay the same.

6

Page 17: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Kullback-Leiber Divergence measures thefaithfulness with wich qj |i models pj |i .

I Pi = {p1|i , p2|i , ..., pn|i} and Qi = {q1|i , q2|i , ..., qn|i} are thedistributions on the neighbors of datapoint i .

I Kullback-Leiber Divergence (KL) compares two distributions.

C =∑i

KL(Pi ||Qi ) =∑i

∑j

pj |i logpj |iqj |i

I KL divergence is asymmetric

I KL divergence is always positive.

I We have our minimization problem: minY C (X ,Y)

7

Page 18: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Some remaining questions.

pj |i =exp(− ‖ xi − xj ‖2 /2σ2

i )∑k 6=i exp(− ‖ xi − xk ‖2 /2σ2

i ), qj |i =

exp(− ‖ yi − yj ‖2)∑k 6=i exp(− ‖ yi − yk ‖2)

1. Why radial basis function (exponential)?

2. Why probabilities?

3. How do you choose σi?

8

Page 19: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Some remaining questions.

pj |i =exp(− ‖ xi − xj ‖2 /2σ2

i )∑k 6=i exp(− ‖ xi − xk ‖2 /2σ2

i ), qj |i =

exp(− ‖ yi − yj ‖2)∑k 6=i exp(− ‖ yi − yk ‖2)

1. Why radial basis function (exponential)?

2. Why probabilities?

3. How do you choose σi?

Focus on localgeometry.

This is why t-SNEcan be interpreted astopology-based

9

Page 20: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Some remaining questions.

pj |i =exp(− ‖ xi − xj ‖2 /2σ2

i )∑k 6=i exp(− ‖ xi − xk ‖2 /2σ2

i ), qj |i =

exp(− ‖ yi − yj ‖2)∑k 6=i exp(− ‖ yi − yk ‖2)

1. Why radial basis function (exponential)?

2. Why probabilities?

3. How do you choose σi?

Small distance does notmean proximity on manifold.

Probabilities are appropriateto model this uncertainty

10

Page 21: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Some remaining questions.

pj |i =exp(− ‖ xi − xj ‖2 /2σ2

i )∑k 6=i exp(− ‖ xi − xk ‖2 /2σ2

i ), qj |i =

exp(− ‖ yi − yj ‖2)∑k 6=i exp(− ‖ yi − yk ‖2)

1. Why radial basis function (exponential)?

2. Why probabilities?

3. How do you choose σi?

11

Page 22: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

The entropy of Pi increases with σi .

Entropy

H(P) = −∑

i pi log2 pi

12

Page 23: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Perplexity, a smooth measure of the # of neighbors.

Perplexity

Perp(P) = 2H(P)

⇒ Entropy of 1.055Perplexity of 2.078

⇒ Entropy of 3.800Perplexity of 13.929

13

Page 24: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

From SNE to t-SNE.

SNE

Modelisation:pj |i =

exp(−‖xi−xj‖2/2σ2i )∑

k 6=i exp(−‖xi−xk‖2/2σ2i )

qj |i =exp(−‖yi−yj‖2)∑k 6=i exp(−‖yi−yk‖2)

Cost Function:C =

∑i KL(Pi ||Qi )

Derivatives:dCdyi

= 2∑

j(pj |i − qj |i + pi |j − qi |j)(yi − yj)

⇒ Symmetric SNE

Modelisation:pij =

pj|i+pi|j2n

qij =exp(−‖yi−yj‖2)∑k 6=l exp(−‖yk−yl‖2)

Cost Function:C = KL(P||Q)

Derivatives:dCdyi

= 4∑

j(pij − qij)(yi − yj)

I FasterComputa-tion

⇒ t-SNE

Modelisation:pij =

pj|i+pi|j2n

qij =(1+‖yi−yj‖2)−1∑k 6=l (1+‖yk−yl‖2)−1

Cost Function:C = KL(P||Q)

Derivatives:dCdyi

= 4∑

j(pij − qij)(yi − yj)(1+ ‖ yi − yj ‖2)−1

I Even FasterComputation

I BetterBehaviour

14

Page 25: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

From SNE to t-SNE.

SNE

Modelisation:pj |i =

exp(−‖xi−xj‖2/2σ2i )∑

k 6=i exp(−‖xi−xk‖2/2σ2i )

qj |i =exp(−‖yi−yj‖2)∑k 6=i exp(−‖yi−yk‖2)

Cost Function:C =

∑i KL(Pi ||Qi )

Derivatives:dCdyi

= 2∑

j(pj |i − qj |i + pi |j − qi |j)(yi − yj)

⇒ Symmetric SNE

Modelisation:pij =

pj|i+pi|j2n

qij =exp(−‖yi−yj‖2)∑k 6=l exp(−‖yk−yl‖2)

Cost Function:C = KL(P||Q)

Derivatives:dCdyi

= 4∑

j(pij − qij)(yi − yj)

I FasterComputa-tion

⇒ t-SNE

Modelisation:pij =

pj|i+pi|j2n

qij =(1+‖yi−yj‖2)−1∑k 6=l (1+‖yk−yl‖2)−1

Cost Function:C = KL(P||Q)

Derivatives:dCdyi

= 4∑

j(pij − qij)(yi − yj)(1+ ‖ yi − yj ‖2)−1

I Even FasterComputation

I BetterBehaviour

14

Page 26: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

From SNE to t-SNE.

SNE

Modelisation:pj |i =

exp(−‖xi−xj‖2/2σ2i )∑

k 6=i exp(−‖xi−xk‖2/2σ2i )

qj |i =exp(−‖yi−yj‖2)∑k 6=i exp(−‖yi−yk‖2)

Cost Function:C =

∑i KL(Pi ||Qi )

Derivatives:dCdyi

= 2∑

j(pj |i − qj |i + pi |j − qi |j)(yi − yj)

⇒ Symmetric SNE

Modelisation:pij =

pj|i+pi|j2n

qij =exp(−‖yi−yj‖2)∑k 6=l exp(−‖yk−yl‖2)

Cost Function:C = KL(P||Q)

Derivatives:dCdyi

= 4∑

j(pij − qij)(yi − yj)

I FasterComputa-tion

⇒ t-SNE

Modelisation:pij =

pj|i+pi|j2n

qij =(1+‖yi−yj‖2)−1∑k 6=l (1+‖yk−yl‖2)−1

Cost Function:C = KL(P||Q)

Derivatives:dCdyi

= 4∑

j(pij − qij)(yi − yj)(1+ ‖ yi − yj ‖2)−1

I Even FasterComputation

I BetterBehaviour

14

Page 27: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

The ”Crowding problem”

There is much more space in high dimensions.

15

Page 28: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Mismatched Tails can Compensate for MismatchedDimensionalities

Student-t distribution has heavier tails.

16

Page 29: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Last but not least: Optimization

minY

C (X ,Y)

C = KL(P||Q) =∑i

∑j

pj |i logpj |iqj |i

I Non-convex

I Gradient descent + Momentum + Adaptive learning rate

Y(t) = Y(t−1) + η(t)δC

δY+ α(t)(Y(t−1) − Y(t−2))

I Two tricks:I Early CompressionI Early Exaggeration

I Illustration Colah’s blog

17

Page 30: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

18

Page 31: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Visualization and Dimensionality Reduction

Intuition behind t-SNE

Visualizing representations

Page 32: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Mapping raw data to distributed representations.

I Feature engineering is often laborious.

I New tendency is to automatically learn adequate features orrepresentations.

I Ultimate goal: enable AI to extract useful features from rawsensory data.

t-SNE can be used to make sense of the learned representations!

19

Page 33: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Using t-SNE to explore a Word embedding.I System outputs 1 if central word is in right context, 0

otherwise.

I Algorithms learns representation and classificationsimultaneously.

From Machine Learning to Machine Reasoning, L. Bottou (2011)

Goal

Representation captures syntactic and semantic similarity.20

Page 34: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Using t-SNE to explore a Word embedding.

http://colah.github.io/21

Page 35: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Explore a Wikipedia article embedding.

http://colah.github.io/

22

Page 36: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Exploring game state representations.Google Deepmind plays Atari games.

Playing Atari with deep reinforcement learning, V. Mnih et Al.

Goal

Learning to play Space Invaders from score feedback and raw pixelvalues.

23

Page 37: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Exploring game state representations.Google Deepmind plays Atari games.

I A representation is learned with a convolutional neural networkI From 84x84x4 = 28.224 pixel values to 512 neurons.I Predicts expected score if a certain action is taken.

Human-level control through deep reinforcement learning, V. Mnih et Al. (Nature,2015)

24

Page 38: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Exploring game state representations.Google Deepmind plays Atari games.

Human-level control through deep reinforcement learning, V. Mnih et Al. (Nature,2015)

25

Page 39: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Using t-SNE to explore image representations.Classifying dogs and cats.

https://indico.io/blog/visualizing-with-t-sne/

I Each data point is an image of a dog or a cat

I red = cats, blue = dogs

26

Page 40: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Using t-SNE to explore image representations.Classifying dogs and cats.

Representation

Convolutional net trained for Image Classification (1000 classes)

https://indico.io/blog/visualizing-with-t-sne/

27

Page 41: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Using t-SNE to explore image representations.Classifying dogs and cats.

Representation

Convolutional net trained for Image Classification (1000 classes)

https://indico.io/blog/visualizing-with-t-sne/27

Page 42: Visualizing Data using t-SNE - UCLouvain · 2016. 6. 13. · t-SNE can be used to make sense of the learned representations! 19. Using t-SNE to explore a Word embedding. I System

Conclusion

I The t-SNE algorithm reduces dimensionality while preservinglocal similarity.

I The t-SNE algorithm has been build heuristically.

I t-SNE is commonly used to visualize representations.

28