Clustering Semantically Similar Words

  • View
    323

  • Download
    0

Embed Size (px)

Transcript

  • Clustering SemanticallySimilar Words

    0.397

    a

    a

    DSW Camp & JamDecember 4th, 2016

    Bayu Aldi Yansyah

  • - Understand step-by-step how to cluster words based on their

    semantic similarity

    - Understand how Deep learning model is applied to Natural

    Language Processing

    Our GoalsOverview

  • - You understand the basic of Natural Language Processing and

    Machine Learning

    - You are familiar with Artificial neural networks

    I assume Overview

  • 1. Introduction to Word Clustering

    2. Introduction to Word Embedding

    - Feed-forward Neural Net Language Model

    - Continuous Bag-of-Words Model

    - Continuous Skip-gram Model

    3. Similarity metrics

    - Cosine similarity

    - Euclidean similarity

    4. Clustering algorithm: Consensus clustering

    OutlineOverview

  • 1.WORD CLUSTERINGINTRODUCTION

    - Word clustering is a technique for partitioning sets of words into

    subsets of semantically similar words

    - Suppose we have set of words W = w$,w&, ,w( , n , our goal is

    to find C = C$,C&,, C. , k where

    - w1 is a centroid of cluster C2- similarity w1,w is a function to measure the similarity score

    - and is a threshold value where if D , means that

    D and is semantically similar.

    - For $ G and & H apply that $,& < , so

    J = where D, }

    G H = , G,H

  • 1.WORD CLUSTERINGINTRODUCTION

    In order to perform word clustering, we need to:

    1. Represent word as vector semantics, so we can compute their

    similarity and dissimilarity score.

    2. Find the w1 for each cluster.

    3. Choose the similarity metric D, and the threshold

    value .

  • Semantic Synonym

    Words are similar semantically if they have the same thing, are

    opposite of each other, used in the same way, used in the same

    context and one is a type of another. Gomaa and Fahmy (2013)

  • 2.WORD EMBEDDINGINTRODUCTION

    - Word embedding is a technique to represent a word as a vector.

    - The result of word embedding frequently referred as word vector

    or distributed representation of words.

    - There are 3 main approaches to word embedding:

    1. Neural Networks model based

    2. Dimensionality reduction based

    3. Probabilistic model based

    - We focus on (1)

    - The idea of these approaches are to learn vector representations of

    words in an unsupervised manner.

  • 2.WORD EMBEDDINGINTRODUCTION

    - Some Neural networks models that can learn representation of

    words are:

    1. Feed-forward Neural Net Language Model by

    Bengio et al. (2003).

    2. Continuous Bag-of-Words Model by Mikolov et al. (2013).

    3. Continuous Skip-gram Model by Mikolov et al. (2013).

    - We will compare these 3 models.

    - Fun fact: the last two models is highly-inpired by no 1.

    - Only Feed-forward Neural Net Language Model is considered as

    deep learning model.

  • 2.WORD EMBEDDINGCOMPARING NEURAL NETWORKS MODELS

    - We will use notation from Collobert et al. (2011) to represent the

    model. This help us to easily compare the models.

    - Any feed-forward neural network with layers can be seen as a

    composition of functions TU(W), corresponding to each layer :

    - With parameter for each layer :

    - Usually each layer have weight and bias , U = (U ,U).

    T W = T[(T[\$( T$(W) ))

    = ($,&, , [)

  • 2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELBengio et al. (2003)

    - The training data is a sequence of words $,& , ,] for ^

    - The model is trying predict the next word ^ based on the previous

    context (previous words: ^\$,^\&, ,^\a). (Figure 2.1.1)

    - The model is consist of 4 layers: Input layer, Projection layer, Hidden

    layer(s) and output layer. (Figure 2.1.2)

    - Known as NNLM

    ^

    Keren Sale Stock bisa dirumah... ...

    ^\$^\&^\b^\c

    Figure 2.1.1

  • 2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: INPUT LAYER

    - ^\$,^\&, , ^\a is a 1-of-|| vector or one-hot-encoded vector of

    ^\$,^\&, ,^\a- is the number of previous words

    - The input layer is just acting like placeholder here

    ^ = T ^\$, , ^\a

    Output layer : Tc J = ^ = c] Tb J +

    c

    Hidden layer : Tb J = tanh b] T& J +

    b

    Projection layer : T& J = &] T$ J

    Input layer for i-th example : T$ J = ^\$, ^\&, , ^\a

  • 2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: PROJECTION LAYER

    - The idea of this layer is to project the ||-dimension vector to

    smaller dimension.

    - & is the || matrix, also known as embedding matrix, where

    each row is a word vector

    - Unlike hidden layer, there is no non-linearity here

    - This layer also known as The shared word features layer

    ^ = T ^\$, , ^\a

    Output layer : Tc J = ^ = c] Tb J +

    c

    Hidden layer : Tb J = tanh b] T& J +

    b

    Projection layer : T& J = &] T$ J

    Input layer for i-th example : T$ J = ^\$, ^\&, , ^\a

  • 2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: HIDDEN LAYER

    - b is the matrix where is the number of hidden units.

    - b is a dimensional vector.

    - The activation function is hyperbolic tangent.

    ^ = T ^\$, , ^\a

    Output layer : Tc J = ^ = c] Tb J +

    c

    Hidden layer : Tb J = tanh b] T& J +

    b

    Projection layer : T& J = &] T$ J

    Input layer for i-th example : T$ J = ^\$, ^\&, , ^\a

  • 2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: OUPTUT LAYER

    - c is the || matrix.

    - c is a ||-dimensional vector.

    - The activation function is softmax.

    - ^ is a ||-dimensional vector.

    ^ = T ^\$, , ^\a

    Output layer : Tc J = ^ = c] Tb J +

    c

    Hidden layer : Tb J = tanh b] T& J +

    b

    Projection layer : T& J = &] T$ J

    Input layer for i-th example : T$ J = ^\$, ^\&, , ^\a

  • 2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELLOSS FUNCTION

    - Where is the number of training data

    - The goal is to maximize this loss function.

    - The neural networks are trained using stochastic gradient ascent.

    =1nlog T ^\$,, ^\a; J

    r

    Js$

  • Figure 2.1.2 Flow of the tensor of Feed-forward Neural Net Language Model

    with vocabulary size || and hyperparameter = 4, = 2

    and = 5.

    ^\$^\&^\b^\c

    ^\$^\&^\b^\c

    ^

  • 2.2.CONTINUOUS BAG-OF-WORDS MODELMikolov et al. (2013)

    - The training data is a sequence of words $,& , ,] for ^

    - The model is trying predict the word ^ based on the surrounding

    context ( words from left: ^\$,^\& and words from the right:

    ^\$,^\&). (Figure 2.2.1)

    - There are no hidden layer in this model.

    - Projection layer is averaged across input words.

    ^x&

    Keren Sale bisa bayar dirumah... ...

    ^x$^^\$^\&

    Figure 2.2.1

  • 2.2.CONTINUOUS BAG-OF-WORDS MODELCOMPOSITION OF FUNCTIONS: INPUT LAYER

    - ^\y is a 1-of-|| vector or one-hot-encoded vector of ^\y .

    - is the number of words on the left and the right.

    ^ = T ^\a, , ^\$,^x$,, ^xa

    Output layer : Tb J = ^ = b] T& J +

    b

    Projection layer : T& J = =12

    n &] T$() J

    \a{y{a,y|}

    y

    Input layer for i-th example : T$() J = ^\y , , 0

  • 2.2.CONTINUOUS BAG-OF-WORDS MODELCOMPOSITION OF FUNCTIONS: PROJECTION LAYER

    - The difference from previous model is this model project all the

    inputs to one -dimensional vector .

    - & is the || matrix, also known as embedding matrix, where

    each row is a word vector.

    ^ = T ^\a, , ^\$,^x$,, ^xa

    Output layer : Tb J = ^ = b] T& J +

    b

    Projection layer : T& J = =12

    n &] T$() J

    \a{y{a,y|}

    y

    Input layer for i-th example : T$() J = ^\y , , 0

  • 2.2.CONTINUOUS BAG-OF-WORDS MODELCOMPOSITION OF FUNCTIONS: OUPTUT LAYER

    - b is the m|| matrix.

    - b is a ||-dimensional vector.

    - The activation function is softmax.

    - ^ is a ||-dimensional vector.

    ^ = T ^\a, , ^\$,^x$,, ^xa

    Output layer : Tb J = ^ = b] T& J +

    b

    Projection layer : T& J = =12

    n &] T$() J

    \a{y{a,y|}

    y

    Input layer for i-th example : T$() J = ^\y , , 0

  • 2.2.CONTINOUS BAG-OF-WORDS MODELLOSS FUNCTION

    - Where is the number of training data

    - The goal is to maximize this loss function.

    - The neural networks are trained using stochastic gradient ascent.

    =1nlog T ^\a , , ^\$, ^x$, , ^xa J

    r

    Js$

  • ^x&^x$^\$^\&

    ^

    Figure 2.2.2 Flow of the tensor of Continuous Bag-of-Words Model with

    vocabulary size || and hyperparameter = 2, = 2.

  • 2.3.CONTINUOUS SKIP-GRAM MODELMikolov et al. (2013)

    - The training data is a sequence of words $,& , ,] for ^

    - The model is trying predict the surrounding context ( words from

    left: ^\$,^\& and words from the right: ^\$,^\&) based on the

    word ^ . (Figure 2.3.1)

    ^x&

    Keren bisa... ...

    ^x$^^\$^\&

    Figure 2.3.1

  • 2.3.CONTINUOUS SKIP-GRAM MODELCOMPOSITION OF FUNCTIONS: INPUT LAYER

    - ^ is a 1-of-|| vector or one-hot-encoded vector of ^ .

    = T ^

    Output layer : Tb J = = b] T& J +

    b

    Projection layer : T& J = &] T$ J