21
TensorFlow + NLP Language Vector Space Model (Word2Vec) Tutorial

Thai Word Embedding with Tensorflow

Embed Size (px)

Citation preview

Page 1: Thai Word Embedding with Tensorflow

TensorFlow + NLPLanguage Vector Space Model (Word2Vec) Tutorial

Page 2: Thai Word Embedding with Tensorflow

Goal of this tutorial

• Learn how to do NLP in Tensorflow

• Learning Word embeddings that can extracting relationship between discrete atomic symbols (words) from the textual corpus.

Page 3: Thai Word Embedding with Tensorflow

Wordsin TextCorpus

Page 4: Thai Word Embedding with Tensorflow

NLP in Deep Learning• Word Embeddings is needed for NLP Deep Learning. Why?

• Image and audio are already provide useful information for relationship between instance (pixels, frames)

• A pixel value of #FF0000 is very similar to #FE0000, since both are red. We can compute the difference automatically.

• Text does not provide useful information about the relationships between individual symbols.

• 'cat' represented as Id537, 'dog' represented as Id143, Computer don’t know relationship between Id537 and Id143.

Page 5: Thai Word Embedding with Tensorflow
Page 6: Thai Word Embedding with Tensorflow

Vector Space Model• Find the relationship between discrete symbols (in this case,

words).

• Two proposed methods.

• Count-based method.

• How often the same word co-occurs with its neighbor words in a large text corpus. (e.g., Latent Semantic Analysis)

• Predictive-based method.

• Trying to predict the words from its neighbors (e.g., Neural Probabilistic language model).

Page 7: Thai Word Embedding with Tensorflow

Word2Vec• Computationally-efficient predictive model for

learning word embedding from raw text.

• Make by Tomas Mikolov at Google.

• 2 Flavors

• Continuous Bag-of-Words (CBOW)

• Skip-Gram model

Page 8: Thai Word Embedding with Tensorflow

CBOW• Continuous Bag-of-Words (CBOW)

• Predict target words from source context words.

• Input: "The cat sits on the ______"

• Output: mat

• Example, 3-gram CBOW = (the,cat) =>sits, (cat,sits)=>on, (sits, on)=> the, (on, the)=> mat

• Better for small dataset.

Page 9: Thai Word Embedding with Tensorflow

Skip-Gram model • Skip-Gram model

• Predict source context words from target words.

• Input: sits

• Output: "The cat ____ on the mats"

• Example, 1-skip 3-gram Skip-Gram = (the,sits)=>cat, (cat,on)=>sits, (sits, the)=> on, (on, mats)=> the

• Better for large dataset. We use this in the slide.

Page 10: Thai Word Embedding with Tensorflow

Noise-Contrastive Training for Vector Space Model

• We are using Gradient decent method for binary regression to modeling word-relationship models. (Neural Network)

• To discriminates the real target words (that exists in the skip-gram model) and the imaginary noise words (that non-exists in the skip-gram model) => We use the following objective function (maximum it)

Page 11: Thai Word Embedding with Tensorflow

Negative Sampling

Page 12: Thai Word Embedding with Tensorflow

Input• Batch Training, For e.g., Windows Size = 9

• "the quick brown fox jumped over the lazy dog"

• 1-skip 3-gram Skip-Gram = (the,brown)=>quick, (quick, fox)=>brown, (brown,jumpted)=> fox,...

• Dataset: (quick, the), (quick, brown), (brown, quick), (brown, fox),...

Page 13: Thai Word Embedding with Tensorflow

Loop• (quick, the), (quick, brown), (brown, quick),

(brown, fox),...

• For each loop, Random pick word that not in windows set as the negative sampling. Then, Stochastic Gradient Descent method adjust the weight for maximum the above objective function.

Page 14: Thai Word Embedding with Tensorflow

Tensorflow code

Page 15: Thai Word Embedding with Tensorflow
Page 16: Thai Word Embedding with Tensorflow

10,000 ข่าว

Page 17: Thai Word Embedding with Tensorflow

Clean Data

Page 18: Thai Word Embedding with Tensorflow

Step 0

Page 19: Thai Word Embedding with Tensorflow

Step 30,000

Page 20: Thai Word Embedding with Tensorflow

Step 0

Step 30,000

Page 21: Thai Word Embedding with Tensorflow