RNN & NLP Application - cs.kangwon.ac.krleeck/ML/RNN_NLP.pdf · LSTM RNN 응용 예 • Hand...

Preview:

Citation preview

RNN & NLP Application

강원대학교 IT대학

이창기

차례

• RNN

• Word Embedding

• NLP application

Recurrent Neural Network

• “Recurrent” property dynamical system over time

Bidirectional RNN

• Exploit future context as well as past

Long Short-Term Memory RNN • Vanishing Gradient Problem for RNN

• LSTM can preserve gradient information

LSTM Block Architecture

Gated Recurrent Unit (GRU)

• 𝑟𝑡 = 𝜎 𝑊𝑥𝑟𝑥𝑡 + 𝑊ℎ𝑟ℎ𝑡−1 + 𝑏𝑟

• 𝑧𝑡 = 𝜎 𝑊𝑥𝑥𝑥𝑡 + 𝑊ℎ𝑧ℎ𝑡−1 + 𝑏𝑧

• ℎ 𝑡 = 𝜙 𝑊𝑥ℎ𝑥𝑡 + 𝑊ℎℎ 𝑟𝑡 ⊙ ℎ𝑡−1 + 𝑏ℎ

• ℎ𝑡 = 𝑧𝑡ℎ𝑡 + 1 − 𝑧𝑡 ℎ 𝑡

• 𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦)

LSTM RNN 응용 예

• Hand Writing by Machine

• Music Composition

• Neural Machine Translation

• Image Caption Generation CNN + LSTM RNN

차례

• RNN

• Word Embedding

• NLP application

Deep Belief Network [Hinton06]

• Key idea

– Pre-train layers with an unsupervised learning algorithm in phases

– Then, fine-tune the whole network by supervised learning

• DBN are stacks of Restricted Boltzmann Machines (RBM)

10

Restricted Boltzmann Machine

• A Restricted Boltzmann machine (RBM) is a generative stochastic neural network that can learn a probability distribution over its set of inputs

• Major applications – Dimensionality reduction

– Topic modeling, …

11

Training DBN: Pre-Training

• 1. Layer-wise greedy unsupervised pre-training – Train layers in phase from the bottom layer

12

Training DBN: Fine-Tuning

• 2. Supervised fine-tuning for the classification task

13

The Back-Propagation Algorithm

Autoencoder

• Autoencoder is an NN whose desired output is the same as the input

– To learn a compressed representation (encoding) for a set of data.

– Find weight vectors A and B that minimize: Σi(yi-xi)

2

15 <겨울학교14 Deep Learning 자료 참고>

Stacked Autoencoders

• After training, the hidden node extracts features from the input nodes

• Stacking autoencoders constructs a deep network

16 <겨울학교14 Deep Learning 자료 참고>

텍스트의 표현 방식

• One-hot representation (or symbolic)

– Ex. [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]

– Dimensionality • 50K (PTB) – 500K (big vocab) – 3M (Google 1T)

– Problem • Motel [0 0 0 0 0 0 0 0 1 0 0] AND

• Hotel [0 0 0 0 0 0 1 0 0 0 0] = 0

• Continuous representation

– Latent Semantic Analysis, Random projection

– Latent Dirichlet Allocation, HMM clustering

– Neural Word Embedding • Dense vector

• By adding supervision from other tasks improve the representation

Neural Network Language Model (Bengio00,03)

Shared weights = Word embedding

• Idea – A word and its context is a

positive training sample

– A random word in that same context negative training sample

– Score(positive) > Score(neg.)

• Training complexity is high – Hidden layer output

– Softmax in the output layer

• Hierarchical softmax

• Negative sampling

• Ranking(hinge loss) Input Dim: 1 Dim: 2 Dim: 3 Dim: 4 Dim: 5

1 (boy) 0.01 0.2 -0.04 0.05 -0.3

2 (girl) 0.02 0.22 -0.05 0.04 -0.4

LT: |V|*d, Input(one hot): |V|*1 LTT I

Ranking-based Model (Collobert)

본 연구 추가

Shared weights = Word embedding

w(t-2)

w(t-1)

w(t) or wc(t)

s or sc

Negative sampling

s > sc

Word2Vec: CBOW, Skip-Gram

• Remove the hidden layer Speedup 1000x – Negative sampling

– Frequent word sampling

– Multi-thread (no lock)

• Continuous Bag-of-words (CBOW) – Predicts the current word given the con

text

• Skip-gram – Predicts the surrounding words given

the current word

– CBOW + DropOut/DropConnect

Shared weights

한국어 Word Embedding: NNLM

• Data – 세종코퍼스 원문 + Korean Wiki abstract + News data

• 2억 8000만 형태소

– Vocab. size: 60,000

• 모든 형태소 대상 (기호, 숫자, 한자, 조사 포함)

• 숫자 정규화 + 형태소/POS: 정부/NNG, 00/SN

• NNLM model – Dimension: 50

– Matlab 구현

– 학습시간: 16일

한국어 Word Embedding: NNLM

한국어 Word Embedding: Word2Vec(CBOW)

• Data – 2012-2013 News + Korean Wiki abstract:

• 9GB raw text

• 29억 형태소

– Vocab. size: 100,000 • 모든 형태소 대상 (기호, 숫자, 한자, 조사 포함)

• 숫자 정규화 + 영어 소문자 + 형태소/POS

• 기존 Word2Vec(CBOW, Skip-Gram) – 모델: CBOW > SKIP-Gram

– 학습 시간: 24분 Shared weights

한국어 Word Embedding: Word2Vec(CBOW)

Word Embedding 응용: Word Analogy

King – Man + Woman ≈ Queen

http://deeplearner.fz-qqq.net/

Bilingual Word Embedding

• Bilingual Word Embedding for PBMT (EMNLP13) – 중국어-영어 word aliment 이용

한영 Bilingual Word Embedding

한영 Bilingual Word Embedding – cont’d

차례

• RNN

• Word Embedding

• NLP application

Sequence Labeling – RNN, LSTM

Word embedding

Feature embedding

FFNN(or CNN), CNN+CRF (SENNA)

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

x(t-1) x(t ) x(t+1)

y(t+1) y(t-1) y(t )

RNN, CRF Recurrent CRF

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

x(t-1) x(t ) x(t+1)

y(t+1) y(t-1) y(t )

C(t) x(t ) h(t )

i (t )

f (t )

o(t )

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

LSTM RNN + CRF LSTM-CRF (KCC 15)

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

LSTM-CRF

• 𝑖𝑡 = 𝜎 𝑊𝑥𝑖𝑥𝑡 + 𝑊ℎ𝑖ℎ𝑡−1 + 𝑊𝑐𝑖𝑐𝑡−1 + 𝑏𝑖

• 𝑓𝑡 = 𝜎 𝑊𝑥𝑓𝑥𝑡 + 𝑊ℎ𝑓ℎ𝑡−1 + 𝑊𝑐𝑓𝑐𝑡−1 + 𝑏𝑓

• 𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ tanh 𝑊𝑥𝑐𝑥𝑡 + 𝑊ℎ𝑐ℎ𝑡−1 + 𝑏𝑐

• 𝑜𝑡 = 𝜎 𝑊𝑥𝑜𝑥𝑡 + 𝑊ℎ𝑜ℎ𝑡−1 + 𝑊𝑐𝑜𝑐𝑡 + 𝑏𝑜

• ℎ𝑡 = 𝑜𝑡 ⊙ tanh(𝑐𝑡)

• 𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦)

• 𝑦𝑡 = 𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦

• 𝑠 𝐱, 𝐲 = 𝐴 𝑦𝑡−1, 𝑦𝑡 + 𝑦𝑡𝑇𝑡=1

• log 𝑃 𝐲 𝐱 = 𝑠 𝐱, 𝐲 − log exp(𝑠(𝐱, 𝐲′))𝐲′

GRU+CRF

• 𝑟𝑡 = 𝜎 𝑊𝑥𝑟𝑥𝑡 + 𝑊ℎ𝑟ℎ𝑡−1 + 𝑏𝑟

• 𝑧𝑡 = 𝜎 𝑊𝑥𝑧𝑥𝑡 + 𝑊ℎ𝑧ℎ𝑡−1 + 𝑏𝑧

• ℎ 𝑡 = 𝜙 𝑊𝑥ℎ𝑥𝑡 + 𝑊ℎℎ 𝑟𝑡 ⊙ ℎ𝑡−1 + 𝑏ℎ

• ℎ𝑡 = 𝑧𝑡 ⊙ ℎ𝑡−1 + 𝟏 − 𝑧𝑡 ⊙ ℎ 𝑡

• 𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦)

• 𝑦𝑡 = 𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦

• 𝑠 𝐱, 𝐲 = 𝐴 𝑦𝑡−1, 𝑦𝑡 + 𝑦𝑡𝑇𝑡=1

• log 𝑃 𝐲 𝐱 = 𝑠 𝐱, 𝐲 − log exp(𝑠(𝐱, 𝐲′))𝐲′

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

Bi-LSTM CRF

• Bidirectional LSTM+CRF

• Bidirectional GRU+CRF

• Stacked Bi-LSTM+CRF …

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

bh(t-1) bh(t ) bh(t+1)

Stacked LSTM CRF

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

bh(t-1) bh(t ) bh(t+1)

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

h2(t-1) h2(t ) h2(t+1)

LSTM CRF with Context words = CNN + LSTM CRF

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t-1) y(t )

x(t-2) x(t+2)

• Bi-LSTM CRF =~ LSTM CRF with Context > LSTM CRF

영어 개체명 인식 (KCC 15, Journal submitted)

영어 개체명 인식 (CoNLL03 data set) F1(dev) F1(test)

SENNA (Collobert) - 89.59

Structural SVM (baseline + Word embedding feature) - 85.58

FFNN (Sigm + Dropout + Word embedding) 91.58 87.35

RNN (Sigm + Dropout + Word embedding) 91.83 88.09

LSTM RNN (Sigm + Dropout + Word embedding) 91.77 87.73

GRU RNN (Sigm + Dropout + Word embedding) 92.01 87.96

CNN+CRF (Sigm + Dropout + Word embedding) 93.09 88.69

RNN+CRF (Sigm + Dropout + Word embedding) 93.23 88.76

LSTM+CRF (Sigm + Dropout + Word embedding) 93.82 90.12

GRU+CRF (Sigm + Dropout + Word embedding) 93.67 89.98

한국어 개체명 인식 (Journal submitted)

한국어 개체명 인식 (TV domain) F1(test)

Structural SVM (baseline) (basic + NE dic. + word cluster feature + morpheme feature)

89.03

FFNN (ReLU + Dropout + Word embedding) 87.70

RNN (Tanh + Dropout + Word embedding) 88.93

LSTM RNN (Tanh + Dropout + Word embedding) 89.38 (+0.35)

Bi-LSTM RNN (Tanh + Dropout + Word embedding) 89.21 (+0.18)

CNN+CRF (ReLU + Dropout + Word embedding) 90.06 (+1.03)

RNN+CRF (Sigm + Dropout + Word embedding) 90.52 (+1.49)

LSTM+CRF (Sigm + Dropout + Word embedding) 91.04 (+2.01)

GRU+CRF (Sigm + Dropout + Word embedding) 91.02 (+1.99)

NLU 실험 (ATIS data)

NLU (Airplane Traveling Information System data) F1(dev) F1(test)

FFNN (Sigm + Dropout + Word embedding) 93.75 92.13

RNN (Sigm + Dropout + Word embedding) 98.05 94.78

LSTM RNN (Sigm + Dropout + Word embedding) 98.27 94.85

GRU RNN (Sigm + Dropout + Word embedding) 98.11 94.82

CNN4+CRF (Sigm + Dropout + Word embedding) 96.66 95.56

RNN+CRF (Sigm + Dropout + Word embedding) 98.42 96.19

LSTM+CRF (Sigm + Dropout + Word embedding) 98.79 96.48

학습 속도

0

50

100

150

200

250

FFNN

RNN

LSTM RNN

GRU

SCRN

R-CRF

LSTM R-CRF

GRU-CRF

SCRN-CRF

Neural Architectures for NER (Arxiv16)

• LSTM-CRF model + Char-based Word Representation – Char: Bi-LSTM RNN

Neural Architectures for NER (Arxiv16)

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF (ACL16)

• LSTM-CRF model + Char-level Representation – Char: CNN

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF (ACL16)

NER with Bidirectional LSTM-CNNs (Arxiv16)

의미역 결정 (Semantic Role Labeling)

Korean PropBank

• Virginia Corpus – 군대 도메인, 54,500 단어 도메인이 맞지 않음

• Newswire Corpus – 신문 도메인

(1994/6/2~2000/3/20), 131,800 단어

– 구구조 기반 태깅을 의존 구조로 변환 4,882 문장

• Frame file – 2,749 xml files

• 용언의 root (not stem)

• 용언화 가능 명사 (deverbal noun)

• Frame file 예제

– 덧붙.01

– English Define = add

– Role set • ARG0: adder

• ARG1: thing added

• ARG2: added to

– Mapping1 • Rel = 덧붙이다

• Src = sbj, Trg = ARG0

• Src = obj, Trg = ARG1

• Src = comp, Trg=ARG2

한국어 의미역 결정 (SRL)

• 서술어 인식(PIC)

– 그는 르노가 3월말까지 인수제의 시한을 [갖고]갖.1 있다고 [덧붙였다]덧붙.1

• 논항 인식(AIC)

– 그는 [르노가]ARG0 [3월말까지]ARGM-TMP 인수제의 [시한을]ARG1 [갖고]갖.1 [있다고]AUX 덧붙였다

– [그는]ARG0 르노가 3월말까지 인수제의 시한을 갖고 [있다고]ARG1 [덧붙였다]

덧붙.1

의존 구문 분석

의미역 결정

딥러닝 기반 한국어 의미역 결정 – 한글 및 한국어 정보처리/동계학술대회 15

• Bidirectional LSTM+CRF

• Korean Word embedding

– Predicate word, argument word

– NNLM

• Feature embedding

– POS, distance, direction

– Dependency path, LCA

• Bi-LSTM+CRF 성능 (AIC)

– F1: 78.2% (+1.2)

– Backward LSTM+CRF: F1 77.6%

– S-SVM 성능 (KCC14) • 기본자질: F1 74.3%

• 기본자질+word cluster: 77.0% – 정보과학회 논문지 2015.02

x(t-1) x(t ) x(t+1)

h(t-1) h(t ) h(t+1)

y(t+1) y(t+1) y(t )

bh(t-1) bh(t ) bh(t+1)

Stacked Bi-LSTM CRF (한국어의미역결정) (정보과학회지 제출)

Syntactic information w/ w/o

Structural SVM

FFNN

Backward LSTM CRFs

Bidirectional LSTM CRFs

Stacked Bidirectional LSTM CRFs (2 layers)

Stacked Bidirectional LSTM CRFs (3 layers)

76.96

76.01

76.79

78.16

78.12

78.14

74.15

73.22

76.37

78.17

78.57

78.36

CNN for Sentence Classification (Sentiment Analysis)

• Convolutional NN – Convolution Layer

• Sparse Connectivity

• Shared Weights

• Multiple feature maps

– Sub-sampling Layer • Average/max pooling

• NxN1

• NLP (Sentence Classification)에 적용 – ACL14

– EMNLP14

Multiple feature maps

한국어 감성 분석 – CNN

• Mobile data – Train: 4543, Test: 500

• EMNLP14 모델(CNN) 적용 – Matlab으로 구현

– Word embedding • 한국어 10만 단어 + 도메인 특화 1420 단어

한국어 감성 분석 – CNN 실험 Data set Model Accuracy

Mobile Train: 4543 Test: 500

SVM (word feature) 85.58

CNN(relu,kernel3,hid50)+Word embedding (word feature)

91.20

CNN(relu,kernel3,hid50)+Random init. 89.00

LSTM RNN 기반 한국어 감성분석

• LSTM RNN-based encoding – Sentence embedding 입력

– Fully connected NN 출력

– GRU encoding 도 유사함

x(1) x(2 ) x(t)

h(1) h(2 ) h(t)

y

Data set Model Accuracy

Mobile Train: 4543 Test: 500

SVM (word feature) 85.58

CNN(relu,kernel3,hid50)+Word embedding (word feature)

91.20

GRU encoding + Fully connected NN 91.12

LSTM RNN encoding + Fully connected NN 90.93

Attention-based LSTM RNN encoding 기반 한국어 감성분석

• Bi-LSTM RNN-based encoding + Attention Mechanism

Data set Model Accuracy

Mobile Train: 4543 Test: 500

SVM (word feature) 85.58

CNN(relu,kernel3,hid50)+Word embedding (word feature)

91.20

GRU encoding + Fully connected NN 91.12

Attention-based GRU encoding + Fully connected NN

90.73

y

Attention Mechanism 결과

• 긍정 – 반응/nng/2 속도/nng/2 터치감/nng/2 만족/nng/2 하/xsa/19 고요/ec/25 </s>/44

– 카메라/nng/1 가/jks/2 깨끗이/mag/1 잘/mag/0 나오/vv/0 는/etm/1 듯/nnb/1 하/xsa/2 아/ec/1 참/mag/0 마음/nng/0 에/jkb/10 들/vv/12 었/ep/12 습니다/ef/12 ./sf/15 </s>/21

– 검정/nng/0 보다/jkb/0 화이트/nng/0 를/jko/1 선호/nng/1 하/xsv/3 는데/ec/3 다행/nng/2 이/jks/4 화이트/nng/2 가/jks/6 있/vv/4 어서/ec/5 디자인/nng/4 도/jx/8 맘/nng/8 에/jkb/10 들/vv/9 구요/ec/9 </s>/11

• 부정 – 화면/nng/10 이나/jc/12 해상도/nng/6 는/jx/12 않/nng/25 좋/va/1 아서/ec/8

</s>/23

– 화면/nng/5 자체/nng/9 가/jks/13 작/va/4 은데/ec/3 편하/va/1 겠/ep/7 냐/ec/16 </s>/38

– 메뉴/nng/5 무지/nng/0 느리/va/4 고요/ec/3 인터넷/nng/1 심심/xr/0 하/xsa/3 면/ec/4 렉먹/vv/7 고/ec/12 문자/nng/8 전송/nng/4 속도/nng/2 gg/sl/12 </s>/28

– 요금/nng/12 넘/vv/9 비싸/va/15 아/ec/19 </s>/43

GRU encoding + Deep Output

• GRU encoding + Deep Output – Sentence embedding

입력

– Fully connected NN 출력

• 실험 결과 성능 개선 없음

x(1) x(2 ) x(t)

h(1) h(2 ) h(t)

y

h'(t)

Attention-based GRU encoding + Deep Output

• Attention-based GRU encoding + Deep Output – Sentence embedding

입력

– Fully connected NN 출력

• 실험 결과 성능 개선 없음

h'

y

Stacked GRU encoding + MLP

• Stacked GRU encoding – Sentence embedding

입력

– Fully connected NN 출력

• 실험 중 – 일부 태스크에서 좋은 성능

을 보임

x(1) x(2 ) x(t)

h(1) h(2 ) h(t)

y

h’t)

Neural Machine Translation

T|S 777 항공편 은 3 시간 동안 지상 에 있 겠 습니다 . </s>

flight 0.5 0.4 0 0 0 0 0 0 0 0 0 0 0

777 0.3 0.6 0 0 0 0 0 0 0 0 0 0 0

is 0 0.1 0 0 0.1 0.2 0 0.4 0 0.1 0 0 0

on 0 0 0 0 0 0 0 0.7 0.2 0.1 0 0 0

the 0 0 0 0.2 0.3 0.3 0.1 0 0 0 0 0

ground 0 0 0 0.1 0.2 0.5 0.3 0 0 0 0 0 0

for 0 0 0 0.1 0.2 0.5 0.1 0.1 0 0 0 0 0

three 0 0 0 0.2 0.2 0.6 0 0 0 0 0 0 0

hours 0 0 0 0.1 0.3 0.5 0 0 0 0 0 0 0

. 0 0 0 0.4 0 0.1 0.2 0.1 0.1 0.1 0 0 0

</s> 0 0 0 0 0 0 0 0.1 0 0.1 0.1 0.3 0.3

Recurrent NN Encoder–Decoder for Statistical Machine Translation (EMNLP14)

GRU RNN Encoding GRU RNN Decoding Vocab: 15,000 (src, tgt)

Sequence to Sequence Learning with Neural Networks (NIPS14 – Google)

Source Voc.: 160,000 Target Voc.: 80,000 Deep LSTMs with 4 layers Train: 7.5 epochs (12M sentences, 10 days with 8-GPU machine)

Neural MT by Jointly Learning to Align and Translate (ICLR15)

GRU RNN + Alignment Encoding GRU RNN Decoding Vocab: 30,000 (src, tgt) Train: 5 days

NMT 모델의 장단점

• NMT의 장점 – Feature Engineering이 필요 없음 – End-to-end 방식의 단일 신경망 구조

• 단어 정렬, 번역 모델, 언어 모델 등이 필요 없음

– 디코더가 간단함 – 구문 분석 등이 필요 없음

• Pre-ordering, Syntax-based SMT

• NMT의 단점 – 출력 언어 단어 사전의 크기가 제한 됨

• 사전의 크기에 비례하여 학습/디코딩 시간이 필요 • 미등록어 처리가 필요함 • 최근 NMT 논문들은 이러한 미등록어 처리에 관한 연구가 대

부분임

문자 단위의 NMT 제안 (WAT15, 한글 및 한국어15)

• 기존의 NMT: 단어 단위의 인코딩-디코딩

– 미등록어 후처리 or NMT 모델의 수정 등이 필요

• 문자 단위의 NMT

– 입력 언어는 단어 단위로 인코딩

• 입력 언어를 문자 단위로 인코딩 성능 하락

– 출력 언어는 문자 단위로 디코딩

• 단어 단위: その/UN 結果/NCA を/PS 詳細/NCD …

• 문자 단위: そ/B の/I 結/B 果/I を/B 詳/B 細/I …

• 문자 단위 NMT의 장점

– 모든 문자를 사전에 등록 미등록어 문제해결

– 학습 및 디코딩 속도 향상

– 기존 NMT 모델의 수정이 필요 없음

– 미등록어 후처리 작업이 필요 없음

본 연구에서 확장한 NMT 모델

• Encoding (source language)

– Bidirectional GRU RNN 이용

– ℎ𝑡 = [ℎ𝑡 , ℎ𝑡].

• Decoding (target language)

– 𝑐 = ℎ0 (global context)

– 𝑐𝑡 = 𝑎𝑡𝑖ℎ𝑖 (local context)

– 𝑧𝑡 = 𝑠𝑖𝑔𝑚 𝑊𝑧𝑥𝐸𝑡(𝑦𝑡−1) + 𝑊𝑧𝑠𝑠𝑡−1 + 𝑏𝑧

– 𝑟𝑡 = 𝑠𝑖𝑔𝑚 𝑊𝑟𝑥𝐸𝑡(𝑦𝑡−1) + 𝑊𝑟𝑠𝑠𝑡−1 + 𝑏𝑟

– 𝑠 𝑡 = 𝑡𝑎𝑛ℎ 𝑊𝑠𝑥𝐸𝑡(𝑦𝑡−1) + 𝑊𝑠𝑠(𝑟𝑡𝑠𝑡−1) + 𝑊𝑠𝑐𝑐𝑡 + 𝑏𝑠

– 𝑠𝑡 = (1 − 𝑧𝑡)𝑠𝑡−1 + 𝑧𝑡𝑠 𝑡

– 𝒔′𝒕 = 𝒓𝒆𝒍𝒖 𝑾𝒔′𝒔𝒔𝒕 + 𝒃𝒚

– 𝑦𝑡 =

𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑾𝒚𝒔′𝒔′𝒕 + 𝑊𝑦𝑠𝑠𝑡 + 𝑊𝑦𝑦𝐸𝑡(𝑦𝑡−1) + 𝑾𝒚𝒄𝒄 + 𝑏𝑦

S’ S’

yt-1 yt

c

ASPEC J-to-E 실험

• ASPEC J-to-E data

• 성능 (Juman 이용 BLEU)

– PB SMT: 18.45

– HPB SMT: 18.72

– Tree-to-string SMT: 22.16

– NMT (Word-level decoding): 21.63

– NMT (Character-level decoding): 21.72

– NMT (Word+Character-level decoding): 23.76

ASPEC E-to-J 실험 (WAT15)

• ASPEC E-to-J data

• 성능 (Juman 이용 BLEU)

– PB SMT: 27.48

– HPB SMT: 30.19

– Tree-to-string SMT: 32.63

– NMT (Word-level decoding): 29.78

– NMT (Character-level decoding): 33.14 (4위) • RIBES 0.8073 (2위)

– Tree-to-String + NMT(Character-level) re-ranking • BLEU 34.60 (2위)

• Human 53.25 (2위)

JPO K-to-J 실험 (WAT15)

• JPO Patent K-to-J data

• 성능 (Juman 이용 BLEU)

– PB SMT: 69.22

– HPB SMT: 67.41

– NMT(Word-level): 61.52

– NMT(Character-level): 65.72

– PB SMT + NMT(Character-level) re-ranking • BLEU 71.38 (2위)

• Human 14.75 (1위)

ASPEC E-to-J 실험 – 예제

• This/DT:0 paper/NN:1 explaines/NNS:2 experimental/JJ:3 result/NN:4 according/VBG:5 to/TO:6 the/DT:7 model/NN:8 ./.:9 </s>:10

• こ/B:0 の/I:1 モ/B:2 デ/I:3 ル/I:4 に/B:5 よ/B:6 る/I:7 実/B:8 験/I:9 結/B:10 果/I:11 を/B:12 説/B:13 明/I:14 し/B:15 た/B:16 。/B:17 </s>:18

ASPEC J-to-E 실험 – 예제

• 又/CJ:0 メルト/NQC:1 フラクチャー/NCC:2 防止/NCS:3 策/NCC:4 に/PS:5 つい/VC:6 て/PJ:7 も/PC:8 述べ/VC:9 た/VX:10 </s>:11

• M/B:0 e/I:1 l/I:2 t/I:3 i/I:4 n/I:5 g/I:6 prevention/NN:7 measures/NNS:8 are/VBP:9 also/RB:10 described/VBN:11 ./.:12 </s>:13

JPO K-to-J 실험 – 예제

• ○/ETC:0 :/ETC:1 필름/NOUN:2 면/NOUN:3 과의/JOSA:4 접착/NOUN:5 이/JOSA:6 없/NORMALVERB:7 다/EOMI:8 ./ETC:9 </s>:10

• ○/B:0 :/B:1 フ/B:2 ィ/I:3 ル/I:4 ム/I:5 面/B:6 と/B:7 の/B:8 接/B:9 着/I:10 が/B:11 な/B:12 い/I:13 。/B:14 </s>:15

Input-feeding Approach (EMNLP15)

The attentional decisions are made independently, which is suboptimal.

In standard MT, a coverage set is often maintained during the translation process to keep track of which source words have been translated.

Effect: - We hope to make the model fully aware of

previous alignment choices - We create a very deep network spanning

both horizontally and vertically

Copying Mechanism or CopyNet (ACL16)

Abstractive Text Summarization (한글 및 한국어 16)

로드킬로 숨진 친구의 곁을 지키는 길고양이의 모습이 포착되었다.

RNN_search+input_feeding+CopyNet

End-to-End 한국어 형태소 분석 (동계학술대회16)

Attention + Input-feeding + Copying mechanism

Sequence-to-sequence 기반 한국어 구구조 구문 분석 (한글 및 한국어 16)

입력 예시 1 43/SN 국/NNG <sp> 참가/NNG

입력 예시 2 4 3 <SN> 국 <NNG> <sp> 참 가 <NNG>

43/SN 국/NNG + 참가/NNG

NP NP

NP

(NP (NP 43/SN + 국/NNG) (NP 참가/NNG))

GRU

GRU

GRU

GRU

GRU

GRU

x1

h1t-1

h2t-1

yt-1

h1t

h2t

yt

x2 xT

ct

Attention + Input-feeding

입력

선 생 <NNG> 님 <XSN> 의 <JKG> <sp> 이 야 기 <NNG> <sp> 끝 나 <VV> 자 <EC> <sp> 마 치 <VV> 는 <ETM> <sp>

종 <NNG> 이 <JKS> <sp> 울 리 <VV> 었 <EP> 다 <EF> . <SF>

정답 (S (S (NP_SBJ (NP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) (S

(NP_SBJ (VP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) )

RNN-search[7] (Beam size 10)

(S (VP (NP_OBJ (NP_MOD XX ) (NP_OBJ XX ) ) (VP XX ) ) (S (NP_SBJ (VP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) )

RNN-search + Input-feeding +

Dropout (Beam size 10)

(S (S (NP_SBJ (NP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) (S (NP_SBJ (VP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) )

Sequence-to-sequence 기반 한국어 구구조 구문 분석 (한글 및 한국어 16)

모델 F1

스탠포드 구문분석기[13] 74.65

버클리 구문분석기[13] 78.74

형태소 + <sp> RNN-search[7] (Beam size 10)

87.34(baseline)

87.65*(+0.31)

형태소의 음절 + 품사태그 + <sp>

RNN-search[7] (Beam size 10) 87.69(+0.35)

88.00*(+0.66)

RNN-search + Input-feeding (Beam size 10) 88.23(+0.89)

88.68*(+1.34)

RNN-search + Input-feeding + Dropout (Beam size 10)

88.78(+1.44)

89.03*(+1.69)

Pointer Network (NIPS15) • Travelling Salesman Problem: NP-hard • Pointer Network can learn approximate solutions: O(n^2)

상호참조해결을 위한 포인터 네트워크 모델 (Journal submitted)

• 입력: 단어(형태소) 열, 출발점(대명사, 한정사구(이 별자리 등))

– X = {A:0, B:1, C:2, D:3, <EOS>:4}, Start_Point=A:0

• 출력: 입력 단어 열의 위치(Pointer) 열 Entity

– Y = {A:0, C:2, D:3, <EOS>:4}

• 특징: End-to-end 방식의 대명사 상호참조 해결 (mention detection 과정 X)

A B C D <EOS> A C D <EOS>

Encoding Decoding

Hidden Layer

Projection Layer

Attention Layer

포인터 네트워크를 이용한 한국어 의존 구문 분석 (동계학술대회16)

Neural Responding Machine for Short-Text Conversation (ACL 15)

Neural Responding Machine – cont’d

실험 결과 (ACL 15)

관련연구: Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network

Models (arXiv 15) - HRED

관련연구: Attention with Intention for a Neural Network Conversation Model (arXiv 15) - AWI

관련연구: A Diversity-Promoting Objective Function for Neural Conversation Models

(arXiv 16) - MMI

관련연구: A Persona-Based Neural Conversation Model (arXiv 16)

Speaker model + MMI

강원대 딥러닝 기반 챗봇 모델

• Short-Text Conversation 가정

– 사용자의 현재 발화 정보만을 이용하여 대답을 생성

• GRU_encdec 모델

– Gate Recurrent Unit(GRU) 기반의 Encoder-Decoder 모델

• Stacked_GRU_encdec 모델

– 2 layer GRU_encdec 모델

• GRU_search 모델

– Attention 기반의 모델

• GRU_DO_search 모델

– Attention 기반의 모델 + Deep Output 모델

Subtitle 데이터 실험

• 학습데이타: 영화 대사 데이터 100만 문장

• 모델: GRU + enc-dec • what is two plus two ?

– A) one , two , three .

• how old are you ?

– A) 18 years .

• are you a follower or a leader ?

– A) yes , i am .

• my name is david . what is my name ?

– A) david palmer .

• what is the purpose of being intelligent ?

– A) it is difficult to explain .

• what happens if machines can think ?

– A) if they find anything , they 'll kill us .

고객센터 데이터 실험

• 학습데이타: 고객센터 데이터 18만 문장

• 모델: Stacked GRU + enc-dec • 세탁기가 전원이 안 들어와요?

– A) 네 아 그러세요

– A) 많이 불편하셨겠습니다

– A) 드럼이세요 아니면 일반 세탁기세요

• 언제 방문해 줄수 있나요?

– A) 언제 방문 드리면 편하실까요

– A) 언제 방문 드리면 편하시겠어요

• 출장비가 얼마죠?

– A) 출장 점검료는 15000원이구요

– A) 출장비가 만오천원이구요

Short-Text Conversation (동계학술대회16)

- Data: 클리앙 ‘아무거나 질문 게시판’ - 77,346 질문-응답 쌍 - 학습:개발:평가 = 8:1:1

이미지 캡션 생성 소개

• 이미지 내용 이해 이미지 내용을 설명하는 캡션 자동 생성 – 이미지 인식(이해) 기술 + 자연어처리(생성) 기술

• 활용 분야 – 이미지 검색

– 맹인들을 위한 사진 설명, 네비게이션

– 유아 교육, …

기존 연구 • Multimodal RNN (M-RNN) [2]

Baidu CNN + vanilla RNN

CNN: VGGNet • Neural Image Caption

generator (NIC) [4] Google CNN + LSTM RNN

CNN: GoogLeNet

• Deep Visual-Semantic alignments (DeepVS) [5] Stanford University RCNN + Bi-RNN

alignment (training) CNN + vanilla RNN

CNN: AlexNet

AlexNet, VGGNet

RNN을 이용한 이미지 캡션 생성 (동계학술대회 15)

• CNN + RNN

– CNN: VGGNet 15 번째 layer (4096 차원)

– RNN: GRU (LSTM RNN의 변형) • Hidden layer unit: 500, 1000 (Best)

• Multimodal layer unit: 500, 1000 (Best)

– Word embedding • SENNA: 50차원 (Best)

• Word2Vec: 300 차원

– Data set • Flickr 8K : 8000 이미지 * 이미지 캡션 5문장

– 6000 학습, 1000 검증, 1000 평가

• Flickr 30K : 31783 이미지 * 이미지 캡션 5문장

– 29000 학습, 1014 검증, 1000 평가

– 4가지 모델 실험 • GRU-DO1, GRU-DO2, GRU-DO3, GRU-DO4

GRU

Embedding

CNNMultimodal

Softmax

Wt

Wt+1

Image

GRU

Embedding

CNNMultimodal

Softmax

Wt

Wt+1

Image GRU

Embedding

CNNMultimodal

Softmax

Wt

Wt+1

Image

• GRU-DO1 • GRU-DO2

• GRU-DO3 • GRU-DO4

RNN을 이용한 이미지 캡션 생성 (동계학술대회 15)

GRU

Embedding

CNNMultimodal

Softmax

Wt

Wt+1

Image GRU

Embedding

CNNMultimodal

Softmax

Wt

Wt+1

Image GRU

Embedding

CNNMultimodal

Softmax

Wt

Wt+1

Image

Flickr 30K B-1 B-2 B-3 B-4

m-RNN (Baidu)[2] 60.0 41.2 27.8 18.7

DeepVS (Stanford)[5] 57.3 36.9 24.0 15.7

NIC (Google)[4] 66.3 42.3 27.7 18.3

Ours-GRU-DO1 63.01 43.60 29.74 20.14

Ours-GRU-DO2 63.24 44.25 30.45 20.58

Ours-GRU-DO3 62.19 43.23 29.50 19.91

Ours-GRU-DO4 63.03 43.94 30.13 20.21

Flickr 8K B-1 B-2 B-3 B-4

m-RNN (Baidu)[2] 56.5 38.6 25.6 17.0

DeepVS (Stanford)[5] 57.9 38.3 24.5 16.0

NIC (Google)[4] 63.0 41.0 27.0 -

Ours-GRU-DO1 63.12 44.27 29.82 19.34

Ours-GRU-DO2 61.89 43.86 29.99 19.85

Ours-GRU-DO3 62.63 44.16 30.03 19.83

Ours-GRU-DO4 63.14 45.14 31.09 20.94

Flickr30k 실험 결과

A black and white dog is jumping in the grass

A group of people in the snow

Two men are working on a roof

신규 데이터

A large clock tower in front of a building

A man in a field throwing a frisbee

A little boy holding a white frisbee

A man and a woman are playing with a sheep

한국어 이미지 캡션 생성

한 어린 소녀가 풀로 덮인 들판에 서 있다

건물 앞에 서 있는 한 남자

분홍색 개를 데리고 있는 한 여자와 한 여자

구명조끼를 입은 한 작은 소녀가 웃고 있다

GRU

Embedding

CNNMultimodal

Softmax

Wt

Wt+1

Image

Residual Network + 한국어 이미지 캡션 생성 (동계학술대회16)

Visual Question Answering

Facebook: Visual Q&A

Toronto COCO-QA dataset (ArXiv 15)

• QA dataset 자동 생성

– MS COCO (image caption) data set 이용

– Image caption으로부터 parser 등을 이용하여 질문과 정답을 자동 생성 • 정답은 single word라고 가정함 (classification problem)

• Object, Number, Color, Location types의 질문 생성 – Object: 70%, Color: 17%, Number: 7%, Location: 6%

– 123,287 images, 78,736 train questions, 38,948 questions

– Ex. What is there in front of the sofa? • Answer: table

VIS+LSTM model (Arxiv 15)

DPPnet (포항공대)

Stacked Attention Networks for Image QA (ArXiv 16)

Recommended