Learning and Knowledge Transfer with Memory Networks for

Learning and Knowledge Transfer withMemory Networks for Machine

Comprehension

Mohit Yadav. Lovekesh Vig. Gautam Shro↵

TCS Research New-Delhi

Presented by Kyo Kim

April 24, 2018

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 1 / 27

Overview

1 Motivation2 Background3 Proposed Method4 Dataset and Experiment Results5 Summary

Motivation

Problem

Obtaining high performance in ”machinecomprehension” requires abundant humanannotated dataset.

Measured by question answering performance.

In a real-world dataset with small amount of data,wider range of vocabulary can be observed and thegrammar structure is often complex.

High-level Overview of Proposed Method

1 Curriculum based training procedure.2 Knowledge transfer to increase the performance in

dataset with less abundant labeled data.3 Pre-trained memory network on small dataset.

Background

End-to-end Memory Networks

1 Vectorize the problem tuple.2 Retrieve the corresponding memory attention vector.3 Use the retrieved memory to answer the question.

End-to-end Memory Networks Cont.

Vectorize the problem tupleProblem tuple: (q,C , S , s)

q: questionC : context textS : set of answer choicess: correct answer (s 2 S)

Question and context embedding matrix A 2 Rp·d

Query vector: ~q = A�(q)�: Bag of words

Memory vector: ~m

= A�(ci

) for i = 1, · · · , n wheren = |C | and c

Retrieve the corresponding memory attention vector

Attention distribution: ai

= softmax( ~m>i

Second memory vector: ~ri

= B�(ci

) where B isanother embedding matrix similar to A.

Aggregated vector: ~ro

Prediction vector: ai

= softmax((~ro

+ ~q)>U�(s

))U is the embedding matrix for the answers

Answer the question

Pick s

that corresponds to the highest ai

Cross-entropy Loss

L(P ,D) =1

· log(an

(P ,D))

+ (1� a

) · log(1� a

(P ,D))

Curriculum Learning

First proposed by Bengio et al. (2009)

Introduce samples with increasing ”di�culty”.

Better local minima even under non-convex loss.

Pre-training and Joint-training

Pre-training

Have a pre-trained model to initially guide thetraining process in a similar domain.

Joint-training

Exploit the similarity between two di↵erent domainsby training the model is two di↵erent domainssimultaneously.

Proposed Method

Curriculum Inspired Training (CIT)

Di�culty Measurement

SF (q, S ,C , s) =

Pword2{q[S[C} log(Freq(word))

#{q [ S [ C}

Partition the dataset into a fixed numberchapter size with increasing di�culty.

Each chapter consists ofS

current chapter

parition[i ].

The model is trained with fixed number of epochsper chapter.

CIT Cont.

Loss Function

L(P ,D, en) =1

· log(an

(P ,D)))

+ (1� a

) · log(1� a

(P ,D)) · 1en>=c(n)·epc

en : Current epoch

c(n) : Chapter number that the example n isassigned to

epc : Epochs per chapterMohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 15 / 27

Joint-Training

General Joint Loss Function

L(P ,TD, SD) = 2� · L(P ,TD)

+ 2(1� �) · L(P , SD) · F (NTD

TD: Target dataset

SD: Source dataset

: Number of examples in the dataset D

�: Tunable weight parameter

Loss Functions

Joint-training� = 1

and F (NTD

L(P ,TD, SD) = L(P ,TD) + L(P , SD)

Weighted joint-training� 2 (0, 1) and F (N

L(P ,TD, SD) = 2� ·L(P ,TD)+2(1��) ·L(P , SD) ·NTD

Loss Functions Cont.

Curriculum joint-training� = 1

and F (NTD

L(P ,TD, SD) = L(P ,TD, en) + L(P , SD, en)

Weighted Curriculum joint-training� 2 (0, 1) and F (N

L(P ,TD, SD) = 2� · L(P ,TD, en)

+ 2(1� �)L(P , SD, en) · NTD

Source only� = 0 and c 2 R+

L(P ,TD, SD) = c · L(P , SD)

Dataset and Experiment Results

Dataset

Figure: Dataset used for experiments.

Experiment Results

Figure: The table has two major rows. The upper row are modelsthat only used the target dataset. The lower rows are models thatused both the target and source dataset.

Experiment Results

Figure: Categorical performance measurement in CNN-11 K . Thetable has two major rows. The upper row are models that only usedthe target dataset. The lower rows are models that used both thetarget and source dataset.

Experiment Results

Figure: Knowledge transfer performance result.

Experiment Results

Figure: Loss convergence comparison between model trained withCIT and without CIT.

Summary

MemNN is often used in QA.

Ordering the samples lead to better local minima.

Joint-training is useful in obtaining betterperformance on small target dataset.

Using pre-trained model improves performance.

The End

Learning and Knowledge Transfer with Memory Networks for

Documents

URBACT III Guide to Transfer Networks - Anka Mrak · •Guide to Transfer Networks • 6 GLOSSARY Transfer Network Transfer networks are one of the three types of networks developed

Networks Long Short Term Memory - rctn.orgLong Short Term Memory Networks Brian Cheung ... Memory cells and gating units allow information to be stored for long periods of time. Memory

Gated End-to-End Memory Networks

Hierarchical Memory with Block Transfer

File transfer - Application Layer- Computer Networks

Radiative Transfer Networks Revisited - Lighting Analysts · 2015. 11. 10. · 3. Radiative Transfer Networks As shown by Oppenheim20 and O’Brien17, the radiative flux transfer

Planaria: Memory Transfer through Cannibalism Reexamined … · 2008-01-09 · Planaria: Memory Transfer through Cannibalism Reexamined Arlene L. Hartry; Patricia Keith-Lee; William

Networks for Transfer Success

Generative Adversarial Style Transfer Networks for …openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w41/... · Generative Adversarial Style Transfer Networks for Face Aging

Memory-Augmented Neural Networks: Enhancing Biological ... · Memory-Augmented Neural Networks: Enhancing Biological Plausibility and Task-Learnability by Marco MARTINOLLI Result

Compound Memory Networks for Few-shot Video Classification · 2018. 8. 28. · augmented Neural Networks ·Compound Memory Networks 1 Introduction Deep learning models have been successfully

Resistor networks and transfer resistance matrices

Knowledge Transfer Networks - The National Archiveswebarchive.nationalarchives.gov.uk/20130221185318/... · Current knowledge transfer networks. 5 ... chris.sier@fs-net.org, tel:

Asynchronous Transfer Mode (ATM) Networks

Exploring Memory Options for Data Transfer on ... · Exploring Memory Options for Data Transfer on Heterogeneous Platforms∗ Chao Liu† Northeastern University liu.chao@husky.neu.edu

Unsupervised learning Networks Associative Memory Networks ELE571 Digital Neural Networks

Designing Efﬁcient Wireless Power Transfer Networks

VayuAnukulani: Adaptive Memory Networks for Air Pollution

Hydrocarbon Processing - Optimize heat transfer networks

Meta-Learning with Memory Augmented Neural Networks