Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Learning Multimodal Representations with Factorized Deep Generative ModelsYao-Hung Hubert Tsai∗†, Paul Pu Liang∗†, Amir Zadeh‡, Louis-Philippe Morency‡, and Ruslan Salakhutdinov†
{†Machine Learning Department, ‡Language Technologies Institute}, Carnegie Mellon University∗equal contributions, {yaohungt,pliang,abagherz,morency,rsalakhu}@cs.cmu.edu
Multimodal Factorization Model• Bayesian Network
Generative Network Inference Network
•Notations– X1:M : multimodal data from M modalities, Y: labels– X1:M : generated multimodal data, Y: generated labels– Za·: modality-specific latent variables, F·: factors
• Summary– Joint generative-discriminative objective for multimodal data.– Factorize representation into independent sets of factors∗Multimodal Discriminative factors∗Modality-Specific Generative factors
•Neural Architecture– Encoder Q(Zy|X1:M ) can be parametrized by any model that per-
forms multimodal fusion.
•Contributions– SOTA performance on six multimodal datasets.– Flexible generative capabilities by independent factors.– Ability to reconstruct missing modalities.– Interpreting multimodal interactions.
Generation, Inference, and Learning•Generation
– factorization over joint distribution
P (X1:M , Y) =
∫F,Z
P (X1:M , Y|F)P (F|Z)P (Z)dFdZ
=
∫Fy,Fa{1:M}Zy,Za{1:M}
(P (Y|Fy)
M∏i=1
P (Xi|Fai,Fy))(P (Fy|Zy)
M∏i=1
P (Fai|Zai))(P (Zy)
M∏i=1
P (Zai))dF dZ,
with dF = dFy∏Mi=1 dFai and dZ = dZy
∏Mi=1 dZai
• Inference– Joint-Distribution Wasserstein Distance– Approximation for intractable exact inference– Proposition 1: For any functions Gy : Zy → Fy, Ga{1:M} :
Za{1:M} → Fa{1:M}, D : Fy → Y, and F1:M :
Fa{1:M},Fy → X1:M , we have Joint-Distribution Wasserstein dis-tance Wc(PX1:M ,Y, PX1:M ,Y
) =
infQZ=PZ
EPX1:M,YEQ(Z|X1:M ,Y)
[M∑i=1
cXi
(Xi, Fi
(Gai(Zai), Gy(Zy)
))+ cY
(Y, D
(Gy(Zy)
))],
where PZ is the prior over Z = [Zy,Za{1,M}] and QZ is the ag-gregated posterior of the proposed approximate inference distributionQ(Z|X1:M ,Y).
– Generalized mean field assumption
Q(Z|X1:M ,Y) := Q(Z|X1:M ) := Q(Zy|X1:M )
M∏i=1
Q(Zai|Xi).
•Relaxed Objective
minF,Ga{1:M},Gy,D
infQ(Z|·)∈Q
EPX1:M,YEQ(Za1|X1) · · ·EQ(ZaM |XM )EQ(Zy|X1:M )[
M∑i=1
cXi
(Xi, F
(Gai(Zai), Gy(Zy)
))+ cY
(Y, D
(Gy(Zy)
))]+ λMMD(QZ, PZ),
with PZ being centered isotropic Gaussian N (0, I) with Z =[Zy,Za{1,M}]
• Surrogate Inference for Missing Modalities– Φ: Surrogate inference network
Φ∗ = argminΦ
EPX2:M,X1
(− logPΦ(X1|X2:M)
)with PΦ(X1|X2:M) :=
∫P (X1|Za1,Zy)QΦ(Za1|X2:M)QΦ(Zy|X2:M) dZa1 dZy.
– Deterministic mappings in QΦ(·|·)– PΦ(Y|X2:M ) :=
∫P (Y|Zy)QΦ(Zy|X2:M ) dZy
Controllable Generation•Digits Dataset
– Handwritten (MNISTa [3]) + Street-view House Numbers (SVHN [5])
•Results
Multimodal Time Series Dataset
•Datasets in Human Multimodal Language
– Multimodal Personal Trait Recognition∗Movie Reviews (POM [6])
– Multimodal Sentiment Analysis∗Monologue Opinion (CMU-MOSI [10])∗Online Social Review (ICT-MMMO [8])∗ Product Review and Opinion (MOUD [7] / YouTube [4])
– Multimodal Emotion Recognition∗ Recorded Dyadic Dialogues (IEMOCAP [1])
•Multimodal Features
– Language: pre-trained Glove word embeddings– Visual: facial action units from Facet– Acoustic: MFCCs from COVAREP– Aligned by P2FA
•ResultsDataset POM Personality Traits
Task Con Pas Voi Dom Cre Viv Exp Ent Res Tru Rel Out Tho Ner Per HumMetric r
SOTA2 0.359† 0.425† 0.166‡ 0.235‡ 0.358† 0.417† 0.450† 0.378‡ 0.295� 0.237� 0.215‡ 0.238� 0.363† 0.258� 0.344† 0.319†
SOTA1 0.395# 0.428# 0.193# 0.313# 0.367# 0.431# 0.452# 0.395# 0.333# 0.296# 0.255# 0.259# 0.381# 0.318# 0.377# 0.386#
MFM 0.431 0.450 0.197 0.411 0.380 0.448 0.467 0.452 0.368 0.212 0.309 0.333 0.404 0.333 0.334 0.408
Dataset CMU-MOSI ICT-MMMO YouTube MOUDTask Sentiment Sentiment Sentiment Sentiment
Metric Acc 7 Acc 2 F1 MAE r Acc 2 F1 Acc 3 F1 Acc 2 F1
SOTA2 34.1# 77.1‡ 77.0‡ 0.968‡ 0.625‡ 72.5∗ 72.6∗ 48.3‡ 45.1† 81.1# 80.4#
SOTA1 34.7‡ 77.4# 77.3# 0.965# 0.632# 73.8# 73.1# 51.7# 51.6# 81.1‡ 81.2‡
MFM 36.2 78.1 78.1 0.951 0.662 81.3 79.2 53.3 52.4 82.1 81.7
Dataset IEMOCAP EmotionsTask Happy Sad Angry Frustrated Excited Neutral
Metric Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1
SOTA2 86.7‡ 84.2§ 83.4∗ 81.7† 85.1� 84.5§ 79.5‡ 76.6‡ 89.6‡ 86.3# 68.8§ 67.1§
SOTA1 90.1# 85.3# 85.8# 82.8∗ 87.0# 86.0# 80.3# 76.8# 89.8# 87.1‡ 71.8# 68.5§
MFM 90.2 85.8 88.4 86.1 87.5 86.7 80.4 74.5 90.0 87.1 72.1 68.1
•Ablation Study
– On CMU-MOSI
•Missing Modalities
– On CMU-MOSI
Task X· Reconstruction Y PredictionMetric MSE (`) MSE (a) MSE (v) Acc 7 Acc 2 F1 MAE r
Purely Generative and Discriminative Baselines`(anguage) missing 0.0411 - - 19.4 59.6 59.7 1.386 0.225a(udio) missing - 0.0533 - 34.0 73.5 73.4 1.024 0.615v(isual) missing - - 0.0220 33.7 75.4 75.4 0.996 0.634
Multimodal Factorization Model (MFM)`(anguage) missing 0.0403 - - 21.7 62.0 61.7 1.313 0.236a(udio) missing - 0.0468 - 35.4 74.3 74.3 1.011 0.603v(isual) missing - - 0.0215 35.0 76.4 76.3 0.990 0.635
all present 0.0391 0.0384 0.0182 36.2 78.1 78.1 0.951 0.662
Analyzing Multimodal Representations• Information-Based Interpretation
– Analysis on overall trends– Hilbert-Schmidt Independence Criterion [2, 9]
MI(F·, Xi) = HSICnorm(F·, Xi) =tr(KF·HK
XiH)
‖HKF·H‖F‖HKXiH‖F
,
– Normalized Ratios ri = MI(Fy, Xi)/MI(Fai, Xi)
Ratio r` rv raCMU-MOSI 0.307 0.030 0.107
•Gradient-Based Interpretation
– Fine-grained analysis– Generated data xi = [x1
i , · · · , xti, · · · , x
Ti ]
xi = Fi(fai, fy), fai = Gai(zai), fy = Gy(zy), zai ∼ Q(Zai|Xi = xi), zy ∼ Q(Zy|X1:M = x1:M )
– Gradients Flow
∇fy(xi) :=[‖∇fyx1i‖
2F , ‖∇fyx
2i‖
2F , · · · , ‖∇fyx
Ti ‖
2F ].
Umm, in a way, a lot of the themes in “never let me go”, which were very profound and deep.
(hesitancy) (emphasis) (neutral)
language
visual
acoustic
𝜵𝒇𝒚(𝒙&ℓ)
𝜵𝒇𝒚(𝒙&𝒗)
𝜵𝒇𝒚(𝒙&𝒂)
𝑡 = 1 𝑡 = 20
(uninformative)
(uninformative)
(slig
ht sm
ile)
References[1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang,
S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion cap-ture database. Journal of Language Resources and Evaluation, 2008.
[2] A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf. Measuring statistical de-pendence with hilbert-schmidt norms. In International conference on algorithmiclearning theory, pages 63–77. Springer, 2005.
[3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
[4] L.-P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis:Harvesting opinions from the web. In ICMI, pages 169–176. ACM, 2011.
[5] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits innatural images with unsupervised feature learning. 2011.
[6] S. Park, H. S. Shim, M. Chatterjee, K. Sagae, and L.-P. Morency. Computationalanalysis of persuasiveness in social multimedia: A novel dataset and multimodalprediction approach. ICMI ’14, 2014.
[7] V. Perez-Rosas, R. Mihalcea, and L.-P. Morency. Utterance-Level Multimodal Sen-timent Analysis. In Association for Computational Linguistics (ACL), Aug. 2013.
[8] M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L.-P.Morency. Youtube movie reviews: Sentiment analysis in an audio-visual context.IEEE Intelligent Systems, 28(3):46–53, 2013.
[9] D. Wu, Y. Zhao, Y.-H. H. Tsai, M. Yamada, and R. Salakhutdinov. ” depen-dency bottleneck” in auto-encoding architectures: an empirical study. arXiv preprintarXiv:1802.05408, 2018.
[10] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency. Multimodal sentiment intensityanalysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems,2016.